ropensci / bib2df Goto Github PK

View Code? Open in Web Editor NEW

98.0 98.0 22.0 167 KB

Parse a BibTeX file to a tibble

Home Page: https://docs.ropensci.org/bib2df

R 100.00%

bibtex peer-reviewed r r-package rstats

bib2df's Introduction

rOpenSci

This repository has been archived. The former README is now in README-NOT.md.

bib2df's People

Contributors

Stargazers

Watchers

bib2df's Issues

Cannot install from CRAN?

I reinstalled R recently and was reinstalling my packages but get this when I attempted to install bib2df.

Warning in install.packages :
  package ‘bib2df’ is not available (for R version 3.5.0)

The README indicates that v1.0.0 is on CRAN?

Is there a way to transform latex accents to plain text?

when reading bib as a data.frame something like this {\\'{e}} can be seen as é? etc.

New release?

There has been quite some work done since the last release which also dates back over two years.
Is there anything particular holding back a new version?

#ropensci: Avoid warning message

Using bib2df without specifying separate_names results in a warning message. Set default to FALSE with the option to set to TRUE.

ropensci/software-review#124 (comment)

Deprecation Warning from tibble 2.0.0

When using bib2df (version 1.1.1) bib2df:::bib2df_gather(bib = myBib) I get the following warning:

Warning (test-occCitePrint.R:19:3): regular print
as_data_frame() was deprecated in tibble 2.0.0.
Please use as_tibble() instead.
The signature and semantics have changed, see ?as_tibble.
This warning is displayed once every 8 hours.
Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.

It looks like this could be fixed by replacing dat <- as_data_frame(dat) with dat <- as_tibble(dat).

parsing problems: = symbol

Hi, I've run into a problem reading in references where abstract field contains an equals symbol = the preceeding abstract text is read in as a column header.
e.g.
"high genetic differentiation (F st = 0.043"
ends up as a new column header "HIGH.GENETIC.DIFFERENTIATION..F.ST"

Editor field is lost when reading and writing with `separate_names = TRUE`

My example bibtex file has an entry of an edited volume, for which there is an editor field but not an author field.

@book{DeBruijn2011,
address = {Hoboken, NJ, USA},
booktitle = {Handb. Mol. Microb. Ecol. II Metagenomics Differ. Habitats},
doi = {10.1002/9781118010549},
editor = {de Bruijn, Frans J.},
file = {:home/michael/articles/Unknown - 2011 - Handbook of Molecular Microbial Ecology II.pdf:pdf},
isbn = {9781118010549},
month = {sep},
publisher = {John Wiley {\&} Sons, Inc.},
title = {{Handbook of Molecular Microbial Ecology II}},
url = {http://doi.wiley.com/10.1002/9781118010549},
year = {2011}
}

When I read and write the bibtex file with separate_names, the editor field is dropped, and an author field with "," is added. But it works ok without separate_names. If I run

tb <- bib2df::bib2df("/tmp/library.bib", separate_names = TRUE)
bib2df::df2bib(tb, "/tmp/library-1.bib")

tb <- bib2df::bib2df("/tmp/library.bib", separate_names = FALSE)
bib2df::df2bib(tb, "/tmp/library-2.bib")

Then in library-1.bib I get

@Book{DeBruijn2011,
  Address = {Hoboken, NJ, USA},
  Author = {,},
  Booktitle = {Handb. Mol. Microb. Ecol. II Metagenomics Differ. Habitats},
  Month = {sep},
  Publisher = {John Wiley {\&} Sons, Inc.},
  Title = {{Handbook of Molecular Microbial Ecology II}},
  Year = {2011},
  Doi = {10.1002/9781118010549},
  File = {:home/michael/articles/Unknown - 2011 - Handbook of Molecular Microbial Ecology II.pdf:pdf},
  Url = {http://doi.wiley.com/10.1002/9781118010549},
  Isbn = {9781118010549}
}

and in library-2.bib,

@Book{DeBruijn2011,
  Address = {Hoboken, NJ, USA},
  Booktitle = {Handb. Mol. Microb. Ecol. II Metagenomics Differ. Habitats},
  Editor = {de Bruijn, Frans J.},
  Month = {sep},
  Publisher = {John Wiley {\&} Sons, Inc.},
  Title = {{Handbook of Molecular Microbial Ecology II}},
  Year = {2011},
  Doi = {10.1002/9781118010549},
  File = {:home/michael/articles/Unknown - 2011 - Handbook of Molecular Microbial Ecology II.pdf:pdf},
  Url = {http://doi.wiley.com/10.1002/9781118010549},
  Isbn = {9781118010549}
}

Support for parsing .bib from scopus

I tried to parse .bib files exported from scopus today but ended up with a total mess of column names (see below).

bib_string <- "@ARTICLE{Brulc20091948,
author={Brulc, J.M. and Antonopoulos, D.A. and Berg Miller, M.E. and Wilson, M.K. and Yannarell, A.C. and Dinsdale, E.A. and Edwards, R.E. and Frank, E.D. and Emerson, J.B. and Wacklin, P. and Coutinho, P.M. and Henrissat, B. and Nelson, K.E. and White, B.A.},
title={Gene-centric metagenomics of the fiber-adherent bovine rumen microbiome reveals forage specific glycoside hydrolases},
journal={Proceedings of the National Academy of Sciences of the United States of America},
year={2009},
doi={10.1073/pnas.0806191105},
url={https://www.scopus.com/inward/record.uri?eid=2-s2.0-60549114321&doi=10.1073%2fpnas.0806191105&partnerID=40&md5=8d70a27545328d4cbb538bdb4757335b},
affiliation={Department of Animal Sciences, University of Illinois, Urbana, IL 61801, United States; Institute for Genomics and Systems Biology, Argonne National Laboratory, Argonne, IL 60439, United States; Department of Biology, San Diego State University, San Diego, CA 92813, United States; School of Biological Sciences, Flinders University, Adelaide, SA 5001, Australia; Center for Microbial Sciences, San Diego State University, San Diego, CA 92813, United States; Department of Computer Sciences, San Diego State University, San Diego, CA 92813, United States; Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, United States; J. Craig Venter Institute, 9712 Medical Center Drive, Rockville, MD 20850, United States; Architecture et Fonction des Macromolecules Biologiques, Unité Mixte de Recherche 6098, Universites Aix-Marseille I and II, Case 932, 163 Avenue de Luminy, 13288 Marseille, France; Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, United States},
abstract={The complex microbiome of the rumen functions as an effective system for the conversion of plant cell wall biomass to microbial protein, short chain fatty acids, and gases. As such, it provides a unique genetic resource for plant cell wall degrading microbial enzymes that could be used in the production of biofuels. The rumen and gastrointestinal tract harbor a dense and complex microbiome. To gain a greater understanding of the ecology and metabolic potential of this microbiome, we used comparative metagenomics (phylotype analysis and SEED subsystems-based annotations) to examine randomly sampled pyrosequence data from 3 fiber-adherent microbiomes and 1 pooled liquid sample (a mixture of the liquid microbiome fractions from the same bovine rumens). Even though the 3 animals were fed the same diet, the community structure, predicted phylotype, and metabolic potentials in the rumen were markedly different with respect to nutrient utilization. A comparison of the glycoside hydrolase and cellulosome functional genes revealed that in the rumen microbiome, initial colonization of fiber appears to be by organisms possessing enzymes that attack the easily available side chains of complex plant polysaccharides and not the more recalcitrant main chains, especially cellulose. Furthermore, when compared with the termite hindgut microbiome, there are fundamental differences in the glycoside hydrolase content that appear to be diet driven for either the bovine rumen (forages and legumes) or the termite hindgut (wood). © 2009 by The National Academy of Sciences of the USA.},
author_keywords={CAZymes;  Cellulases;  Plant cell wall;  Pyrosequencing},
Isoptera},
document_type={Article},
source={Scopus},
}"
fil <- tempfile("data")
write(bib_string, fil)
bib2df::bib2df(fil)
#> Column `YEAR` contains character strings.
#>               No coercion to numeric applied.
#> # A tibble: 1 x 37
#>   CATEGORY BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF
#>   <chr>    <chr>     <chr>   <chr>  <list> <chr>     <chr>   <chr>   
#> 1 ARTICLE  Brulc200~ <NA>    <NA>   <chr ~ <NA>      <NA>    <NA>    
#> # ... with 29 more variables: EDITION <chr>, EDITOR <list>,
#> #   HOWPUBLISHED <chr>, INSTITUTION <chr>, JOURNAL <chr>, KEY <chr>,
#> #   MONTH <chr>, NOTE <chr>, NUMBER <chr>, ORGANIZATION <chr>,
#> #   PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>,
#> #   TYPE <chr>, VOLUME <chr>, YEAR <chr>, AUTHOR..BRULC. <chr>,
#> #   TITLE..GENE.CENTRIC <chr>, JOURNAL..PROCEEDINGS <chr>,
#> #   YEAR..2009.. <chr>, DOI..10.1073.PNAS.0806191105.. <chr>,
#> #   URL..HTTPS...WWW.SCOPUS.COM.INWARD.RECORD.URI.EID.2.S2.0.60549114321.DOI.10.1073.2FPNAS.0806191105.PARTNERID.40.MD5.8D70A27545328D4CBB538BDB4757335B.. <chr>,
#> #   AFFILIATION..DEPARTMENT <chr>, ABSTRACT..THE <chr>,
#> #   AUTHOR_KEYWORDS..CAZYMES. <chr>, DOCUMENT_TYPE..ARTICLE.. <chr>,
#> #   SOURCE..SCOPUS.. <chr>

^{Created on 2019-08-06 by the reprex package (v0.3.0)}

Ability to separate surname and given name

This might help: https://cran.r-project.org/web/packages/humaniformat/index.html

problem with quotes

Hi,

I found a problem parsing .bib file with quotes instead of curly brackets:

Here is a file

@BOOK{hoehlig97, 
  author = "Monika Hoehlig", 
  title = "Kontaktbedingter {S}prachwandel in der adygeischen {U}mgangssprache im {K}aukasus und in der {T}uerkei", 
  series = "LINCOM Studies in Caucasian Linguistics 03", 
  year = "1997", 
  publisher = "Lincom GmbH", 
  address = "Muenchen", 
}

Here is the code:

bib2df("test.bib") %>% 
   unlist() %>% 
   na.omit() %>% 
   View()

Here is the result:

CATEGORY	BOOK
BIBTEXKEY	hoehlig97
ADDRESS	Muenchen",
AUTHOR	Monika Hoehlig",
PUBLISHER	Lincom GmbH",
SERIES	LINCOM Studies in Caucasian Linguistics 03",
TITLE	Kontaktbedingter {S}prachwandel in der adygeischen {U}mgangssprache im {K}aukasus und in der {T}uerkei",
YEAR	1997",

As you see the problem is in the final ",.

I'm using bib2df v. 1.1.1

Import bibtex from scopus generates thousands of variables

Whenever I want to import an scopus export, the resulting dataframe is completely messed up and has thousands of columns. Apparently, this should be fixed after #33 or #34 , but I'm afraid it is not.

Steps:

Make a query in scopus
Export bibtext file (see here: 20210604_scopus_urban_commons.zip)
install development version from bib2df (devtools::install_github("ropensci/bib2df") - 28th November 2021)
run testbib <- bib2df::bib2df("<attached file>")

Result:

 testbib
# A tibble: 307 × 55
   CATEGORY   BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION EDITOR HOWPUBLISHED INSTITUTION JOURNAL KEY   MONTH
   <chr>      <chr>     <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>   <list> <chr>        <chr>       <chr>   <chr> <chr>
 1 ARTICLE    Köpper20… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Resear… NA    NA   
 2 CONFERENCE Manfredi… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          IOP Co… NA    NA   
 3 ARTICLE    Avdikos2… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Geofor… NA    NA   
 4 ARTICLE    Petrescu… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Enviro… NA    NA   
 5 ARTICLE    Parikh20… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Enviro… NA    NA   
 6 ARTICLE    Dekeyser… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Enviro… NA    NA   
 7 BOOK       Stuber20… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Balanc… NA    NA   
 8 ARTICLE    Wang2021… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Americ… NA    NA   
 9 ARTICLE    Marino20… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Territ… NA    NA   
10 ARTICLE    Sardeshp… NA      NA     <chr … NA        NA      NA       NA      <chr … NA           NA          Cities  NA    NA   
# … with 297 more rows, and 40 more variables: NOTE <chr>, NUMBER <chr>, ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>,
#   SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <dbl>, DOI <chr>, URL <chr>, AFFILIATION <chr>,
#   ABSTRACT <chr>, AUTHOR_KEYWORDS <chr>, REFERENCES <chr>, ISSN <chr>, LANGUAGE <chr>, ABBREV_SOURCE_TITLE <chr>,
#   DOCUMENT_TYPE <chr>, SOURCE <chr>, ART_NUMBER <chr>, KEYWORDS <chr>, FUNDING_DETAILS <chr>, FUNDING_TEXT <chr>,
#   CORRESPONDENCE_ADDRESS1 <chr>, SPONSORS <chr>, FUNDING_TEXT.1 <chr>, FUNDING_DETAILS.1 <chr>, FUNDING_DETAILS.2 <chr>,
#   ISBN <chr>, FUNDING_DETAILS.3 <chr>, FUNDING_DETAILS.4 <chr>, FUNDING_TEXT.2 <chr>, CODEN <chr>, FUNDING_DETAILS.5 <chr>,
#   PUBMED_ID <chr>, PAGE_COUNT <chr>, CHEMICALS_CAS <chr>

sessioninfo:

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: KDE neon User - Plasma 25th Anniversary Edition

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
 [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7         rstudioapi_0.13    magrittr_2.0.1     tidyselect_1.1.1   R6_2.5.1           rlang_0.4.12      
 [7] fansi_0.5.0        stringr_1.4.0      httr_1.4.2         dplyr_1.0.7        tools_4.1.2        humaniformat_0.6.0
[13] utf8_1.2.2         cli_3.1.0          DBI_1.1.1          ellipsis_0.3.2     assertthat_0.2.1   tibble_3.1.5      
[19] lifecycle_1.0.1    crayon_1.4.2       purrr_0.3.4        vctrs_0.3.8        glue_1.4.2         stringi_1.7.5     
[25] compiler_4.1.2     pillar_1.6.4       generics_0.1.1     renv_0.13.2        bib2df_1.1.2       pkgconfig_2.0.3

#ropensci: Do not export internal functions

These are:

bib2df_read()
bib2df_gather()
bib2df_tidy()

Also, remove documentation for these functions.

ropensci/software-review#124 (comment)

Write df back to .bib

It might be useful to be able to write the imported data frame back to a data frame. This would enable users to do processing in R to programmatically modify a bibtex database.

#ropensci: Improve the vignette

Vignettes is not indexed
Update %\VignetteIndexEntry{}
Link to the humaniformat package

ropensci/software-review#124 (comment)

Add option to change encoding assumed for input strings (UTF-8)

I use UTF-8 characters in a bibliography, which bib2df doesn't seem to import properly. A way to fix this issue is to add an additional argument to the bib2df function that allows users to specify the encoding of the .bib file (default: encoding = "unknown"), and to feed that argument to the readLines call inside the bib2df_read function. This way I could import the bibliography with bib2df("references.bib", encoding = "UTF-8").

I'd be happy to make a pull request. Thanks for this useful package!

UTF-8 encoding

I have difficulties reading some of my reference files. R markdown shows there is an error in the UTF-8 encoding of my BibTex files. Please add UTF-8 encoding. This would be great!

bib variables with '.' in name

Hi, thanks for the great package.

I have found an edge-case that causes an issue with writing valid .bib files with df2bib. Specifically, I had a .bib entry that had 2 DOI values, like the following:

@Article{test2022,
doi = {DOI_PLACEHOLDER},
doi = {DOI_PLACEHOLDER2}
}

parsing this with bib2df, and then rewriting it with df2bib, then results in something like:

@Article{test2022,
doi = {DOI_PLACEHOLDER},
doi.1 = {DOI_PLACEHOLDER2}
}

However, '.' is not a valid character for a variable name for .bib files (I think; at least had issues knitting a .rmd).

I suspect this would be an easy fix, where you could substitute '.' in variable names with '_' or something?

Cheers

bib2df_gather strips braces incorrectly

Some regex bugs exist in bib2df_gather, e.g.:

cat('@Article{mykey,
  Author = {me},
  Title = {{FOO} bar {bAZ}},
  Year = {2011}
}
', file=f <- tempfile())

bib <- bib2df::bib2df(f)
bib$TITLE
#> [1] "FOO} bar {bAZ"

^{Created on 2019-11-13 by the reprex package (v0.3.0.9000)}

Session info

sessioninfo::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       macOS Mojave 10.14.3        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Melbourne         
#>  date     2019-11-13                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package      * version date       lib source                          
#>  assertthat     0.2.1   2019-03-21 [2] CRAN (R 3.6.0)                  
#>  bib2df         1.1.1   2019-11-13 [1] Github (ROpenSci/bib2df@e151772)
#>  cli            1.1.0   2019-03-19 [2] CRAN (R 3.6.0)                  
#>  crayon         1.3.4   2017-09-16 [2] CRAN (R 3.6.0)                  
#>  digest         0.6.22  2019-10-21 [1] CRAN (R 3.6.0)                  
#>  dplyr          0.8.3   2019-07-04 [2] CRAN (R 3.6.0)                  
#>  evaluate       0.14    2019-05-28 [2] CRAN (R 3.6.0)                  
#>  glue           1.3.1   2019-03-12 [2] CRAN (R 3.6.0)                  
#>  highr          0.8     2019-03-20 [2] CRAN (R 3.6.0)                  
#>  htmltools      0.4.0   2019-10-04 [1] CRAN (R 3.6.0)                  
#>  httr           1.4.1   2019-08-05 [2] CRAN (R 3.6.0)                  
#>  humaniformat   0.6.0   2016-04-24 [1] CRAN (R 3.6.0)                  
#>  knitr          1.25    2019-09-18 [1] CRAN (R 3.6.0)                  
#>  magrittr       1.5     2014-11-22 [2] CRAN (R 3.6.0)                  
#>  pillar         1.4.2   2019-06-29 [2] CRAN (R 3.6.0)                  
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 3.6.0)                  
#>  purrr          0.3.3   2019-10-18 [1] CRAN (R 3.6.0)                  
#>  R6             2.4.0   2019-02-14 [2] CRAN (R 3.6.0)                  
#>  Rcpp           1.0.3   2019-11-08 [1] CRAN (R 3.6.0)                  
#>  rlang          0.4.1   2019-10-24 [1] CRAN (R 3.6.0)                  
#>  rmarkdown      1.16    2019-10-01 [1] CRAN (R 3.6.0)                  
#>  sessioninfo    1.1.1   2018-11-05 [2] CRAN (R 3.6.0)                  
#>  stringi        1.4.3   2019-03-12 [2] CRAN (R 3.6.0)                  
#>  stringr        1.4.0   2019-02-10 [2] CRAN (R 3.6.0)                  
#>  tibble         2.1.3   2019-06-06 [2] CRAN (R 3.6.0)                  
#>  tidyselect     0.2.5   2018-10-11 [2] CRAN (R 3.6.0)                  
#>  withr          2.1.2   2018-03-15 [2] CRAN (R 3.6.0)                  
#>  xfun           0.10    2019-10-01 [1] CRAN (R 3.6.0)                  
#>  yaml           2.2.0   2018-07-25 [2] CRAN (R 3.6.0)                  
#> 
#> [1] /Users/jbau/Library/R/3.6/library
#> [2] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

parsing problems: `@` in field (email addresses) and multi-line fields

I've run into a few problems parsing my .bib files with this. First, parsing fails for any field that has an @ anywhere in it (e.g., as part of an email address). Second, it fails for multi-line fields (like the annote fields that are auto-exported by Mendeley). Neither of these causes bibtex itself to complain so they are at least de facto supported. Fixing the first probably isn't so difficult, but fixing the second might be more challenging given how the reading function works (need a pass to combine lines with un-terminated strings/fields).

`as_data_frame()` was deprecated in tibble 2.0.0.Y

You probably have already seen this, but if not: I'm getting:
Warning messages:
1: as_data_frame() was deprecated in tibble 2.0.0.
ℹ Please use as_tibble() (with slightly different semantics) to convert to a tibble, or
as.data.frame() to convert to a data frame.
ℹ The deprecated feature was likely used in the bib2df package.
Please report the issue to the authors.
This warning is displayed once every 8 hours.
Call lifecycle::last_lifecycle_warnings() to see where this warning was generated.
2: In bib2df_tidy(bib, separate_names) : NAs introduced by coercion

Parsing .bib fails when field separator is on the next line

bib2df::bib2df() fails to load fields when the field separator (",") is preceded by a newline, as in the following example:

@article{SHBP
,title = "Efficient DC Analysis of RVJ Circuits for Moment and Derivative Commutations of Interconnect Networks"
,author = " S. H. Batterywala and H. Narayanan "
,journal = "12th International Conference on VLSI Design"
,pages = "169-174"
,year = 1999
}

reprex:

f <- tempfile()
download.file('https://www.ee.iitb.ac.in/~trivedi/LatexHelp/Docs/ref.bib', f)
bib2df::bib2df(f)

With version 1.1.1 it loads in new columns "X.≪fieldname≫":

# A tibble: 9 × 41
  CATEGORY    BIBTE…¹ ADDRESS ANNOTE AUTHOR BOOKT…² CHAPTER CROSS…³ EDITION EDITOR HOWPU…⁴ INSTI…⁵ JOURNAL KEY   MONTH NOTE  NUMBER ORGAN…⁶
  <chr>       <chr>   <chr>   <chr>  <list> <chr>   <chr>   <chr>   <chr>   <list> <chr>   <chr>   <chr>   <chr> <chr> <chr> <chr>  <chr>  
1 ARTICLE     SHBP    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
2 ARTICLE     SIE     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
3 BOOK        HN      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
4 BOOK        DON     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
5 MASTERSTHE… GAK     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
6 MASTERSTHE… GT      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
7 MASTERSTHE… NJB     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
8 MANUAL      PVM     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
9 MISC        PVMS    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
# … with 23 more variables: PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <chr>,
#   X.TITLE <chr>, X.AUTHOR <chr>, X.JOURNAL <chr>, X.PAGES <chr>, X.YEAR <chr>, X.VOLUME <chr>, X.NUMBER <chr>, X.PUBLISHER <chr>,
#   X.MONTH <chr>, X.SCHOOL <chr>, X.ORGANIZATION <chr>, X.ADDRESS <chr>, X.NOTE <chr>, X.KEY <chr>, X.HOWPUBLISHED <chr>, and abbreviated
#   variable names ¹BIBTEXKEY, ²BOOKTITLE, ³CROSSREF, ⁴HOWPUBLISHED, ⁵INSTITUTION, ⁶ORGANIZATION

With version 1.1.2 it doesn't load at all (all values are either NA, character(0) or an empty string):

# A tibble: 9 × 26
  CATEGORY    BIBTE…¹ ADDRESS ANNOTE AUTHOR BOOKT…² CHAPTER CROSS…³ EDITION EDITOR HOWPU…⁴ INSTI…⁵ JOURNAL KEY   MONTH NOTE  NUMBER ORGAN…⁶
  <chr>       <chr>   <chr>   <chr>  <list> <chr>   <chr>   <chr>   <chr>   <list> <chr>   <chr>   <chr>   <chr> <chr> <chr> <chr>  <chr>  
1 ARTICLE     SHBP    NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      ""      NA    NA    NA    NA     NA     
2 ARTICLE     SIE     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      ""      NA    NA    NA    ""     NA     
3 BOOK        HN      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
4 BOOK        DON     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    NA    NA    NA     NA     
5 MASTERSTHE… GAK     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
6 MASTERSTHE… GT      NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
7 MASTERSTHE… NJB     NA      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    NA    NA     NA     
8 MANUAL      PVM     ""      NA     <chr>  NA      NA      NA      NA      <chr>  NA      NA      NA      NA    ""    ""    NA     ""     
9 MISC        PVMS    NA      NA     <chr>  NA      NA      NA      NA      <chr>  ""      NA      NA      ""    NA    NA    NA     NA     
# … with 8 more variables: PAGES <chr>, PUBLISHER <chr>, SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>, YEAR <chr>,
#   and abbreviated variable names ¹BIBTEXKEY, ²BOOKTITLE, ³CROSSREF, ⁴HOWPUBLISHED, ⁵INSTITUTION, ⁶ORGANIZATION

I am not sure how common this is (probably not at all), but this did happen on the first example .bib I found online and it seems like a basic parsing error.

#ropensci: Remove `plyr` dependency

plyr is no longer actively developed, and since dplyr is already imported and has bind_rows.

ropensci/software-review#124 (comment)

Another format in .bib

I encountered issues when parsing a file with entries like e.g.

@Article{RJournal:2011-1:Cook,
  author       = {Dianne Cook},
  title        = {Tips for Presenting Your Work},
  journal      = {The R Journal},
  year         = 2011,
  volume       = 3,
  number       = 1,
  pages        = {72--74},
  month        = jun,
  url          = {http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Cook.pdf}
}

Fields like year, volume were NA in the final table.

I solved this rather not elegantly in https://github.com/masalmon/bib2df/commit/5bbf89d4c168eaddcc7c43ad7f3e300f9101400e (I wasn't able to find something with str_extract).

I guess my new code is not usable because I don't use str_extract (if I had I would have done a PR), do you have an idea how to solve this issue for all users?

Problems parsing .bib from ORCID

When reading a .bib file exported from an ORCID profile (Export works), bib2df() have some problems parsing it.

The same file can be imported in zotero without problems.

See attached bib file: works_G.zip

bib2df and curly brackets

Hi,

I create my bibtex files from Zotero.
For some reason (that is unclear to me) Zotero puts all capitalized words in brackets in the title object.

When I use bib2df, if the last word in the title has brackets then the final brackets are ignored, which causes parsing problems later when I want to use df2bib (for example).

e.g.

bibtex:

@book{patz_grammar_2002,
	address = {Canberra},
	series = {Pacific linguistics},
	title = {A grammar of the {Kuku} {Yalanji} language of north {Queensland}},
	isbn = {978-0-85883-534-4},
	number = {527},
	publisher = {Research School of Pacific and Asian Studies, Australian National University},
	author = {Patz, Elisabeth},
	collaborator = {{Australian National University}},
	year = {2002},
	note = {OCLC: ocm51721900},
	keywords = {CS, PN5, kinbank, kuku1273},
	file = {Patz_2002_A grammar of the Kuku Yalanji language of north Queensland.pdf:files/2186/Patz_2002_A grammar of the Kuku Yalanji language of north Queensland.pdf:application/pdf}
}

and the title object after reading in with bib2df:

zotero_subset[1,]$TITLE
"A grammar of the {Kuku} {Yalanji} language of north {Queensland"

Here we see two end brackets have been removed from the title (rather than one).

And the bibtex object after using df2bib (with emphasis on the problem)

@Book{patz_grammar_2002,
  Address = {Canberra},
  Author = {Patz, Elisabeth},
  Note = {OCLC: ocm51721900},
  Number = {527},
  Publisher = {Research School of Pacific and Asian Studies, Australian National University},
  Series = {Pacific linguistics},
  Title = **{A grammar of the {Kuku} {Yalanji} language of north {Queensland},**
  Year = {2002},
  File = {Patz_2002_A grammar of the Kuku Yalanji language of north Queensland.pdf:files/2186/Patz_2002_A grammar of the Kuku Yalanji language of north Queensland.pdf:application/pdf},
  Isbn = {978-0-85883-534-4},
  Collaborator = {Australian National University},
  Keywords = {CS, PN5, kinbank, kuku1273},
  sourceid = {PN5}
}

From this, it looks like the problem lies in bib2df (rather than df2bib), in that it finds the end of the line as any number of end brackets (and removes them) rather than a single bracket.

field names in lower case

I find that bib2df does not correctly parse when field names are "title", "author" etc. Has anyone else faced this problem?

problem with whitespaces around =

I've discovered that when I have an entry like this:

@book{fassberg2019modern,
  title      = {Languages of the Eastern Section: Great Lakes to Indian Ocean},
author={Fassberg, Steven E},
  lgcode={west2763},
  hhtype={overview},
  pages={632652},
  year={2019},
  publisher={Routledge}
}

I get a table that looks like this from bib2df::bib2df()

CATEGORY	BIBTEXKEY	ADDRESS	ANNOTE	AUTHOR	BOOKTITLE	CHAPTER	CROSSREF	EDITION	EDITOR	HOWPUBLISHED	INSTITUTION	JOURNAL	KEY	MONTH	NOTE	NUMBER	ORGANIZATION	PAGES	PUBLISHER	SCHOOL	SERIES	TITLE	TYPE	VOLUME	YEAR	AUTHOR..FASSBERG.	LGCODE..WEST2763..	HHTYPE..OVERVIEW..	PAGES..632652..	YEAR..2019..	PUBLISHER..ROUTLEDGE.
BOOK	fassberg2019modern																					Languages of the Eastern Section: Great Lakes to Indian Ocean				Fassberg, Steven E	west2763	overview	632652	2019	Routledge

I've isolated the problem down to the lack of whitespaces before and after the equal sign at the field assignment. It's an easy fix, I basically just inserted whitespaces before and after every equal sign before a curly bracket, but it was a bit frustrating to debug. Can this be included in the documentation, or fixed?

#ropensci: Formatting and Code Conventions

Avoid line widths > 80 characters
spell check on the whole package
Commas should always have a space after, use lintr::lint_package()
Variables and function names should all be lowercase
Improve formatting of the bib2df-package help file
The title field in .Rd files should not end in a period
Update and improve formatting of NEWS.md

ropensci/software-review#124 (comment)

Package removed from CRAN

It looks like the package was removed from CRAN for some reason. Is there a plan to resubmit the package in the near future?

Update to Tibble Package

I receive the following warning indicating that the package is out of date:

Warning message:
`as_data_frame()` was deprecated in tibble 2.0.0.
Please use `as_tibble()` instead.
The signature and semantics have changed, see `?as_tibble`.

Everything appears to work, but a simple update should resolve the warning?

Problem parsing double quoted tokens

In the following BibTeX file:

@phdthesis{Yang2011Lalo,
    author = {Yang, Cathryn},
    address = {Bundoora},
    language = {English},
    school = {La Trobe University},
    shorttitle = {Lalo regional varieties},
    title = {Lalo regional varieties: {Phylogeny}, dialectometry and sociolinguistics},
    type = {{PhD} dissertation},
    year = {2011}
}

... the field type gets parsed to PhD} dissertation (i.e. the first curly brace protecting the casing in 'PhD' gets eaten.

The culprit is this statement in bib_gather. I'm not quite sure what this regex is doing, so I don't want to fiddle with it to fix it.

submitting bib2df to rOpenSci?

This is not an academic spam I promise. 😄

Have you considered submitting this package to rOpenSci onboarding process? More info here + I can answer any question.

In brief, onboarding is an open review process, with often two reviewers having a look at the package according to the guidelines. I'm a co-editor now so I might sound biased but I've submitted a few packages before that and I've really learned a ton & improved the packages. Onboarded packages then live in the ropensci organization on Github but you're still the maintainer and keep admin rights to the repository. Your package seems to fit in the data extraction category.

And obviously no problem if you prefer not to submit bib2df!

Article TITLE truncated when parsed by bib2df

Hello -

When I parse the attached .bib file, the article TITLE is truncated. How can I import the entire TITLE?
Thanks!

monitoring_library.txt

bib <- bib2df("Desktop/monitoring_library.txt") %>% 
  select(TITLE)

bib <- structure(list(TITLE = c("Exploring perspectives, preferences and needs of a", 
"Short-Term} Postpartum Blood Pressure {Self-Management} and", 
"Blood pressure after {PREeclampsia/HELLP} by {SELF} monitoring", 
"Pregnancy outcomes following home blood pressure monitoring in", 
"A randomised controlled trial of blood pressure self-monitoring"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

spell check

Spell check the whole package + vignette

ropensci/software-review#124

Improper link in documentation?

Hi Philipp,
I was reading the documentation for df2bib this evening and think perhaps you have a mistake?

Line 3 of df2bib.R has.

#' @param x \code{tibble}, returned by \code{\link{df2bib}}.

Should it not be?

#' @param x \code{tibble}, returned by \code{\link{bib2df}}.

v 1.1.1 bib2df() loses 1st bib entry

i'm seeing the bib2df() function lose or skip the first bibtex entry in the files I'm using v1.1.1

example file attached, rename to .bib, has 100 records
17 water and demand.txt

library(bib2df)
path2file <- "17 water and demand.bib"
bib <- bib2df(file=path2file, separate_names = FALSE)

nrow(bib)
[1] 99

#ropensci: Tests

Increase code coverage, especially for df2bib.R and bib2df_tidy.R.

ropensci/software-review#124 (comment)

Line widths

Avoid line widths > 80 characters.

ropensci/software-review#124

Error when reading bib file with one reference

When reading a bib file with a single reference, bib2df gives the error:

Error in x[, 1] : incorrect number of dimensions

I downloaded the file you share in the vignette: LiteratureOnCommonKnowledgeInGameTheory and erased all but the first reference.

As you will see, bib2df works fine with the complete file but fails with the single-reference file. I had the same problem with other .bib files created using rcrossref::cr_cn().

bib2df::bib2df("bib2df.bib") # Works
bib2df::bib2df("bib2df_single.bib") # Error in x[, 1] : incorrect number of dimensions

I attach both files: bib2df.zip

Thanks!

Error "Invalid URL: File is not readable." when trying to read `.bib` file in a subfolder with name `www`

I happened to save a .bib file in a subfolder of my project called www. When trying to read it, pattern www. is matched by www/ in line 6 of bib2pdf() which makes it to try to "GET" a remote file from a URL, e.g.:

bib2df::bib2df("www/a_bibliography.bib")
#> Error: Invalid URL: File is not readable.

^{Created on 2021-07-15 by the reprex package (v2.0.0)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  Spanish_Spain.1252          
#>  ctype    Spanish_Spain.1252          
#>  tz       Europe/Paris                
#>  date     2021-07-15                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date       lib source        
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
#>  bib2df         1.1.1   2019-05-22 [1] CRAN (R 4.1.0)
#>  cli            3.0.0   2021-06-30 [1] CRAN (R 4.1.0)
#>  crayon         1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
#>  curl           4.3.2   2021-06-23 [1] CRAN (R 4.1.0)
#>  DBI            1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
#>  digest         0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
#>  dplyr          1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
#>  evaluate       0.14    2019-05-28 [1] CRAN (R 4.1.0)
#>  fansi          0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
#>  fs             1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
#>  generics       0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
#>  glue           1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.0)
#>  htmltools      0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
#>  httr           1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
#>  humaniformat   0.6.0   2016-04-24 [1] CRAN (R 4.1.0)
#>  knitr          1.33    2021-04-24 [1] CRAN (R 4.1.0)
#>  lifecycle      1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
#>  magrittr       2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
#>  pillar         1.6.1   2021-05-16 [1] CRAN (R 4.1.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
#>  R6             2.5.0   2020-10-28 [1] CRAN (R 4.1.0)
#>  Rcpp           1.0.7   2021-07-07 [1] CRAN (R 4.1.0)
#>  reprex         2.0.0   2021-04-02 [1] CRAN (R 4.1.0)
#>  rlang          0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
#>  rmarkdown      2.9     2021-06-15 [1] CRAN (R 4.1.0)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.0)
#>  sessioninfo    1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
#>  stringi        1.6.2   2021-05-17 [1] CRAN (R 4.1.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
#>  tibble         3.1.2   2021-05-16 [1] CRAN (R 4.1.0)
#>  tidyselect     1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
#>  utf8           1.2.1   2021-03-12 [1] CRAN (R 4.1.0)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
#>  withr          2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
#>  xfun           0.24    2021-06-15 [1] CRAN (R 4.1.0)
#>  yaml           2.2.1   2020-02-01 [1] CRAN (R 4.1.0)
#> 
#> [1] C:/Users/Mori.P16/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.0/library

I think this could be simply solved by changing the pattern in line 6 to: "http://|https://|www\." so it only matches "www" followed by a dot instead of "www" followed by "any character".

Problems parsing .bib from Web of Science

I downloaded a bib file from Web of Science savedrecs.zip and there are multiple issues when reading it. The solution shown in #21 doesn't work here :(

Most of them seen to be related with what you @ottlngr mentioned in in #21 (key-value pairs not separated by linebreaks):

AUTHORS: The authors not in the first line are lost
ABSTRACT: Only the first line of the abstract is imported

But other issues seem to arise from a different thing:

A bunch of extra columns appear (for a simplified case, see [A] below)

[A] single_reference.zip
When reading this bib reference, the following lines of the abstract are creating new columns (the first-word of the line is the column title, and the text in the cell is whatever comes after the "="):

benefits and harms; n = 451) or non-evidence-based (e.g., relative risks
on benefits only; n = 446) patient information about a cancer screening
non-evidence-based patient information (n = 446), a mean of 33.1% of
whereas with evidence-based patient information (n = 451), only half as

So, the first of those creates a BENEFITS column with a text "451) or non-evidence-based (e.g., relative risks"

Please, let me know if I can be of any help testing/debugging this.

ropensci / bib2df Goto Github PK

bib2df's Introduction

rOpenSci

bib2df's People

Contributors

Stargazers

Watchers

Forkers

bib2df's Issues

Recommend Projects

Recommend Topics

Recommend Org