Giter VIP home page Giter VIP logo

Comments (3)

GegznaV avatar GegznaV commented on June 16, 2024 2

I can still confirm that there is an encoding issue in bibtex::do_read_bib() and bibtex::read.bib() on Windows:

file <- "book.bib"
encoding <- "UTF-8"
out <- bibtex::do_read_bib(file, encoding = encoding, srcfile(file, encoding = encoding))
out[[1]]

##                                                      address 
##                                                      "Vilnius" 
##                                                         author 
##   "{\\v{C}}ekanavi{\\v{c}}ius, Vydas and Murauskas, Gediminas" 
##                                                          title 
##      "{Taikomoji regresinÄ— analizÄ— socialiniuose tyrimuose}" 

The contents of "book.bib" file:

@book{Cekanavicius2014,
	address = {Vilnius},
	author = {{\v{C}}ekanavi{\v{c}}ius, Vydas and Murauskas, Gediminas},
	title = {{Taikomoji regresinė analizė socialiniuose tyrimuose}},
	year = {2014}
}

An RStudio project for further experimentation: bib-file--UTF-8--issue.zip

@romainfrancois It is quite an old issue. What can be done towards solving it? The solution to this issue would also solve some issues in packages that depend on bibtex including ropensci/RefManageR#66 or crsh/citr#67

from bibtex.

mrustl avatar mrustl commented on June 16, 2024

I have a similar problem with my bib file (kwb_dummy.txt) on Windows:

### Importing file with default 
bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Setting encoding to UTF-8 does not change result
bibtex::read.bib(file = "kwb_dummy.txt", encoding = "UTF-8")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

> bibtex::read.bib(file = "kwb_dummy.txt")

Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.

### Correct import with readLines
readLines("kwb_dummy.txt", n = 3, encoding = "UTF-8")
[1] "@article{RN7335,"                                                                                                     
[2] "   author = {Grützmacher, Gesche and Kumar, P.J.Sajil and Rustler, Michael and Hannappel, Stephan and Sauer, U.},"    
[3] "   title = {Geogenic groundwater contamination – definition, occurrence and relevance for drinking water production},"

### System
sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
 setting  value                       
 version  R version 3.5.1 (2018-07-02)
 os       Windows 7 x64 SP 1          
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United Kingdom.1252 
 ctype    English_United Kingdom.1252 
 tz       Europe/Berlin               
 date     2018-12-11                  

- Packages --------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.0   2017-04-11 [1] CRAN (R 3.5.0)
 bibtex        0.4.2   2017-06-30 [1] CRAN (R 3.5.1)
 cli           1.0.1   2018-09-25 [1] CRAN (R 3.5.1)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.0)
 digest        0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 evaluate      0.12    2018-10-09 [1] CRAN (R 3.5.1)
 htmltools     0.3.6   2017-04-28 [1] CRAN (R 3.5.0)
 httr          1.3.1   2017-08-20 [1] CRAN (R 3.5.0)
 jsonlite      1.6     2018-12-07 [1] CRAN (R 3.5.1)
 knitr         1.20    2018-02-20 [1] CRAN (R 3.5.0)
 lubridate     1.7.4   2018-04-11 [1] CRAN (R 3.5.0)
 magrittr      1.5     2014-11-22 [1] CRAN (R 3.5.1)
 packrat       0.4.9-3 2018-06-01 [1] CRAN (R 3.5.1)
 plyr          1.8.4   2016-06-08 [1] CRAN (R 3.5.1)
 R6            2.3.0   2018-10-04 [1] CRAN (R 3.5.1)
 Rcpp          1.0.0   2018-11-07 [1] CRAN (R 3.5.0)
 RefManageR    1.2.0   2018-04-25 [1] CRAN (R 3.5.1)
 rmarkdown     1.11    2018-12-08 [1] CRAN (R 3.5.1)
 rstudioapi    0.8     2018-10-02 [1] CRAN (R 3.5.1)
 sessioninfo   1.1.0   2018-09-25 [1] CRAN (R 3.5.1)
 stringi       1.2.4   2018-07-20 [1] CRAN (R 3.5.1)
 stringr       1.3.1   2018-05-10 [1] CRAN (R 3.5.1)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.0)
 xml2          1.2.0   2018-01-24 [1] CRAN (R 3.5.1)

[1] C:/Users/mrustl.KWB/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library

from bibtex.

hongyuanjia avatar hongyuanjia commented on June 16, 2024

Some findings on this:

bibtex::read.bib() is able to read bib files on Windows if bib files were written with native.enc encoding:

Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

bib_text <-
"
@misc{text,
    title = {{你好}},
    author = {{你好}},
    year = 2020
}
"

# native encoding which is the default on Windows
options(encoding = "native.enc")
writeLines(bib_text, "native.enc.bib")

readLines("native.enc.bib")
# [1] ""                       "@misc{text,"
# [3] "    title = {{你好}},"  "    author = {{你好}},"
# [5] "    year = 2020"        "}"
# [7] ""

# default encoding option "unknown" which is equivalent to "native.enc"
bibtex::read.bib("native.enc.bib", encoding = "unknown") 
# 你好 (2020). "你好."

bibtex::read.bib() is not able to read bib files on Windows if bib files were written with UTF-8 encoding:

# UTF-8 encoding
# NOTE:
# 'native.enc' encoding option is still necessary on Windows to ensure
# writing as UTF-8. useBytes should also set to TRUE to prevent re-encoding the
# text in the file() connection in writeLines()
# See https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
# and https://github.com/yihui/xfun/blob/12e77f58cbee106bfdfb0b288282f47cbf537937/R/io.R#L32
options(encoding = 'native.enc')
writeLines(enc2utf8(bib_text), "utf8.bib", useBytes = TRUE)

readLines("utf8.bib", encoding = "UTF-8")
# [1] ""                           "    @misc{text,"
# [3] "        title = {{你好}},"  "        author = {{你好}},"
# [5] "        year = 2020"        "    }"
# [7] ""

bibtex::read.bib("utf8.bib", encoding = "UTF-8")
# 浣犲ソ (2020). "浣犲ソ

The issue here is that even UTF-8 is selected for the encoding, what bibtex::do_read_bib() still return parsed text as native encoded:

out_native.enc <- .External( "do_read_bib", file = "native.enc.bib", encoding = "unknown", srcfile = srcfile("native.enc.bib", "native.enc") )
out_native.enc
# [[1]]
#    title   author     year 
# "{你好}" "{你好}"   "2020" 
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
# 
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# native encoded which is expected
lapply(out_native.enc, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

out_utf8 <- .External( "do_read_bib", file = "utf8.bib", encoding = "UTF-8", srcfile = srcfile("utf8.bib", "UTF-8") )
out_utf8
# [[1]]
#      title     author       year
# "{浣犲ソ}" "{浣犲ソ}"     "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)

# this is also native encoded
lapply(out_utf8, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#

Force the encoding to UTF-8 can fix this issue.

# change to UTF-8
lapply(out_utf8, `Encoding<-`, "UTF-8")
# [[1]]
#    title   author     year
# "{你好}" "{你好}"   "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#

Since the do_read_bib() is written in C, it is possible that the default encoding of the input stream is set to "C" locale and fall back to native encoding on Windows. Unfortunately I knew little about C, this is just my guess. This may be verified by changing the encoding option for do_read_bib() and it results in the same parsed tests and encoding:

Encoding(.External( "do_read_bib", file = "native.enc.bib", encoding = "latin1", srcfile = srcfile("native.enc.bib", "native.enc"))[[1]])
# [1] "unknown" "unknown" "unknown"

So in summary, on Windows, it is better to always use native.enc. For those downstream packages that use bibtex::do_read_bib() such as RefManageR::ReadBib(), the default encoding should be set to unknown instead of UTF-8.

I will send a PR to provide a possible fix on the R side.

from bibtex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.