Comments (3)
I can still confirm that there is an encoding issue in bibtex::do_read_bib()
and bibtex::read.bib()
on Windows:
file <- "book.bib"
encoding <- "UTF-8"
out <- bibtex::do_read_bib(file, encoding = encoding, srcfile(file, encoding = encoding))
out[[1]]
## address
## "Vilnius"
## author
## "{\\v{C}}ekanavi{\\v{c}}ius, Vydas and Murauskas, Gediminas"
## title
## "{Taikomoji regresinÄ— analizÄ— socialiniuose tyrimuose}"
The contents of "book.bib" file:
@book{Cekanavicius2014,
address = {Vilnius},
author = {{\v{C}}ekanavi{\v{c}}ius, Vydas and Murauskas, Gediminas},
title = {{Taikomoji regresinė analizė socialiniuose tyrimuose}},
year = {2014}
}
An RStudio project for further experimentation: bib-file--UTF-8--issue.zip
@romainfrancois It is quite an old issue. What can be done towards solving it? The solution to this issue would also solve some issues in packages that depend on bibtex including ropensci/RefManageR#66 or crsh/citr#67
from bibtex.
I have a similar problem with my bib file (kwb_dummy.txt) on Windows:
### Importing file with default
bibtex::read.bib(file = "kwb_dummy.txt")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
### Setting encoding to UTF-8 does not change result
bibtex::read.bib(file = "kwb_dummy.txt", encoding = "UTF-8")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
> bibtex::read.bib(file = "kwb_dummy.txt")
Grützmacher G, Kumar P, Rustler M, Hannappel S, Sauer U (2013). “Geogenic
groundwater contamination – definition, occurrence and relevance for drinking
water production.” _Zbl. Geol. Paläont. Teil I_, *1*(1), 69-75.
### Correct import with readLines
readLines("kwb_dummy.txt", n = 3, encoding = "UTF-8")
[1] "@article{RN7335,"
[2] " author = {Grützmacher, Gesche and Kumar, P.J.Sajil and Rustler, Michael and Hannappel, Stephan and Sauer, U.},"
[3] " title = {Geogenic groundwater contamination – definition, occurrence and relevance for drinking water production},"
### System
sessioninfo::session_info()
- Session info ----------------------------------------------------------------------------
setting value
version R version 3.5.1 (2018-07-02)
os Windows 7 x64 SP 1
system x86_64, mingw32
ui RStudio
language (EN)
collate English_United Kingdom.1252
ctype English_United Kingdom.1252
tz Europe/Berlin
date 2018-12-11
- Packages --------------------------------------------------------------------------------
package * version date lib source
assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0)
bibtex 0.4.2 2017-06-30 [1] CRAN (R 3.5.1)
cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.1)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
evaluate 0.12 2018-10-09 [1] CRAN (R 3.5.1)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0)
httr 1.3.1 2017-08-20 [1] CRAN (R 3.5.0)
jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.1)
knitr 1.20 2018-02-20 [1] CRAN (R 3.5.0)
lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
packrat 0.4.9-3 2018-06-01 [1] CRAN (R 3.5.1)
plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.1)
R6 2.3.0 2018-10-04 [1] CRAN (R 3.5.1)
Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0)
RefManageR 1.2.0 2018-04-25 [1] CRAN (R 3.5.1)
rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.1)
rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.5.1)
sessioninfo 1.1.0 2018-09-25 [1] CRAN (R 3.5.1)
stringi 1.2.4 2018-07-20 [1] CRAN (R 3.5.1)
stringr 1.3.1 2018-05-10 [1] CRAN (R 3.5.1)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.1)
[1] C:/Users/mrustl.KWB/Documents/R/win-library/3.5
[2] C:/Program Files/R/R-3.5.1/library
from bibtex.
Some findings on this:
bibtex::read.bib()
is able to read bib files on Windows if bib files were written with native.enc
encoding:
Sys.setlocale(locale = "Chinese")
#> [1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"
bib_text <-
"
@misc{text,
title = {{你好}},
author = {{你好}},
year = 2020
}
"
# native encoding which is the default on Windows
options(encoding = "native.enc")
writeLines(bib_text, "native.enc.bib")
readLines("native.enc.bib")
# [1] "" "@misc{text,"
# [3] " title = {{你好}}," " author = {{你好}},"
# [5] " year = 2020" "}"
# [7] ""
# default encoding option "unknown" which is equivalent to "native.enc"
bibtex::read.bib("native.enc.bib", encoding = "unknown")
# 你好 (2020). "你好."
bibtex::read.bib()
is not able to read bib files on Windows if bib files were written with UTF-8
encoding:
# UTF-8 encoding
# NOTE:
# 'native.enc' encoding option is still necessary on Windows to ensure
# writing as UTF-8. useBytes should also set to TRUE to prevent re-encoding the
# text in the file() connection in writeLines()
# See https://kevinushey.github.io/blog/2018/02/21/string-encoding-and-r/
# and https://github.com/yihui/xfun/blob/12e77f58cbee106bfdfb0b288282f47cbf537937/R/io.R#L32
options(encoding = 'native.enc')
writeLines(enc2utf8(bib_text), "utf8.bib", useBytes = TRUE)
readLines("utf8.bib", encoding = "UTF-8")
# [1] "" " @misc{text,"
# [3] " title = {{你好}}," " author = {{你好}},"
# [5] " year = 2020" " }"
# [7] ""
bibtex::read.bib("utf8.bib", encoding = "UTF-8")
# 浣犲ソ (2020). "浣犲ソ
The issue here is that even UTF-8
is selected for the encoding, what bibtex::do_read_bib()
still return parsed text as native encoded:
out_native.enc <- .External( "do_read_bib", file = "native.enc.bib", encoding = "unknown", srcfile = srcfile("native.enc.bib", "native.enc") )
out_native.enc
# [[1]]
# title author year
# "{你好}" "{你好}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)
# native encoded which is expected
lapply(out_native.enc, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#
out_utf8 <- .External( "do_read_bib", file = "utf8.bib", encoding = "UTF-8", srcfile = srcfile("utf8.bib", "UTF-8") )
out_utf8
# [[1]]
# title author year
# "{浣犲ソ}" "{浣犲ソ}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
# attr(,"include")
# character(0)
# attr(,"strings")
# named character(0)
# attr(,"preamble")
# character(0)
# this is also native encoded
lapply(out_utf8, Encoding)
# [[1]]
# [1] "unknown" "unknown" "unknown"
#
Force the encoding to UTF-8
can fix this issue.
# change to UTF-8
lapply(out_utf8, `Encoding<-`, "UTF-8")
# [[1]]
# title author year
# "{你好}" "{你好}" "2020"
# attr(,"entry")
# [1] "misc"
# attr(,"key")
# [1] "text"
#
Since the do_read_bib()
is written in C, it is possible that the default encoding of the input stream is set to "C" locale and fall back to native encoding on Windows. Unfortunately I knew little about C, this is just my guess. This may be verified by changing the encoding option for do_read_bib()
and it results in the same parsed tests and encoding:
Encoding(.External( "do_read_bib", file = "native.enc.bib", encoding = "latin1", srcfile = srcfile("native.enc.bib", "native.enc"))[[1]])
# [1] "unknown" "unknown" "unknown"
So in summary, on Windows, it is better to always use native.enc
. For those downstream packages that use bibtex::do_read_bib()
such as RefManageR::ReadBib()
, the default encoding should be set to unknown
instead of UTF-8
.
I will send a PR to provide a possible fix on the R side.
from bibtex.
Related Issues (20)
- caught segfault read.bib() - macOS 10.14.6 HOT 6
- merge changes from Brian Ripley HOT 1
- ASCII turned into non-ASCII HOT 3
- Orphaned on CRAN HOT 16
- rchk issues HOT 2
- Difficulty loading bibtex in R Studio
- Parse single entry from string HOT 4
- GSOC 2021 R project HOT 2
- DONT WRITE BACK TO YOUR BIBTEX-FILE: custom fields are imported with column-names that includes the values in the fields...!! HOT 5
- write.bib chooses the wrong citation, and doesn't warn that there was an option HOT 3
- Development environment of contributors? HOT 1
- Unable to recover after encountering two consecutive TOKEN_LBRACE "{"
- `write.bib` does not write UTF-8 characters properly HOT 1
- Proposal: Improving the package HOT 6
- oldrel testthat snapshot differences
- Issue with "\\}$" HOT 2
- Commas added to references when using bibtex in rmarkdown HOT 2
- Replace as.personList(authors) with do.call(c, authors)
- Importing bibtex to Zotero classifies citation as "Book" HOT 2
- Direct import into EndNote is not possible HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bibtex.