I have difficulties reading some of my reference files. R markdown shows there is an e

UTF-8 encoding about bib2df HOT 7 OPEN

ropensci commented on September 27, 2024

UTF-8 encoding

from bib2df.

Comments (7)

mpr1255 commented on September 27, 2024

Second this! It would make it really really great. I work with a lot of Chinese-language bibliographical data, and df2bib always breaks the characters... I just spent 40 minutes trying to tinker with the code, randomly throwing some fields <- map(fields, ~as_utf8(.x)) just to see if it would do anything... No luck. I would love for this to work. Unfortunately I don't yet know what I don't know in order to try to help.

Note that when using the locale "chs", characters come out like this: ÄÏ¾©¹ÄÂ¥Ò½Ôº¸ÎÔàÒÆÖ²ÖÐÐÄ

When I use the "C" locale, they come out like this: <U+5357><U+4EAC><U+9F13><U+697C><U+533B> (and I can't really figure out how to turn them back into characters easily.)

from bib2df.

GilmourR commented on September 27, 2024

My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.

from bib2df.

mpr1255 commented on September 27, 2024

I see. Any idea how to fix? Current workarounds are getting clunky, if they work at all... * dispatched from a small screen

…

-------- Original Message --------

On Feb 21, 2021, 16:01, Ross wrote: My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself. — You are receiving this because you commented. Reply to this email directly, [view it on GitHub](#44 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AIMH2NLP37M55XMVO5EWC43TACHUFANCNFSM4QNCDPKQ).

from bib2df.

GilmourR commented on September 27, 2024

I clean the file like this:
bsf.df <- readLines(filename.df,encoding="UTF-8") bsf.df <- str_replace_all(bsf.df, "[^[:graph:]]", " ") bsf.df <- iconv(bsf.df, from = 'UTF-8', to = 'ASCII//TRANSLIT') outfile <- "bsfdf.bib") writeLines(bsf.df,con=outfile)
Then:
bib2df(filename.df,separate_names=TRUE)

from bib2df.

harkanatta commented on September 27, 2024

What worked for me was to make small changes in two functions of the package:
bib2df_read and bib2df_tidy
In the former function I set the encoding argument to UTF-8: readLines(file,encoding = "UTF-8")
In the latter function there are two lapply functions and I added enc2native() %>% to both of them like so:
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names())

To do this you have to download the code and look for the functions in a folder named R. You will also need the function text_between_curly_brackets :
text_between_curly_brackets <- function(string) { min <- min(gregexpr("\\{", string)[[1]]) max <- max(gregexpr("\\}", string)[[1]]) content <- substring(string, min + 1, max - 1) return(content) }
Hope this helps.

from bib2df.

harkanatta commented on September 27, 2024

library(dplyr)
library(ggplot2)
library(tidyr)
library(humaniformat)
library(plyr)
library(stringr)

file <- "A:/path/to/file.bib"

bib2df_read <- function(file) {
bib <- readLines(file,encoding = "UTF-8")
bib <- str_replace_all(bib, "[^[:graph:]]", " ")
return(bib)
}
bib2df_gather <- function(bib) {

from <- which(str_extract(bib, "[:graph:]") == "@")
to <- c(from[-1] - 1, length(bib))
if (!length(from)) {
return(empty)
}
itemslist <- mapply(
function(x, y) return(bib[x:y]),
x = from,
y = to - 1,
SIMPLIFY = FALSE
)
keys <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=\{)[^,]+")
}
)
fields <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=@)[^\\{]+")
}
)
fields <- lapply(fields, toupper)

categories <- lapply(itemslist,
function(x) {
str_extract(x, "[[:alnum:]_-]+")
}
)

dupl <- sum(
unlist(
lapply(categories, function(x) sum(duplicated(x[!is.na(x)])))
)
)

if (dupl > 0) {
message("Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.")
}

values <- lapply(itemslist,
function(x) {
str_extract(x, "(?<==).*")
}
)

values <- lapply(values,
function(x) {
sapply(x, text_between_curly_brackets, simplify = TRUE, USE.NAMES = FALSE)
}
)

values <- lapply(values, trimws)
items <- mapply(cbind, categories, values, SIMPLIFY = FALSE)
items <- lapply(items,
function(x) {
x <- cbind(toupper(x[, 1]), x[, 2])
}
)
items <- lapply(items,
function(x) {
x[complete.cases(x), ]
}
)
items <- mapply(function(x, y) {
rbind(x, c("CATEGORY", y))
},
x = items, y = fields, SIMPLIFY = FALSE)

items <- lapply(items, t)
items <- lapply(items,
function(x) {
colnames(x) <- x[1, ]
x <- x[-1, ]
return(x)
}
)
items <- lapply(items,
function(x) {
x <- t(x)
x <- data.frame(x, stringsAsFactors = FALSE)
return(x)
}
)
dat <- bind_rows(c(list(empty), items))
dat <- as_tibble(dat)
dat$BIBTEXKEY <- unlist(keys)
dat
}

empty <- data.frame(
CATEGORY = character(0L),
BIBTEXKEY = character(0L),
ADDRESS = character(0L),
ANNOTE = character(0L),
AUTHOR = character(0L),
BOOKTITLE = character(0L),
CHAPTER = character(0L),
CROSSREF = character(0L),
EDITION = character(0L),
EDITOR = character(0L),
HOWPUBLISHED = character(0L),
INSTITUTION = character(0L),
JOURNAL = character(0L),
KEY = character(0L),
MONTH = character(0L),
NOTE = character(0L),
NUMBER = character(0L),
ORGANIZATION = character(0L),
PAGES = character(0L),
PUBLISHER = character(0L),
SCHOOL = character(0L),
SERIES = character(0L),
TITLE = character(0L),
TYPE = character(0L),
VOLUME = character(0L),
YEAR = character(0L),
stringsAsFactors = FALSE
)

bib2df_tidy <- function(bib, separate_names = FALSE) {

if (dim(bib)[1] == 0) {
return(bib)
}

AUTHOR <- EDITOR <- YEAR <- CATEGORY <- NULL
if ("AUTHOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(AUTHOR = strsplit(AUTHOR, " and ", fixed = TRUE))
if (separate_names) {
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("EDITOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(EDITOR = strsplit(EDITOR, " and ", fixed = TRUE))
if (separate_names) {
bib$EDITOR <- lapply(bib$EDITOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("YEAR" %in% colnames(bib)) {
if (sum(is.na(as.numeric(bib$YEAR))) == 0) {
bib <- bib %>%
mutate(YEAR = as.numeric(YEAR))
} else {
message("Column YEAR contains character strings.
No coercion to numeric applied.")
}
}
bib <- bib %>%
select(CATEGORY, dplyr::everything())
return(bib)
}

bib <- bib2df_read(file)
bib <- bib2df_gather(bib)
bib <- bib2df_tidy(bib,separate_names = TRUE)

bib %>%
select(YEAR, AUTHOR) %>%
unnest(cols = c(AUTHOR)) %>%
ggplot() +
aes(x = YEAR, y = reorder(full_name, desc(YEAR))) +
geom_point()

from bib2df.

HedvigS commented on September 27, 2024

I think that this isn't something that can be addressed in the bib2df package itself. I've made a PR with a warning message if the file isn't ASCII, UTF-8 or UTF-16 which should help users address this on their own.

from bib2df.

UTF-8 encoding about bib2df HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent