Comments (7)
Second this! It would make it really really great. I work with a lot of Chinese-language bibliographical data, and df2bib always breaks the characters... I just spent 40 minutes trying to tinker with the code, randomly throwing some fields <- map(fields, ~as_utf8(.x))
just to see if it would do anything... No luck. I would love for this to work. Unfortunately I don't yet know what I don't know in order to try to help.
Note that when using the locale "chs", characters come out like this: ÄϾ©¹ÄÂ¥Ò½Ôº¸ÎÔàÒÆÖ²ÖÐÐÄ
When I use the "C" locale, they come out like this: <U+5357><U+4EAC><U+9F13><U+697C><U+533B> (and I can't really figure out how to turn them back into characters easily.)
from bib2df.
My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.
from bib2df.
from bib2df.
I clean the file like this:
bsf.df <- readLines(filename.df,encoding="UTF-8") bsf.df <- str_replace_all(bsf.df, "[^[:graph:]]", " ") bsf.df <- iconv(bsf.df, from = 'UTF-8', to = 'ASCII//TRANSLIT') outfile <- "bsfdf.bib") writeLines(bsf.df,con=outfile)
Then:
bib2df(filename.df,separate_names=TRUE)
from bib2df.
What worked for me was to make small changes in two functions of the package:
bib2df_read
and bib2df_tidy
In the former function I set the encoding argument to UTF-8: readLines(file,encoding = "UTF-8")
In the latter function there are two lapply functions and I added enc2native() %>%
to both of them like so:
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names())
To do this you have to download the code and look for the functions in a folder named R. You will also need the function text_between_curly_brackets :
text_between_curly_brackets <- function(string) { min <- min(gregexpr("\\{", string)[[1]]) max <- max(gregexpr("\\}", string)[[1]]) content <- substring(string, min + 1, max - 1) return(content) }
Hope this helps.
from bib2df.
library(dplyr)
library(ggplot2)
library(tidyr)
library(humaniformat)
library(plyr)
library(stringr)
file <- "A:/path/to/file.bib"
bib2df_read <- function(file) {
bib <- readLines(file,encoding = "UTF-8")
bib <- str_replace_all(bib, "[^[:graph:]]", " ")
return(bib)
}
bib2df_gather <- function(bib) {
from <- which(str_extract(bib, "[:graph:]") == "@")
to <- c(from[-1] - 1, length(bib))
if (!length(from)) {
return(empty)
}
itemslist <- mapply(
function(x, y) return(bib[x:y]),
x = from,
y = to - 1,
SIMPLIFY = FALSE
)
keys <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=\{)[^,]+")
}
)
fields <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=@)[^\\{]+")
}
)
fields <- lapply(fields, toupper)
categories <- lapply(itemslist,
function(x) {
str_extract(x, "[[:alnum:]_-]+")
}
)
dupl <- sum(
unlist(
lapply(categories, function(x) sum(duplicated(x[!is.na(x)])))
)
)
if (dupl > 0) {
message("Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.")
}
values <- lapply(itemslist,
function(x) {
str_extract(x, "(?<==).*")
}
)
values <- lapply(values,
function(x) {
sapply(x, text_between_curly_brackets, simplify = TRUE, USE.NAMES = FALSE)
}
)
values <- lapply(values, trimws)
items <- mapply(cbind, categories, values, SIMPLIFY = FALSE)
items <- lapply(items,
function(x) {
x <- cbind(toupper(x[, 1]), x[, 2])
}
)
items <- lapply(items,
function(x) {
x[complete.cases(x), ]
}
)
items <- mapply(function(x, y) {
rbind(x, c("CATEGORY", y))
},
x = items, y = fields, SIMPLIFY = FALSE)
items <- lapply(items, t)
items <- lapply(items,
function(x) {
colnames(x) <- x[1, ]
x <- x[-1, ]
return(x)
}
)
items <- lapply(items,
function(x) {
x <- t(x)
x <- data.frame(x, stringsAsFactors = FALSE)
return(x)
}
)
dat <- bind_rows(c(list(empty), items))
dat <- as_tibble(dat)
dat$BIBTEXKEY <- unlist(keys)
dat
}
empty <- data.frame(
CATEGORY = character(0L),
BIBTEXKEY = character(0L),
ADDRESS = character(0L),
ANNOTE = character(0L),
AUTHOR = character(0L),
BOOKTITLE = character(0L),
CHAPTER = character(0L),
CROSSREF = character(0L),
EDITION = character(0L),
EDITOR = character(0L),
HOWPUBLISHED = character(0L),
INSTITUTION = character(0L),
JOURNAL = character(0L),
KEY = character(0L),
MONTH = character(0L),
NOTE = character(0L),
NUMBER = character(0L),
ORGANIZATION = character(0L),
PAGES = character(0L),
PUBLISHER = character(0L),
SCHOOL = character(0L),
SERIES = character(0L),
TITLE = character(0L),
TYPE = character(0L),
VOLUME = character(0L),
YEAR = character(0L),
stringsAsFactors = FALSE
)
bib2df_tidy <- function(bib, separate_names = FALSE) {
if (dim(bib)[1] == 0) {
return(bib)
}
AUTHOR <- EDITOR <- YEAR <- CATEGORY <- NULL
if ("AUTHOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(AUTHOR = strsplit(AUTHOR, " and ", fixed = TRUE))
if (separate_names) {
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("EDITOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(EDITOR = strsplit(EDITOR, " and ", fixed = TRUE))
if (separate_names) {
bib$EDITOR <- lapply(bib$EDITOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("YEAR" %in% colnames(bib)) {
if (sum(is.na(as.numeric(bib$YEAR))) == 0) {
bib <- bib %>%
mutate(YEAR = as.numeric(YEAR))
} else {
message("Column YEAR
contains character strings.
No coercion to numeric applied.")
}
}
bib <- bib %>%
select(CATEGORY, dplyr::everything())
return(bib)
}
bib <- bib2df_read(file)
bib <- bib2df_gather(bib)
bib <- bib2df_tidy(bib,separate_names = TRUE)
bib %>%
select(YEAR, AUTHOR) %>%
unnest(cols = c(AUTHOR)) %>%
ggplot() +
aes(x = YEAR, y = reorder(full_name, desc(YEAR))) +
geom_point()
from bib2df.
I think that this isn't something that can be addressed in the bib2df package itself. I've made a PR with a warning message if the file isn't ASCII, UTF-8 or UTF-16 which should help users address this on their own.
from bib2df.
Related Issues (20)
- Problems parsing .bib from Web of Science HOT 12
- Support for parsing .bib from scopus HOT 2
- Is there a way to transform latex accents to plain text?
- field names in lower case HOT 5
- bib2df_gather strips braces incorrectly HOT 2
- problem with quotes
- Add option to change encoding assumed for input strings (UTF-8)
- v 1.1.1 bib2df() loses 1st bib entry HOT 2
- New release? HOT 1
- Editor field is lost when reading and writing with `separate_names = TRUE`
- Update to Tibble Package HOT 1
- Error "Invalid URL: File is not readable." when trying to read `.bib` file in a subfolder with name `www`
- Deprecation Warning from tibble 2.0.0 HOT 2
- Article TITLE truncated when parsed by bib2df
- Import bibtex from scopus generates thousands of variables
- problem with whitespaces around = HOT 9
- bib variables with '.' in name HOT 2
- Parsing .bib fails when field separator is on the next line HOT 3
- `as_data_frame()` was deprecated in tibble 2.0.0.Y HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bib2df.