Giter VIP home page Giter VIP logo

Comments (7)

mpr1255 avatar mpr1255 commented on September 27, 2024

Second this! It would make it really really great. I work with a lot of Chinese-language bibliographical data, and df2bib always breaks the characters... I just spent 40 minutes trying to tinker with the code, randomly throwing some fields <- map(fields, ~as_utf8(.x)) just to see if it would do anything... No luck. I would love for this to work. Unfortunately I don't yet know what I don't know in order to try to help.

Note that when using the locale "chs", characters come out like this: ÄϾ©¹ÄÂ¥Ò½Ôº¸ÎÔàÒÆÖ²ÖÐÐÄ

When I use the "C" locale, they come out like this: <U+5357><U+4EAC><U+9F13><U+697C><U+533B> (and I can't really figure out how to turn them back into characters easily.)

from bib2df.

GilmourR avatar GilmourR commented on September 27, 2024

My tests some time ago, suggested the problem was in Humaniformat, rather than bib2df itself.

from bib2df.

mpr1255 avatar mpr1255 commented on September 27, 2024

from bib2df.

GilmourR avatar GilmourR commented on September 27, 2024

I clean the file like this:
bsf.df <- readLines(filename.df,encoding="UTF-8") bsf.df <- str_replace_all(bsf.df, "[^[:graph:]]", " ") bsf.df <- iconv(bsf.df, from = 'UTF-8', to = 'ASCII//TRANSLIT') outfile <- "bsfdf.bib") writeLines(bsf.df,con=outfile)
Then:
bib2df(filename.df,separate_names=TRUE)

from bib2df.

harkanatta avatar harkanatta commented on September 27, 2024

What worked for me was to make small changes in two functions of the package:
bib2df_read and bib2df_tidy
In the former function I set the encoding argument to UTF-8: readLines(file,encoding = "UTF-8")
In the latter function there are two lapply functions and I added enc2native() %>% to both of them like so:
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>% enc2native() %>% format_reverse() %>% format_period() %>% parse_names())

To do this you have to download the code and look for the functions in a folder named R. You will also need the function text_between_curly_brackets :
text_between_curly_brackets <- function(string) { min <- min(gregexpr("\\{", string)[[1]]) max <- max(gregexpr("\\}", string)[[1]]) content <- substring(string, min + 1, max - 1) return(content) }
Hope this helps.

from bib2df.

harkanatta avatar harkanatta commented on September 27, 2024

library(dplyr)
library(ggplot2)
library(tidyr)
library(humaniformat)
library(plyr)
library(stringr)

file <- "A:/path/to/file.bib"

bib2df_read <- function(file) {
bib <- readLines(file,encoding = "UTF-8")
bib <- str_replace_all(bib, "[^[:graph:]]", " ")
return(bib)
}
bib2df_gather <- function(bib) {

from <- which(str_extract(bib, "[:graph:]") == "@")
to <- c(from[-1] - 1, length(bib))
if (!length(from)) {
return(empty)
}
itemslist <- mapply(
function(x, y) return(bib[x:y]),
x = from,
y = to - 1,
SIMPLIFY = FALSE
)
keys <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=\{)[^,]+")
}
)
fields <- lapply(itemslist,
function(x) {
str_extract(x[1], "(?<=@)[^\\{]+")
}
)
fields <- lapply(fields, toupper)

categories <- lapply(itemslist,
function(x) {
str_extract(x, "[[:alnum:]_-]+")
}
)

dupl <- sum(
unlist(
lapply(categories, function(x) sum(duplicated(x[!is.na(x)])))
)
)

if (dupl > 0) {
message("Some BibTeX entries may have been dropped.
The result could be malformed.
Review the .bib file and make sure every single entry starts
with a '@'.")
}

values <- lapply(itemslist,
function(x) {
str_extract(x, "(?<==).*")
}
)

values <- lapply(values,
function(x) {
sapply(x, text_between_curly_brackets, simplify = TRUE, USE.NAMES = FALSE)
}
)

values <- lapply(values, trimws)
items <- mapply(cbind, categories, values, SIMPLIFY = FALSE)
items <- lapply(items,
function(x) {
x <- cbind(toupper(x[, 1]), x[, 2])
}
)
items <- lapply(items,
function(x) {
x[complete.cases(x), ]
}
)
items <- mapply(function(x, y) {
rbind(x, c("CATEGORY", y))
},
x = items, y = fields, SIMPLIFY = FALSE)

items <- lapply(items, t)
items <- lapply(items,
function(x) {
colnames(x) <- x[1, ]
x <- x[-1, ]
return(x)
}
)
items <- lapply(items,
function(x) {
x <- t(x)
x <- data.frame(x, stringsAsFactors = FALSE)
return(x)
}
)
dat <- bind_rows(c(list(empty), items))
dat <- as_tibble(dat)
dat$BIBTEXKEY <- unlist(keys)
dat
}

empty <- data.frame(
CATEGORY = character(0L),
BIBTEXKEY = character(0L),
ADDRESS = character(0L),
ANNOTE = character(0L),
AUTHOR = character(0L),
BOOKTITLE = character(0L),
CHAPTER = character(0L),
CROSSREF = character(0L),
EDITION = character(0L),
EDITOR = character(0L),
HOWPUBLISHED = character(0L),
INSTITUTION = character(0L),
JOURNAL = character(0L),
KEY = character(0L),
MONTH = character(0L),
NOTE = character(0L),
NUMBER = character(0L),
ORGANIZATION = character(0L),
PAGES = character(0L),
PUBLISHER = character(0L),
SCHOOL = character(0L),
SERIES = character(0L),
TITLE = character(0L),
TYPE = character(0L),
VOLUME = character(0L),
YEAR = character(0L),
stringsAsFactors = FALSE
)

bib2df_tidy <- function(bib, separate_names = FALSE) {

if (dim(bib)[1] == 0) {
return(bib)
}

AUTHOR <- EDITOR <- YEAR <- CATEGORY <- NULL
if ("AUTHOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(AUTHOR = strsplit(AUTHOR, " and ", fixed = TRUE))
if (separate_names) {
bib$AUTHOR <- lapply(bib$AUTHOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("EDITOR" %in% colnames(bib)) {
bib <- bib %>%
mutate(EDITOR = strsplit(EDITOR, " and ", fixed = TRUE))
if (separate_names) {
bib$EDITOR <- lapply(bib$EDITOR, function(x) x %>%
enc2native() %>%
format_reverse() %>%
format_period() %>%
parse_names())
}
}
if ("YEAR" %in% colnames(bib)) {
if (sum(is.na(as.numeric(bib$YEAR))) == 0) {
bib <- bib %>%
mutate(YEAR = as.numeric(YEAR))
} else {
message("Column YEAR contains character strings.
No coercion to numeric applied.")
}
}
bib <- bib %>%
select(CATEGORY, dplyr::everything())
return(bib)
}

bib <- bib2df_read(file)
bib <- bib2df_gather(bib)
bib <- bib2df_tidy(bib,separate_names = TRUE)

bib %>%
select(YEAR, AUTHOR) %>%
unnest(cols = c(AUTHOR)) %>%
ggplot() +
aes(x = YEAR, y = reorder(full_name, desc(YEAR))) +
geom_point()

from bib2df.

HedvigS avatar HedvigS commented on September 27, 2024

I think that this isn't something that can be addressed in the bib2df package itself. I've made a PR with a warning message if the file isn't ASCII, UTF-8 or UTF-16 which should help users address this on their own.

from bib2df.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.