Comments (7)
from googlelanguager.
library(googleAuthR)
library(httr)
library(RCurl)
library(rjson)
service_token <- gar_auth_service(json_file= jsonfile,
scope = "https://www.googleapis.com/auth/cloud-platform")
text <- "Ich würde die Mäuse mit Käse füttern"
text <- iconv(text, "latin1", "UTF-8")
Encoding(text)
[1] "UTF-8"
nchar(text)
[1] 36
baseurl <- "https://language.googleapis.com/v1beta1/documents:analyzeSyntax?key="
outJSON <- getURL(paste0(baseurl, googleAPI),
.opts = curlOptions(postfields = toJSON(list(document=list(type ="PLAIN_TEXT",
language="DE",
content = text),
encodingType = "UTF8")),
httpheader = c("Content-Type" = "application/json",
Authorization = service_token)))
tokens <- fromJSON(outJSON)[[2]]
tokens <- do.call("rbind", lapply(tokens, function(x) data.frame(t(unlist(x)), stringsAsFactors = F)))
tokens[,c(1:3)]
text.content text.beginOffset partOfSpeech.tag
1 Ich 0 PRON
2 würde 4 VERB
3 die 11 DET
4 Mäuse 15 NOUN
5 mit 22 ADP
6 Käse 26 NOUN
7 füttern 32 VERB
The offset is correct for the first and second token, but the third token ("die") should be offset 10, Mäuse should be 14, "mit" should be 20, and so on... Every umlaut adds an additional count. The same problem occurs for the NER (entities) analysis. My locale should not be the problem, Norwegian accepts Umlaut, and if I change the local to German I have the same issue. Part of the code is adapted from a blog (can't find it again), it might have been a blog of yours?
Thanks a lot for your help!
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252 LC_CTYPE=Norwegian (Bokmål)_Norway.1252 LC_MONETARY=Norwegian (Bokmål)_Norway.1252
[4] LC_NUMERIC=C LC_TIME=Norwegian (Bokmål)_Norway.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] coreNLP_0.4-2 dplyr_0.7.4 stringr_1.2.0 wordcloud_2.6 RColorBrewer_1.1-2 readr_1.1.1 rjson_0.2.15
[8] RCurl_1.95-4.8 bitops_1.0-6 httr_1.3.1 googleAuthR_0.7.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.19 bindr_0.1 magrittr_1.5 hms_0.3 R6_2.2.2 rlang_0.3.4 tools_3.4.4 openssl_0.9.6
[9] yaml_2.1.16 assertthat_0.2.0 digest_0.6.12 tibble_1.3.4 bindrcpp_0.2 rJava_0.9-11 curl_2.8.1 glue_1.1.1
[17] memoise_1.1.0 stringi_1.4.3 compiler_3.4.4 XML_3.98-1.19 jsonlite_1.5 pkgconfig_2.0.1
from googlelanguager.
The package does replicates your problem, so perhaps its an issue with the API response itself as the encoding and response parsing you do is slightly different from the package's method. For example:
library(googleLanguageR)
nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern", language = "de")
nlp$tokens[[1]][, c("content","beginOffset")]
content beginOffset
1 Ich 0
2 würde 4
3 die 11
4 Mäuse 15
5 mit 22
6 Käse 26
7 füttern 32
So it does look like a bug, perhaps worth reporting to the Google API team.
from googlelanguager.
I think I get the right offsets when I use "UTF16" encoding though?
nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern",
language = "de", encodingType = "UTF16")
#>2019-05-07 10:34:39 -- annotateText: 36 characters
nlp$tokens[[1]][, c("content","beginOffset")]
content beginOffset
1 Ich 0
2 würde 4
3 die 10
4 Mäuse 14
5 mit 20
6 Käse 24
7 füttern 29
from googlelanguager.
interesting, UTF16 seems to work. Thanks for looking into it. I will try to figure how to report it to Google's NLP developers. Cheers from Oslo
from googlelanguager.
Update:
Here is the response from the Google Cloud support team:
"I have now heard from our Cloud Natural Language API team. According to them, the behavior is intended for UTF-8 as it counts at a byte-level and those characters (Umlauts) are read 2 bytes each for UTF-8"
So the solution here would be to use the UTF-16 to avoid this problem. Hope that helps :)
from googlelanguager.
Good to know! Thanks for following this up @fkrauer
from googlelanguager.
Related Issues (20)
- Error with speaker diarization HOT 7
- I am getting this error on passing a wav file into readwave function. HOT 1
- Support SSML for text-to-speech HOT 2
- Support device profiles in text to speech
- Authenticated website examples
- Step #1: API returned: Invalid JSON payload received. Unknown name "enableSpeakerDiarization" at 'config': Cannot find field. HOT 1
- Package has a VignetteBuilder field but no prebuilt vignette index.
- Possible issue with asynch call? HOT 2
- gl_speech request almost always times out, no proper error message HOT 18
- no access to Google Cloud Service HOT 2
- Entity sentiment shows but document sentiment shows NA HOT 4
- googleLanguageR does not translate tweets. HOT 8
- lack of MP3 encoding HOT 1
- Error midway and no translated text output HOT 3
- Call for co-maintainers :-) HOT 3
- Split calls in gl_translate more effectively - not all or nothing? HOT 2
- Error: lexical error: invalid char in json text. HOT 1
- Link to package webpage broken - gives 404 on markedmonson.me ...
- Add support to translate files HOT 7
- gl_talk: language code detection HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from googlelanguager.