Giter VIP home page Giter VIP logo

Comments (7)

MarkEdmondson1234 avatar MarkEdmondson1234 commented on September 15, 2024

from googlelanguager.

fkrauer avatar fkrauer commented on September 15, 2024
library(googleAuthR)
library(httr)
library(RCurl)
library(rjson)

service_token <- gar_auth_service(json_file= jsonfile, 
                                  scope = "https://www.googleapis.com/auth/cloud-platform")

text <- "Ich würde die Mäuse mit Käse füttern"
text <- iconv(text, "latin1", "UTF-8")
Encoding(text)
[1] "UTF-8"
nchar(text)
[1] 36

baseurl <- "https://language.googleapis.com/v1beta1/documents:analyzeSyntax?key="
outJSON <- getURL(paste0(baseurl, googleAPI),
                                 .opts = curlOptions(postfields = toJSON(list(document=list(type ="PLAIN_TEXT",
                                                                                            language="DE",
                                                                                            content = text), 
                                                                              encodingType = "UTF8")),
                                                     httpheader = c("Content-Type" = "application/json",
                                                                    Authorization = service_token)))

tokens <- fromJSON(outJSON)[[2]]
tokens <- do.call("rbind", lapply(tokens, function(x) data.frame(t(unlist(x)), stringsAsFactors = F)))
tokens[,c(1:3)]

  text.content text.beginOffset partOfSpeech.tag
1          Ich                0             PRON
2        würde                4             VERB
3          die               11              DET
4        Mäuse               15             NOUN
5          mit               22              ADP
6         Käse               26             NOUN
7      füttern               32             VERB

The offset is correct for the first and second token, but the third token ("die") should be offset 10, Mäuse should be 14, "mit" should be 20, and so on... Every umlaut adds an additional count. The same problem occurs for the NER (entities) analysis. My locale should not be the problem, Norwegian accepts Umlaut, and if I change the local to German I have the same issue. Part of the code is adapted from a blog (can't find it again), it might have been a blog of yours?

Thanks a lot for your help!

R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252  LC_CTYPE=Norwegian (Bokmål)_Norway.1252    LC_MONETARY=Norwegian (Bokmål)_Norway.1252
[4] LC_NUMERIC=C                               LC_TIME=Norwegian (Bokmål)_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] coreNLP_0.4-2      dplyr_0.7.4        stringr_1.2.0      wordcloud_2.6      RColorBrewer_1.1-2 readr_1.1.1        rjson_0.2.15      
 [8] RCurl_1.95-4.8     bitops_1.0-6       httr_1.3.1         googleAuthR_0.7.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19     bindr_0.1        magrittr_1.5     hms_0.3          R6_2.2.2         rlang_0.3.4      tools_3.4.4      openssl_0.9.6   
 [9] yaml_2.1.16      assertthat_0.2.0 digest_0.6.12    tibble_1.3.4     bindrcpp_0.2     rJava_0.9-11     curl_2.8.1       glue_1.1.1      
[17] memoise_1.1.0    stringi_1.4.3    compiler_3.4.4   XML_3.98-1.19    jsonlite_1.5     pkgconfig_2.0.1 

from googlelanguager.

MarkEdmondson1234 avatar MarkEdmondson1234 commented on September 15, 2024

The package does replicates your problem, so perhaps its an issue with the API response itself as the encoding and response parsing you do is slightly different from the package's method. For example:

library(googleLanguageR)

nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern", language = "de")
nlp$tokens[[1]][, c("content","beginOffset")]
  content beginOffset
1     Ich           0
2   würde           4
3     die          11
4   Mäuse          15
5     mit          22
6    Käse          26
7 füttern          32

So it does look like a bug, perhaps worth reporting to the Google API team.

from googlelanguager.

MarkEdmondson1234 avatar MarkEdmondson1234 commented on September 15, 2024

I think I get the right offsets when I use "UTF16" encoding though?

nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern", 
                      language = "de", encodingType = "UTF16")
#>2019-05-07 10:34:39 -- annotateText: 36 characters

nlp$tokens[[1]][, c("content","beginOffset")]
  content beginOffset
1     Ich           0
2   würde           4
3     die          10
4   Mäuse          14
5     mit          20
6    Käse          24
7 füttern          29

from googlelanguager.

fkrauer avatar fkrauer commented on September 15, 2024

interesting, UTF16 seems to work. Thanks for looking into it. I will try to figure how to report it to Google's NLP developers. Cheers from Oslo

from googlelanguager.

fkrauer avatar fkrauer commented on September 15, 2024

Update:
Here is the response from the Google Cloud support team:

"I have now heard from our Cloud Natural Language API team. According to them, the behavior is intended for UTF-8 as it counts at a byte-level and those characters (Umlauts) are read 2 bytes each for UTF-8"

So the solution here would be to use the UTF-16 to avoid this problem. Hope that helps :)

from googlelanguager.

MarkEdmondson1234 avatar MarkEdmondson1234 commented on September 15, 2024

Good to know! Thanks for following this up @fkrauer

from googlelanguager.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.