false character count when using Umlauts about googlelanguager HOT 7 CLOSED

ropensci commented on September 15, 2024

false character count when using Umlauts

from googlelanguager.

Comments (7)

MarkEdmondson1234 commented on September 15, 2024

Hi Fabienne, I don’t recognise that issue as I’ve been translating using Danish æ / ø / å with no problems. Perhaps it’s the way I’m processing the text before sending it to the api. If you have a reproducible example of your GET request we could see if there are differences in the package code, and potentially the same problem. Yours sincerely, Mark

…

________________________________ From: Fabienne Krauer <[email protected]> Sent: Monday, May 6, 2019 5:33 PM To: ropensci/googleLanguageR Cc: Subscribed Subject: [ropensci/googleLanguageR] false character count when using Umlauts (#58) Hi all I have not used your package so far, I wrote my one script to call the REST API with rjson. However, I have noticed a problem when using Google's analyzeSyntax and analyzeEntities, and I was wondering if you have the same issue: I am analyzing a german text with Umlauts (ä,ö,ü,..) and Google seems to count all Umlauts as two characters, which means that all values for the offsets are wrong after the first appearance of an Umlaut in the text. So basically I can't map the tokens or the entities back to the original text. Have you had this issue (maybe with other special characters) and how do you deal with it? I am encoding with UTF8, so it should not be an encoding problem. I am working with a windows environment. Thanks and best wishes from Oslo Fabienne — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#58>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAYCPLBOSBQ3PJTN4RXJPGDPUBFWNANCNFSM4HLBC2OA>.

from googlelanguager.

fkrauer commented on September 15, 2024

library(googleAuthR)
library(httr)
library(RCurl)
library(rjson)

service_token <- gar_auth_service(json_file= jsonfile, 
                                  scope = "https://www.googleapis.com/auth/cloud-platform")

text <- "Ich würde die Mäuse mit Käse füttern"
text <- iconv(text, "latin1", "UTF-8")
Encoding(text)
[1] "UTF-8"
nchar(text)
[1] 36

baseurl <- "https://language.googleapis.com/v1beta1/documents:analyzeSyntax?key="
outJSON <- getURL(paste0(baseurl, googleAPI),
                                 .opts = curlOptions(postfields = toJSON(list(document=list(type ="PLAIN_TEXT",
                                                                                            language="DE",
                                                                                            content = text), 
                                                                              encodingType = "UTF8")),
                                                     httpheader = c("Content-Type" = "application/json",
                                                                    Authorization = service_token)))

tokens <- fromJSON(outJSON)[[2]]
tokens <- do.call("rbind", lapply(tokens, function(x) data.frame(t(unlist(x)), stringsAsFactors = F)))
tokens[,c(1:3)]

  text.content text.beginOffset partOfSpeech.tag
1          Ich                0             PRON
2        würde                4             VERB
3          die               11              DET
4        Mäuse               15             NOUN
5          mit               22              ADP
6         Käse               26             NOUN
7      füttern               32             VERB

The offset is correct for the first and second token, but the third token ("die") should be offset 10, Mäuse should be 14, "mit" should be 20, and so on... Every umlaut adds an additional count. The same problem occurs for the NER (entities) analysis. My locale should not be the problem, Norwegian accepts Umlaut, and if I change the local to German I have the same issue. Part of the code is adapted from a blog (can't find it again), it might have been a blog of yours?

Thanks a lot for your help!

R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=Norwegian (Bokmål)_Norway.1252  LC_CTYPE=Norwegian (Bokmål)_Norway.1252    LC_MONETARY=Norwegian (Bokmål)_Norway.1252
[4] LC_NUMERIC=C                               LC_TIME=Norwegian (Bokmål)_Norway.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] coreNLP_0.4-2      dplyr_0.7.4        stringr_1.2.0      wordcloud_2.6      RColorBrewer_1.1-2 readr_1.1.1        rjson_0.2.15      
 [8] RCurl_1.95-4.8     bitops_1.0-6       httr_1.3.1         googleAuthR_0.7.0 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.19     bindr_0.1        magrittr_1.5     hms_0.3          R6_2.2.2         rlang_0.3.4      tools_3.4.4      openssl_0.9.6   
 [9] yaml_2.1.16      assertthat_0.2.0 digest_0.6.12    tibble_1.3.4     bindrcpp_0.2     rJava_0.9-11     curl_2.8.1       glue_1.1.1      
[17] memoise_1.1.0    stringi_1.4.3    compiler_3.4.4   XML_3.98-1.19    jsonlite_1.5     pkgconfig_2.0.1

from googlelanguager.

MarkEdmondson1234 commented on September 15, 2024

The package does replicates your problem, so perhaps its an issue with the API response itself as the encoding and response parsing you do is slightly different from the package's method. For example:

library(googleLanguageR)

nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern", language = "de")
nlp$tokens[[1]][, c("content","beginOffset")]
  content beginOffset
1     Ich           0
2   würde           4
3     die          11
4   Mäuse          15
5     mit          22
6    Käse          26
7 füttern          32

So it does look like a bug, perhaps worth reporting to the Google API team.

from googlelanguager.

MarkEdmondson1234 commented on September 15, 2024

I think I get the right offsets when I use "UTF16" encoding though?

nlp <- gl_nlp("Ich würde die Mäuse mit Käse füttern", 
                      language = "de", encodingType = "UTF16")
#>2019-05-07 10:34:39 -- annotateText: 36 characters

nlp$tokens[[1]][, c("content","beginOffset")]
  content beginOffset
1     Ich           0
2   würde           4
3     die          10
4   Mäuse          14
5     mit          20
6    Käse          24
7 füttern          29

from googlelanguager.

fkrauer commented on September 15, 2024

interesting, UTF16 seems to work. Thanks for looking into it. I will try to figure how to report it to Google's NLP developers. Cheers from Oslo

from googlelanguager.

fkrauer commented on September 15, 2024

Update:
Here is the response from the Google Cloud support team:

"I have now heard from our Cloud Natural Language API team. According to them, the behavior is intended for UTF-8 as it counts at a byte-level and those characters (Umlauts) are read 2 bytes each for UTF-8"

So the solution here would be to use the UTF-16 to avoid this problem. Hope that helps :)

from googlelanguager.

MarkEdmondson1234 commented on September 15, 2024

Good to know! Thanks for following this up @fkrauer

from googlelanguager.

false character count when using Umlauts about googlelanguager HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent