pemistahl / lingua-go Goto Github PK

The most accurate natural language detection library for Go, suitable for short text and mixed-language text

License: Apache License 2.0

Go 100.00%

natural-language-processing language-detection language-recognition language-classification language-identification language-processing nlp nlp-machine-learning golang-library go

lingua-go's Introduction

Hello, thank you for visiting my profile. 🖖🏻🤓

My name is Peter. There are actually more people than I thought who have the same first and family name, so I bother you with my middle name Michael as well. Were Type O Negative really so popular back then? I haven't got a clue...

I hold a Master's degree in computational linguistics from Saarland University in Saarbrücken, Germany. After my graduation in 2013, I decided against a research career because I like building things that help people now and not in the unforeseeable future.

Currently, I work for Riege, a leading provider of cloud-based software for the logistics industry. In my free time, I like working on open source projects in the fields of computational linguistics and string processing in general.

I have a special interest in modern programming languages and green computing. I believe that the software industry should make more significant contributions towards environmental protection. Great advances have been made to decrease energy consumption and emissions of hardware. However, those are often canceled out by poorly optimized software and resource-intensive runtime environments.

This is why I'm especially interested in the Rust programming language which allows writing performant and memory-safe applications without the need for a garbage collector or a virtual runtime environment, making use of modern syntax abstractions at the same time.

For those of you interested in how Rust and related technology can accomplish the goal of more eco-friendly software, I strongly recommend you to read the dissertation Energyware Engineering: Techniques and Tools for Green Software Development published in 2018 by Rui Pereira at the University of Minho in Portugal.

lingua-go's People

Contributors

Stargazers

Watchers

lingua-go's Issues

Find more efficient serialization format for language models

Currently, the language models are stored in json files. Perhaps it is possible to store them in some kind of binary format which can be loaded faster than the json files.

One promising candidate could be protocol buffers.

can support han-pinyin?

panic: runtime error: slice bounds out of range [:10] with length 9

When I run the code from this example:
https://github.com/pemistahl/lingua-go#96-detection-of-multiple-languages-in-mixed-language-texts

go run . test.txt

I got this error, if the text has only one word:

English 0 10 :
panic: runtime error: slice bounds out of range [:10] with length 9

goroutine 1 [running]:
main.main()
	/home/rom/w/kube/apps/tts/split/split-text.go:49 +0x3ce
exit status 2

How to reproduce:
cat test.txt

testword

cat ./split-text.go

package main

import (
  "fmt"
  "github.com/pemistahl/lingua-go"
  "os"
)

func getFileContent(filename string) string {
	testData, err := os.ReadFile(filename)
	if err != nil {
		panic(err.Error())
	}
	return string(testData)
}

func main() {
  if len(os.Args) < 2 {
    fmt.Println("Missing parameter, provide file name!")
    return
  }
  filename := os.Args[1]

  languages := []lingua.Language{
    lingua.English,
    lingua.Finnish,
  }

  detector := lingua.NewLanguageDetectorBuilder().
    FromLanguages(languages...).
    Build()

  sentence := getFileContent(filename)
  for _, result := range detector.DetectMultipleLanguagesOf(sentence) {
      fmt.Printf("%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex())
      fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])
  }
}

`go get` error: `unknown revision serialization/v1.2.0`

I hope this is related to the library and not something with my Go installation.

When I run: go get github.com/pemistahl/lingua-go
I get the following error:

go: downloading github.com/pemistahl/lingua-go/serialization v1.2.0
github.com/pemistahl/lingua-go imports
        github.com/pemistahl/lingua-go/serialization: reading github.com/pemistahl/lingua-go/seriali
zation/go.mod at revision serialization/v1.2.0: unknown revision serialization/v1.2.0

Add confidence metric for single language

A new API method shall be implemented that allows to say whether some text is likely written in one specific language.

Compile-time language inclusion

If generated data for languages will be split between per-language, it's possible to strip down bundled language immensely.

Tag handling

Each file can contain //go:build directive that controls inclusion of languages

By default, all generated files can include //go:build !lingua_ignore which means "unless built with -tags lingua_ignore, include this file". That is the same behaviour as it is now.

Then, build constraint //go:build (!lingua_ignore && !lingua_no<language>) || lingua_<language> will be built when either tags -lingua_<language> is specified or -tags lingua_no<language> is NOT specified.

Thus, if you want all languages to be included, you simply do nothing and when you want to reduce language set to the minimum, you use build tags like -tags lingua_ignore,lingua_en,lingua_es,etc.

If you want to exclude only several languages, you add -tags lingua_noge without adding lingua_ignore.

Model loading

For now, models are loaded from a single point in detector.go through embed.FS.
Instead of that, each language-model/<language> could contain .go file that has aforementioned build constraints.

This file can also load all *.zip files into separate embed.FS entity which can be then passed to the "main" filesystem in language-model package.

language-model package then can implement interface for fs.SubFS.
It could be as simple as generated file that has switch/case for all available languages that includes all language-model/* packages.
Or, if you don't want to use generation, it should be simple enough to add Register method that init function of language-model/<language>/ package can then call. It won't be called if language package is ignored.

Add more classification metrics in library comparisons

Add more classification metrics such as Precision, Recall, Specificity and F1.

Add absolute confidence metric

I do see that the README has this example:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 1.00
    // French: 0.79
    // German: 0.75
    // Spanish: 0.72
}

But if I do detector.ComputeLanguageConfidenceValues("yo bebo ein large quantity of tasty leche"), English is still going to result in 1.0. How do I get something like a certainty / probability that the text is English? Because 1.0 doesn't seem so helpful in that case. It might just be my lack of math experience, I'm assuming this is possible with the values above in the example, but I don't exactly see how.

Support absolute language confidence metric

Hi,
In my scenario, the goal is to detect whether the input text is in English or another language. I'm not sure how to utilize the library to accomplish this task. For instance, if the input text is in a specified language, such as Vietnamese, I expect the detection as non english

	languages := []lingua.Language{
		lingua.English,
		lingua.Vietnamese,
		lingua.Unknown,
	}

	sentence := "Thông tin tài khoản của bạn"

	detector := lingua.NewLanguageDetectorBuilder().
		FromLanguages(languages...).
		WithMinimumRelativeDistance(0.9).
		Build()

	confidenceValues := detector.ComputeLanguageConfidenceValues(sentence)

	for _, elem := range confidenceValues {
		fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
	}

output:

Vietnamese: 1.00
English: 0.00

when remove lingua.Vietnamese from expected language list, the program outputs English: 1.00, I would like the result is other language type rather than engilsh.
please help me on how to do this.
Thanks in advance.

Detection results differ between JVM and Go edition of Lingua

For some text samples, Lingua-go detected the wrong language, but Lingua's results were correct.

Examples:

texts	lingua-go	lingua
Chắc tại có +18:)?	Malay	VIETNAMESE

how to judge a multi-language text contains English or not?

Glaring bug: Klingon is missing.

butlhghajboghnuv'e'eyIHo'.

build to wasm, run slowly.

I build the simple example to detect the word 'app', it work. but when I build it to wasm.it cost about 6 second to done.Is there any way to solve this problem?

Language detection is sometimes non-deterministic

To reproduce the issue:

package main

import (
	"github.com/pemistahl/lingua-go"
	"log"
)

func main() {
	detectorAll := lingua.NewLanguageDetectorBuilder().FromAllLanguages().WithPreloadedLanguageModels().Build()
	for i := 0; i < 1000 ; i++ {
		lang,_:=detectorAll.DetectLanguageOf("Az elmúlt hétvégén 12-re emelkedett az elhunyt koronavírus-fertőzöttek száma Szlovákiában. Mindegyik szociális otthon dolgozóját letesztelik, Matovič szerint az ingázóknak még várniuk kellene a teszteléssel")
		log.Println(lang.IsoCode639_1().String())

	}
}

Thank you for amazing work!

Panics at loadJson

Code to reproduce:

package main

import (
    "fmt"
    "github.com/pemistahl/lingua-go"
)

func main() {
    languages := []lingua.Language{
        lingua.English,
        lingua.French,
        lingua.German,
        lingua.Spanish,
    }

    detector := lingua.NewLanguageDetectorBuilder().
        FromLanguages(languages...).
        Build()

    confidenceValues := detector.ComputeLanguageConfidenceValues("languages are awesome")

    for _, elem := range confidenceValues {
        fmt.Printf("%s: %.2f\n", elem.Language(), elem.Value())
    }

    // Output:
    // English: 1.00
    // French: 0.79
    // German: 0.75
    // Spanish: 0.72
}

go.mod

module lingua

go 1.16

require github.com/pemistahl/lingua-go v1.0.0

go env:

❯ go env
GO111MODULE="on"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dmitriysmotrov/Library/Caches/go-build"
GOENV="/Users/dmitriysmotrov/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/dmitriysmotrov/.gvm/pkgsets/go1.16.5/global/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dmitriysmotrov/.gvm/pkgsets/go1.16.5/global"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/dmitriysmotrov/.gvm/gos/go1.16.5"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/dmitriysmotrov/.gvm/gos/go1.16.5/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.5"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/dmitriysmotrov/space/dsxack/lingua/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/z5/8ts06jv92yjc5sp5mdsdzr2h0000gn/T/go-build2817996487=/tmp/go-build -gno-record-gcc-switches -fno-common"

Expect: no panics

Actual:

panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10e9c82]

goroutine 22 [running]:
archive/zip.(*ReadCloser).Close(0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/archive/zip/reader.go:161 +0x22
panic(0x11841e0, 0x12c0160)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/runtime/panic.go:965 +0x1b9
github.com/pemistahl/lingua-go.loadJson(0x18, 0x5, 0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/json.go:32 +0x18e
github.com/pemistahl/lingua-go.loadFivegrams(...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:925
github.com/pemistahl/lingua-go.germanFivegramModel.func1.1()
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:368 +0x45
sync.(*Once).doSlow(0xc0000b0f30, 0xc000094b68)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:68 +0xec
sync.(*Once).Do(...)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:59
github.com/pemistahl/lingua-go.germanFivegramModel.func1(0x11831c0, 0xc00009afc0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/fivegrams.go:367 +0xbb
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:530 +0x1cb
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:516 +0xf7
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:474 +0xca
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:442 +0xca
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:170 +0x525
panic: runtime error: invalid memory address or nil pointer dereference
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10e9c82]

goroutine 18 [running]:
archive/zip.(*ReadCloser).Close(0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/archive/zip/reader.go:161 +0x22
panic(0x11841e0, 0x12c0160)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/runtime/panic.go:965 +0x1b9
github.com/pemistahl/lingua-go.loadJson(0x18, 0x1, 0x0, 0x0, 0x0)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/json.go:32 +0x18e
github.com/pemistahl/lingua-go.loadUnigrams(...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:925
github.com/pemistahl/lingua-go.germanUnigramModel.func1.1()
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:368 +0x45
sync.(*Once).doSlow(0xc0000b1d40, 0xc000064b68)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:68 +0xec
sync.(*Once).Do(...)
	/Users/dmitriysmotrov/.gvm/gos/go1.16.5/src/sync/once.go:59
github.com/pemistahl/lingua-go.germanUnigramModel.func1(0x11831c0, 0xc00009b050)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/unigrams.go:367 +0xbb
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:538 +0x128
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:516 +0xf7
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:474 +0xca
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels(0xc0000d49c0, 0x4, 0x4, 0x0, 0xc00012c080, 0x2, 0x2, 0xc00009b0b0, 0xc00009b050, 0xc00009af90, ...)
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:442 +0xca
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/dmitriysmotrov/.gvm/pkgsets/go1.16rc1/global/pkg/mod/github.com/pemistahl/[email protected]/detector.go:170 +0x525

`lingua.Unknown` is not handled appropriately if included in the set of input languages

I have started testing the library a few days ago and just saw a first nil pointer panic like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x4314fd8]

goroutine 22414 [running]:
github.com/pemistahl/lingua-go.loadJson(0xa834340, 0x45537a0)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/json.go:37 +0x178
github.com/pemistahl/lingua-go.languageDetector.loadLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:612 +0x8d
github.com/pemistahl/lingua-go.languageDetector.lookUpNgramProbability({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:552 +0x146
github.com/pemistahl/lingua-go.languageDetector.computeSumOfNgramProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:526 +0x145
github.com/pemistahl/lingua-go.languageDetector.computeLanguageProbabilities({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:484 +0xcb
github.com/pemistahl/lingua-go.languageDetector.lookUpLanguageModels({{0xc01622a700, 0x1f, 0x1f}, 0x0, {0xc017283f80, 0xc, 0x10}, 0xc01312f320, 0xa834340, 0xa834240, ...}, ...)
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:452 +0xb7
created by github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues
	/Users/marian/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:176 +0x455

Seems as if even reading from an embed File can fail at some point.

I'm using go version go1.17.2 darwin/amd64.

How do you generate the ngram probabilities?

Hi, how do you actually generate the ngram probabilities?

I suspect that there might be something rotten in the state of Denmark, because I just tried to load the data for English trigrams and received these:

...
afd:0.0006082185532001166 afe:0.1033211267248698 aff:0.2505226878191564 afg:0.018677378071186912 afh:0.00011404097872502186 afi:0.008679785602959997 afj:6.335609929167881e-05 afk:0.00022808195745004371 afl:0.00424485865254248 afm:0.0002914380567417225 afn:0.000532191234050102 afo:0.010098962227093602
...
aôn:1 aõs:0.8 aúc:0.1 aúj:0.1 aúl:0.8 aül:1 aýt:1 aÿe:1 aća:1 ača:0.5 ači:0.5 ağl:0.4 ağr:0.2 ała:0.14285714285714285 ałb:0.14285714285714285 ało:0.14285714285714285 ały:0.14285714285714285 ałę:0.2857142857142857 ańs:0.6363636363636364 aşa:0.16666666666666666 aşg:0.16666666666666666 aşı:0.16666666666666666 aši:0.4 ašk:0.2 ašm:0.2 ašp:0.2 aţi:1 aźm:1 aża:1 aži:0.6666666666666666 ažs:0.3333333333333333 ași:1 aʔi:1 aﬀo:1
...

And it certainly does not seem right.

imports github.com/pemistahl/lingua-go: import cycle not allowed

Hello !

I have an issue when i try to import lingua with go 21.

imports github.com/pemistahl/lingua-go: import cycle not allowed

Do you have any solution ?

Thanks

Strange results for Chinese with Japanese

To reproduce:

package main

import (
	"github.com/pemistahl/lingua-go"
	"fmt"
)

func main() {
	detector := lingua.NewLanguageDetectorBuilder().
		FromAllLanguages().
		Build()

	text := "上海大学是一个好大学. わー!"
	if language, exists := detector.DetectLanguageOf(text); exists {
		fmt.Println(language.String()) // Japanese
	}
}

Expected:
Get Chinese for this case.

https://github.com/pemistahl/lingua-go/blob/main/detector.go#L467

It's because here return Japanese if any japaneseCharacterSet char exists, I'm unsure if this is intended.

Thanks for awesome work!

Could not add the package to project(403 Forbidden)

Hello

I could not add package to my project when i use command below
go get github.com/pemistahl/lingua-go

i got this error
go: downloading github.com/pemistahl/lingua-go v1.2.1
go: github.com/pemistahl/[email protected]: reading https://proxy.golang.org/github.com/pemistahl/lingua-go/@v/v1.2.1.zip: 403 Forbidden

Replace min-max normalization of confidence values with softmax function

It turns out that the softmax function is a much better fit for normalizing the confidence values. The resulting values behave more like real probabilities because they all sum up to 1.0. So, the currently applied min-max normalization should be replaced with the softmax function.

Find more memory-efficient data structure for language models

Currently, the language models are loaded into simple maps at runtime. Even though accessing the maps is pretty fast, they consume a significant amount of memory. The goal is to investigate whether there are more suitable data structures available that require less storage space in memory, something like NumPy for Python.

One promising candidate could be Gonum.

Detect multiple languages in mixed-language text

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input:
He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

panic: decimal division by 0

There is a panic in the latest 1.3.1 version.

panic: decimal division by 0
  55
  56 goroutine 41191 [running]:
  57 github.com/shopspring/decimal.Decimal.QuoRem({0xc0248f08e0, 0xffff9e58}, {0xc058a1f080, 0xffff9e58}, 0x10)
  58     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:565 +0x2c5
  59 github.com/shopspring/decimal.Decimal.DivRound({0xc0248f08e0?, 0x58a1e100?}, {0xc058a1f080?, 0x7f1272d8?}, 0x10)
  60     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:607 +0x56
  61 github.com/shopspring/decimal.Decimal.Div(...)
  62     /home/ec2-user/go/pkg/mod/github.com/shopspring/[email protected]/decimal.go:552
  63 github.com/pemistahl/lingua-go.languageDetector.computeConfidenceValues({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  64     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:615 +0x1af
  65 github.com/pemistahl/lingua-go.languageDetector.ComputeLanguageConfidenceValues({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  66     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:351 +0x8b4
  67 github.com/pemistahl/lingua-go.languageDetector.DetectLanguageOf({{0xc000852500, 0x4b, 0x4b}, 0x0, 0x0, {0xc00060c100, 0x14, 0x20}, 0xc0004b68a0, 0xb104080, ...}, ...)
  68     /home/ec2-user/go/pkg/mod/github.com/pemistahl/[email protected]/detector.go:147 +0x58

Strange matching for Spanish phrase detected as Finnish

Hey! I've been messing with this library, most of it seems great! There is one issue I've ran into with a spanish phrase being detected as Finnish, as it has a confidence level of 1, I'm unsure if this is intended.

Phrase: ¿les gustan los pokemon?

With the following code:

package main

import (
	"log"

	"github.com/pemistahl/lingua-go"
)

func main() {
	detector := lingua.
		NewLanguageDetectorBuilder().
		FromAllSpokenLanguages().
		WithPreloadedLanguageModels().
		Build()

	content := "¿les gustan los pokemon?"
	lang, reliable := detector.DetectLanguageOf(content)
	log.Println(lang.String(), reliable)

	log.Println(" --- ")

	confidences := detector.ComputeLanguageConfidenceValues(content)
	for _, langConf := range confidences {
		log.Println(langConf.Language().String(), langConf.Value())
	}
}

The following output is produced:

2022/04/20 00:22:43 Finnish true
2022/04/20 00:22:43  --- 
2022/04/20 00:22:43 Finnish 1
2022/04/20 00:22:43 English 0.9883978684270469
2022/04/20 00:22:43 Indonesian 0.978563900119626
2022/04/20 00:22:43 Spanish 0.9747851212151981
2022/04/20 00:22:43 Croatian 0.9724182360849759
2022/04/20 00:22:43 Lithuanian 0.9647225277871057
2022/04/20 00:22:43 Estonian 0.9641581778214242
2022/04/20 00:22:43 Esperanto 0.9606587809451471
2022/04/20 00:22:43 Polish 0.9594230676987932
2022/04/20 00:22:43 Slovene 0.9546050214213473
2022/04/20 00:22:43 Malay 0.9541465232681227
2022/04/20 00:22:43 Albanian 0.9524198444722406
2022/04/20 00:22:43 Italian 0.9486618781887298
2022/04/20 00:22:43 Catalan 0.946963416607054
2022/04/20 00:22:43 Danish 0.9403916449998727
2022/04/20 00:22:43 Bosnian 0.9269675882527444
2022/04/20 00:22:43 Portuguese 0.9261989417434195
2022/04/20 00:22:43 German 0.919921338933763
2022/04/20 00:22:43 Sotho 0.9152876229202939
2022/04/20 00:22:43 Dutch 0.9145928120132025
2022/04/20 00:22:43 French 0.9140644855054184
2022/04/20 00:22:43 Slovak 0.9125324543349711
2022/04/20 00:22:43 Latvian 0.9119548274103094
2022/04/20 00:22:43 Tswana 0.9030296447404719
2022/04/20 00:22:43 Romanian 0.8980252449808623
2022/04/20 00:22:43 Nynorsk 0.8962667914904449
2022/04/20 00:22:43 Tagalog 0.8961041054613276
2022/04/20 00:22:43 Swedish 0.8861739698250194
2022/04/20 00:22:43 Hungarian 0.8860583424196719
2022/04/20 00:22:43 Bokmal 0.8860501842325473
2022/04/20 00:22:43 Swahili 0.8855438630695021
2022/04/20 00:22:43 Czech 0.877987508198549
2022/04/20 00:22:43 Welsh 0.8706583132077192
2022/04/20 00:22:43 Turkish 0.8635506224236865
2022/04/20 00:22:43 Yoruba 0.8618678522282041
2022/04/20 00:22:43 Basque 0.8587542505212317
2022/04/20 00:22:43 Afrikaans 0.8435800177987139
2022/04/20 00:22:43 Maori 0.8429171795365868
2022/04/20 00:22:43 Ganda 0.8407646218672701
2022/04/20 00:22:43 Icelandic 0.8248853640378799
2022/04/20 00:22:43 Tsonga 0.8245248538291974
2022/04/20 00:22:43 Irish 0.817982923494266
2022/04/20 00:22:43 Zulu 0.8175325635441859
2022/04/20 00:22:43 Shona 0.8008811823165958
2022/04/20 00:22:43 Xhosa 0.7829601259301775
2022/04/20 00:22:43 Vietnamese 0.774240344355879
2022/04/20 00:22:43 Azerbaijani 0.7541427903961347
2022/04/20 00:22:43 Somali 0.7538078988192347

I'm not sure why it ranked Spanish as 4th. Is there a good method to get around this? Unfortunately given my use case I need to detect from a wide range of languages like this.

This library is overall awesome, I'm using the latest stable release, thank you for this!

Library size optimization

Thanks for this very efficient library (it's the best I've tested so far).

Unfortunately, I struggle with its size, because whatever parameters I choose, it keeps adding around 120MiB to my app (which is 50MiB, assets included). Since I am using Kubernetes, the docker image matters.

I am only interested in checking a few languages, but it seems that whatever language (or options I choose), the whole package is still compiled. May be I miss something ...

If not, it would be nice to be able to provide the languages as imports (lingua.English, ..., lingua.Languages) in order to keep the binary small.

Detection of multiple languages: bytes, runes

Detection of multiple languages sometimes returns indices in bytes, but sometimes in runes (code points):
To reproduce:

package main

import (
  "fmt"
  "github.com/pemistahl/lingua-go"
)

func main() {
  sentence := ""
  fmt.Printf("--- this will return indices in bytes:")
  sentence = "Parlez çççç? I would like"
  split(sentence);

  fmt.Printf("\n\n")
  fmt.Printf("--- this will return indices in code points (runes):")
  sentence = "ççççfran"
  split(sentence);
}

func split(sentence string) {
  languages := []lingua.Language{
    lingua.English,
    lingua.French,
  }

  detector := lingua.NewLanguageDetectorBuilder().
    FromLanguages(languages...).
    // WithLowAccuracyMode().
    Build()
  detectionResults := detector.DetectMultipleLanguagesOf(sentence)

  fmt.Printf("\ninput str:\n%s\n", sentence)

  for i := 0; i < len(sentence); i++ {
    fmt.Printf("% x", sentence[i])
    // fmt.Printf("%q", sentence[i])
  }
  fmt.Printf("\n")

  for _, result := range detectionResults {
    fmt.Printf("\n%s %d %d :\n", result.Language(), result.StartIndex(), result.EndIndex())

    fmt.Printf("%s: '%s'\n", result.Language(), sentence[result.StartIndex():result.EndIndex()])

    fmt.Printf("%s: '%s'\n", result.Language(), string([]rune(sentence)[result.StartIndex():result.EndIndex()]))
  }
}

output:

--- this will return indices in bytes:
input str:
Parlez çççç? I would like
 50 61 72 6c 65 7a 20 c3 a7 c3 a7 c3 a7 c3 a7 3f 20 49 20 77 6f 75 6c 64 20 6c 69 6b 65

French 0 17 :
French: 'Parlez çççç? '
French: 'Parlez çççç? I wo'

English 17 29 :
English: 'I would like'
English: 'uld like'


--- this will return indices in code points (runes):
input str:
ççççfran
 c3 a7 c3 a7 c3 a7 c3 a7 66 72 61 6e

French 0 8 :
French: 'çççç'
French: 'ççççfran'

I tested with a long italian text but the output is "17" which is English. How to make it work correctly?

I tested the adaptation of the basic.go for calling the golang function from a c-code:

package main

import "C"

import (
        "fmt"
        "github.com/pemistahl/lingua-go"
)

var lan lingua.Language

//var lan int

//export Langdetectfunct
func Langdetectfunct(text *C.char) int {

    textS := C.GoString(text);

    detector := lingua.NewLanguageDetectorBuilder().
        FromAllLanguages().
        Build()

    if language, exists := detector.DetectLanguageOf(textS); exists {
        lan = language
    }
    lan = lingua.English

    return int(lan)

}

func main() {
    // https://github.com/pemistahl/lingua-go/blob/main/language.go 

    testo := "Il liceo classico, noto in passato anche come ginnasio, è una scuola secondaria di secondo grado quinquennale a ciclo unico del sistema scolastico italiano incentrata sugli studi umanistici. Fu istituito come scuola d'élite con la riforma Gentile nel 1923, traendo origini dal ginnasio-liceo>

    ctesto := C.CString(testo);

    res := Langdetectfunct(ctesto);
    fmt.Println(res)
}

Despite the Italian text is long, it outputs "17" which, according to the language codes list here: https://github.com/pemistahl/lingua-go/blob/main/language.go is "English" :

raphy@raohy:~/go-lang-detect$ go run basic.go
17

Why? Did I make any mistake? How to make it work correctly?

How to get back an ISO 639-1 code

How do I get back the ISO 639-1 code for a language? My use case is a text that is turned to HTML and I want to set the lang attribute, e.g. lang="en" or lang="de".

Sadly, I'm new to Go, so this is what I have in a test and I always get back EU for English.

func TestLanguage(t *testing.T) {
	language := lingua.English
	if lingua.IsoCode639_1(language) != lingua.EN {
		t.Logf("Language: %s, ISO 639-1: %s", language.String(), lingua.IsoCode639_1(language).String())
		t.Fail()
	}
}

Result:

    page_test.go:84: Language: English, ISO 639-1: EU

I guess I'm confused about the IsoCode639_1 func and type?

Go type not supported in export: lingua.Language

I would like to call lingua-go functions from C++ code

I tried to generate h file and so file for this code ( following the indications found here: https://github.com/vladimirvivien/go-cshared-examples)

basic.go :

package main

import "C"

import (
        "github.com/pemistahl/lingua-go"
)

var lan lingua.Language

//export Langdetectfunct
func Langdetectfunct(text string) lingua.Language {

    detector := lingua.NewLanguageDetectorBuilder().
        FromAllLanguages().
        Build()

    if language, exists := detector.DetectLanguageOf(text); exists {
        lan = language
    }
    lan = lingua.English

    return lan

}

func main() {}

Doing:

raphy@raohy:~/go-lang-detect$ go build -o basic.so -buildmode=c-shared basic.go

I get :

raphy@raohy:~/go-lang-detect$ go build -o basic.so -buildmode=c-shared basic.go
# command-line-arguments
./basic.go:14:35: Go type not supported in export: lingua.Language

Add low accuracy mode

Lingua's high detection accuracy comes at the cost of being noticeably slower than other language detectors. The large language models also consume significant amounts of memory. These requirements might not be feasible for systems running low on resources.

For users who want to classify mostly long texts or need to save resources, a so-called low accuracy mode will be implemented that loads only a small subset of the language models into memory. The API will be as follows:

lingua.NewLanguageDetectorBuilder().FromAllLanguages().WithLowAccuracyMode().Build()

The downside of this approach is that detection accuracy for short texts consisting of less than 120 characters will drop significantly. However, detection accuracy for texts which are longer than 120 characters will remain mostly unaffected.

Reduce "bloat"

Hi,

Thanks for the excellent work first and foremost, but may I suggest keeping metadata (e.g. 61c7054) separately, outside this repository. You could create another repo, e.g. github.com/pemistahl/lingua-go-accuracy-reports or similar.

The comparisons are useful but also currently bloat the repository, plus they introduce quite a few extra dependencies, i.e. https://github.com/pemistahl/lingua-go/blob/main/go.sum.

What do you think?

Detection of multiple languages strange results

I took a few tests with lingua as I am interested in the “Detection of multiple languages in mixed-language texts” feature.

I checked the following texts:

Hello, I told you the house is green. Hallo, ich habe dir gesagt, das Haus ist grün.

Hallo, ich sage, das Haus ist grün. Hello, I told you the house is green.

Lingua returned the following to me:

First text:

English: 'Hello, I told you the house is green. Hallo, '
German: 'ich habe dir gesagt, das Haus ist grün.'

Second text:

German: 'Hallo, ich sage das Haus ist grün. Hello, I '
English: 'told you the house is green.'

Whereby neither Hello in German should be a correct word, nor Hallo in English.

Perhaps a parameter can be added to the DetectMultipleLanguagesOf that ensures that punctuation marks are considered, and only one language is returned per sentence.

Add possibility to select language by ISO string as part of this library

I would really appreciate possibility to select language by ISO string as part of this library - I plan to load some configuration including ISO string from json and keeping mapping by myself is kind of pain. Something along these lines would be great:

stringToIsoCode639_1 = map[string]IsoCode639_1 {
	"AF": AF,
	...
}

func GetLanguageFromStringIsoCode639_1(code string) Language {
	for _, language := range AllLanguages() {
		if language.IsoCode639_1() == stringToIsoCode639_1[code] {
			return language
		}
	}
	return -1
}

Also, I noticed that this function is not exactly optimal. It has linear complexity with regards to number of languages. It's probably not noticeable due to relatively small number of languages, but still, it can be also optimized by lookup map:

IsoCode639_1ToLanguage = map[IsoCode639_1]Language {
	AF: Afrikaans,
	...
}

func GetLanguageFromIsoCode639_1(isoCode IsoCode639_1) Language {
	if val, ok := IsoCode639_1ToLanguage[isoCode]; ok {
		return val
	}
	return -1
}