ai-ku / morse.jl Goto Github PK

View Code? Open in Web Editor NEW

14.0 6.0 6.0 71 KB

Paper: Morphological Analysis Using a Sequence Decoder

Home Page: https://www.transacl.org/ojs/index.php/tacl/article/view/1654

License: MIT License

Shell 20.61% Julia 79.39%

nlp machine-learning morphological-disambiguator morphological-analyser

morse.jl's Introduction

Morse

Morse is the morphological analysis model described in:

Akyürek, Ekin, Erenay Dayanık, and Deniz Yuret. "Morphological Analysis Using a Sequence Decoder." Transactions of the Association for Computational Linguistics 7 (2019): 567-579. (TACL, arXiv).

Dependencies

Julia 1.x
Network connection

Installation

   git clone https://github.com/ai-ku/Morse.jl
   cd Morse.jl

Note: Setup and Data is optional because running an experiment from the scripts directory automatically sets up the environment and installs required data when needed. However, if you're working in a cluster node that has no internet connection, you may need to perform these steps manually. To get the pkg> prompt in Julia for package commands please use the ']' key. Backspace gets back to the original julia> prompt.

Setup (Optional)

   julia> # Press the `]` key to get the `pkg>` prompt
   (v1.1) pkg> activate .
   (v1.1) Morse> instantiate # only in the first time

Data (Optional)

   julia> using Morse
   julia> download(TRDataSet)
   julia> download(UDDataSet)

Experiments

To verify the results presented in the paper, you may run the scripts to train models and ablations. During training logs will be created at logs/ folder.

Detailed information about experiments can be found in scripts/

Note: An Nvidia GPU is required to train the models in a reasonable amount of time.

Tagging

Available Pre-Trained Models

trained(MorseModel, TRDataSet);
trained(MorseModel, UDDataSet, lang="ru"); # Russian
trained(MorseModel, UDDataSet, lang="da"); # Danish
trained(MorseModel, UDDataSet, lang="fi"); # Finnish
trained(MorseModel, UDDataSet, lang="pt"); # Portuguese
trained(MorseModel, UDDataSet, lang="es"); # Español
trained(MorseModel, UDDataSet, lang="hu"); # Hungarian
trained(MorseModel, UDDataSet, lang="bg"); # Bulgarian
trained(MorseModel, UDDataSet, lang="sv"); # Swedish

How To Use

Note: Please use lowercased and tokenized inputs.

   julia> using Knet, KnetLayers, Morse
   julia> model, vocabulary, parser = trained(MorseModel, TRDataSet);
   julia> predictions = model("annem sana yardım edemez .", v=vocabulary, p=parser)
   annem anne+Noun+A3sg+P1sg+Nom
   sana sen+Pron+Pers+A2sg+Pnon+Dat
   yardım yardım+Noun+A3sg+Pnon+Nom
   edemez et+Verb^DB+Verb+Able+Neg+Aor+A3sg
   . .+Punct

Customized Training

Note: Nvidia GPU is required to train on a reasonable time.

   julia> using Knet, KnetLayers, Morse
   julia> config = Morse.intro(split("--logFile nothing --lemma --dataSet TRDataSet")) # you can modify the program arguments
   julia> dataFiles = ["train.txt", "test.txt"] # make sure you have theese files exists in the given path
   julia> data, vocab, parser = prepareData(dataFiles,TRDataSet) # or UDDataSet
   julia> data = miniBatch(data,vocab) # sentence minibatching is required for processing a sentence correctly
   julia> model = MorseModel(config,vocab)
   julia> setoptim!(model, SGD(;lr=1.6,gclip=60.0))
   julia> trainmodel!(model,data,config,vocab,parser) # can take hours or more depends to your data
   julia> predictions = model("Annem sana yardım edemez .", v=vocab, p=parser)

morse.jl's People

Contributors

Stargazers

Watchers

Forkers

juliatagbot ekinakyurek sefakonac1 mineata standardgalactic karahan-sahin

morse.jl's Issues

can we add time estimates to the instructions in scripts/README?

Todo list before EMNLP

Son haline geldiginde bunu ai-ku reposuna push edelim, paper'da o adresi verdik. Kullanmak icin video seyretmeye gerek olmasin, instruction'lar olabildigince compact olsun ve README'de olsun.

ai-ku’daki repo’ya geçici olarak “bu repo sonuçları aynen reproduce etmek için, modeli kullanmak istiyorsanız buraya nakın” gibi bir yazı ekledim. Son haline geldiğinde ai-ku’ya push etmek için deneyleri tekrar test etmemiz gerekiyor.

Gerek yok, ai-ku'da versiyon koyup (replicability icin), ease of use'a konsantre olalim.

Video'daki adimlari takip ederken yine su error'a takiliyorum:

julia> using Knet, KnetLayers, Morse
[ Info: Recompiling stale cache file /home/deniz/.julia/compiled/v1.2/Knet/f4vSz.ji for Knet [1902f260-5fb4-5aff-8c31-6271790ab950]
ERROR: LoadError: LoadError: syntax: malformed "import" statement

Buradaki hatayı 1.2’de çalıştırdığınız için almışsınız. Knet’in deney yaparken kullandığımız versiyonu 1.2’de çalışmıyormuş.

Manifest.toml ile version freeze etme motivasyonunu anliyorum, fakat sanirim burada release (ve dolayisiyla dogru durust test) edilmemis bir Knet ve KnetLayers versiyonu var. Bu kendi aramizda calisirken ok, ama halka release edilirken degil. Standart (mumkunse en son) release edilmis versiyonlari destekleyecek sekilde kodu update edip, varsa supported versiyon limitlerini Project.toml'da ifade edelim. Manifest.toml bildigim kadariyla release edilmemesi gereken bir dosya, Julia'cilara sorabilirsin bu prosedurun dogrusunu.

Motivasyon olarak soyle dusun: insanlar bu kodu alip bir subroutine olarak kullanip uzerine bir seyler yapacaklar, ornegin Turkce parser ya da NMT sistemi yazacaklar. Onlari eski Julia versiyonu ve release edilmemis unstable paket versiyonlariyla calismaya zorlamak uygun degil. Kodu son Julia ve son released Knet/KnetLayers ile calisir hale getirmek ne kadar

Julia 1.1 kullanarak #2'deki hatayi astim.

v1.2‘de çalıştırdığınız gözüküyor?

Hayir, v1.1'de calistirabildim.

Fakat insanlari eski Julia kullanmaya mecbur etmeyelim.

Katılıyorum ama tamamen reproducibility ile ilgili. Örneğin Knet’in versiyonu değiştiğinde rnn’deki dropout’ın yeri, rnn initialization tipi değişebiliyor. Böyle şeyler sonucumuzu az da olsa değiştirebilir. Birisi gelip sizin kodunuz aynı sonucu vermiyor dediğinde ne diyeceğiz? Şuan Manifest.toml sayesinde doğru Julia’yı kullandığınızda her zaman aynı şekilde çalışacak bir kod var.

Kodun bu haliyle birinin ugrasip kullanma sansi 0'a yakin. Ugrasip replicate edebilse bile onun uzerine bir seyler build edip research yapabilme sansi 0'a yakin (non-standart julia/paket versiyonlariyla kimse yeni kod build etmez).

Belirttiğim hassasiyeti sağlamanın bence en kolay yolu şuanki haline versiyon atamak. Ya da şuanda olduğu gibi ai-ku’daki reposunu dondurup benim hesabımdaki repoya yönlendirmek ve onu user-friendly yapmak.

Versiyon atayalim. README'de belirtelim. Paper'da ai-ku'ya point ettik, tum development'i orada yapalim, kafa karistirmamak icin iki repo olmamasi daha iyi olur.

Asagidaki warning Base fonksiyon override gosteriyor, bunu yapma (daha evvel yazmistim sanirim), yapacaksan sadece kendin type'larin icin yap:
julia> using Morse
[ Info: Precompiling Morse [1ee78b96-f10b-11e8-2f7c-e192173d3e61]
WARNING: Method definition show(IO, Union{Float32, Float64}) in module Grisu at grisu/grisu.jl:158 overwritten in modul
e Morse at /home/deniz/.julia/projects/Morse.jl/src/util.jl:5.

Haklısınız tembelliğime gelmiş.

@doc MorseModel (your main export) pek bir sey anlatmiyor, en azindan API ayrintili anlatilmali. Ornegin videoda kullandigin metod model(sentence::String) undocumented. Output tuple'larin 3. elemani nedir bilmiyorum. Paper'a refer etmissin ama paper'da API yok, teori var. Her kullanimin docstring'de anlatilmasina dikkat et. Typical use case'ler icin (ornegin bir dosyadaki tum cumlelerin tag'lenmesi) ornek ver. predict/loss gibi fonksiyonlar da document edilmeli.

Evet repo henüz bir Julia pkg değil. Önce kullanışlı hale getirelim daha sonra bunu yazarız.

Dokumantasyon genel bir software engineering requirement, julia paketi icin degil sadece :)

Bu arada predict/loss gibi isimleri export etmek tehlikeli olabilir, cunku diyelim ki 5 farkli modeli birlestirerek bir sey yapacagim, her paketten predict/loss export etmeye kalkarsa durum iyi olmayabilir (gerci model specific argument type'lar oldugu surece ok). Birden fazla modeli birlestirip is yapabilmek icin nasil bir standart API kullanabiliriz dusunelim/konusalim. Istiyorum ki using Morse, VGG, Transformer, BERT, ResNet, YoLo falan deyip 5 dakikada bunlari bir arada kullanan bir model tanimlayabilelim, her birinin standart train edilme veya pretrained weight yukleme API'i olsun vs.

Haklısınız

Bu konuda vaktin oldugunda konusalim.

README'de Tagging niye limited support diyor?

Her dili indiremedikleri için.

Diger dilleri de available yapalim.

Tagger line by line bir dosyayi cok yavas process ediyor (saniyede 2-3 cumle). Bulk processing icin daha iyi bir yontem var mi yoksa model mi genelde yavas?

Batching kodda hiç bir yerde yok. O yüzden performans arttırılabilir ama scale edilemez sanırım. GPU’da mısınız?

Evet GPU'da, ustelik V100. Burada batching olmasa bile cumledeki tum kelimeleri olabildigince paralel process ederek bir seyler yapilamaz mi? Character based word embedding'lerin hepsi paralel cikar, context embedding'ler paralel cikar, bir tek output embedding one step at a time.

Ideal olarak bu isleri EMNLP'den once bitirelim.

Bu hafta iki sınavım var, sonrasını tam bilmiyorum ama konuşuruz. Zaten poster de hazırlayacağız ambitious bir hedef olabilir. PR açarsanız onların üzerinden de katkıda bulunmaya çalışırım.

ok. simdilik versiyon koyup master'i en son julia/paket versiyonlariyla calistiralim. sonra downloadable modeller. sonra efficiency. ben bir ilk pass bakip problemleri yazarim.

Benim calismaya baslayabilmem icin ai-ku reposu uzerinde herseyi konsolide edelim. ekinakyurek reposunu silelim. Replication icin gecerli bir release (v1.0.0) yapalim.
ai-ku README'de hatalar var: paper link yanlis (tacl link kullanalim, arxiv v2 de dogru ama v1 degil).
ilk cumle yanlis: This repo is not a collection of morphological taggers. Neyse readme'yi bastan yazacagim alet stable oldugunda.
Ben kodu son julia/Knet/KnetLayers ile update edeyim. Eger hale replication yapilabiliyorsa #1'de yarattigimiz versiyonu silip bu noktada yeniden yaratalim.
Tum modellerin trained weight'lerini easily downloadable hale getirelim. Bunun icin MorseModel(url) uygun mu?
KnetLayers'da issue ve PR actim. Bu haliyle testleri geciyor ve Morse'u run ediyor. Yeni versiyon tag edelim baska ciddi bir sey yoksa, su anki released 0.1.0 ile Morse calismiyor.
Morse'da (ai-ku) issue ve PR actim. Bu haliyle compile edip calisiyor. Dolayisiyla buyuk ihtimal KnetLayers tag edildikten sonra en son paket versiyonlariyla calisacak gibi gorunuyor. Training'i sonuna kadar goturup sonuc kontrol etmedim, ama bir iki tablo icin bunu yapip "v1.0.0" tag'ini bu yeni haline verebiliriz.

Fails during installation

Im getting following error during installation,

This happens regardless of whether i use
add https://github.com/ai-ku/Morse.jl
or
dev https://github.com/ai-ku/Morse.jl

Getting `TypeError: in Type, in typeassert, expected Int32, got Int64` when trying to load trained models.

Hello,

I am trying to use Morse's pre-trained models without having luck.
I have installed Morse, then tried to load a pre-trained model as you show in the readme file:
model, vocabulary, parser = trained(MorseModel, TRDataSet);
but I got the following error:

TypeError: in Type, in typeassert, expected Int32, got Int64

Stacktrace:
 [1] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:input, :output),Tuple{Int32,Int64}}, ::Type{Multiply}) at ./none:0
 [2] WordEncoder(::Dict{Symbol,Any}, ::Vocabulary) at /path/to/Morse.jl/src/models.jl:62
 [3] MorseModel(::Dict{Symbol,Any}, ::Vocabulary) at /path/to/Morse.jl/src/models.jl:250
 [4] loadModel(::String) at /path/to/Morse.jl/src/util.jl:81
 [5] #trained#73(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type, ::Vararg{Type,N} where N) at /path/to/Morse.jl/src/util.jl:68
 [6] trained(::Type, ::Vararg{Type,N} where N) at /path/to/Morse.jl/src/util.jl:67
 [7] top-level scope at In[3]:1

To solve that, I did a quick but not clean fix; I replaced every possible Int64 with convert(Int32, o[:some_symbole] in the models.jl file. For example:

WordEncoder(o::Dict,v::Vocabulary) = WordEncoder(
    Embed(input=convert(Int32, length(v.chars)), output=convert(Int32, o[:embedSizes][1])),
    Dropout(p=o[:dropouts]),
    LSTM(input=convert(Int32, o[:embedSizes][1]), hidden=convert(Int32, o[:hiddenSizes][1]), seed=convert(Int32, o[:seed])) |> fgbias!
)

After that fix, I got the pre-trained models loaded. However, I got an error when trying to use them.
predictions = model("annem sana yardım edemez .", v=vocabulary, p=parser)

MethodError: no method matching EncodedAnalysis(::Int64, ::Int32)
Closest candidates are:
  EncodedAnalysis(::Any, ::Any, !Matched::Any, !Matched::Any) at /path/to/Morse.jl/src/parser.jl:66
  EncodedAnalysis(!Matched::Int32, ::Int32) at /path/to/Morse.jl/src/parser.jl:72

Stacktrace:
 [1] #EncodedIO#6(::Vocabulary, ::String, ::Int32, ::Type, ::Array{Char,1}) at /path/to/Morse.jl/src/parser.jl:113
 [2] (::getfield(Core, Symbol("#kw#Type")))(::NamedTuple{(:v, :mask, :len),Tuple{Vocabulary,String,Int32}}, ::Type{EncodedIO}, ::Array{Char,1}) at ./none:0
 [3] (::getfield(Morse, Symbol("##40#42")){Vocabulary,Int32})(::SubString{String}) at /path/to/Morse.jl/src/models.jl:23
 [4] _collect(::Array{SubString{String},1}, ::Base.Generator{Array{SubString{String},1},getfield(Morse, Symbol("##40#42")){Vocabulary,Int32}}, ::Base.EltypeUnknown, ::Base.HasShape{1}) at ./generator.jl:47
 [5] collect_similar(::Array{SubString{String},1}, ::Base.Generator{Array{SubString{String},1},getfield(Morse, Symbol("##40#42")){Vocabulary,Int32}}) at ./array.jl:561
 [6] map at ./abstractarray.jl:1987 [inlined]
 [7] #predict#39(::Vocabulary, ::Parser{TRDataSet}, ::Int32, ::Function, ::MorseModel, ::String) at /path/to/Morse.jl/src/models.jl:23
 [8] #predict at ./none:0 [inlined]
 [9] #call#53 at /path/to/Morse.jl/src/models.jl:329 [inlined]
 [10] (::getfield(Morse, Symbol("#kw#MorseModel")))(::NamedTuple{(:v, :p),Tuple{Vocabulary,Parser{TRDataSet}}}, ::MorseModel, ::String) at ./none:0
 [11] top-level scope at In[4]:1

I am on Ubuntu 18.04, and julia is 1.0.4. Could you help me with that?

Error While Running the Script trmor2018_full.sh

I got the following errors when I try to run trmor2018_full.sh.

nothing in update!

I got:

julia> Morse.main(split("--dataSet TRDataSet --version 2006 --epochs 1 --lemma --dropouts 0.3 --patience 6"))
ERROR: MethodError: update!(::Knet.KnetArray{Float32,3}, ::Nothing, ::Knet.SGD) is ambiguous. Candidates:                            
  update!(w::Knet.KnetArray{Float32,N} where N, g, p::Knet.SGD) in Knet at /kuacc/users/dyuret/.julia/dev/Knet/src/update.jl:475     
  update!(w::Knet.KnetArray{Float32,N} where N, g::Nothing, p) in Knet at /kuacc/users/dyuret/.julia/dev/Knet/src/update.jl:547      
Possible fix, define                                                                                                                 
  update!(::Knet.KnetArray{Float32,N} where N, ::Nothing, ::Knet.SGD)

which I worked around by defining:

Knet.update!(w::KnetArray{Float32}, ::Nothing, ::Knet.SGD) = w

We can open an issue to Knet about this later. But here my question is: are we expecting nothing gradients in normal operation or is this pointing to a bug?

Morse.main gives no progress feedback

I started training for one epoch on TrMor2006. The logfile stopped at the vocabLengths line. I have no idea if the code is stuck on an infinite loop, whether I should wait one hour or one week, or whether the loss is improving. Some periodic feedback would be nice.

Speed Issues

1. CuDNN-RNN sorting requirement:

To call CuDNN, we need to sort sequences by length. This breaks the order of sentence, so we need to reorder hidden states after the call. This might be a slow procedure, we might come up with an alternative. Any ideas @denizyuret ? I guess we can do padding from right to make sequences same length, and process them at once. And make the embeddings of pads 0, and expect RNN to learn to ignore those 0s.

Morse.jl/src/models.jl

Line 76 in 2dfe277

return out.hidden[:,:,end][:,indices], out.memory[:,:,end][:,indices]

2. Minibatching at decoding time:

1- We can't process all words at once because of the output encoding. Should we stick to non output-encoding version of the model for faster public model? @denizyuret

2- I believe we should be able to eliminate for loop over the characters of a word (for t=1:timeSteps). Is there any obstacle for that? I guess this is only helpful in training