vecto-ai / vecto Goto Github PK

View Code? Open in Web Editor NEW

64.0 64.0 11.0 1009 KB

Doing things with embeddings

Home Page: http://vecto.space/

License: Mozilla Public License 2.0

Python 96.11% Perl 3.89%

embeddings linquistics nlp

vecto's People

Contributors

Stargazers

Watchers

Forkers

libofang matthewdowney18 iris2hu yuziguo yuanzhike renjithravindran shannonyu wangyiyao2016 wtbacon andreiccoman ahmad-abdellatif

vecto's Issues

Fail to reproduce the result in "Subcharacter Information in Japanese Embeddings: When Is It Worth It?"

Hi.

I tried to use this project to reproduce the results in

Karpinska, Marzena, et al. "Subcharacter Information in Japanese Embeddings: When Is It Worth It?." Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP. 2018.

I used the openly shared JP word vectors (without character and subcharacter).

However I found that for each subset the result is much lower than the paper. (Method = 3CosAdd)

My code to see the result is as follows:

jBATS_folder = '../analogy_results/word_analogy/JBATS_1.0/'
result_file = os.path.join(jBATS_folder, '19.05.29_21.14.41', 'results.json')
with open(result_file, 'r') as f:
    data = json.load(f)
for subset in data:
    print(len(subset['details']))
    print(subset['result'])

The result is as follows:

2450
{'cnt_questions_correct': 1090, 'cnt_questions_total': 2450, 'accuracy': 0.4448979591836735}
2550
{'cnt_questions_correct': 2558, 'cnt_questions_total': 5000, 'accuracy': 0.5116}
3192
{'cnt_questions_correct': 3537, 'cnt_questions_total': 8192, 'accuracy': 0.4317626953125}
2450
{'cnt_questions_correct': 5636, 'cnt_questions_total': 10642, 'accuracy': 0.529599699304642}
2450
{'cnt_questions_correct': 6832, 'cnt_questions_total': 13092, 'accuracy': 0.5218454017720745}
3192
{'cnt_questions_correct': 7826, 'cnt_questions_total': 16284, 'accuracy': 0.48059444853844263}
2450
{'cnt_questions_correct': 8885, 'cnt_questions_total': 18734, 'accuracy': 0.47427137824276716}
2450
{'cnt_questions_correct': 10426, 'cnt_questions_total': 21184, 'accuracy': 0.49216389728096677}
2450
{'cnt_questions_correct': 11819, 'cnt_questions_total': 23634, 'accuracy': 0.500084623847}
2450
{'cnt_questions_correct': 12853, 'cnt_questions_total': 26084, 'accuracy': 0.49275417880693145}
2450
{'cnt_questions_correct': 13217, 'cnt_questions_total': 28534, 'accuracy': 0.46320179435059927}
2450
{'cnt_questions_correct': 14000, 'cnt_questions_total': 30984, 'accuracy': 0.4518461141234185}
2450
{'cnt_questions_correct': 14365, 'cnt_questions_total': 33434, 'accuracy': 0.42965244960220134}
2450
{'cnt_questions_correct': 14621, 'cnt_questions_total': 35884, 'accuracy': 0.40745178909820534}
2450
{'cnt_questions_correct': 14955, 'cnt_questions_total': 38334, 'accuracy': 0.3901236500234779}
2652
{'cnt_questions_correct': 15478, 'cnt_questions_total': 40986, 'accuracy': 0.3776411457570878}
2450
{'cnt_questions_correct': 15695, 'cnt_questions_total': 43436, 'accuracy': 0.36133621880467814}
2450
{'cnt_questions_correct': 15926, 'cnt_questions_total': 45886, 'accuracy': 0.347077539990411}
2450
{'cnt_questions_correct': 16339, 'cnt_questions_total': 48336, 'accuracy': 0.33802962595167163}
2450
{'cnt_questions_correct': 17286, 'cnt_questions_total': 50786, 'accuracy': 0.3403693931398417}
2450
{'cnt_questions_correct': 17980, 'cnt_questions_total': 53236, 'accuracy': 0.33774137801487714}
2352
{'cnt_questions_correct': 18442, 'cnt_questions_total': 55588, 'accuracy': 0.3317622508455062}
2162
{'cnt_questions_correct': 19739, 'cnt_questions_total': 57750, 'accuracy': 0.3418008658008658}
2450
{'cnt_questions_correct': 19949, 'cnt_questions_total': 60200, 'accuracy': 0.33137873754152825}
2450
{'cnt_questions_correct': 20158, 'cnt_questions_total': 62650, 'accuracy': 0.321755786113328}
2450
{'cnt_questions_correct': 20252, 'cnt_questions_total': 65100, 'accuracy': 0.3110906298003072}
2450
{'cnt_questions_correct': 20253, 'cnt_questions_total': 67550, 'accuracy': 0.2998223538119911}
2450
{'cnt_questions_correct': 20472, 'cnt_questions_total': 70000, 'accuracy': 0.29245714285714286}
2450
{'cnt_questions_correct': 20640, 'cnt_questions_total': 72450, 'accuracy': 0.28488612836438926}
2550
{'cnt_questions_correct': 20657, 'cnt_questions_total': 75000, 'accuracy': 0.27542666666666665}
2450
{'cnt_questions_correct': 20741, 'cnt_questions_total': 77450, 'accuracy': 0.26779857972885734}
2450
{'cnt_questions_correct': 20869, 'cnt_questions_total': 79900, 'accuracy': 0.26118898623279097}
2450
{'cnt_questions_correct': 21030, 'cnt_questions_total': 82350, 'accuracy': 0.2553734061930783}
2450
{'cnt_questions_correct': 21086, 'cnt_questions_total': 84800, 'accuracy': 0.24865566037735848}
2450
{'cnt_questions_correct': 21170, 'cnt_questions_total': 87250, 'accuracy': 0.24263610315186246}
2450
{'cnt_questions_correct': 21242, 'cnt_questions_total': 89700, 'accuracy': 0.23681159420289855}
2450
{'cnt_questions_correct': 21375, 'cnt_questions_total': 92150, 'accuracy': 0.23195876288659795}
2450
{'cnt_questions_correct': 21636, 'cnt_questions_total': 94600, 'accuracy': 0.22871035940803383}
2450
{'cnt_questions_correct': 21859, 'cnt_questions_total': 97050, 'accuracy': 0.2252344152498712}
2450
{'cnt_questions_correct': 22208, 'cnt_questions_total': 99500, 'accuracy': 0.2231959798994975}

The command to run the task is,

python -m vecto benchmark analogy Karpinska/word/vectors Karpinska/JBATS_1.0 --path_out analogy_results/ --method 3CosAdd

The embeddings and jBATS set are from
http://vecto.space/projects/jBATS/

Could you please tell me what is the result in the outputfile and how to get the accuracy on each subset correctly?

Thank you.

Can not run intrinsic evaluation

When I used the command line like this:
python3 -m vecto.benchmarks.analogy /path/to/config_analogy.yaml

It raised the following error:
No module named vecto.benchmarks.analogy.__main__; 'vecto.benchmarks.analogy' is a package and cannot be directly executed

When I tried to evaluate from the code, like:

path_model = "./test/data/embeddings/text/plain_no_file_header"
model = vecto.model.load_from_dir(path_model)
options = {}
options["path_dataset"] = "./test/data/benchmarks/analogy/"
options["path_results"] = "/tmp/vecto/analogy"
options["name_method"] = "3CosAdd"
vecto.benchmarks.analogy.analogy.run(model, options)

It raised an error: AttributeError: module 'vecto' has no attribute 'model'

Am I doing it right? I can not evaluate on either methods.

vecto/examples/test_benchmarks.ipynb not working

Hi,

I tried to run the notebook test_benchmarks.ipynb because I need to perform all three types of intrinsic evaluation that are tested in that notebook (categorization, outliers detection, synonymy detection) but I had some issues.

For the categorization I have an AttributeError: 'KMeansCategorization' object has no attribute 'get_result'.

For outliers and synonymy_detection, using the test datasets in /tests/data/benchmarks/outliers and /tests/data/benchmarks/synonymy_detection I get the ValueError: not enough values to unpack (expected 4, got 1).

Thank you,
Anna

Interpreting intrinsic anology evaluation result

Hi, I'm having some trouble trying to interpret the output I get with my anology using BATS dataset.
What does landing_a, landing_a_prime, landing_b, and landing_b_prime mean?

Also, the dataset only contain pairs of values like able->unable, so these are b and b prime? How does the program know what is a and a prime respectively?

Thanks a lot in advance.

WordEmbeddingsDense.has_word always return True

vecto/vecto/embeddings/dense.py

Line 244 in 7d5d7b8

i = self.vocabulary.get_id(w)

If OOV then i == 0 which means that i < 0 is always False hence has_word always return True

I think it would make more sense for get_id to propagate the KeyError so you can catch it in has_word (and return False)

incompatible array types Error

Hi,

Im trying to train embeddings on a multilingual data and get the following Incompatible array exception with the following parameters
--subword bilstm --dimension 300 --verbose --gpu

Traceback (most recent call last):
File "/usr/lib64/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib64/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/train_word2vec.py", line 260, in
main()
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/train_word2vec.py", line 256, in main
run(args)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/train_word2vec.py", line 242, in run
model = train(args)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/train_word2vec.py", line 235, in train
model = create_model(args, model, vocab)

File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/train_word2vec.py", line 120, in create_model
model.matrix = cuda.to_cpu(net.getEmbeddings())
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/utils/subword.py", line 488, in getEmbeddings
return self.getEmbeddings_f()
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/utils/subword.py", line 528, in getEmbeddings_f
e_batch = self.f(tokenIdsList_merged, tokenIdsList_merged_b, argsort, argsort_reverse, pList)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/utils/subword.py", line 304, in call
self.rnn(tokenIdsList_ordered[:, i])
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/vecto/embeddings/utils/subword.py", line 375, in rnn
x = self.embed(cur_word)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/chainer/link.py", line 242, in call
out = forward(*args, **kwargs)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/chainer/links/connection/embed_id.py", line 70, in forward
return embed_id.embed_id(x, self.W, ignore_label=self.ignore_label)
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/chainer/functions/connection/embed_id.py", line 164, in embed_id
return EmbedIDFunction(ignore_label=ignore_label).apply((x, W))[0]
File "/mount/arbeitsdaten31/studenten1/raufmz/virtual/vecto/lib/python3.7/site-packages/chainer/function_node.py", line 237, in apply
', '.join(str(type(x)) for x in in_data)))
TypeError: incompatible array types are mixed in the forward input (EmbedIDFunction).
Actual: <class 'numpy.ndarray'>, <class 'cupy.core.core.ndarray'>

Visualization

I have checked the visualization source code available in the following link

https://vecto.readthedocs.io/en/docs/tutorial/visualization.html

I couldn't find the function in the vecto github. Obviously, we will get function not found error after installing.

`>>> from vecto import visualize as vz

vs.draw_features(vsm, ["apple", "pear", "cat", "dog"], num_features=20)`

In the above code, what is vsm. Also, how to implement the visualization.

ValueError: k-fold cross-validation requires at least one train/test split by setting n_splits=2 or more, got n_splits=0.

I want to evaluate my model by using JBATS_1.0(http://vecto.space/projects/jBATS/), like this:

import vecto
import json
from vecto.embeddings import load_from_dir
from vecto.benchmarks.analogy.analogy import Analogy

path_model = "./plain_no_file_header/"
model = load_from_dir(path_model)

results = Analogy().run(model, "./JBATS_1.0")
with open("out.json", "w") as f:
    json.dump(results, f)

but ValueError occurred:

INFO:vecto.embeddings:./plain_no_file_header/Detected VSM in plain text format
INFO:vecto.benchmarks.analogy.analogy:processing ./JBATS_1.0/.Rhistory
Traceback (most recent call last):
  File "run_eval.py", line 9, in <module>
    results = Analogy().run(model, "./JBATS_1.0")
  File "/opt/conda/lib/python3.7/site-packages/vecto/benchmarks/analogy/analogy.py", line 185, in run
    result_for_category = self.run_category(pairs)
  File "/opt/conda/lib/python3.7/site-packages/vecto/benchmarks/analogy/analogy.py", line 109, in run_c
ategory
    kfold = sklearn.model_selection.KFold(n_splits=len(pairs) // self.size_cv_test)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 419, in __init_
_
    super(KFold, self).__init__(n_splits, shuffle, random_state)
  File "/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_split.py", line 284, in __init_
_
    " got n_splits={0}.".format(n_splits))
ValueError: k-fold cross-validation requires at least one train/test split by setting n_splits=2 or mor
e, got n_splits=0.

Could you tell me how to run it correctly?

use cached matrix of pairwise dot products where possible

Currently the 3CosAdd and 3CosMul methods run very slowly. It should be possible to speed both operations with the strategy, used by hyperwords, of precalculating all pairwise cosine similarity scores once, and simply adding, subtracting, multiplying, or dividing the resulting scalars for tests.

This is possible because cos(b*, a* − a + b) can be distributed out to cos(b*, a*) − cos(b*, a) + cos(b*, b). (The second looked worse to me until I realized you can precompute all possible pairwise dot products and never have to calculate any more, whereas every new combination of a*, a, b in the first case requires executing a new dot product.)

Given the current architecture this might not be a trivial operation, but at least for smaller vocabularies, it should definitely be possible, and could yield a ~100x speedup by leveraging fast linear algebra routines.

This is something I'd be interested in working on, though I may not be able to devote time to it in the near term. If that would be of interest, though, please let me know.

Similarity Check Not working

Hello, I have done the analogy tests successfully but i have issues with the similarity. It keeps expecting a json file but in the documentation, theres no mention of JSON. metadata is not found.

Kindly assist me

Extrinsic Evaluating Methods

Hi, thanks for your great work, it has helped me a lot! And I would like to know when the extrinsic evaluating methods could be implemented?

import error

I was able to use veco without problem last week, but today when I try to import veco and assign a model I get the following error:
import vecto
path_to_vsm = "/Users/Y/Downloads/word_linear_sg_500d"
my_vsm = vecto.model.load_from_dir(path_to_vsm)

module 'vecto' has no attribute 'model'

I looked at the initiation file (init.py) and it seems to only load the version:
from ._version import VERSION
and _version.py only loads the version:
VERSION = "0.1.7"

Thankfully, I am still able to use vsmlib. Is there a technical reason that vsmlib was replaced with vecto?

make DL frameworks optional requirements

at least those besides PyTorch

Unclear documentation and examples; how to evaluate embeddings on BATS?

Hi,

I tried to follow the documentation on readthedocs, but the api there seems to be outdated.
What I want to do is, evaluate my trained embeddings on the BATS analogy task. What I have so far is:
created a directory dir with a .npy and a .vocab file, containing embeddings for a SUBSET of the words in BATS (I can't learn embeddings for all words in BATS, as the corpus I train doesn't contain all those words), and their corresponding names.
I also created a folder for the original BATS dataset, which has subfolders like '1_Inflectional_morphology' that each contain the individual text files of that dataset.

I then did:

import vecto.embeddings
from vecto.benchmarks.analogy.analogy import Analogy

path_to_my_vsm_directory = 'path_to_dir'
model = vecto.embeddings.load_from_dir(path)

options = {}
options['path_dataset'] = 'path_to_BATS'
options["path_results"] = path_to_my_vsm_directory
options["name_method"] = '3CosAdd'

analogy = Analogy(model, options)

This generates no output. What am I supposed to do here?
analogy.get_result()
as described in 'vecto/examples/analogy.ipynb' gives: AttributeError: 'Analogy' object has no attribute 'get_result'.

Analogies

Hello,
I'm trying to evaluate some word embeddings with analogies, however, the examples from the jupyter notebooks and documentations are not working, probably deprecated...

import vecto.embeddings
embeddings = vecto.embeddings.load_from_dir("/storage/data/NLP/embeddings/6b.wiki_giga")
 
analogy = Analogy()
analogy.get_result()

path_model = "./UC/SINONIMO_N_8_2_100_50.txt"

options = {}
options["path_dataset"] = "./UC/SINONIMO_N_8_2_100_50.txt"
options["path_results"] = "UC/results"
options["name_method"] = "3CosAdd"
vecto.benchmarks.analogy.run(model, options)

None of those approaches work..

How can I use my analogies to test my embeddings?

Unstable results with LRCos

Re-running benchmarks of LRCos produces differences of 1 or 2 correct answers out of the 50 questions in some individual BATS tests, so that the benchmark accuracy varies by up to 4 per cent on the same data, which is a concern for the reproduciblity of experiments. The differences tend to average out across different test categories, so the overall percentage difference is smaller, though still problematic.

It seems plausible that the randomization used in the LogisticRegression from sklearn.linear_model could cause this problem. But seeding the random number generator with np.random.seed(1), random.seed(1), or calling LogisticRegression with random_state=1 does not help.

Scikit-learn (sklearn) missing from requirements.txt

Just installed vecto via pip, motivated by the tutorial at LREC 2018.

The installation works, but Scikit-learn (sklearn) seems to be missing from requirements.txt.

This causes some imports to fail, for example:

from vecto.benchmarks.analogy import Analogy

results in a ModuleNotFound error.

How can we interpret the results?

Hi, there,
I have finished the first evaluation on BATS dataset.
The result file is a json file which contains 5840964 lines.
How can we interpret it?
What I am looking for is one single number that can indicate the performance of my word embeddings.

NLTK tokenizers are downloaded at every import

this generated unnecessary outputs which look not so nice, especially in notebooks.
Have to check if resources are available first, and then download them if they are not found.