cisnlp / simalign Goto Github PK
View Code? Open in Web Editor NEWObtain Word Alignments using Pretrained Language Models (e.g., mBERT)
License: MIT License
Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
License: MIT License
Line 172 - 200 of the file examples/align_files.py is shown below. The return type of embed_loader.get_embed_list(...)
is tensors whereas SentenceAligner.get_similarity requires numpy arrays.
vectors = embed_loader.get_embed_list(list(sent_pair))
if convert_to_words:
w2b_map = []
cnt = 0
w2b_map.append([])
for wlist in l1_tokens:
w2b_map[0].append([])
for x in wlist:
w2b_map[0][-1].append(cnt)
cnt += 1
cnt = 0
w2b_map.append([])
for wlist in l2_tokens:
w2b_map[1].append([])
for x in wlist:
w2b_map[1][-1].append(cnt)
cnt += 1
new_vectors = []
for l_id in range(2):
w_vector = []
for word_set in w2b_map[l_id]:
w_vector.append(vectors[l_id][word_set].mean(0))
new_vectors.append(np.array(w_vector))
vectors = np.array(new_vectors)
all_mats = {}
sim = SentenceAligner.get_similarity(vectors[0], vectors[1])
sim = SentenceAligner.apply_distortion(sim, args.distortion)
This is problematic when --token-type
= word since sklearn.metrics.pairwise.cosine_similarity isn't able to convert tensors to numpy array directly (because they also have gradients).
This is the exact error
File "/home/ishan/simalign/simalign/simalign.py", line 110, in get_similarity
return (cosine_similarity(X, Y) + 1.0) / 2.0
File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1179, in cosine_similarity
X, Y = check_pairwise_arrays(X, Y)
File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 134, in check_pairwise_arrays
X, Y, dtype_float = _return_float_dtype(X, Y)
File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 45, in _return_float_dtype
X = np.asarray(X)
File "/home/ishan/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
return array(a, dtype, copy=False, order=order)
File "/home/ishan/.local/lib/python3.6/site-packages/torch/tensor.py", line 492, in __array__
return self.numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.
Quick workaround is to add the line vectors = np.array(vectors.detach())
by adding the else
clause to if convert_to_words
I've implemented simalign into my heroku app, but as soon as I've done so, I get the "Compiled slug size: 1G is too large (max is 500M)." error.
Reviewing the files, I can see that pytorch alone is responsible for 600MB of memory, which simalign depends on. I also tried Digital Ocean, but seem to be running into similar issues.
Any ideas for working around this to reduce the memory consumption required by simalign because of pytorch?
Hi, congratulations on your paper!
I am working on word alignment between en and hi. I found there are two en-hi test sets provided by this link, i.e., en-hi.wa, en-hi.wa.nonnullalign. Which test set is used in the paper?
My test results on en-hi (using subword embeddings):
en-hi.wa.nonnullalign:
XLM-R Argmax prec=85.62 rec=46.91 f1=60.61 AER=39.39
XLM-R IterMax prec=75.36 rec=51.88 f1=61.45 AER=38.55
en-hi.wa:
XLM-R Argmax prec=85.62 rec=36.32 f1=51.00 AER=49.00
The reported results in paper is:
XLM-R Argmax f1=60 AER=40
So, do you use en-hi.wa.nonnullalign as the test set?
Hey! I have a list of options of sentences, as opposed to only 2. I'd like to align all of them, and iterate over them step by step. What do you think is the best way to go about this with simalign?
Thanks for this project, it's really useful.
It would be even more useful if you could do alignments in batches. This could make doing lots of alignments on a GPU much faster.
I noticed that install_requires specific version of networkx
:
networkx==2.4
howerver networkx2.4
is not compatible with the latest numpy1.24.0
,
which leads to an AttributeError: module 'numpy' has no attribute 'int'
when trying to import SentenceAligner
Possible Fix
we better find a newer version of networkx
, but we can also use lower numpy-1.23.4
as an workaround.
Hi, I just wonder whether simalign supports the feature of extracting alignments at BPE level?
I would need this feature in order to find out about possible mis-aligned words? Thanks!
When running the example code you provide in README, I run into this error:
ValueError: too many values to unpack (expected 2)
I used this code a few weeks ago and it was working fine. I don't understand why now it's not working.
edit: This error was solved by installing simalign and all dependencies in another environment.
Hi,
Thank you for open-source a well-performed word alignment tool. When running simalign, I met a problem about the model speed. It takes about 160 seconds to deal with a batch (100 sentence pairs) of data on cpu. When using the GPU by setting the device as "cuda", the speed is still about 160 seconds a batch. I think the data is loaded on the gpu, cause part of the memory of the gpu is being used. Is there any insights?
Why is the default layer 8
?
I understand that the model returns a set of layers, but other models return different numbers of layers. How did the 8th layer get picked as a default?
Line 26 in 249a7f3
hi, for dataset ENG-FAS (Tvakoli et al. 2014 | Gold Alignment ), I find that I can't reach this link (http://eceold.ut.ac.ir/en/node/940) , will anyone send me this dataset ? My e-mail : [email protected]
In the simalign.py file the inputs aren't converted to cuda tensors when the flat -device
is set to cuda
.
def get_embed_list(self, sent_pair: List[List[str]]) -> torch.Tensor:
if self.emb_model is not None:
inputs = self.tokenizer(sent_pair, is_pretokenized=True, padding=True, truncation=True, return_tensors="pt")
outputs = self.emb_model(**inputs)[2][self.layer]
return outputs[:, 1:-1, :]
else:
return None
This can be fixed by adding the line inputs = inputs.to(self.device)
before passing it to self.emb_model
I tried using Meta's facebook/nllb-200-distilled-600M
model, but it seems that hidden_states
is not being set on the self.emb_model
output (line 65). I'm getting:
ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds
Any suggestions for how to use NLLB?
i am facing problems when aligning sentences, where one contains spelling mistakes. For the method ArgMax the result is missing indices.
For Example:
2 sentences:
['Ds', 'ist', 'en', 'Test', '.']
['This', 'is', 'a', 'test', '.']
Method ArgMax --> [(1, 1), (2, 2), (3, 3), (4, 4)] (is missing (0, 0))
Method Match --> [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] (is correct)
When I run python align_example.py, it is stuck at the follow message:
2020-08-23 21:51:40.688388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll
I installed per the pip command line, and I use Windows 10 + Python 3.7.5.
get_word_aligns
returns a mapping (list of tuples). Some applications call for different matching policies, e.g. a one-to-one or one-to-many mapping.
It would be useful to separate out the first part into a new function e.g. get_word_align_matrices
, that is essentially this part of get_word_aligns
. This would allow the user to implement their own matching algorithm on the matrix however they want.
It's a very easy change that would add a lot of value.
Hi,
there is version mismatch between tag (0.2) and setup.py (0.1).
Some python tools might be confused :)
If I were you I will:
Thank you.
Hi,
This is a very useful tool. Thank you for making it available. I was reading your paper and code, and I wondering if you could clarify me which method names in the code correspond to those in the paper. In the code, the matching methods are {"a": "inter", "m": "mwmf", "i": "itermax", "f": "fwd", "r": "rev"}
. In the paper, I can only see 3: argmax
, itermax
, and match
. So, besides the obvious itermax
, which method in the code corresponds to argmax
and which to match
? From the code, I think inter
refers to argmax
and mwmf
to match
, but I ask to be sure. Thanks for your help!
Hi!
Thank you for making your tool available! I want to align sentences in two different files. I guess that align_files.py
can do exactly this. Do you think you can clarify the format that each file must be in?
I am confused by "Lines in the file should be indexed separated by TABs."
Sorry in advance if this might be trivial.
Thank you in advance!
I have the following example:
Sentence A: a # 9.8 m deficit recorded for 2014/15 at an essex hospital is to be investigated by a health service watchdog.
Sentence B: A £9.8m deficit recorded for 2014/15 at an Essex hospital is to be investigated by a health service watchdog.
When I run the following:
myaligner = simalign.SentenceAligner(token_type="word")
aligns = myaligner.get_word_aligns(sentence_A, sentence_B)['itermax']
This produces an aligns of the form:
[(0, 0), (2, 1), (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]
I cannot figure out how you then produce a matching of the form:
[(0, 0), (1, 1), (2, 1), (3,1) (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]
This is done on the interactive website in order to produce the graphs but I cannot find where you do something of this form in the code provided.
Thanks in advance!
I know this might sound irrelevant, but can the logic of aligning words in two sentences be used to align sentences in two articles?
Hi! Do you happen to know the time complexity of aligning 2 sentences?
Hi! Thanks for your work. A small question here.
For a multilingual language model, why can it be guaranteed that words with the same meaning in different languages have similar representations in the word vector space? What theory is this hypothesis based on? Is there a relevant article for theoretical analysis?
Hi, I liked your paper and found that I need the "null" option. Could you please include its implementation?
Thanks for the impressive work of SimAlign.
Currently I am working on reproducing the results in your paper, but I don't know how to compute the redefined precision and recall measures presented by Och & Ney, 2003, i.e., which labels should be S labels or P labels?
Could you please provide the eavluation scripts on the test sets used in paper?
Thanks.
It seems like my script running for getting alignments failed to use CUDA so it's running very slow.
How should I assure using CUDA devices? plus I have joined CUDA_VISIBLE_DEVICES.
Hi Jalili Sabet etc,
When I run simalign for the first time, it will download certain model files for several minutes, then give the info below:
"simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased"
My question is, can I download those model files beforehand and load it from my local disk when init simalign? And how to do this?
Thank you very much!
Bao
I have paragraphs in German and English and I am searching to get which sentences in the source language are mapped to the target language sentences. Can u advise me about that?
Hi,
I am getting the following error while executing the align_example.py
/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
2020-12-29 12:04:07,522 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
Traceback (most recent call last):
File "align_example.py", line 7, in <module>
result = model.get_word_aligns(source_sentence, target_sentence)
File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/simalign/simalign.py", line 211, in get_word_aligns
vectors = self.embed_loader.get_embed_list([src_sent, trg_sent]).cpu().detach().numpy()
File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/simalign/simalign.py", line 66, in get_embed_list
inputs = self.tokenizer(sent_batch, is_pretokenized=True, padding=True, truncation=True, return_tensors="pt")
File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2371, in __call__
**kwargs,
File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2556, in batch_encode_plus
**kwargs,
File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 526, in _batch_encode_plus
ids, pair_ids = ids_or_pair_ids
ValueError: too many values to unpack (expected 2)
Can you please help to resolve this issue?
Thanks,
Sriram
I modified simalign to use LaBSE (or "pvl/labse_bert") for underlying multilingual model to calculate embeddings. It showed better precision and recall on the alignments that either mBERT or XLM-RoBERTa and I think it would be a useful additional option for simalign.
After running the example code provided I get this error:
>>> import simalign
>>>
>>> source_sentence = "Sir Nils Olav III. was knighted by the norwegian king ."
>>> target_sentence = "Nils Olav der Dritte wurde vom norwegischen König zum Ritter geschlagen ."
>>> model = simalign.SentenceAligner()
2020-09-13 18:02:40,806 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
I0913 18:02:40.806071 4394976704 simalign.py:47] Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
>>> result = model.get_word_aligns(source_sentence.split(), target_sentence.split())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 181, in get_word_aligns
vectors = self.embed_loader.get_embed_list(list(bpe_lists))
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in get_embed_list
outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in <listcomp>
outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_bert.py", line 806, in forward
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_utils.py", line 248, in get_extended_attention_mask
input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))
I wonder if this is due to my recent update of transformers. If so, that's going to be difficult for me to solve because the newest version of transformers has a fill-mask feature that was not available in previous versions that I'm going to need in conjunction with simalign's invaluable functionality.
Hopefully, this is unrelated. I did cancel the download then restart it again (and it seemed to restart from a fresh file though I could be wrong).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.