cisnlp / simalign Goto Github PK

View Code? Open in Web Editor NEW

345.0 345.0 47.0 139 KB

Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)

License: MIT License

Python 100.00%

simalign's People

Contributors

Stargazers

Watchers

simalign's Issues

Error in examples/align_files.py when token-type is word

Line 172 - 200 of the file examples/align_files.py is shown below. The return type of embed_loader.get_embed_list(...) is tensors whereas SentenceAligner.get_similarity requires numpy arrays.

                vectors = embed_loader.get_embed_list(list(sent_pair))
		if convert_to_words:
			w2b_map = []
			cnt = 0
			w2b_map.append([])
			for wlist in l1_tokens:
				w2b_map[0].append([])
				for x in wlist:
					w2b_map[0][-1].append(cnt)
					cnt += 1
			cnt = 0
			w2b_map.append([])
			for wlist in l2_tokens:
				w2b_map[1].append([])
				for x in wlist:
					w2b_map[1][-1].append(cnt)
					cnt += 1
			new_vectors = []
			for l_id in range(2):
				w_vector = []
				for word_set in w2b_map[l_id]:
					w_vector.append(vectors[l_id][word_set].mean(0))
				new_vectors.append(np.array(w_vector))
			vectors = np.array(new_vectors)

		all_mats = {}
		sim = SentenceAligner.get_similarity(vectors[0], vectors[1])
		sim = SentenceAligner.apply_distortion(sim, args.distortion)

This is problematic when --token-type = word since sklearn.metrics.pairwise.cosine_similarity isn't able to convert tensors to numpy array directly (because they also have gradients).

This is the exact error

  File "/home/ishan/simalign/simalign/simalign.py", line 110, in get_similarity
    return (cosine_similarity(X, Y) + 1.0) / 2.0
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1179, in cosine_similarity
    X, Y = check_pairwise_arrays(X, Y)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/utils/validation.py", line 72, in inner_f
    return f(**kwargs)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 134, in check_pairwise_arrays
    X, Y, dtype_float = _return_float_dtype(X, Y)
  File "/home/ishan/.local/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 45, in _return_float_dtype
    X = np.asarray(X)
  File "/home/ishan/.local/lib/python3.6/site-packages/numpy/core/_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/home/ishan/.local/lib/python3.6/site-packages/torch/tensor.py", line 492, in __array__
    return self.numpy()
RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead.

Quick workaround is to add the line vectors = np.array(vectors.detach()) by adding the else clause to if convert_to_words

Any workarounds for pytorch being too large?

I've implemented simalign into my heroku app, but as soon as I've done so, I get the "Compiled slug size: 1G is too large (max is 500M)." error.

Reviewing the files, I can see that pytorch alone is responsible for 600MB of memory, which simalign depends on. I also tried Digital Ocean, but seem to be running into similar issues.

Any ideas for working around this to reduce the memory consumption required by simalign because of pytorch?

Question on en-hi test set

Hi, congratulations on your paper!

I am working on word alignment between en and hi. I found there are two en-hi test sets provided by this link, i.e., en-hi.wa, en-hi.wa.nonnullalign. Which test set is used in the paper?

My test results on en-hi (using subword embeddings):

en-hi.wa.nonnullalign:

XLM-R Argmax prec=85.62 rec=46.91 f1=60.61 AER=39.39
XLM-R IterMax prec=75.36 rec=51.88 f1=61.45 AER=38.55

en-hi.wa:

XLM-R Argmax prec=85.62 rec=36.32 f1=51.00 AER=49.00

The reported results in paper is:

XLM-R Argmax f1=60 AER=40

So, do you use en-hi.wa.nonnullalign as the test set?

Any tips on how to align multiple (not only 2) sentences?

Hey! I have a list of options of sentences, as opposed to only 2. I'd like to align all of them, and iterate over them step by step. What do you think is the best way to go about this with simalign?

Batching alignments

Thanks for this project, it's really useful.

It would be even more useful if you could do alignments in batches. This could make doing lots of alignments on a GPU much faster.

The install_requires may need update

I noticed that install_requires specific version of networkx:
networkx==2.4
howerver networkx2.4 is not compatible with the latest numpy1.24.0 ,
which leads to an AttributeError: module 'numpy' has no attribute 'int' when trying to import SentenceAligner

Possible Fix
we better find a newer version of networkx, but we can also use lower numpy-1.23.4 as an workaround.

Alignments for BPE token

Hi, I just wonder whether simalign supports the feature of extracting alignments at BPE level?

How can I get the confidence of a specific alignment?

I would need this feature in order to find out about possible mis-aligned words? Thanks!

Is there a way to adjust the threshold τ?

I'd like to reduce the threshold so that fewer words are aligned since sometimes the words are aligned improperly. Is there a way to adjust this threshold?

ValueError: too many values to unpack (expected 2)

When running the example code you provide in README, I run into this error:

ValueError: too many values to unpack (expected 2)

I used this code a few weeks ago and it was working fine. I don't understand why now it's not working.

edit: This error was solved by installing simalign and all dependencies in another environment.

The speed is not improved when using gpu

Hi,
Thank you for open-source a well-performed word alignment tool. When running simalign, I met a problem about the model speed. It takes about 160 seconds to deal with a batch (100 sentence pairs) of data on cpu. When using the GPU by setting the device as "cuda", the speed is still about 160 seconds a batch. I think the data is loaded on the gpu, cause part of the memory of the gpu is being used. Is there any insights?

Question re Default Layer

Why is the default layer 8?

I understand that the model returns a set of layers, but other models return different numbers of layers. How did the 8th layer get picked as a default?

simalign/simalign/simalign.py

Line 26 in 249a7f3

 def __init__(self, model: str="bert-base-multilingual-cased", device=torch.device('cpu'), layer: int=8): 

dataset ENG-FAS

hi, for dataset ENG-FAS (Tvakoli et al. 2014 | Gold Alignment ), I find that I can't reach this link (http://eceold.ut.ac.ir/en/node/940) , will anyone send me this dataset ? My e-mail : [email protected]

Inputs not converted to cuda tensors when -device is cuda

In the simalign.py file the inputs aren't converted to cuda tensors when the flat -device is set to cuda.

	def get_embed_list(self, sent_pair: List[List[str]]) -> torch.Tensor:
		if self.emb_model is not None:
			inputs = self.tokenizer(sent_pair, is_pretokenized=True, padding=True, truncation=True, return_tensors="pt")
			outputs = self.emb_model(**inputs)[2][self.layer]

			return outputs[:, 1:-1, :]
		else:
			return None

This can be fixed by adding the line inputs = inputs.to(self.device) before passing it to self.emb_model

ValueError when using NLLB model

I tried using Meta's facebook/nllb-200-distilled-600M model, but it seems that hidden_states is not being set on the self.emb_model output (line 65). I'm getting:

ValueError: You have to specify either decoder_input_ids or decoder_inputs_embeds

Any suggestions for how to use NLLB?

Indices missing (just for ArgMax)

i am facing problems when aligning sentences, where one contains spelling mistakes. For the method ArgMax the result is missing indices.

For Example:
2 sentences:
['Ds', 'ist', 'en', 'Test', '.']
['This', 'is', 'a', 'test', '.']
Method ArgMax --> [(1, 1), (2, 2), (3, 3), (4, 4)] (is missing (0, 0))

Method Match --> [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] (is correct)

I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll

When I run python align_example.py, it is stuck at the follow message:
2020-08-23 21:51:40.688388: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_101.dll

I installed per the pip command line, and I use Windows 10 + Python 3.7.5.

Function to get match matrices

get_word_aligns returns a mapping (list of tuples). Some applications call for different matching policies, e.g. a one-to-one or one-to-many mapping.

It would be useful to separate out the first part into a new function e.g. get_word_align_matrices, that is essentially this part of get_word_aligns. This would allow the user to implement their own matching algorithm on the matrix however they want.

It's a very easy change that would add a lot of value.

Version mismatch

Hi,
there is version mismatch between tag (0.2) and setup.py (0.1).
Some python tools might be confused :)

If I were you I will:

change version to 0.3 in setup.py
make new tag "v0.3"

Thank you.

Looking for clarifications

Hi,
This is a very useful tool. Thank you for making it available. I was reading your paper and code, and I wondering if you could clarify me which method names in the code correspond to those in the paper. In the code, the matching methods are {"a": "inter", "m": "mwmf", "i": "itermax", "f": "fwd", "r": "rev"}. In the paper, I can only see 3: argmax, itermax, and match. So, besides the obvious itermax, which method in the code corresponds to argmax and which to match? From the code, I think inter refers to argmax and mwmf to match, but I ask to be sure. Thanks for your help!

Clarification in align_files.py

Hi!

Thank you for making your tool available! I want to align sentences in two different files. I guess that align_files.py can do exactly this. Do you think you can clarify the format that each file must be in?

I am confused by "Lines in the file should be indexed separated by TABs."

Sorry in advance if this might be trivial.

Thank you in advance!

How to get matchings from alignment

I have the following example:
Sentence A: a # 9.8 m deficit recorded for 2014/15 at an essex hospital is to be investigated by a health service watchdog.
Sentence B: A £9.8m deficit recorded for 2014/15 at an Essex hospital is to be investigated by a health service watchdog.

When I run the following:
myaligner = simalign.SentenceAligner(token_type="word")
aligns = myaligner.get_word_aligns(sentence_A, sentence_B)['itermax']

This produces an aligns of the form:
[(0, 0), (2, 1), (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

I cannot figure out how you then produce a matching of the form:
[(0, 0), (1, 1), (2, 1), (3,1) (4, 2), (5, 3), (6, 4), (7, 5), (8, 6), (9, 7), (10, 8), (11, 9), (12, 10), (13, 11), (14, 12), (15, 13), (16, 14), (17, 15), (18, 16), (19, 17), (20, 18)]

This is done on the interactive website in order to produce the graphs but I cannot find where you do something of this form in the code provided.

Thanks in advance!

similarity alignment of sentences

I know this might sound irrelevant, but can the logic of aligning words in two sentences be used to align sentences in two articles?

Time Complexity

Hi! Do you happen to know the time complexity of aligning 2 sentences?

A question about the premise assumptions of the algorithm

Hi! Thanks for your work. A small question here.
For a multilingual language model, why can it be guaranteed that words with the same meaning in different languages have similar representations in the word vector space? What theory is this hypothesis based on? Is there a relevant article for theoretical analysis?

Null extension

Hi, I liked your paper and found that I need the "null" option. Could you please include its implementation?

Evaluation Scripts & Results Reproduction

Thanks for the impressive work of SimAlign.

Currently I am working on reproducing the results in your paper, but I don't know how to compute the redefined precision and recall measures presented by Och & Ney, 2003, i.e., which labels should be S labels or P labels?

Could you please provide the eavluation scripts on the test sets used in paper?

Thanks.

seems like not using CUDA

It seems like my script running for getting alignments failed to use CUDA so it's running very slow.
How should I assure using CUDA devices? plus I have joined CUDA_VISIBLE_DEVICES.

How to init simalign with local model files?

Hi Jalili Sabet etc,

When I run simalign for the first time, it will download certain model files for several minutes, then give the info below:

"simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased"

My question is, can I download those model files beforehand and load it from my local disk when init simalign? And how to do this?

Thank you very much!

Bao

paragraph alignments

I have paragraphs in German and English and I am searching to get which sentences in the source language are mapped to the target language sentences. Can u advise me about that?

Error in running align_example.py

Hi,

I am getting the following error while executing the align_example.py

/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
2020-12-29 12:04:07,522 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
Traceback (most recent call last):
  File "align_example.py", line 7, in <module>
    result = model.get_word_aligns(source_sentence, target_sentence)
  File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/simalign/simalign.py", line 211, in get_word_aligns
    vectors = self.embed_loader.get_embed_list([src_sent, trg_sent]).cpu().detach().numpy()
  File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/simalign/simalign.py", line 66, in get_embed_list
    inputs = self.tokenizer(sent_batch, is_pretokenized=True, padding=True, truncation=True, return_tensors="pt")
  File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2371, in __call__
    **kwargs,
  File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2556, in batch_encode_plus
    **kwargs,
  File "/home/sriram/anaconda3/envs/simalign/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 526, in _batch_encode_plus
    ids, pair_ids = ids_or_pair_ids
ValueError: too many values to unpack (expected 2)

Can you please help to resolve this issue?

Thanks,
Sriram

Incorporate LaBSE as a model option

I modified simalign to use LaBSE (or "pvl/labse_bert") for underlying multilingual model to calculate embeddings. It showed better precision and recall on the alignments that either mBERT or XLM-RoBERTa and I think it would be a useful additional option for simalign.

ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

After running the example code provided I get this error:

>>> import simalign
>>> 
>>> source_sentence = "Sir Nils Olav III. was knighted by the norwegian king ."
>>> target_sentence = "Nils Olav der Dritte wurde vom norwegischen König zum Ritter geschlagen ."
>>> model = simalign.SentenceAligner()
2020-09-13 18:02:40,806 - simalign.simalign - INFO - Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
I0913 18:02:40.806071 4394976704 simalign.py:47] Initialized the EmbeddingLoader with model: bert-base-multilingual-cased
>>> result = model.get_word_aligns(source_sentence.split(), target_sentence.split())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 181, in get_word_aligns
    vectors = self.embed_loader.get_embed_list(list(bpe_lists))
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in get_embed_list
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/simalign/simalign.py", line 65, in <listcomp>
    outputs = [self.emb_model(in_ids.to(self.device)) for in_ids in inputs]
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_bert.py", line 806, in forward
    extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/transformers/modeling_utils.py", line 248, in get_extended_attention_mask
    input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([18])) or attention_mask (shape torch.Size([18]))

I wonder if this is due to my recent update of transformers. If so, that's going to be difficult for me to solve because the newest version of transformers has a fill-mask feature that was not available in previous versions that I'm going to need in conjunction with simalign's invaluable functionality.

Hopefully, this is unrelated. I did cancel the download then restart it again (and it seemed to restart from a fresh file though I could be wrong).

cisnlp / simalign Goto Github PK

simalign's People

Contributors

Stargazers

Watchers

Forkers

simalign's Issues

Recommend Projects

Recommend Topics

Recommend Org