anhaidgroup / deepmatcher Goto Github PK
View Code? Open in Web Editor NEWPython package for performing Entity and Text Matching using Deep Learning.
License: BSD 3-Clause "New" or "Revised" License
Python package for performing Entity and Text Matching using Deep Learning.
License: BSD 3-Clause "New" or "Revised" License
I have trained a model and I would love to continue training it on other data. Is that possible?
I tried it with model.run_train(...)
All i got back was:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:26
Thankful for advice 👍
hello world
please please i need your help the date to presenting my graduation project is nearing
i get error when i tried to save model
my code is :
import deepmatcher as dm
import pandas as pd
import numpy as np
import os
train_set,validation_set = dm.data.process(
path="deepmatcher_model/",
cache='train_cache.pth',
train='train.csv',
validation='valid.csv',
embeddings='fasttext.fr.bin',
embeddings_cache_path="deepmatcher_model/",
ignore_columns=['id','','ltable_index','rtable_index'],
id_attr='_id',
label_attr='label',
left_prefix='ltable_',
right_prefix='rtable_')
model=dm.MatchingModel(attr_summarizer='hybrid')
model.initialize(train_set)
model.run_train(
train_dataset=train_set,
validation_dataset=validation_set,
epochs=20,
batch_size=16,
best_save_path='deepmatcher_model/hybrid_model.pth',
pos_neg_ratio=3)
model.save_state('hybrid_model.pth',include_meta=True)
the error message :
AttributeError Traceback (most recent call last)
in ()
----> 1 model.save_state('hybrid_model.pth',include_meta=True)4 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
197 pickler.persistent_id = persistent_id
--> 198 pickler.dump(obj)
199
200 serialized_storage_keys = sorted(serialized_storages.keys())AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'
please tell me how can i solve this
i use google colab ..
thanks .. and i am sorry for my english
@thodrek @sidharthms @hanli91 @anhaidgroup
It looks like the links on the data page, and the ref to the paper are dead https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md
I've seen examples with CSV files both training and testing,
How can we use deepmatcher on direct strings without using files?
It would be better if one can specify the data separator for splitting text from csv. I think now it only support the default pandas separator that is the comma
Hi, I have just tested deepmatcher on the walmart-amazon dataset.
However, I cannot achieve the precision/recall/F1 reported in SIGMOD18.
import deepmatcher as dm
import logging
import torch
logging.getLogger('deepmatcher.core').setLevel(logging.INFO)
model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN(word_aggregator='birnn-last-pool'))
# model = dm.MatchingModel()
print(model)
model.initialize(train_dataset) # Explicitly initialize model.
model = model.cuda()
lr_decay = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]
batch_size = [16, 32]
pos_neg_ratio = [10, 7, 5, 4, 3, 2, 1]
best_f1 = -1
best_params = {'lr_decay': -1, 'batch_size': -1, 'pos_neg_ratio':-1}
for alpha in lr_decay:
for b in batch_size:
for rho in pos_neg_ratio:
optimizer = dm.optim.Optimizer(lr_decay=alpha)
model.run_train(
train_dataset,
validation_dataset,
epochs=15,
batch_size = b,
pos_neg_ratio=rho,
optimizer=optimizer,
best_save_path='../output/walmart-amazon/rnn_model.pth')
print(f'lr_decay: {alpha}, batch_size:{b}, pos_neg_ratio:{rho}')
f1 = model.run_eval(test_dataset)
if f1 > best_f1:
best_f1 = f1
best_params['lr_decay'] = alpha
best_params['batch_size'] = b
best_params['pos_neg_ratio'] = rho
print(f'best f1-score: {best_f1}, {best_params}')
print(f'best f1-score: {best_f1}, {best_params}')
I haven't yet tested all the hyperparameters, but the available results show that f1-score is only about 37%.
However, the f1 reported in SIGMOD18 is 67.6% for RNN on the structured walmart-amazon dataset.
I use the data downloaded from the link in this repository.
Could you tell me what the problem is?
What are the hyper-parameters selected for the experiments in SIGMOD 18 (RNN)?
Hi, I have just read your codes.
I found that in many scripts you use 'masked_fill_' for tensor computation, eg.,
word_aggregators.py:139-140
word_comparators.py:173-175
word_contexualizers.py:157-159
Could you please tell me why you use masked_fill for tensors?
i get error when i tried to save model
my code is :
import deepmatcher as dm
import pandas as pd
import numpy as np
import os
train_set,validation_set = dm.data.process(
path="deepmatcher_model/",
cache='train_cache.pth',
train='train.csv',
validation='valid.csv',
embeddings='fasttext.fr.bin',
embeddings_cache_path="deepmatcher_model/",
ignore_columns=['id','','ltable_index','rtable_index'],
id_attr='_id',
label_attr='label',
left_prefix='ltable_',
right_prefix='rtable_')
model=dm.MatchingModel(attr_summarizer='hybrid')
model.initialize(train_set)
model.run_train(
train_dataset=train_set,
validation_dataset=validation_set,
epochs=20,
batch_size=16,
best_save_path='deepmatcher_model/hybrid_model.pth',
pos_neg_ratio=3)
model.save_state('hybrid_model.pth',include_meta=True)
the error message :
AttributeError Traceback (most recent call last)
in ()
----> 1 model.save_state('hybrid_model.pth',include_meta=True)4 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
197 pickler.persistent_id = persistent_id
--> 198 pickler.dump(obj)
199
200 serialized_storage_keys = sorted(serialized_storages.keys())AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'
please tell me how can i solve this
i use google colab ..
RuntimeError raised using SIF as attribute summurizer
torch==0.3.1
torchtext==0.2.3
Traceback (most recent call last): File "matching.py", line 107, in best_save_path=path.join(RESULTS_DIR, 'models', 'rnn_sif_fasttext_model.pth') File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 183, in run_train return Runner.train(self, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/runner.py", line 300, in train model.initialize(train_dataset) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 357, in initialize self.forward(init_batch) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 427, in forward embeddings[right]) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 548, in _forward left_aggregated = self.word_aggregator(left_compared, left_aggregator_context) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 286, in _forward return self.module.forward(*args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 260, in forward inputs = module(inputs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 745, in _forward pc = Variable(input_with_meta.pc).unsqueeze(0).repeat(v.shape[0], 1) RuntimeError: dimension specified as 0 but tensor has no dimensions
When training a model on my own data, the consumption of RAM steadily increases during each epoch, until there is a crash caused by lack of memory.
Also, when the state of the model is saved at the end of each epoch, I noticed that the temporary increase in RAM (probably needed to create temp files / memmaps / something like that) is each time more important than it was for the previous time.
I also have a similar problem when I try to use a model for prediction: when the input data file is too big, I split it, and run the model on one part at a time. After each run, the RAM consumption has increased. I found that I can mitigate that by calling the 'deepmatcher.data.reset_vector_cache' function, which reset at least part of the increase in RAM consumption.
As such, I tend to also interpret the RAM consumption problem met during the training of the model, as a problem of ever increasing caching for vectors. But I don't really understand how this should be possible, since:
So then, if the hypothesis according to which the caching of vectors is linked to the increase of RAM that causes the problem, why does the RAM usage continue to increase during epochs following the first epoch? Why does the quantity of temporary additional RAM needed during the saving of the state of the model increase from one epoch to another?
Here is the code I use.
'train' and 'validation' are the data used during an epoch.
'pos_neg_ratio' is computed directly from the training dataset.
I first ran the code on Google Colab with a GPU, and then on a full GCP VM without GCP, and the problem appeared on both machines.
With my data, I cannot get past epoch n°4.
model = dm.MatchingModel(attr_summarizer='hybrid')
model.run_train(
train,
validation,
epochs=10,
batch_size=16,
pos_neg_ratio=pos_neg_ratio,
device="cuda:0")
I haven't seen a solution to this issue on the comments. First it errors trying to install Torch 0.3.1; attempting to pip install torch separately also results in a build error. Any ideas?
the error mentions something about needed in GPU. could that be the issue?
also this: ModuleNotFoundError: No module named 'tools.nnwrap'
While working on patching the project to accept in-memory files, I found myself refactoring and refactoring the same inner logic inside the project.
In the end, I came up with a "Schema Class" a simple abstraction that hides all the details of the underlying dataset.
This aligns also with the recent developments in PyTorch NLP engineer with apache arrow in-memory backends (see https://github.com/huggingface/nlp).
What is the roadmap for this project? Is there an interest to keep evolving it?
class ExampleSchema(object):
def __init__(self,
header,
id_attr,
label_attr,
left_prefix,
right_prefix,
ignore_columns=[],
tokenize='nltk',
lower=True,
include_lengths=True):
"""
Checks that:
* There is a label column
* There is an ID column
* All columns except the label and ID columns, and ignored columns start with either
the left table attribute prefix or the right table attribute prefix.
* The number of left and right table attributes are the same.
Create field metadata, i.e., attribute processing specification for each
attribute.
This includes fields for label and ID columns.
Returns:
list(tuple(str, MatchingField)): A list of tuples containing column name
(e.g. "left_address") and corresponding :class:`~data.MatchingField` pairs,
in the same order that the columns occur in the CSV file.
"""
# assert id_attr in header
if label_attr:
assert label_attr in header
for attr in header:
if attr not in (id_attr,
label_attr) and attr not in ignore_columns:
if not attr.startswith(left_prefix) and not attr.startswith(
right_prefix):
raise ValueError(
'Attribute ' + attr +
' is not a left or a right table '
'column, not a label or id and is not ignored. Not sure '
'what it is...')
num_left = sum(attr.startswith(left_prefix) for attr in header)
num_right = sum(attr.startswith(right_prefix) for attr in header)
assert num_left == num_right, "left,right attributes mismatch"
text_field = MatchingField(lower=lower,
tokenize=tokenize,
init_token='<<<',
eos_token='>>>',
batch_first=True,
include_lengths=include_lengths)
numeric_field = MatchingField(sequential=False,
preprocessing=int,
use_vocab=False)
id_field = MatchingField(sequential=False, use_vocab=False, id=True)
fields = []
for attr in header:
if attr == id_attr:
fields.append((attr, id_field))
elif attr == label_attr:
fields.append((attr, numeric_field))
elif attr in ignore_columns:
fields.append((attr, None))
else:
fields.append((attr, text_field))
self.fields = dict(fields)
Hi, how would I go about viewing (and potentially modifying) the actual embedding vectors that are generated by dm.data.process
? I'm having a hard time figuring out where they are located.
Fasttext binaries download does not work unless you select the english language.
Looking at the code i presume that is due to the fact that the english binaries is located at some google drive folder, while the others will be automatically downloaded from an aws server, that it seems unreachable
Hi
Can you please tell how much deepmatcher scales(X times faster) in terms of no of cpu cores increase for same process with data?
Thanks
Sumon
When I try to make predictions over unlabeled data with
unlabeled_data = dm.data.process_unlabeled( path='sample_data/itunes-amazon/unlabeled.csv',trained_model=model)
predictions = model.run_prediction(unlabeled_data)
predictions.head()
I get
NameError: name 'fn' is not defined
in want to install deepmatcher with pip
pip3 install deepmatcher
or
pip install deepmatcher
and i get this erreur :
Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher)
DeepMatcher slower than BERT. How we can speed up predictions
hello
i have problem when i load the model that i saved it yesterday
my code
model2 = dm.MatchingModel()
model2.load_state('/content/drive/My Drive/recommandersystem/deepmatcher_model/hybrid_model.pth')`
the message error :
EOFError Traceback (most recent call last)
in ()
1 model2 = dm.MatchingModel()
----> 2 model2.load_state('/content/drive/My Drive/recommandersystem/deepmatcher_model/hybrid_model.pth')2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _load(f, map_location, pickle_module)
418 unpickler = pickle_module.Unpickler(f)
419 unpickler.persistent_load = persistent_load
--> 420 result = unpickler.load()
421
422 deserialized_storage_keys = pickle_module.load(f)EOFError: Ran out of input
thanks
import deepmatcher as dm
import torch
torch.cuda.is_available()
import pandas as pd
pd.read_csv(r'E:\project\deepmatcher\examples\sample_data\itunes-amazon\train.csv').head()
train, validation, test = dm.data.process(
path='E:\project\deepmatcher\examples\sample_data\itunes-amazon',train='train.csv', validation='validation.csv',test='test.csv')
model = dm.MatchingModel()
Windows system
How to deal with this problem?
test/test_datasets folder does not contain required vectors to run tests, and travis could not find nltk.
Hi, I'm having an issue with calculating predictions and I managed to reproduce it with your itunes-amazon dataset. The issue is as follows:
When following your Getting Started notebook the predictions for the first 5 records are:
id | match_score |
---|---|
141999 | 0.094718 |
302034 | 0.283064 |
126354 | 0.794840 |
714676 | 0.282333 |
659997 | 0.175172 |
These are different from your results but I can imagine that is due to my model (with a f1-score of 70.77 on the test set) being different.
However when I feed in record 714676 separately into the processing and prediction function I get a match_score of 0.272498. That may not be a big difference but in my own datasets the differences are much larger (> 0.1).
Any idea what might be causing this difference? Is it the preprocessing step or the prediction step?
Some information about my environment:
System
Google Cloud Compute Engine VM with 1 Tesla K80 GPU
Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u4 (2019-07-19) x86_64 GNU/Linux
Software
deepmatcher 0.1.0.post1
fasttext 0.9.1
pandas 0.25.1
torch 0.3.1
torchtext 0.2.3
cuda 10.0
After upgrade to torch==1.0.1 and torchtext>=0.2.3 I receive the following error. Is there a way I can fix it? I've to setting device to the possible modules in question but nothing had worked
Traceback (most recent call last):
File "matching.py", line 42, in <module>
model.initialize(train)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 357, in initialize
self.forward(init_batch)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 421, in forward
embeddings[name] = self.embed[name](attr_input)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 187, in forward
results = self.module(*module_args)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 117, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'
Hi I am attempting to use deepmatcher on Mac OS. I have installed using
pip3 install deepmatcher
which seems to have worked.
Installing collected packages: torch, torchtext, deepmatcher
Successfully installed deepmatcher-0.1.0.post1 torch-0.3.1 torchtext-0.3.1
I am currently using
python3 --version
Python 3.6.5
When attempting to load deepmatcher
using import deepmatcher as dm
I get the following error:
>>> import deepmatcher as dm
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/site-packages/deepmatcher/__init__.py", line 10, in <module>
from .data import process as data_process
File "/usr/local/lib/python3.6/site-packages/deepmatcher/data/__init__.py", line 1, in <module>
from .field import MatchingField, reset_vector_cache
File "/usr/local/lib/python3.6/site-packages/deepmatcher/data/field.py", line 11, in <module>
from torchtext import data, vocab
File "/usr/local/lib/python3.6/site-packages/torchtext/__init__.py", line 1, in <module>
from . import data
File "/usr/local/lib/python3.6/site-packages/torchtext/data/__init__.py", line 4, in <module>
from .field import RawField, Field, ReversibleField, SubwordField, NestedField, LabelField
File "/usr/local/lib/python3.6/site-packages/torchtext/data/field.py", line 61, in <module>
class Field(RawField):
File "/usr/local/lib/python3.6/site-packages/torchtext/data/field.py", line 118, in Field
torch.float32: float,
AttributeError: module 'torch' has no attribute 'float32'
Any help would be greatly appreciated.
when i want to intall deepmatcher packege i get this error massage :
Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher==0.1.0.post1) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher==0.1.0.post1)
any solution please !!
i have windows 10
and i use python 3.6
and how i can install it with conda ??
Hi,
I'm trying to figure out if you have have any code for processing data like the fodors-zagat dataset from the data link point (i.e., ltable_id,rtable_id,label
) to the structure you identify as being required for model input (i.e., ltable_id,lfeature_1,lfeature_2,rtable_id,rfeature_1,rfeature_2
and having an unlabeled.csv
. My data (different than the fodors-zagat dataset) is structured like the linked data point above, and I looked through the repository but couldn't seem to find any to accomplish this processing. Just wanted to make sure it wasn't there before I write it myself... Thanks!
How can I set parameter pos_neg_ratio in run_train()? Should I set it as the ratio of the number of negative examples to that of positive examples in the training set? @sidharthms
try:
import deepmatcher
except:
!pip install -qqq deepmatcher
While running the above code in python 3.6 I am getting the error mention below
File "/home/vikrant/anaconda2/lib/python2.7/site-packages/deepmatcher/data/field.py", line 163
def build_vocab(self, *args, vectors=None, cache=None, **kwargs):
^
SyntaxError: invalid syntax
Showing error at vectors
Hi,
I am trying to train a data set having attribute date of birth as value for example '01-01-1970'. Can fasttext character embedding generate vector for this token or string with numeric during data processing of deepmatcher? So far I know, fasttext works with alphabets only.
I appreciate if you kindly reply with any information.
Thanks
Hi,
I am trying to reproduce the experiments from the SIGMOD 2018 paper: http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf. I am having a hard time finding the right setup and I get results far poorer than the ones reported in the paper for most of the datasets. Can you please give me a hint regarding the right setup? For example, what are the parameters for the hybrid setup? Using the defaults leads to poor results and following the existing guides in the repository did not help much.
As an example, for the (complete) iTunes-Amazon scenario the best I could obtain was F1: 35.09 | Prec: 33.33 | Rec: 37.04. But the paper reports better results.
Thank you!
I am getting this error while running code->
Code-> train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv')
Error->
Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00
ValueError Traceback (most recent call last)
in ()
3 train='train.csv',
4 validation='validation.csv',
----> 5 test='test.csv')
7 frames
/usr/local/lib/python3.6/dist-packages/fastText/FastText.py in init(self, model)
35 self.f = fasttext.fasttext()
36 if model is not None:
---> 37 self.f.loadModel(model)
38
39 def is_quantized(self):
ValueError: /root/.vector_cache/wiki.en.bin has wrong file format!
The train/test/validation.csv downloaded from https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md are different from the csv files under examples/sample_data.
Also, it was not mentioned where the ids in the train/test/validation.csv under examples/sample_data come.
Could you kindly clarify these?
the specified version of library torch (0.3.1) is no longer available via pip
I get a value error, tried to uninstall and install several packages but nothing worked :(
I prepared the sets as they are supposed to be. Including the 'left' and right prefixes, id, label and so on
train, validation, test = dm.data.process(
path='/home/censored/quora-question-pairs',
train='train.csv',
validation='validation.csv',
test='test.csv')
including the Error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-989ecb797b84> in <module>()
4 train='train.csv',
5 validation='validation.csv',
----> 6 test='test.csv')
/home/censored/.local/lib/python3.6/site-packages/deepmatcher/data/process.py in process(path, train, validation, test, unlabeled, cache, check_cached_data, auto_rebuild_cache, tokenize, lowercase, embeddings, embeddings_cache_path, ignore_columns, include_lengths, id_attr, label_attr, left_prefix, right_prefix, use_magellan_convention, pca)
195
196 _maybe_download_nltk_data()
--> 197 _check_header(header, id_attr, left_prefix, right_prefix, label_attr, ignore_columns)
198 fields = _make_fields(header, id_attr, label_attr, ignore_columns, lowercase,
199 tokenize, include_lengths)
/home/censored/.local/lib/python3.6/site-packages/deepmatcher/data/process.py in _check_header(header, id_attr, left_prefix, right_prefix, label_attr, ignore_columns)
32 if attr not in (id_attr, label_attr) and attr not in ignore_columns:
33 if not attr.startswith(left_prefix) and not attr.startswith(right_prefix):
---> 34 raise ValueError('Attribute ' + attr + ' is not a left or a right table '
35 'column, not a label or id and is not ignored. Not sure '
36 'what it is...')
ValueError: Attribute is not a left or a right table column, not a label or id and is not ignored. Not sure what it is...
How can I avoid this?
when run the code with dataprocess
train, validation, test = dm.data.process( path='sample_data/itunes-amazon', train='train.csv', validation='validation.csv', test='test.csv')
I got the error with " Expected object of backend CPU but got backend CUDA for argument #3 'index'"
here my device is support cuda, while torch.cuda.is_avaliable() is true and torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') get the value of device(type='cuda', index=0).
how can i fix this problem? hope for solution.
Hi I get a runtime error when I run dm.data.process() saying RuntimeError: Invalid device, must be cuda device.I have already specified my environment variable : os.environ['CUDA_VISIBLE_DEVICES'] = '0'
cuda_device = torch.device('cuda:0')
I am wondering where I can pass in the device variable in dm.data.process().
I'm trying to run the tutorial notebook on my GPU and when running the tutorial script:
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv')
I get the following output:
Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00tcmalloc: large alloc 2013274112 bytes == 0x55dd7d8ac000 @ 0x7ff96db501e1 0x7ff96df237e8 0x7ff91ede0f7b 0x7ff91ede1647 0x7ff91ede7058 0x7ff91ede7d99 0x7ff91edaf775 0x7ff91edd247e 0x55dcf69a9114 0x55dcf69a9231 0x55dcf6a0de8f 0x55dcf69626f9 0x55dcf6963805 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0d626 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69626f9 0x55dcf69a8917 0x55dcf6a0a0a6 0x55dcf69626f9 0x55dcf6963a30 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0e3d7 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69629da
tcmalloc: large alloc 4026540032 bytes == 0x55ddf58ae000 @ 0x7ff96db501e1 0x7ff96df237e8 0x7ff91ede0f7b 0x7ff91ede1647 0x7ff91ede7058 0x7ff91ede7d99 0x7ff91edaf775 0x7ff91edd247e 0x55dcf69a9114 0x55dcf69a9231 0x55dcf6a0de8f 0x55dcf69626f9 0x55dcf6963805 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0d626 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69626f9 0x55dcf69a8917 0x55dcf6a0a0a6 0x55dcf69626f9 0x55dcf6963a30 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0e3d7 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69629da
And then it hangs.
I checked if the GPU is visible through torch and it is.
Is there a way of reducing the memory allocation?
I m using deepmatcher for product matching. I can across a scenario where I want to give different weightage to different features.
Ex: I have 5 columns(A, B, C, D, E) in each set left and right. I want to give more importance to Column A and B than rest. is there any way in deepmatcher?
something like a weightage vector [0.4, 0.2, 0.1, 0.1 0.1, 0.1]
Hi, I've read your codes.
If I am not mistaken, the embedding is processed on the result of fasttext, which is downloaded from
https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh
However, I am not so sure how this fasttext embedding deals with numerical values. I can get get the embedding of some numeric values from this compressed file directly.
My code is pasted below:
import fastText
model = fasttext.load_model(path)
model.get_word_vector('1.29')
the output is shown like this:
array([-1.36406599e-02, -2.27776930e-01, -3.18875402e-01, 7.06118524e-01,
-3.12648743e-01, 2.43467003e-01, -1.02767743e-01, -3.95509809e-01,
2.36076638e-01, 3.91702265e-01, 1.61795065e-01, 3.05263754e-02,
1.05867624e-01, 1.74811155e-01, 2.04814345e-01, -1.86537594e-01,
2.55494360e-02, -1.98053271e-01, 8.03033933e-02, 1.76812127e-01,
4.34105396e-02, 1.29547983e-01, -3.17463964e-01, -5.20383835e-01,
-2.43617725e-02, -7.41683841e-02, -8.27122629e-02, 1.17084600e-01,
1.25584468e-01, 6.53219968e-02, -6.95866272e-02, 3.29474926e-01,
-3.14294517e-01, 2.79334456e-01, -2.46289775e-01, 5.94363129e-03,
-3.22101228e-02, -9.52130780e-02, 1.58929497e-01, -1.14688493e-01,
-3.16427574e-02, -1.76016539e-01, -1.87940076e-01, -1.46237537e-01,
-3.98945883e-02, -1.22498609e-01, 9.15530622e-02, 6.35887161e-02,
8.42858702e-02, 4.53437585e-03, -3.56534153e-01, 5.57965748e-02,
-3.38896476e-02, 1.23230398e-01, -1.83964506e-01, 2.08228782e-01,
4.10810187e-02, 1.98457673e-01, -1.58749551e-01, 2.27738291e-01,
-7.31550083e-02, -2.98727542e-01, 2.16822296e-01, -4.91270497e-02,
-1.06079698e-01, -2.47507811e-01, -4.14096743e-01, -1.95570275e-01,
1.56012088e-01, 3.75828221e-02, -4.57418412e-01, 4.39721309e-02,
2.05725119e-01, -2.20915675e-01, 1.10607982e-01, -1.37868347e-02,
5.35952687e-01, 1.82460938e-02, -5.61991110e-02, -1.59084573e-01,
2.89105892e-01, -3.88202481e-02, -1.57112852e-02, 9.92794558e-02,
-5.02652168e-01, -4.81455810e-02, -1.25789657e-01, 1.05233647e-01,
1.36335000e-01, -1.60364717e-01, 4.15847301e-02, 5.85811920e-02,
3.02284360e-02, -9.40186158e-02, -1.46366581e-01, 1.04580402e-01,
1.44954011e-01, -8.10427219e-02, -2.52872854e-01, 2.61546880e-01,
5.68291619e-02, 3.77267972e-02, 1.76686642e-03, 9.93822962e-02,
6.99491519e-03, 5.82082570e-02, 7.09483027e-02, -2.14030035e-02,
-1.68091357e-01, 3.08654398e-01, 2.23699287e-01, -4.38281268e-01,
2.84722336e-02, -1.44860461e-01, -3.60347666e-02, -2.86953785e-02,
3.16625744e-01, 3.14383358e-01, -8.37426037e-02, 1.59112439e-02,
3.63424510e-01, -2.88430214e-01, -5.00641167e-01, 1.58014759e-01,
1.60332680e-01, 4.05774731e-03, -2.23307282e-01, -8.44741017e-02,
8.90725628e-02, 2.42359862e-01, 1.04806900e-01, 1.37488708e-01,
6.99754432e-02, -8.64755437e-02, 1.30111221e-02, 8.31332207e-02,
-4.22606841e-02, 1.11255124e-01, -1.57022268e-01, 1.78717270e-01,
-2.79926240e-01, 1.30639657e-01, -2.83100437e-02, 1.12825038e-03,
2.09572345e-01, -1.69247791e-01, -8.15944299e-02, -8.19083378e-02,
3.97944562e-02, 4.19981182e-02, 2.88578361e-01, -4.15978402e-01,
-3.40752304e-05, 8.56315866e-02, 1.16690114e-01, 1.64606094e-01,
6.05781339e-02, -1.87327877e-01, 2.52706140e-01, -5.35242222e-02,
2.21059874e-01, 5.93100078e-02, -1.46213770e-02, 1.78334340e-01,
3.03723902e-01, 7.16771334e-02, 5.16987927e-02, -1.03483133e-01,
9.72970948e-02, 7.37599423e-03, -9.61232856e-02, -7.14995861e-02,
-1.90513685e-01, 1.72579419e-02, -1.89847827e-01, 2.04318315e-01,
-4.94804990e-04, -1.00425832e-01, -2.45145261e-02, 1.63981542e-01,
-1.45653874e-01, 2.69459605e-01, -2.49348000e-01, -4.23478693e-01,
1.72101721e-01, -2.33451411e-01, -1.80588961e-02, 1.44084945e-01,
-1.00001099e-03, 1.98338985e-01, 1.92507535e-01, -1.30623341e-01,
-7.05276728e-02, -1.66634411e-01, -1.49547324e-01, 1.85020670e-01,
1.95668384e-01, -1.56739816e-01, -1.06223106e-01, -1.60616800e-01,
2.39245847e-01, 1.47702649e-01, 1.21150471e-01, -2.49310836e-01,
-1.10669866e-01, -8.79117772e-02, 1.82536706e-01, 3.26546691e-02,
1.69249430e-01, 1.70959141e-02, 4.93817814e-02, 7.77583430e-03,
-1.84142128e-01, 1.33904340e-02, 1.10010996e-01, -4.86998521e-02,
5.05935326e-02, -2.94754207e-01, 1.26824811e-01, -2.62340099e-01,
-5.71499839e-02, 2.18661547e-01, -5.79368174e-02, 2.14305073e-01,
2.28410965e-04, 1.37265712e-01, 1.12893008e-01, -3.38227987e-01,
-7.97498375e-02, 2.89683323e-02, -1.54006824e-01, -1.32086128e-01,
1.03580654e-01, 4.06690761e-02, -1.92804530e-01, 9.63973999e-03,
-6.48700222e-02, -6.88762739e-02, 4.79521938e-02, 3.03832978e-01,
-2.59389788e-01, 3.46401095e-01, -1.06269956e-01, -7.59946853e-02,
1.15125896e-02, 5.88730052e-02, -7.50966091e-03, -1.37814924e-01,
-1.35907531e-03, 1.28101259e-01, 2.76466906e-01, 5.75332232e-02,
-2.88491875e-01, -1.55083895e-01, -1.80856034e-01, -1.27918571e-01,
5.49529456e-02, -2.76215613e-01, -7.87807554e-02, 8.41445550e-02,
4.81731407e-02, 3.26566786e-01, -5.92513792e-02, -1.39655858e-01,
-1.49621695e-01, -2.57521514e-02, -2.72838682e-01, 7.26688057e-02,
1.03472546e-01, -1.62503228e-01, 2.39889622e-02, 2.57862747e-01,
-1.53003678e-01, -1.23614408e-01, -6.24490380e-02, 3.78105789e-01,
2.08718851e-01, 2.09210403e-02, 1.83302522e-01, -4.95651364e-02,
1.95827410e-01, -2.01158553e-01, -2.34887339e-02, 1.55502364e-01,
-6.06963746e-02, 1.41840905e-01, -4.55948375e-02, 7.59486184e-02,
-1.65563941e-01, 4.15077597e-01, -5.43421283e-02, -3.51880491e-02,
2.68175632e-01, -3.60581696e-01, 5.03844058e-04, 1.37063740e-02,
-1.10333703e-01, -1.05074406e-01, -1.10602118e-01, -1.41570672e-01],
dtype=float32)
Could you please tell me how to deal with numeric data embeddings?
Thanks!
Is torch==0.3.1 a mandatory install? I mean can I install torch==1.0.1 instead?
Stack trace:
File "<path>/deepmatcher/models/core.py", line 184, in run_train
return Runner.train(self, *args, **kwargs)
File "<path>/deepmatcher/runner.py", line 300, in train
model.initialize(train_dataset)
File "<path>/deepmatcher/models/core.py", line 366, in initialize
self.forward(init_batch)
File "<path>/deepmatcher/models/core.py", line 436, in forward
embeddings[right])
TypeError: type() takes 2 positional arguments but 3 were given
Temporary workaround:
Change the name of the attribute causing the problem, e.g., you can change "left_type" and "right_type" to "left_entity_type" and "right_entity_type" respectively.
i use google colab
when i tried to process data with fasttext in french language i set it like this :
train_set,validation_set = dm.data.process(
path='drive/My Drive/recommandersystem/deepmatcher_model',
cache='train_cache.pth',
train='train.csv',
validation='valid.csv',
embeddings='fasttext.fr.bin',
embeddings_cache_path='drive/My Drive/recommandersystem/deepmatcher_model',
ignore_columns=['id',''],
id_attr='_id',
label_attr='label',
left_prefix='ltable_',
right_prefix='rtable_')
and i get this error message :
HTTPError Traceback (most recent call last)
in ()
11 label_attr='label',
12 left_prefix='ltable_',
---> 13 right_prefix='rtable_')
13 frames
/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
648 class HTTPDefaultErrorHandler(BaseHandler):
649 def http_error_default(self, req, fp, code, msg, hdrs):
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):HTTPError: HTTP Error 403: Forbidden
please how i can solve this
The new generated table after executing get_raw_table() in dataset.py doesn't have the same order as in the original input table. This is because "fields" is a dict rather than a list, which doesn't contain the column order info.
Hi,
during the execution of
train, validation, test = dm.data.process(
path='sample_data/itunes-amazon2',
train='train.csv',
validation='validate.csv',
test='test.csv')
a KeyError: tensor(2)
is raised.
Are there any solutions? Can the problem arise due to a wrong configuration of the deepmatcher installation?
Hi, I have successfully run the codes with the guide of tutorial Getting Start.
I tried to run deepmatcher with the dataset of structured walmart_amazon and I reformat the csv file like the itunes-amazon dataset from Getting Start tutorial. But when I run this code :
train, validation, test = dm.data.process(
path='.',
train='train.csv',
validation='validation.csv',
test='test.csv')
it reports an error as :
Computing principal components
0% [######] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01
*** Error in `~/anaconda3/bin/python': double free or corruption (!prev): 0x00005613ec89a910 ***
Aborted (core dumped)
Do you have any ideas of what is wrong?
getting this error while trying to run the model in a platform other than Google Colab. Seems like the model is not getting saved. Is there any method to go about this? also is this a dependency issue ?
try:
import deepmatcher
except:
!pip install -qqq deepmatcher
While running the above code in python 3.6 i am getting the error mention below
"Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher)"
And
If I run pip install deepmatcher.
Note: you may need to restart the kernel to use updated packages.
'C:\Users\ABC' is not recognized as an internal or external command,
operable program or batch file.
So what should I do to resolve this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.