anhaidgroup / deepmatcher Goto Github PK

View Code? Open in Web Editor NEW

547.0 547.0 127.0 5.77 MB

Python package for performing Entity and Text Matching using Deep Learning.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

deepmatcher's People

Contributors

Stargazers

Watchers

Forkers

hanli91 zinc-30 mindis summersepsilon tpnguyen borismmitrovic ml-lab goel42 thocao276 olgnaydn kkonevets jimxzai hikaty123 babibubebon luxiaolingfei brunnurs belerico lousiaye nikhilshekhar nikolalazarovontotext wmaucla stjordanis donovan68 somani-iitb xzymustbexzy adfi suzil benjadb kiminh rouzbeh-afrasiabi naresh-sundarrajan benjaminfuentes tteofili daren996 raidus gravik123 ethanaward xingkailiu datafields-team sharky222 jorgedc lorarjohns robpotter89 deptlearnn kamakshi-malhotra unifyd-insights andy-wagner qianrenjian pombredanne ale3385 sudhu26 decoder996 yueyedeai grupocolav aberham01 aehm03 subrag weikman canslove abustamantev danielschulz waileong94 yento2004 tuhahaha ogust-1 xiaobai0419 sour4bh pmaxit inksong zn-qiao jeevandath94 mvahit sockthem fytsfyts er-research s-waitz mathisloevenich yusuf-jkhan1 stungkit wyfunique ansuman-logicoy esingildinov yashkumaratri makenya techthiyanes alexissaintpierresia nishadi ndanevski1 iguanaware navid777 oldregan atharvacc joaogf21 nithin8145 etiennekintzler matrs petschwind sstitans brad-gh frutik

deepmatcher's Issues

Continue Training?

I have trained a model and I would love to continue training it on other data. Is that possible?
I tried it with model.run_train(...)
All i got back was:
RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/torch/lib/THC/generic/THCTensorMath.cu:26

Thankful for advice 👍

can't save model state

hello world
please please i need your help the date to presenting my graduation project is nearing
i get error when i tried to save model
my code is :

import deepmatcher as dm
import pandas as pd
import numpy as np
import os
train_set,validation_set = dm.data.process(
    path="deepmatcher_model/",
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    embeddings='fasttext.fr.bin',
    embeddings_cache_path="deepmatcher_model/",
    ignore_columns=['id','','ltable_index','rtable_index'],
    id_attr='_id', 
    label_attr='label',
    left_prefix='ltable_', 
    right_prefix='rtable_')
model=dm.MatchingModel(attr_summarizer='hybrid')
model.initialize(train_set)
model.run_train(
    train_dataset=train_set,
    validation_dataset=validation_set,
    epochs=20,
    batch_size=16,
    best_save_path='deepmatcher_model/hybrid_model.pth',
    pos_neg_ratio=3)
model.save_state('hybrid_model.pth',include_meta=True)

the error message :

AttributeError Traceback (most recent call last)
in ()
----> 1 model.save_state('hybrid_model.pth',include_meta=True)

4 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
197 pickler.persistent_id = persistent_id
--> 198 pickler.dump(obj)
199
200 serialized_storage_keys = sorted(serialized_storages.keys())

AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'

please tell me how can i solve this
i use google colab ..
thanks .. and i am sorry for my english
@thodrek @sidharthms @hanli91 @anhaidgroup

Data links dead

It looks like the links on the data page, and the ref to the paper are dead https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md

Inference on direct strings

I've seen examples with CSV files both training and testing,
How can we use deepmatcher on direct strings without using files?

Read tables with separator

It would be better if one can specify the data separator for splitting text from csv. I think now it only support the default pandas separator that is the comma

Cannot achieve the precision/recall/F1 reported in SIGMOD18

Hi, I have just tested deepmatcher on the walmart-amazon dataset.
However, I cannot achieve the precision/recall/F1 reported in SIGMOD18.

import deepmatcher as dm
import logging
import torch

logging.getLogger('deepmatcher.core').setLevel(logging.INFO)

model = dm.MatchingModel(attr_summarizer=dm.attr_summarizers.RNN(word_aggregator='birnn-last-pool'))
# model = dm.MatchingModel()
print(model)
model.initialize(train_dataset)  # Explicitly initialize model.

model = model.cuda()
lr_decay = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1]
batch_size = [16, 32]
pos_neg_ratio = [10, 7, 5, 4, 3, 2, 1]

best_f1 = -1
best_params = {'lr_decay': -1, 'batch_size': -1, 'pos_neg_ratio':-1}
for alpha in lr_decay:
    for b in batch_size:
        for rho in pos_neg_ratio:
            optimizer = dm.optim.Optimizer(lr_decay=alpha)
            model.run_train(
                train_dataset,
                validation_dataset,
                epochs=15,
                batch_size = b,
                pos_neg_ratio=rho,
                optimizer=optimizer,
                best_save_path='../output/walmart-amazon/rnn_model.pth')
            print(f'lr_decay: {alpha}, batch_size:{b}, pos_neg_ratio:{rho}')
            f1 = model.run_eval(test_dataset)
            if f1 > best_f1:
                best_f1 = f1
                best_params['lr_decay'] = alpha
                best_params['batch_size'] = b
                best_params['pos_neg_ratio'] = rho
                print(f'best f1-score: {best_f1}, {best_params}')
print(f'best f1-score: {best_f1}, {best_params}')

I haven't yet tested all the hyperparameters, but the available results show that f1-score is only about 37%.
However, the f1 reported in SIGMOD18 is 67.6% for RNN on the structured walmart-amazon dataset.
I use the data downloaded from the link in this repository.

Could you tell me what the problem is？
What are the hyper-parameters selected for the experiments in SIGMOD 18 (RNN)?

Why use mask_fill_ ?

Hi, I have just read your codes.
I found that in many scripts you use 'masked_fill_' for tensor computation, eg.,
word_aggregators.py:139-140
word_comparators.py:173-175
word_contexualizers.py:157-159

Could you please tell me why you use masked_fill for tensors?

can't save state after model training

i get error when i tried to save model
my code is :

import deepmatcher as dm
import pandas as pd
import numpy as np
import os
train_set,validation_set = dm.data.process(
    path="deepmatcher_model/",
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    embeddings='fasttext.fr.bin',
    embeddings_cache_path="deepmatcher_model/",
    ignore_columns=['id','','ltable_index','rtable_index'],
    id_attr='_id', 
    label_attr='label',
    left_prefix='ltable_', 
    right_prefix='rtable_')
model=dm.MatchingModel(attr_summarizer='hybrid')
model.initialize(train_set)
model.run_train(
    train_dataset=train_set,
    validation_dataset=validation_set,
    epochs=20,
    batch_size=16,
    best_save_path='deepmatcher_model/hybrid_model.pth',
    pos_neg_ratio=3)
model.save_state('hybrid_model.pth',include_meta=True)

the error message :

AttributeError Traceback (most recent call last)
in ()
----> 1 model.save_state('hybrid_model.pth',include_meta=True)

4 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _save(obj, f, pickle_module, pickle_protocol)
196 pickler = pickle_module.Pickler(f, protocol=pickle_protocol)
197 pickler.persistent_id = persistent_id
--> 198 pickler.dump(obj)
199
200 serialized_storage_keys = sorted(serialized_storages.keys())

AttributeError: Can't pickle local object 'MatchingDataset.finalize_metadata..'

please tell me how can i solve this
i use google colab ..

RuntimeError: dimension specified as 0 but tensor has no dimensions

RuntimeError raised using SIF as attribute summurizer
torch==0.3.1
torchtext==0.2.3

Traceback (most recent call last): File "matching.py", line 107, in best_save_path=path.join(RESULTS_DIR, 'models', 'rnn_sif_fasttext_model.pth') File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 183, in run_train return Runner.train(self, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/runner.py", line 300, in train model.initialize(train_dataset) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 357, in initialize self.forward(init_batch) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 427, in forward embeddings[right]) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 548, in _forward left_aggregated = self.word_aggregator(left_compared, left_aggregator_context) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 286, in _forward return self.module.forward(*args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 260, in forward inputs = module(inputs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call result = self.forward(*input, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 122, in forward return self._forward(input, *args, **kwargs) File "/home/belerico/.local/share/virtualenvs/prova-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 745, in _forward pc = Variable(input_with_meta.pc).unsqueeze(0).repeat(v.shape[0], 1) RuntimeError: dimension specified as 0 but tensor has no dimensions

Unexplained increase of RAM consumption at each epoch during training, until crash

When training a model on my own data, the consumption of RAM steadily increases during each epoch, until there is a crash caused by lack of memory.
Also, when the state of the model is saved at the end of each epoch, I noticed that the temporary increase in RAM (probably needed to create temp files / memmaps / something like that) is each time more important than it was for the previous time.

I also have a similar problem when I try to use a model for prediction: when the input data file is too big, I split it, and run the model on one part at a time. After each run, the RAM consumption has increased. I found that I can mitigate that by calling the 'deepmatcher.data.reset_vector_cache' function, which reset at least part of the increase in RAM consumption.

As such, I tend to also interpret the RAM consumption problem met during the training of the model, as a problem of ever increasing caching for vectors. But I don't really understand how this should be possible, since:

The vectors which are constant are the underlying word or character vectors. After one training epoch, all of them which are useful to process the training set are loaded in the cache. During the next epoch, there should not be an increase of cache space because of them, since they were all loaded during the first epoch.
The vectors which are not constant are the vectors created during a forward pass of the model, and should not be cached, since they change from one epoch to another.

So then, if the hypothesis according to which the caching of vectors is linked to the increase of RAM that causes the problem, why does the RAM usage continue to increase during epochs following the first epoch? Why does the quantity of temporary additional RAM needed during the saving of the state of the model increase from one epoch to another?

Here is the code I use.
'train' and 'validation' are the data used during an epoch.
'pos_neg_ratio' is computed directly from the training dataset.
I first ran the code on Google Colab with a GPU, and then on a full GCP VM without GCP, and the problem appeared on both machines.
With my data, I cannot get past epoch n°4.

model = dm.MatchingModel(attr_summarizer='hybrid')

model.run_train(
    train,
    validation,
    epochs=10,
    batch_size=16,
    pos_neg_ratio=pos_neg_ratio,
    device="cuda:0")

cannot install on windows using Python 3.6.8

I haven't seen a solution to this issue on the comments. First it errors trying to install Torch 0.3.1; attempting to pip install torch separately also results in a build error. Any ideas?

the error mentions something about needed in GPU. could that be the issue?

also this: ModuleNotFoundError: No module named 'tools.nnwrap'

Schema: an abstraction that I think is missing in the project

While working on patching the project to accept in-memory files, I found myself refactoring and refactoring the same inner logic inside the project.

In the end, I came up with a "Schema Class" a simple abstraction that hides all the details of the underlying dataset.

This aligns also with the recent developments in PyTorch NLP engineer with apache arrow in-memory backends (see https://github.com/huggingface/nlp).

What is the roadmap for this project? Is there an interest to keep evolving it?

class ExampleSchema(object):
    def __init__(self,
                 header,
                 id_attr,
                 label_attr,
                 left_prefix,
                 right_prefix,
                 ignore_columns=[],
                 tokenize='nltk',
                 lower=True,
                 include_lengths=True):
        """
        Checks that:
        * There is a label column
        * There is an ID column
        * All columns except the label and ID columns, and ignored columns start with either
            the left table attribute prefix or the right table attribute prefix.
        * The number of left and right table attributes are the same.
        
        Create field metadata, i.e., attribute processing specification for each
        attribute.

        This includes fields for label and ID columns.

        Returns:
            list(tuple(str, MatchingField)): A list of tuples containing column name
                (e.g. "left_address") and corresponding :class:`~data.MatchingField` pairs,
                in the same order that the columns occur in the CSV file.
        """
        # assert id_attr in header
        if label_attr:
            assert label_attr in header

        for attr in header:
            if attr not in (id_attr,
                            label_attr) and attr not in ignore_columns:
                if not attr.startswith(left_prefix) and not attr.startswith(
                        right_prefix):
                    raise ValueError(
                        'Attribute ' + attr +
                        ' is not a left or a right table '
                        'column, not a label or id and is not ignored. Not sure '
                        'what it is...')

        num_left = sum(attr.startswith(left_prefix) for attr in header)
        num_right = sum(attr.startswith(right_prefix) for attr in header)

        assert num_left == num_right, "left,right attributes mismatch"

        text_field = MatchingField(lower=lower,
                                   tokenize=tokenize,
                                   init_token='<<<',
                                   eos_token='>>>',
                                   batch_first=True,
                                   include_lengths=include_lengths)
        numeric_field = MatchingField(sequential=False,
                                      preprocessing=int,
                                      use_vocab=False)
        id_field = MatchingField(sequential=False, use_vocab=False, id=True)

        fields = []
        for attr in header:
            if attr == id_attr:
                fields.append((attr, id_field))
            elif attr == label_attr:
                fields.append((attr, numeric_field))
            elif attr in ignore_columns:
                fields.append((attr, None))
            else:
                fields.append((attr, text_field))

        self.fields = dict(fields)

Where are embeddings generated during processing stored?

Hi, how would I go about viewing (and potentially modifying) the actual embedding vectors that are generated by dm.data.process ? I'm having a hard time figuring out where they are located.

Can't download fasttext slovenian binaries

Fasttext binaries download does not work unless you select the english language.
Looking at the code i presume that is due to the fact that the english binaries is located at some google drive folder, while the others will be automatically downloaded from an aws server, that it seems unreachable

CPU cores to speed up

Hi
Can you please tell how much deepmatcher scales(X times faster) in terms of no of cpu cores increase for same process with data?

Thanks
Sumon

Unable to predict

When I try to make predictions over unlabeled data with

unlabeled_data = dm.data.process_unlabeled( path='sample_data/itunes-amazon/unlabeled.csv',trained_model=model)
predictions = model.run_prediction(unlabeled_data)
predictions.head()

I get
NameError: name 'fn' is not defined

problem when installing package deepmatcher

in want to install deepmatcher with pip
pip3 install deepmatcher
or
pip install deepmatcher
and i get this erreur :

Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher)

Too Slow Predictions?

DeepMatcher slower than BERT. How we can speed up predictions

Move repository to anhaidgroup account

i have problem when i tried to load model

hello
i have problem when i load the model that i saved it yesterday
my code

model2 = dm.MatchingModel()
model2.load_state('/content/drive/My Drive/recommandersystem/deepmatcher_model/hybrid_model.pth')`

the message error :

EOFError Traceback (most recent call last)
in ()
1 model2 = dm.MatchingModel()
----> 2 model2.load_state('/content/drive/My Drive/recommandersystem/deepmatcher_model/hybrid_model.pth')

2 frames
/usr/local/lib/python3.6/dist-packages/torch/serialization.py in _load(f, map_location, pickle_module)
418 unpickler = pickle_module.Unpickler(f)
419 unpickler.persistent_load = persistent_load
--> 420 result = unpickler.load()
421
422 deserialized_storage_keys = pickle_module.load(f)

EOFError: Ran out of input

thanks

OverflowError: Python int too large to convert to C long

import deepmatcher as dm
import torch

torch.cuda.is_available()
import pandas as pd
pd.read_csv(r'E:\project\deepmatcher\examples\sample_data\itunes-amazon\train.csv').head()

train, validation, test = dm.data.process(
path='E:\project\deepmatcher\examples\sample_data\itunes-amazon',train='train.csv', validation='validation.csv',test='test.csv')
model = dm.MatchingModel()

Windows system
How to deal with this problem？

Update tests

test/test_datasets folder does not contain required vectors to run tests, and travis could not find nltk.

Difference in prediction when using individual unlabeled records

Hi, I'm having an issue with calculating predictions and I managed to reproduce it with your itunes-amazon dataset. The issue is as follows:

When following your Getting Started notebook the predictions for the first 5 records are:

id	match_score
141999	0.094718
302034	0.283064
126354	0.794840
714676	0.282333
659997	0.175172

These are different from your results but I can imagine that is due to my model (with a f1-score of 70.77 on the test set) being different.

However when I feed in record 714676 separately into the processing and prediction function I get a match_score of 0.272498. That may not be a big difference but in my own datasets the differences are much larger (> 0.1).

Any idea what might be causing this difference? Is it the preprocessing step or the prediction step?

Some information about my environment:
System
Google Cloud Compute Engine VM with 1 Tesla K80 GPU
Linux 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u4 (2019-07-19) x86_64 GNU/Linux
Software
deepmatcher 0.1.0.post1
fasttext 0.9.1
pandas 0.25.1
torch 0.3.1
torchtext 0.2.3
cuda 10.0

RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index' upgrade to torch 1.0.1

After upgrade to torch==1.0.1 and torchtext>=0.2.3 I receive the following error. Is there a way I can fix it? I've to setting device to the possible modules in question but nothing had worked

Traceback (most recent call last):
  File "matching.py", line 42, in <module>
    model.initialize(train)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 357, in initialize
    self.forward(init_batch)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/core.py", line 421, in forward
    embeddings[name] = self.embed[name](attr_input)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/src/deepmatcher/deepmatcher/models/modules.py", line 187, in forward
    results = self.module(*module_args)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 117, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/home/belerico/.local/share/virtualenvs/ceneje_prodmatch-Z0l7suQy/lib/python3.6/site-packages/torch/nn/functional.py", line 1506, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected object of backend CPU but got backend CUDA for argument #3 'index'

Unable to train defined neural network model using training and validation data

While running this line of code I am getting an error saying Attribute Error: Metadata

I don't know what this error means and how to resolve this.

AttributeError: module 'torch' has no attribute 'float32'

Hi I am attempting to use deepmatcher on Mac OS. I have installed using

pip3 install deepmatcher

which seems to have worked.

Installing collected packages: torch, torchtext, deepmatcher
Successfully installed deepmatcher-0.1.0.post1 torch-0.3.1 torchtext-0.3.1

I am currently using

python3 --version
Python 3.6.5

When attempting to load deepmatcher using import deepmatcher as dm I get the following error:

>>> import deepmatcher as dm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/site-packages/deepmatcher/__init__.py", line 10, in <module>
    from .data import process as data_process
  File "/usr/local/lib/python3.6/site-packages/deepmatcher/data/__init__.py", line 1, in <module>
    from .field import MatchingField, reset_vector_cache
  File "/usr/local/lib/python3.6/site-packages/deepmatcher/data/field.py", line 11, in <module>
    from torchtext import data, vocab
  File "/usr/local/lib/python3.6/site-packages/torchtext/__init__.py", line 1, in <module>
    from . import data
  File "/usr/local/lib/python3.6/site-packages/torchtext/data/__init__.py", line 4, in <module>
    from .field import RawField, Field, ReversibleField, SubwordField, NestedField, LabelField
  File "/usr/local/lib/python3.6/site-packages/torchtext/data/field.py", line 61, in <module>
    class Field(RawField):
  File "/usr/local/lib/python3.6/site-packages/torchtext/data/field.py", line 118, in Field
    torch.float32: float,
AttributeError: module 'torch' has no attribute 'float32'

Any help would be greatly appreciated.

can not intall the package torch==0.3.1

when i want to intall deepmatcher packege i get this error massage :
Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher==0.1.0.post1) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher==0.1.0.post1)

any solution please !!
i have windows 10
and i use python 3.6
and how i can install it with conda ??

trying to run "getting started"

Hello there,

I am having to run your getting started in the colab. It was configured python3 + GPU. Could you help what is going on there?

Thanks.

Getting from data link points to correctly formatted data

Hi,

I'm trying to figure out if you have have any code for processing data like the fodors-zagat dataset from the data link point (i.e., ltable_id,rtable_id,label) to the structure you identify as being required for model input (i.e., ltable_id,lfeature_1,lfeature_2,rtable_id,rfeature_1,rfeature_2 and having an unlabeled.csv. My data (different than the fodors-zagat dataset) is structured like the linked data point above, and I looked through the repository but couldn't seem to find any to accomplish this processing. Just wanted to make sure it wasn't there before I write it myself... Thanks!

How can I set parameter pos_neg_ratio in run_train()?

How can I set parameter pos_neg_ratio in run_train()? Should I set it as the ratio of the number of negative examples to that of positive examples in the training set? @sidharthms

Issue in installing deepmatcher library

try:
import deepmatcher
except:
!pip install -qqq deepmatcher

While running the above code in python 3.6 I am getting the error mention below

File "/home/vikrant/anaconda2/lib/python2.7/site-packages/deepmatcher/data/field.py", line 163
def build_vocab(self, *args, vectors=None, cache=None, **kwargs):
^
SyntaxError: invalid syntax

Showing error at vectors

fasttext embedding vector for token having numerics only

Hi,

I am trying to train a data set having attribute date of birth as value for example '01-01-1970'. Can fasttext character embedding generate vector for this token or string with numeric during data processing of deepmatcher? So far I know, fasttext works with alphabets only.

I appreciate if you kindly reply with any information.

Thanks

SIGMOD experiments reproducibility

Hi,
I am trying to reproduce the experiments from the SIGMOD 2018 paper: http://pages.cs.wisc.edu/~anhai/papers1/deepmatcher-sigmod18.pdf. I am having a hard time finding the right setup and I get results far poorer than the ones reported in the paper for most of the datasets. Can you please give me a hint regarding the right setup? For example, what are the parameters for the hybrid setup? Using the defaults leads to poor results and following the existing guides in the repository did not help much.

As an example, for the (complete) iTunes-Amazon scenario the best I could obtain was F1: 35.09 | Prec: 33.33 | Rec: 37.04. But the paper reports better results.

Thank you!

Error in running dm.data.process

I am getting this error while running code->

Code-> train, validation, test = dm.data.process(
path='sample_data/itunes-amazon',
train='train.csv',
validation='validation.csv',
test='test.csv')

Error->

Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00

ValueError Traceback (most recent call last)
in ()
3 train='train.csv',
4 validation='validation.csv',
----> 5 test='test.csv')

7 frames
/usr/local/lib/python3.6/dist-packages/fastText/FastText.py in init(self, model)
35 self.f = fasttext.fasttext()
36 if model is not None:
---> 37 self.f.loadModel(model)
38
39 def is_quantized(self):

ValueError: /root/.vector_cache/wiki.en.bin has wrong file format!

Datasets

The train/test/validation.csv downloaded from https://github.com/anhaidgroup/deepmatcher/blob/master/Datasets.md are different from the csv files under examples/sample_data.

Also, it was not mentioned where the ids in the train/test/validation.csv under examples/sample_data come.

Could you kindly clarify these?

torch==0.3.1 not available anymore

the specified version of library torch (0.3.1) is no longer available via pip

Value error

I get a value error, tried to uninstall and install several packages but nothing worked :(
I prepared the sets as they are supposed to be. Including the 'left' and right prefixes, id, label and so on


train, validation, test = dm.data.process(
    path='/home/censored/quora-question-pairs',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')

including the Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-989ecb797b84> in <module>()
      4     train='train.csv',
      5     validation='validation.csv',
----> 6     test='test.csv')

/home/censored/.local/lib/python3.6/site-packages/deepmatcher/data/process.py in process(path, train, validation, test, unlabeled, cache, check_cached_data, auto_rebuild_cache, tokenize, lowercase, embeddings, embeddings_cache_path, ignore_columns, include_lengths, id_attr, label_attr, left_prefix, right_prefix, use_magellan_convention, pca)
    195 
    196     _maybe_download_nltk_data()
--> 197     _check_header(header, id_attr, left_prefix, right_prefix, label_attr, ignore_columns)
    198     fields = _make_fields(header, id_attr, label_attr, ignore_columns, lowercase,
    199                           tokenize, include_lengths)

/home/censored/.local/lib/python3.6/site-packages/deepmatcher/data/process.py in _check_header(header, id_attr, left_prefix, right_prefix, label_attr, ignore_columns)
     32         if attr not in (id_attr, label_attr) and attr not in ignore_columns:
     33             if not attr.startswith(left_prefix) and not attr.startswith(right_prefix):
---> 34                 raise ValueError('Attribute ' + attr + ' is not a left or a right table '
     35                                  'column, not a label or id and is not ignored. Not sure '
     36                                  'what it is...')

ValueError: Attribute  is not a left or a right table column, not a label or id and is not ignored. Not sure what it is...

How can I avoid this?

Error: Expected object of backend CPU but got backend CUDA for argument #3 'index'

when run the code with dataprocess
train, validation, test = dm.data.process( path='sample_data/itunes-amazon', train='train.csv', validation='validation.csv', test='test.csv')
I got the error with " Expected object of backend CPU but got backend CUDA for argument #3 'index'"
here my device is support cuda, while torch.cuda.is_avaliable() is true and torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') get the value of device(type='cuda', index=0).

how can i fix this problem? hope for solution.

Invalid device, must be cuda device

Hi I get a runtime error when I run dm.data.process() saying RuntimeError: Invalid device, must be cuda device.I have already specified my environment variable : os.environ['CUDA_VISIBLE_DEVICES'] = '0'
cuda_device = torch.device('cuda:0')

I am wondering where I can pass in the device variable in dm.data.process().

GPU tcmalloc: large alloc error while dm.data.process()

I'm trying to run the tutorial notebook on my GPU and when running the tutorial script:

train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon',
    train='train.csv',
    validation='validation.csv',
    test='test.csv')

I get the following output:

Reading and processing data from "sample_data/itunes-amazon/train.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/validation.csv"
0% [############################# ] 100% | ETA: 00:00:00
Reading and processing data from "sample_data/itunes-amazon/test.csv"
0% [############################# ] 100% | ETA: 00:00:00tcmalloc: large alloc 2013274112 bytes == 0x55dd7d8ac000 @  0x7ff96db501e1 0x7ff96df237e8 0x7ff91ede0f7b 0x7ff91ede1647 0x7ff91ede7058 0x7ff91ede7d99 0x7ff91edaf775 0x7ff91edd247e 0x55dcf69a9114 0x55dcf69a9231 0x55dcf6a0de8f 0x55dcf69626f9 0x55dcf6963805 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0d626 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69626f9 0x55dcf69a8917 0x55dcf6a0a0a6 0x55dcf69626f9 0x55dcf6963a30 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0e3d7 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69629da
tcmalloc: large alloc 4026540032 bytes == 0x55ddf58ae000 @  0x7ff96db501e1 0x7ff96df237e8 0x7ff91ede0f7b 0x7ff91ede1647 0x7ff91ede7058 0x7ff91ede7d99 0x7ff91edaf775 0x7ff91edd247e 0x55dcf69a9114 0x55dcf69a9231 0x55dcf6a0de8f 0x55dcf69626f9 0x55dcf6963805 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0d626 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69626f9 0x55dcf69a8917 0x55dcf6a0a0a6 0x55dcf69626f9 0x55dcf6963a30 0x55dcf697e943 0x55dcf69bd06a 0x55dcf69bde28 0x55dcf6a0e3d7 0x55dcf69a868b 0x55dcf6a0d6c9 0x55dcf69629da

And then it hangs.

I checked if the GPU is visible through torch and it is.

Is there a way of reducing the memory allocation?

Way to specify the weights to the features

I m using deepmatcher for product matching. I can across a scenario where I want to give different weightage to different features.
Ex: I have 5 columns(A, B, C, D, E) in each set left and right. I want to give more importance to Column A and B than rest. is there any way in deepmatcher?
something like a weightage vector [0.4, 0.2, 0.1, 0.1 0.1, 0.1]

About the embeddings of numerical data

Hi, I've read your codes.
If I am not mistaken, the embedding is processed on the result of fasttext, which is downloaded from
https://drive.google.com/uc?export=download&id=1Vih8gAmgBnuYDxfblbT94P6WjB7s1ZSh

However, I am not so sure how this fasttext embedding deals with numerical values. I can get get the embedding of some numeric values from this compressed file directly.

My code is pasted below:

import fastText

model = fasttext.load_model(path)
model.get_word_vector('1.29')

the output is shown like this:

array([-1.36406599e-02, -2.27776930e-01, -3.18875402e-01, 7.06118524e-01,
-3.12648743e-01, 2.43467003e-01, -1.02767743e-01, -3.95509809e-01,
2.36076638e-01, 3.91702265e-01, 1.61795065e-01, 3.05263754e-02,
1.05867624e-01, 1.74811155e-01, 2.04814345e-01, -1.86537594e-01,
2.55494360e-02, -1.98053271e-01, 8.03033933e-02, 1.76812127e-01,
4.34105396e-02, 1.29547983e-01, -3.17463964e-01, -5.20383835e-01,
-2.43617725e-02, -7.41683841e-02, -8.27122629e-02, 1.17084600e-01,
1.25584468e-01, 6.53219968e-02, -6.95866272e-02, 3.29474926e-01,
-3.14294517e-01, 2.79334456e-01, -2.46289775e-01, 5.94363129e-03,
-3.22101228e-02, -9.52130780e-02, 1.58929497e-01, -1.14688493e-01,
-3.16427574e-02, -1.76016539e-01, -1.87940076e-01, -1.46237537e-01,
-3.98945883e-02, -1.22498609e-01, 9.15530622e-02, 6.35887161e-02,
8.42858702e-02, 4.53437585e-03, -3.56534153e-01, 5.57965748e-02,
-3.38896476e-02, 1.23230398e-01, -1.83964506e-01, 2.08228782e-01,
4.10810187e-02, 1.98457673e-01, -1.58749551e-01, 2.27738291e-01,
-7.31550083e-02, -2.98727542e-01, 2.16822296e-01, -4.91270497e-02,
-1.06079698e-01, -2.47507811e-01, -4.14096743e-01, -1.95570275e-01,
1.56012088e-01, 3.75828221e-02, -4.57418412e-01, 4.39721309e-02,
2.05725119e-01, -2.20915675e-01, 1.10607982e-01, -1.37868347e-02,
5.35952687e-01, 1.82460938e-02, -5.61991110e-02, -1.59084573e-01,
2.89105892e-01, -3.88202481e-02, -1.57112852e-02, 9.92794558e-02,
-5.02652168e-01, -4.81455810e-02, -1.25789657e-01, 1.05233647e-01,
1.36335000e-01, -1.60364717e-01, 4.15847301e-02, 5.85811920e-02,
3.02284360e-02, -9.40186158e-02, -1.46366581e-01, 1.04580402e-01,
1.44954011e-01, -8.10427219e-02, -2.52872854e-01, 2.61546880e-01,
5.68291619e-02, 3.77267972e-02, 1.76686642e-03, 9.93822962e-02,
6.99491519e-03, 5.82082570e-02, 7.09483027e-02, -2.14030035e-02,
-1.68091357e-01, 3.08654398e-01, 2.23699287e-01, -4.38281268e-01,
2.84722336e-02, -1.44860461e-01, -3.60347666e-02, -2.86953785e-02,
3.16625744e-01, 3.14383358e-01, -8.37426037e-02, 1.59112439e-02,
3.63424510e-01, -2.88430214e-01, -5.00641167e-01, 1.58014759e-01,
1.60332680e-01, 4.05774731e-03, -2.23307282e-01, -8.44741017e-02,
8.90725628e-02, 2.42359862e-01, 1.04806900e-01, 1.37488708e-01,
6.99754432e-02, -8.64755437e-02, 1.30111221e-02, 8.31332207e-02,
-4.22606841e-02, 1.11255124e-01, -1.57022268e-01, 1.78717270e-01,
-2.79926240e-01, 1.30639657e-01, -2.83100437e-02, 1.12825038e-03,
2.09572345e-01, -1.69247791e-01, -8.15944299e-02, -8.19083378e-02,
3.97944562e-02, 4.19981182e-02, 2.88578361e-01, -4.15978402e-01,
-3.40752304e-05, 8.56315866e-02, 1.16690114e-01, 1.64606094e-01,
6.05781339e-02, -1.87327877e-01, 2.52706140e-01, -5.35242222e-02,
2.21059874e-01, 5.93100078e-02, -1.46213770e-02, 1.78334340e-01,
3.03723902e-01, 7.16771334e-02, 5.16987927e-02, -1.03483133e-01,
9.72970948e-02, 7.37599423e-03, -9.61232856e-02, -7.14995861e-02,
-1.90513685e-01, 1.72579419e-02, -1.89847827e-01, 2.04318315e-01,
-4.94804990e-04, -1.00425832e-01, -2.45145261e-02, 1.63981542e-01,
-1.45653874e-01, 2.69459605e-01, -2.49348000e-01, -4.23478693e-01,
1.72101721e-01, -2.33451411e-01, -1.80588961e-02, 1.44084945e-01,
-1.00001099e-03, 1.98338985e-01, 1.92507535e-01, -1.30623341e-01,
-7.05276728e-02, -1.66634411e-01, -1.49547324e-01, 1.85020670e-01,
1.95668384e-01, -1.56739816e-01, -1.06223106e-01, -1.60616800e-01,
2.39245847e-01, 1.47702649e-01, 1.21150471e-01, -2.49310836e-01,
-1.10669866e-01, -8.79117772e-02, 1.82536706e-01, 3.26546691e-02,
1.69249430e-01, 1.70959141e-02, 4.93817814e-02, 7.77583430e-03,
-1.84142128e-01, 1.33904340e-02, 1.10010996e-01, -4.86998521e-02,
5.05935326e-02, -2.94754207e-01, 1.26824811e-01, -2.62340099e-01,
-5.71499839e-02, 2.18661547e-01, -5.79368174e-02, 2.14305073e-01,
2.28410965e-04, 1.37265712e-01, 1.12893008e-01, -3.38227987e-01,
-7.97498375e-02, 2.89683323e-02, -1.54006824e-01, -1.32086128e-01,
1.03580654e-01, 4.06690761e-02, -1.92804530e-01, 9.63973999e-03,
-6.48700222e-02, -6.88762739e-02, 4.79521938e-02, 3.03832978e-01,
-2.59389788e-01, 3.46401095e-01, -1.06269956e-01, -7.59946853e-02,
1.15125896e-02, 5.88730052e-02, -7.50966091e-03, -1.37814924e-01,
-1.35907531e-03, 1.28101259e-01, 2.76466906e-01, 5.75332232e-02,
-2.88491875e-01, -1.55083895e-01, -1.80856034e-01, -1.27918571e-01,
5.49529456e-02, -2.76215613e-01, -7.87807554e-02, 8.41445550e-02,
4.81731407e-02, 3.26566786e-01, -5.92513792e-02, -1.39655858e-01,
-1.49621695e-01, -2.57521514e-02, -2.72838682e-01, 7.26688057e-02,
1.03472546e-01, -1.62503228e-01, 2.39889622e-02, 2.57862747e-01,
-1.53003678e-01, -1.23614408e-01, -6.24490380e-02, 3.78105789e-01,
2.08718851e-01, 2.09210403e-02, 1.83302522e-01, -4.95651364e-02,
1.95827410e-01, -2.01158553e-01, -2.34887339e-02, 1.55502364e-01,
-6.06963746e-02, 1.41840905e-01, -4.55948375e-02, 7.59486184e-02,
-1.65563941e-01, 4.15077597e-01, -5.43421283e-02, -3.51880491e-02,
2.68175632e-01, -3.60581696e-01, 5.03844058e-04, 1.37063740e-02,
-1.10333703e-01, -1.05074406e-01, -1.10602118e-01, -1.41570672e-01],
dtype=float32)

Could you please tell me how to deal with numeric data embeddings?
Thanks!

torch==0.3.1 mandatory?

Is torch==0.3.1 a mandatory install? I mean can I install torch==1.0.1 instead?

Deepmatcher raises an error when dataset contains an attribute named "type".

Stack trace:

  File "<path>/deepmatcher/models/core.py", line 184, in run_train
    return Runner.train(self, *args, **kwargs)
  File "<path>/deepmatcher/runner.py", line 300, in train
    model.initialize(train_dataset)
  File "<path>/deepmatcher/models/core.py", line 366, in initialize
    self.forward(init_batch)
  File "<path>/deepmatcher/models/core.py", line 436, in forward
    embeddings[right])
TypeError: type() takes 2 positional arguments but 3 were given

Temporary workaround:

Change the name of the attribute causing the problem, e.g., you can change "left_type" and "right_type" to "left_entity_type" and "right_entity_type" respectively.

geting error when train model in google colab

i use google colab
when i tried to process data with fasttext in french language i set it like this :

train_set,validation_set = dm.data.process(
    path='drive/My Drive/recommandersystem/deepmatcher_model',
    cache='train_cache.pth',
    train='train.csv',
    validation='valid.csv',
    embeddings='fasttext.fr.bin', 
    embeddings_cache_path='drive/My Drive/recommandersystem/deepmatcher_model',
    ignore_columns=['id',''],
    id_attr='_id', 
    label_attr='label',
    left_prefix='ltable_', 
    right_prefix='rtable_')

and i get this error message :

HTTPError Traceback (most recent call last)
in ()
11 label_attr='label',
12 left_prefix='ltable_',
---> 13 right_prefix='rtable_')

13 frames

/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
648 class HTTPDefaultErrorHandler(BaseHandler):
649 def http_error_default(self, req, fp, code, msg, hdrs):
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

please how i can solve this

The column order is not correct in dataset.get_raw_table().

The new generated table after executing get_raw_table() in dataset.py doesn't have the same order as in the original input table. This is because "fields" is a dict rather than a list, which doesn't contain the column order info.

Problem in dm.data.process

Hi,
during the execution of

train, validation, test = dm.data.process(
    path='sample_data/itunes-amazon2',
    train='train.csv',
    validation='validate.csv',
    test='test.csv')

a KeyError: tensor(2) is raised.
Are there any solutions? Can the problem arise due to a wrong configuration of the deepmatcher installation?

double free or corruption (!prev): 0x00005613ec89a910 ***

Hi, I have successfully run the codes with the guide of tutorial Getting Start.
I tried to run deepmatcher with the dataset of structured walmart_amazon and I reformat the csv file like the itunes-amazon dataset from Getting Start tutorial. But when I run this code :

train, validation, test = dm.data.process(
path='.',
train='train.csv',
validation='validation.csv',
test='test.csv')

it reports an error as :

Computing principal components
0% [######] 100% | ETA: 00:00:00
Total time elapsed: 00:00:01
*** Error in `~/anaconda3/bin/python': double free or corruption (!prev): 0x00005613ec89a910 ***
Aborted (core dumped)

Do you have any ideas of what is wrong?

can't pickle PyCapsule objects

getting this error while trying to run the model in a platform other than Google Colab. Seems like the model is not getting saved. Is there any method to go about this? also is this a dependency issue ?

Unable to install deepmatcher in python 3.6

try:
import deepmatcher
except:
!pip install -qqq deepmatcher

While running the above code in python 3.6 i am getting the error mention below

"Could not find a version that satisfies the requirement torch==0.3.1 (from deepmatcher) (from versions: 0.1.2, 0.1.2.post1)
No matching distribution found for torch==0.3.1 (from deepmatcher)"

And

If I run pip install deepmatcher.
Note: you may need to restart the kernel to use updated packages.
'C:\Users\ABC' is not recognized as an internal or external command,
operable program or batch file.
So what should I do to resolve this issue.