The sparker's discuss from gaglia88

Concern with weight calculation using BLAST and entropies

This library is pretty incredible, just have a bit of a concern I wanted to report.

My use case is as follows:

Take 2 CSV's containing customer data that should contain 1 or more fields that are matchable (an identifier for example)

customers1.csv:

id	name	random_field_1	random_field_2	random_field_3	etc...
1	google	555-333-222	...	...	...
2	facebook	222-555-111	...	...	...
3	microsoft	333-111-888	...	...	...

customers2.csv:

identifier	customer_name	random_field_1	random_field_2	random_field_3	etc...
5	google inc	555	...	...	...
10	facebook corp	111	...	...	...
300	microsoft industries	555	...	...	...

create profiles
cluster_similar_attributes

[
    {'cluster_id': 1, 'keys': ['1_name', '2_customer_name'] 'entropy': 1.4},
    {'cluster_id': 2, 'keys': ['1_id', '2_id', '1_random_field_1', '2_random_field_1', '1_random_field_2', '2_random_field_2', (etc...)], 'entropy': 9.5}, 
]

create_block_clusters

[
{'block_id': 0, 'profiles': [{0}, {0}], 'entropy': 1.4, 'cluster_id': -1, 'blocking_key': ''}
{'block_id': 1, 'profiles': [{1,2}, {1}], 'entropy': 9.5, 'cluster_id': -1, 'blocking_key': ''}
],

block purging
block filtering
WNP
I would get a few mis-matches because the weight of matches for cluster_id 2 would be greater than cluster_id 1.
Assuming there's 100 rows in each profile and 100 being the separator id (200 profiles total) the output edges would look something like:

[[0, 100, 10.5]
[1, 101, 10.5]
[2, 101, 20.8]]

You will notice that the higher weight goes to the match that has the higher entropy.
This doesn't seem correct to me since lower entropy should give higher weight.

Using standard library I was able to get around 80-90 perfect matches. Once I edited calc_weights function in common_node_pruning.py from calc_chi_square(...) * entropies[neighbor_id] to calc_chi_square(...) / entropies[neighbor_id] I was able to get 100 perfect 1to1 matches.

Does the division instead of multiplication here make sense, and is my assumption of lower entropy should be greater match correct?

Please let me know :)

gaglia88 / sparker Goto Github PK

sparker's Issues

Concern with weight calculation using BLAST and entropies

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent