catalyst-cooperative / ccai-entity-matching Goto Github PK
View Code? Open in Web Editor NEWAn exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
License: MIT License
An exploration of generalizable approaches to unsupervised entity matching for use in linking tabular public energy data sources.
License: MIT License
The entity matching process breaks down into two steps: blocking and matching.
After cleaning and standardizing the data in both the FERC and EIA datasets, we perform a process called blocking, in which we remove record pairs that are unlikely to be matched from the candidate set of record pairs. This reduces computational complexity and boosts model performance, as we no longer need to evaluate n2 candidate match pairs and instead only evaluate a set of record pairs that are more likely to be matched. The goal of blocking is to create a set of candidate record pairs that is as small as possible while still containing all correctly matched pairs.
The simplest way we tried to create blocks of candidate pairs is with rule based blocking. This involves creating a set of heuristics that, when applied disjunctively, create "blocks" of record pairs that form a complete candidate set of pairs. This approach was too simple for our problem, and it was difficult to capture the training data matches without creating a very large candidate set.
It's worth noting that the output of rule based blocking can be combined with the output of an embedding vector approach described below to increase recall, while increasing the blocking output size only modestly (Thirumuruganathan, Li).
Instead of creating heuristics for blocking, we can create embedding vectors that represent the tuples in the FERC and EIA datasets and find the most similar pairs of embedding vectors to create a candidate set. This process involves three main steps.
t
in the FERC and EIA datasets, compute an embedding vector for each attribute (column) in t
.t
.Attribute Embedding
There are multiple methods for embedding the string value attributes of the tuples.
TF-IDF:
Word Embeddings (word2vec, GloVe):
Character Level (fastText) or Sub-Word Embedding (bi-grams):
The numeric attributes can be normalized within each column. (or should they go through the same embedding process as the string columns? in the case of TF-IDF does it matter if the numeric columns aren't on the same scale as the string columns?)
Tuple Embedding
Equal Weight Aggregation: Attribute embeddings are averaged together into one tuple embedding.
Weighted Aggregation: A weighted average is used to combine the attribute embeddings together into one tuple embedding. The weights of the attribute embeddings can optionally be learned.
Note: With aggregation methods, order is not considered: "Generator 10" has the same embedding as "10 Generator" (could be good or bad)
Roughly speaking, they take a tuple t, feed it into a neural network (NN) to output a compact embedding vector u<sub>t</sub>
, such that if we feed u<sub>t</sub>
into a second NN, we can recover the original tuple t (or a good approximation of t). If this happens, u<sub>t</sub>
can be viewed as a good compact summary of tuple t, and can be used as the tuple
embedding of t.
Vector Pairing
For all combinations of attribute and tuple embedding, we will use KNN cosine similarity to choose the vector pairs in the candidate set.
Evaluation Metric
These metrics work best for a rules based blocking method, where you can't adjust the size of the candidate set. Include metrics for blocking when Vector Pairing step is done at end to retain k most similar vector pairs.
Experiment Matrix
(Note: There could probably be more experimentation added with the way that numeric attributes are embedded and concatenated onto the record tuple embedding)
Attribute Embedding Method | Tuple Embedding Method | % of Training Matches Retained |
---|---|---|
Rule Based Blocking | ||
TF-IDF | Equal Weight Aggregation | |
Weighted Aggregation | ||
autoencoder | ||
seq2seq | ||
word2vec | Equal Weight Aggregation | |
Weighted Aggregation | ||
autoencoder | ||
seq2seq | ||
fastText | Equal Weight Aggregation | |
Weighted Aggregation | ||
autoencoder | ||
seq2seq |
The code in this repo was developed specifically with the FERC-EIA matching problem, but ideally should be usable for other matching problems. The core underlying modelling is not dependent on this problem, but the implementation is currently tooled specifically to work with these inputs.
We've got a bunch of (potential) experiments that we want to compare, so setting up a framework for running them all in repeatable way will be helpful
This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:
The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:
These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are
As we start to integrate the CCAI modelling work back into PUDL, we need to have a concrete understanding of the performance vs our baselines. Here are some comparisons that can be done that may lead to potential understanding and improvement:
The final output of the CCAI project should be the complete replacement of the FERC-EIA matching in PUDL. Once #106 is complete we should be prepared to drop the framework developed here directly into PUDL (we could also add a dependency to this repo, but I believe it will be much more maintainable in PUDL).
Out of scope:
faiss
and comparison to existing blocking column results will be handled in separate PR (Try running modeling on blocks outputted from faiss
clustering step and see if there's a score improvement)Note from Katie:
There are several metrics all over the FERC to EIA matching model module that would be good track, like accuracy, checks for the coverage of certain types of plants in the matches, and the consistency of model generated FERC plant IDs across time (see pudl.analysis.record_linkage.eia_ferc1_record_linkage._log_match_coverage
, pudl.analysis.record_linkage.eia_ferc1_record_linkage.check_match_consistency
, pudl.analysis.record_linkage.eia_ferc1_record_linkage.overwrite_bad_predictions
). If you think it's in scope, these metrics should probably be consolidated into one place and tracked.
We've decided the first place to begin integrating the CCAI entity matching into PUDL will be with the inter-year FERC-FERC matching. This matching process uses an almost identical approach as the blocking step, so it will hopefully be a straightforward place to start.
Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.
plant_id_pudl
assignments or training data.The blocking step itself is already fairly generalized, and I've done work refactoring to improve configuration and hopefully streamline the logic flow, but the inputs are prepared by the InputManager
class which does a fair amount of preprocessing. The majority of the remaining work for the blocking step generalization is deciding how much (if any) of the input preparation can be generalized, and ensuring there is a clear delineation between the generalized framework and problem specific work.
Use TF-IDF to vectorize string features, and then test standard linkage performance with:
After the tuples are embedded into vectors for each record, we run a similarity function to decide what the best record pair candidates are. This set of good record pair candidates are then fed into the matching model.
KNN cosine similarity is widely accepted for this step. This involves choosing the K best "right side" candidate match tuples for each "left side" tuple based on the cosine similarity of the tuple embeddings. To start, we'll use a threshold similarity.
Packages like faiss will be helpful for creating this functionality.
We'll want to run one or more notebooks in CI to make sure that they're up to date with the existing modules.
Create a CI test that uses nbconvert
to run our designated notebook(s).
Functions to check in integration tests
Currently the matching model I've built with splink is in a notebook. I'm going to integrate this into a matching module in the repo so that as we develop blocking methods we can run the candidate set through splink to evaluate how well it's performing.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.