Overview Of Experiments and Progress Checklist

The entity matching process breaks down into two steps: blocking and matching.

Blocking

After cleaning and standardizing the data in both the FERC and EIA datasets, we perform a process called blocking, in which we remove record pairs that are unlikely to be matched from the candidate set of record pairs. This reduces computational complexity and boosts model performance, as we no longer need to evaluate n² candidate match pairs and instead only evaluate a set of record pairs that are more likely to be matched. The goal of blocking is to create a set of candidate record pairs that is as small as possible while still containing all correctly matched pairs.

Rule Based Blocking

The simplest way we tried to create blocks of candidate pairs is with rule based blocking. This involves creating a set of heuristics that, when applied disjunctively, create "blocks" of record pairs that form a complete candidate set of pairs. This approach was too simple for our problem, and it was difficult to capture the training data matches without creating a very large candidate set.

It's worth noting that the output of rule based blocking can be combined with the output of an embedding vector approach described below to increase recall, while increasing the blocking output size only modestly (Thirumuruganathan, Li).

Embedding Vectors for Blocking

Instead of creating heuristics for blocking, we can create embedding vectors that represent the tuples in the FERC and EIA datasets and find the most similar pairs of embedding vectors to create a candidate set. This process involves three main steps.

Attribute Embedding: For each tuple t in the FERC and EIA datasets, compute an embedding vector for each attribute (column) in t.
Tuple Embedding: Combine each attribute embedding vector into one embedding vector for the tuple t.
Vector Pairing: Find similar vector pairs from the FERC and EIA datasets using a similarity metric and add the tuple pairs represented by these embedding vectors to the candidate set.

Attribute Embedding
There are multiple methods for embedding the string value attributes of the tuples.

TF-IDF:
Word Embeddings (word2vec, GloVe):
- can be trained on the domain instead of pre-trained
Character Level (fastText) or Sub-Word Embedding (bi-grams):
- can handle similarities in words like "generator", "generation"
- better handles typos

The numeric attributes can be normalized within each column. (or should they go through the same embedding process as the string columns? in the case of TF-IDF does it matter if the numeric columns aren't on the same scale as the string columns?)

Tuple Embedding

Equal Weight Aggregation: Attribute embeddings are averaged together into one tuple embedding.
Weighted Aggregation: A weighted average is used to combine the attribute embeddings together into one tuple embedding. The weights of the attribute embeddings can optionally be learned.

Note: With aggregation methods, order is not considered: "Generator 10" has the same embedding as "10 Generator" (could be good or bad)

Self-Reproduction: Autoencoder or seq2seq (write this up)

Roughly speaking, they take a tuple t, feed it into a neural network (NN) to output a compact embedding vector ut, such that if we feed ut into a second NN, we can recover the original tuple t (or a good approximation of t). If this happens, ut can be viewed as a good compact summary of tuple t, and can be used as the tuple
embedding of t.

Vector Pairing
For all combinations of attribute and tuple embedding, we will use KNN cosine similarity to choose the vector pairs in the candidate set.

Evaluation Metric

Reduction Ratio (percentage that the candidate set has been reduced from n x n comparisons)
Pairs Completeness (percentage of matching record pairs contained within the reduced comparison space after blocking)
Harmonic Mean of Reduction Ratio and Pairs Completeness: 2 * RR * PC / (RR + PC)

These metrics work best for a rules based blocking method, where you can't adjust the size of the candidate set. Include metrics for blocking when Vector Pairing step is done at end to retain k most similar vector pairs.

Experiment Matrix
(Note: There could probably be more experimentation added with the way that numeric attributes are embedded and concatenated onto the record tuple embedding)

Attribute Embedding Method	Tuple Embedding Method	% of Training Matches Retained
Rule Based Blocking
TF-IDF	Equal Weight Aggregation
	Weighted Aggregation
	autoencoder
	seq2seq
word2vec	Equal Weight Aggregation
	Weighted Aggregation
	autoencoder
	seq2seq
fastText	Equal Weight Aggregation
	Weighted Aggregation
	autoencoder
	seq2seq

Generalize Entity Matching Framework

Background

The code in this repo was developed specifically with the FERC-EIA matching problem, but ideally should be usable for other matching problems. The core underlying modelling is not dependent on this problem, but the implementation is currently tooled specifically to work with these inputs.

Tasks

Beta Give feedback

Generalize blocking step to take two arbitrary dataframes and produce candidate sets #107
Test matching with new generalized framework
Compare results vs existing PUDL matching
Test generalized framework with FERC-FERC inter-year matching
Options

Set up record-linkage experiment infrastructure

We've got a bunch of (potential) experiments that we want to compare, so setting up a framework for running them all in repeatable way will be helpful

Tasks

Beta Give feedback

Switch to pulling nightly build DB from S3 rather than datasette (#41 )
#38 (#41)
#39 (#41)
Integrate splink matching model into pipeline #32
Create KNN Cosine Similarity Function #48
unpin pudl dependency from commit and use main once nightly builds are running on main
integrate utility name merge onto PPL into the PUDL plant_parts_eia analysis
Options

FERC-EIA Record Linkage Experiments

This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:

Blocking Strategies

The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:

String attribute embedding methods that are used to turn text like plant or utility names into numerical features (TF-IDF, word2vec, and fastText)
Tuple embedding methods that are used to combine distinct vectorized features into a single vector representing the whole tuple. These include either setting a priori or learning the relative weights (importance) of the various feature vectors to be combined, or using a neural network to reduce the dimensionality of the feature vector. Options include seq2seq and AutoEncoders (TensorFlow example)
Choice of threshold: Once we've got the tuple embedding, how do we pick a subset of records to compare to each other? E.g. k-nearest neighbors (KNN) or some minimum threshold value like cosine similarity >= 0.75.
There's also the old-school rule based blocking, where we pick some heuristics that split up the records along reasonable lines (e.g. only compare records from the same report year or state). This can potentially be used in combination with the above strategies as a pre-filter.

Record Linkage Models

These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are

Splink: a logistic regression model that can be run supervised or unsupervised.
Probabalistic Graph Models (PGMs) which can be used to find a consensus among several noisy labeling functions (aka weak supervision)

Evaluate and compare current performance of models

As we start to integrate the CCAI modelling work back into PUDL, we need to have a concrete understanding of the performance vs our baselines. Here are some comparisons that can be done that may lead to potential understanding and improvement:

Tasks

Beta Give feedback

Run the splink model with the same train test split as in the PUDL model and compare results on test set to those of PUDL model
Look at the output matches of the splink model (unsupervised and supervised) and compare to the existing matches from the PUDL model
Try running splink model with all the columns that are used for feature embedding in the existing PUDL model
Try running the existing PUDL model with the blocking column generated by faiss clustering, do results improve?
Try running the existing PUDL model with simple imputations for null columns from the CCAI repo.
Options

word2vec + Splink + Equal Weights

Apply PUDL entity matching framework to FERC-EIA

The final output of the CCAI project should be the complete replacement of the FERC-EIA matching in PUDL. Once #106 is complete we should be prepared to drop the framework developed here directly into PUDL (we could also add a dependency to this repo, but I believe it will be much more maintainable in PUDL).

Out of scope:

This does NOT include bringing the Splink model into PUDL, but does lay the groundwork for doing so easily in the future.
Experiment tracking will be handled in separate PR.
Blocking with faiss and comparison to existing blocking column results will be handled in separate PR (Try running modeling on blocks outputted from faiss clustering step and see if there's a score improvement)

Success criteria

Beta Give feedback

FERC-EIA match serves as a template for using PUDL entity matching framework
Options

TODO

Beta Give feedback

Replace InputManager and Features from pudl.analysis.ferc1_eia_record_linkage with pudl.analysis.record_linkage.embed_dataframe and name cleaner. Remove recordlinkage package dependency.
Rework asset structure of FERC input so that a bunch of merges aren't performed?
Make into graph_asset broken down with ops. Should training steps be pickled and grid search cross validation be removed?
Use revert nulls functions to revert dataframe after model runs.
Options

Integrate experiment tracking with `mlflow`.

Note from Katie:

There are several metrics all over the FERC to EIA matching model module that would be good track, like accuracy, checks for the coverage of certain types of plants in the matches, and the consistency of model generated FERC plant IDs across time (see pudl.analysis.record_linkage.eia_ferc1_record_linkage._log_match_coverage, pudl.analysis.record_linkage.eia_ferc1_record_linkage.check_match_consistency, pudl.analysis.record_linkage.eia_ferc1_record_linkage.overwrite_bad_predictions). If you think it's in scope, these metrics should probably be consolidated into one place and tracked.

Integrate blocking step to FERC-FERC matching process in PUDL

We've decided the first place to begin integrating the CCAI entity matching into PUDL will be with the inter-year FERC-FERC matching. This matching process uses an almost identical approach as the blocking step, so it will hopefully be a straightforward place to start.

TF-IDF + Splink + Equal Weights

Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.

Parameters to vary

Choice of min/max lengths for n-grams generated by TF-IDF.
Vary the value of k in KNN or the minimum allowable cosine similarity used in blocking
Try using Splink supervised (based on our manual training data) and also unsupervised
Is there any useful exploration to be done in how we encode the non-string (numerical & categorical) features?

Evaluation criteria / outputs

Run time for the whole process.
Proportion of training data pairs excluded by the blocking strategy.
Reduction in the number of tuple pairs that need to be compared after blocking.
Proportion of identified matches that violate e.g. manual plant_id_pudl assignments or training data.
What fraction of the manually assigned training data matches have been recovered by the model

Generalize blocking step to take two arbitrary dataframes and produce candidate sets

The blocking step itself is already fairly generalized, and I've done work refactoring to improve configuration and hopefully streamline the logic flow, but the inputs are prepared by the InputManager class which does a fair amount of preprocessing. The majority of the remaining work for the blocking step generalization is deciding how much (if any) of the input preparation can be generalized, and ensuring there is a clear delineation between the generalized framework and problem specific work.

Evaluate TF-IDF Attribute Embedding

Use TF-IDF to vectorize string features, and then test standard linkage performance with:

Tuple Embedding Methods

Beta Give feedback

TF-IDF + Splink + Equal Weights #35

splink tf-idf
Weighted Aggregation (with training data)
autoencoder (neural network)
seq2seq (neural network
Options

Create KNN Cosine Similarity Function

After the tuples are embedded into vectors for each record, we run a similarity function to decide what the best record pair candidates are. This set of good record pair candidates are then fed into the matching model.

KNN cosine similarity is widely accepted for this step. This involves choosing the K best "right side" candidate match tuples for each "left side" tuple based on the cosine similarity of the tuple embeddings. To start, we'll use a threshold similarity.

Packages like faiss will be helpful for creating this functionality.

Get CI set up to run notebooks

We'll want to run one or more notebooks in CI to make sure that they're up to date with the existing modules.

Create a CI test that uses nbconvert to run our designated notebook(s).

Run integration tests on FERC & EIA input generation

Functions to check in integration tests

Tasks

Beta Give feedback

test that the PUDL DB is accessible
Test that FERC1 input pre-processing
Test that EIA inputs can be generated
Options

Integrate splink matching model into pipeline

Currently the matching model I've built with splink is in a notebook. I'm going to integrate this into a matching module in the repo so that as we develop blocking methods we can run the candidate set through splink to evaluate how well it's performing.

catalyst-cooperative / ccai-entity-matching Goto Github PK

ccai-entity-matching's People

Contributors

Stargazers

Watchers

Forkers

ccai-entity-matching's Issues

Blocking

Rule Based Blocking

Embedding Vectors for Blocking

Background

Tasks

Tasks

Blocking Strategies

Record Linkage Models

Experiments to Run

Tasks

Success criteria

TODO

Parameters to vary

Evaluation criteria / outputs

Tuple Embedding Methods

Tasks

Recommend Projects

Recommend Topics

Recommend Org