Giter VIP home page Giter VIP logo

ccai-entity-matching's People

Contributors

dependabot[bot] avatar katie-lamb avatar pre-commit-ci[bot] avatar zschira avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

ggurjar333

ccai-entity-matching's Issues

Overview Of Experiments and Progress Checklist

The entity matching process breaks down into two steps: blocking and matching.

Blocking

After cleaning and standardizing the data in both the FERC and EIA datasets, we perform a process called blocking, in which we remove record pairs that are unlikely to be matched from the candidate set of record pairs. This reduces computational complexity and boosts model performance, as we no longer need to evaluate n2 candidate match pairs and instead only evaluate a set of record pairs that are more likely to be matched. The goal of blocking is to create a set of candidate record pairs that is as small as possible while still containing all correctly matched pairs.

Rule Based Blocking

The simplest way we tried to create blocks of candidate pairs is with rule based blocking. This involves creating a set of heuristics that, when applied disjunctively, create "blocks" of record pairs that form a complete candidate set of pairs. This approach was too simple for our problem, and it was difficult to capture the training data matches without creating a very large candidate set.

It's worth noting that the output of rule based blocking can be combined with the output of an embedding vector approach described below to increase recall, while increasing the blocking output size only modestly (Thirumuruganathan, Li).

Embedding Vectors for Blocking

Instead of creating heuristics for blocking, we can create embedding vectors that represent the tuples in the FERC and EIA datasets and find the most similar pairs of embedding vectors to create a candidate set. This process involves three main steps.

  1. Attribute Embedding: For each tuple t in the FERC and EIA datasets, compute an embedding vector for each attribute (column) in t.
  2. Tuple Embedding: Combine each attribute embedding vector into one embedding vector for the tuple t.
  3. Vector Pairing: Find similar vector pairs from the FERC and EIA datasets using a similarity metric and add the tuple pairs represented by these embedding vectors to the candidate set.

Attribute Embedding
There are multiple methods for embedding the string value attributes of the tuples.

  • TF-IDF:

  • Word Embeddings (word2vec, GloVe):

    • can be trained on the domain instead of pre-trained
  • Character Level (fastText) or Sub-Word Embedding (bi-grams):

    • can handle similarities in words like "generator", "generation"
    • better handles typos

The numeric attributes can be normalized within each column. (or should they go through the same embedding process as the string columns? in the case of TF-IDF does it matter if the numeric columns aren't on the same scale as the string columns?)

Tuple Embedding

  • Equal Weight Aggregation: Attribute embeddings are averaged together into one tuple embedding.

  • Weighted Aggregation: A weighted average is used to combine the attribute embeddings together into one tuple embedding. The weights of the attribute embeddings can optionally be learned.

Note: With aggregation methods, order is not considered: "Generator 10" has the same embedding as "10 Generator" (could be good or bad)

  • Self-Reproduction: Autoencoder or seq2seq (write this up)

Roughly speaking, they take a tuple t, feed it into a neural network (NN) to output a compact embedding vector u<sub>t</sub>, such that if we feed u<sub>t</sub> into a second NN, we can recover the original tuple t (or a good approximation of t). If this happens, u<sub>t</sub> can be viewed as a good compact summary of tuple t, and can be used as the tuple
embedding of t.

Screen Shot 2023-03-23 at 12 31 44 PM

Vector Pairing
For all combinations of attribute and tuple embedding, we will use KNN cosine similarity to choose the vector pairs in the candidate set.

Evaluation Metric

  • Reduction Ratio (percentage that the candidate set has been reduced from n x n comparisons)
  • Pairs Completeness (percentage of matching record pairs contained within the reduced comparison space after blocking)
  • Harmonic Mean of Reduction Ratio and Pairs Completeness: 2 * RR * PC / (RR + PC)

These metrics work best for a rules based blocking method, where you can't adjust the size of the candidate set. Include metrics for blocking when Vector Pairing step is done at end to retain k most similar vector pairs.

Experiment Matrix
(Note: There could probably be more experimentation added with the way that numeric attributes are embedded and concatenated onto the record tuple embedding)

Attribute Embedding Method Tuple Embedding Method % of Training Matches Retained
Rule Based Blocking
TF-IDF Equal Weight Aggregation
Weighted Aggregation
autoencoder
seq2seq
word2vec Equal Weight Aggregation
Weighted Aggregation
autoencoder
seq2seq
fastText Equal Weight Aggregation
Weighted Aggregation
autoencoder
seq2seq

Generalize Entity Matching Framework

Background

The code in this repo was developed specifically with the FERC-EIA matching problem, but ideally should be usable for other matching problems. The core underlying modelling is not dependent on this problem, but the implementation is currently tooled specifically to work with these inputs.

Tasks

Set up record-linkage experiment infrastructure

We've got a bunch of (potential) experiments that we want to compare, so setting up a framework for running them all in repeatable way will be helpful

Tasks

  1. katie-lamb
  2. katie-lamb

FERC-EIA Record Linkage Experiments

This epic lists the combinations of techniques that we want to explore for performing the FERC-EIA record linkage. The categories include:

Blocking Strategies

The blocking step dramatically reduces the number of pairs of records that need to be compared, making the problem computationally feasible. There are several parts:

  • String attribute embedding methods that are used to turn text like plant or utility names into numerical features (TF-IDF, word2vec, and fastText)
  • Tuple embedding methods that are used to combine distinct vectorized features into a single vector representing the whole tuple. These include either setting a priori or learning the relative weights (importance) of the various feature vectors to be combined, or using a neural network to reduce the dimensionality of the feature vector. Options include seq2seq and AutoEncoders (TensorFlow example)
  • Choice of threshold: Once we've got the tuple embedding, how do we pick a subset of records to compare to each other? E.g. k-nearest neighbors (KNN) or some minimum threshold value like cosine similarity >= 0.75.
  • There's also the old-school rule based blocking, where we pick some heuristics that split up the records along reasonable lines (e.g. only compare records from the same report year or state). This can potentially be used in combination with the above strategies as a pre-filter.

Record Linkage Models

These operate on the subset of pairs of records that were identified as potential matches in the blocking step. The options we're exploring are

  • Splink: a logistic regression model that can be run supervised or unsupervised.
  • Probabalistic Graph Models (PGMs) which can be used to find a consensus among several noisy labeling functions (aka weak supervision)

Experiments to Run

  1. splink tf-idf
    katie-lamb
  2. splink word2vec
    katie-lamb

Evaluate and compare current performance of models

As we start to integrate the CCAI modelling work back into PUDL, we need to have a concrete understanding of the performance vs our baselines. Here are some comparisons that can be done that may lead to potential understanding and improvement:

Tasks

Apply PUDL entity matching framework to FERC-EIA

The final output of the CCAI project should be the complete replacement of the FERC-EIA matching in PUDL. Once #106 is complete we should be prepared to drop the framework developed here directly into PUDL (we could also add a dependency to this repo, but I believe it will be much more maintainable in PUDL).

Out of scope:

  • This does NOT include bringing the Splink model into PUDL, but does lay the groundwork for doing so easily in the future.
  • Experiment tracking will be handled in separate PR.
  • Blocking with faiss and comparison to existing blocking column results will be handled in separate PR (Try running modeling on blocks outputted from faiss clustering step and see if there's a score improvement)

Success criteria

TODO

Integrate experiment tracking with `mlflow`.

Note from Katie:

There are several metrics all over the FERC to EIA matching model module that would be good track, like accuracy, checks for the coverage of certain types of plants in the matches, and the consistency of model generated FERC plant IDs across time (see pudl.analysis.record_linkage.eia_ferc1_record_linkage._log_match_coverage, pudl.analysis.record_linkage.eia_ferc1_record_linkage.check_match_consistency, pudl.analysis.record_linkage.eia_ferc1_record_linkage.overwrite_bad_predictions). If you think it's in scope, these metrics should probably be consolidated into one place and tracked.

Integrate blocking step to FERC-FERC matching process in PUDL

We've decided the first place to begin integrating the CCAI entity matching into PUDL will be with the inter-year FERC-FERC matching. This matching process uses an almost identical approach as the blocking step, so it will hopefully be a straightforward place to start.

TF-IDF + Splink + Equal Weights

Run the FERC1-EIA record linkage process using TF-IDF for string feature vectorization with naive equal weighting of features, and Splink to do the record linkage.

Parameters to vary

  • Choice of min/max lengths for n-grams generated by TF-IDF.
  • Vary the value of k in KNN or the minimum allowable cosine similarity used in blocking
  • Try using Splink supervised (based on our manual training data) and also unsupervised
  • Is there any useful exploration to be done in how we encode the non-string (numerical & categorical) features?

Evaluation criteria / outputs

  • Run time for the whole process.
  • Proportion of training data pairs excluded by the blocking strategy.
  • Reduction in the number of tuple pairs that need to be compared after blocking.
  • Proportion of identified matches that violate e.g. manual plant_id_pudl assignments or training data.
  • What fraction of the manually assigned training data matches have been recovered by the model

Generalize blocking step to take two arbitrary dataframes and produce candidate sets

The blocking step itself is already fairly generalized, and I've done work refactoring to improve configuration and hopefully streamline the logic flow, but the inputs are prepared by the InputManager class which does a fair amount of preprocessing. The majority of the remaining work for the blocking step generalization is deciding how much (if any) of the input preparation can be generalized, and ensuring there is a clear delineation between the generalized framework and problem specific work.

Evaluate TF-IDF Attribute Embedding

Use TF-IDF to vectorize string features, and then test standard linkage performance with:

Tuple Embedding Methods

  1. splink tf-idf
    katie-lamb

Create KNN Cosine Similarity Function

After the tuples are embedded into vectors for each record, we run a similarity function to decide what the best record pair candidates are. This set of good record pair candidates are then fed into the matching model.

KNN cosine similarity is widely accepted for this step. This involves choosing the K best "right side" candidate match tuples for each "left side" tuple based on the cosine similarity of the tuple embeddings. To start, we'll use a threshold similarity.

Packages like faiss will be helpful for creating this functionality.

Get CI set up to run notebooks

We'll want to run one or more notebooks in CI to make sure that they're up to date with the existing modules.

Create a CI test that uses nbconvert to run our designated notebook(s).

Run integration tests on FERC & EIA input generation

Functions to check in integration tests

Tasks

Integrate splink matching model into pipeline

Currently the matching model I've built with splink is in a notebook. I'm going to integrate this into a matching module in the repo so that as we develop blocking methods we can run the candidate set through splink to evaluate how well it's performing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.