dchud / ddbench Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 5.22 MB

Benchmarking suite for dedupe

License: Apache License 2.0

Python 100.00%

ddbench's People

Contributors

Watchers

ddbench's Issues

Sub-commands command structure

Let's create an argparse command that uses sub-commands for primary functionality (similar to git and it's related sub commands). We can use something like Commis if we want a bit more structure; or do straight up argparse.

Did this ever produce any results

Did this project ever produce any results on dedupe's performance?

Thanks!

Determine what data to collect and add to reports

Each dedupe run only has very minimal data captured for it; consider what each run should capture and add them to the report output. Some possibilities include:

verify components necessary to compute f1 score are handled correctly; include f1 score too
matched and unmatched pairs
correct and incorrect matches
execution timing by gross category: load/prep, sampling, labelling, training, matching, review
execution timing by finer category (probably requires changes within dedupe)

Reconsider report output format/location/strategy

Current version writes json output to files with common prefixes. We also discussed generating multiple sqlite dbs or one centralized db, and there are other options to consider, like something optimized for time series / event data.

If we roll ahead with the current strategy, we could also move all output runs from a single execution into their own folder, change the folder/file naming strategy, etc.

Decide org/location for this repo

We should make sure we're using the same license as dedupe; hopefully it's also Apache!

We can also decide whether or not to move this into the DDL org or keep it with Dan's code, or maybe even move to Datamade's org?

Summarize report output per batch

Pull together the output data from multiple runs into a single batch, computing at least basic descriptive statistics for all key values.

Review fields in sample datasets

Review fields from the 4 sample datasets that are nominated for use in deduplication.

Getting ready for team development

In order to avoid conflicts, I suggest we set up the Repo according to: http://nvie.com/posts/a-successful-git-branching-model/ or the cactus model: https://barro.github.io/2016/02/a-succesful-git-branching-model-considered-harmful/

We can also add issue labels according to https://mediocre.com/forum/topics/how-we-use-labels-on-github-issues-at-mediocre-laboratories Or some other method.

rationalize debug output

Currently too much low-level debug output is generated. Review all the output from each dependency, add/remove/update and modify handlers as needed.

Reconsider queue / multiprocessing

It's running with redis queue now because it was fast and familiar. It requires a running redis server, which is easy but requires an extra step. If we stick with redis, perhaps we could use it for more functionality such as storing report data and computing summary results. If we switch to multiprocessing or something else, we could remove a batteries-not-included installation requirement.

Review license

To make sure it's the same one that dedupe has.

Encoding/char issue with dblp-scholar data

The id field in one of the dblp-scholar source files isn't coming through correctly. Note the different file formats:

data/dblp-scholar/DBLP-Scholar_perfectMapping.csv: ASCII text, with very long lines, with CRLF line terminators
data/dblp-scholar/DBLP1.csv:                       ISO-8859 English text, with very long lines, with CRLF line terminators
data/dblp-scholar/Scholar.csv:                     UTF-8 Unicode (with BOM) English text, with very long lines, with CRLF line terminators

Queue status reporting is too simplistic

[assuming we keep the queue]

The current code checking the status of jobs on the queue (https://github.com/dchud/ddbench/blob/master/ddbench.py#L273-L277) is awfully simplistic. Check status more granularly to provide more useful updates.

dchud / ddbench Goto Github PK

ddbench's People

Contributors

Watchers

ddbench's Issues

Sub-commands command structure

Did this ever produce any results

Determine what data to collect and add to reports

Reconsider report output format/location/strategy

Decide org/location for this repo

Summarize report output per batch

Review fields in sample datasets

Getting ready for team development

rationalize debug output

Reconsider queue / multiprocessing

Review license

Encoding/char issue with dblp-scholar data

Queue status reporting is too simplistic

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent