Giter VIP home page Giter VIP logo

ddbench's People

Contributors

dchud avatar

Watchers

 avatar  avatar

ddbench's Issues

Sub-commands command structure

Let's create an argparse command that uses sub-commands for primary functionality (similar to git and it's related sub commands). We can use something like Commis if we want a bit more structure; or do straight up argparse.

Determine what data to collect and add to reports

Each dedupe run only has very minimal data captured for it; consider what each run should capture and add them to the report output. Some possibilities include:

  • verify components necessary to compute f1 score are handled correctly; include f1 score too
  • matched and unmatched pairs
  • correct and incorrect matches
  • execution timing by gross category: load/prep, sampling, labelling, training, matching, review
  • execution timing by finer category (probably requires changes within dedupe)

Reconsider report output format/location/strategy

Current version writes json output to files with common prefixes. We also discussed generating multiple sqlite dbs or one centralized db, and there are other options to consider, like something optimized for time series / event data.

If we roll ahead with the current strategy, we could also move all output runs from a single execution into their own folder, change the folder/file naming strategy, etc.

Decide org/location for this repo

We should make sure we're using the same license as dedupe; hopefully it's also Apache!

We can also decide whether or not to move this into the DDL org or keep it with Dan's code, or maybe even move to Datamade's org?

Summarize report output per batch

Pull together the output data from multiple runs into a single batch, computing at least basic descriptive statistics for all key values.

rationalize debug output

Currently too much low-level debug output is generated. Review all the output from each dependency, add/remove/update and modify handlers as needed.

Reconsider queue / multiprocessing

It's running with redis queue now because it was fast and familiar. It requires a running redis server, which is easy but requires an extra step. If we stick with redis, perhaps we could use it for more functionality such as storing report data and computing summary results. If we switch to multiprocessing or something else, we could remove a batteries-not-included installation requirement.

Encoding/char issue with dblp-scholar data

The id field in one of the dblp-scholar source files isn't coming through correctly. Note the different file formats:

data/dblp-scholar/DBLP-Scholar_perfectMapping.csv: ASCII text, with very long lines, with CRLF line terminators
data/dblp-scholar/DBLP1.csv:                       ISO-8859 English text, with very long lines, with CRLF line terminators
data/dblp-scholar/Scholar.csv:                     UTF-8 Unicode (with BOM) English text, with very long lines, with CRLF line terminators

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.