dchud / ddbench Goto Github PK
View Code? Open in Web Editor NEWBenchmarking suite for dedupe
License: Apache License 2.0
Benchmarking suite for dedupe
License: Apache License 2.0
Let's create an argparse command that uses sub-commands for primary functionality (similar to git and it's related sub commands). We can use something like Commis if we want a bit more structure; or do straight up argparse.
Did this project ever produce any results on dedupe's performance?
Thanks!
Each dedupe run only has very minimal data captured for it; consider what each run should capture and add them to the report output. Some possibilities include:
Current version writes json output to files with common prefixes. We also discussed generating multiple sqlite dbs or one centralized db, and there are other options to consider, like something optimized for time series / event data.
If we roll ahead with the current strategy, we could also move all output runs from a single execution into their own folder, change the folder/file naming strategy, etc.
We should make sure we're using the same license as dedupe; hopefully it's also Apache!
We can also decide whether or not to move this into the DDL org or keep it with Dan's code, or maybe even move to Datamade's org?
Pull together the output data from multiple runs into a single batch, computing at least basic descriptive statistics for all key values.
Review fields from the 4 sample datasets that are nominated for use in deduplication.
In order to avoid conflicts, I suggest we set up the Repo according to: http://nvie.com/posts/a-successful-git-branching-model/ or the cactus model: https://barro.github.io/2016/02/a-succesful-git-branching-model-considered-harmful/
We can also add issue labels according to https://mediocre.com/forum/topics/how-we-use-labels-on-github-issues-at-mediocre-laboratories Or some other method.
Currently too much low-level debug output is generated. Review all the output from each dependency, add/remove/update and modify handlers as needed.
It's running with redis queue now because it was fast and familiar. It requires a running redis server, which is easy but requires an extra step. If we stick with redis, perhaps we could use it for more functionality such as storing report data and computing summary results. If we switch to multiprocessing or something else, we could remove a batteries-not-included installation requirement.
To make sure it's the same one that dedupe
has.
The id
field in one of the dblp-scholar
source files isn't coming through correctly. Note the different file formats:
data/dblp-scholar/DBLP-Scholar_perfectMapping.csv: ASCII text, with very long lines, with CRLF line terminators
data/dblp-scholar/DBLP1.csv: ISO-8859 English text, with very long lines, with CRLF line terminators
data/dblp-scholar/Scholar.csv: UTF-8 Unicode (with BOM) English text, with very long lines, with CRLF line terminators
[assuming we keep the queue]
The current code checking the status of jobs on the queue (https://github.com/dchud/ddbench/blob/master/ddbench.py#L273-L277) is awfully simplistic. Check status more granularly to provide more useful updates.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.