Giter VIP home page Giter VIP logo

rusty-von-humboldt's People

Contributors

ddanielr avatar matthewkmayer avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

pombredanne

rusty-von-humboldt's Issues

Use S3 Select when Rusoto supports it

Much of runtime is spent waiting for files to download from S3, even inside AWS. S3 Select will let us take the files on S3 and select only the fields we want for commits and repository ID mappings.

Having this functionality should vastly reduce runtime. ๐ŸŽ‰

Retry on S3 error

Couldn't GET object currently panics and brings the whole house down when it happens.

Retrying is a better way.

Bring back clap for CLI args

Add things like "number of files/hours to process" and "year to start." This will let us not have to recompile every time we change those settings.

Actor count on repo

Time for the rubber to hit the road.

Output SQL containing similar things to the repo id to name mapping:

  • event_id
  • repo_id
  • actor_id

If we put it all in one big table it could look like:

  • auto-increment bigserial as PK
  • repo_id (index here)
  • actor_id (unique constraint for repo_id + actor_id to avoid duplicates

Upserts would be DO NOTHING on conflict. ๐Ÿ‘

Report time spent

Instead of wrapping execution in time, RvH should time itself and output how long it took. Perhaps include what year and how many hours were processed.

Reducing memory usage ideas

  • Have serde convert the event ID directly to i64 instead of having another field in the event struct (see serde-rs/json#317 ) #14 โœ…
  • For determining up to date repo names, use types that are only what we need (event id, repo id, repo name) #15
  • For determining up to date repo names, don't keep them all in memory: use a data store that can update if the incoming event ID is higher (newer event). Redis, Postgres, Dynamo?
  • Committer count: filter events out right after deserialization to avoid having lots of events we don't care about

Do better pipelining

Instead of reading in huge chunks and processing them as big chunks, let's reduce memory usage by taking a pipeline approach.

Rough idea:

Make four channels. Each channel gets a thread that:

  • listens for an S3 file
  • downloads and parses the file, converting to the ultimate end type (mostly parallelized right now)
  • collect a certain number of the end types (CommitEvent or RepoIdToName)
  • dedupe items if possible
  • once the "certain number" has been hit after deduplication, write to SQL, compress and upload to destination bucket. Maybe add another set of channels for this so we can ensure it's not the bottleneck

Advantages:

  • (much) less memory consumption
  • the more items we collect the more we can dedupe them before the destination database has to
  • backpressure from the slowest steps will prevent us from going hog-wild on the upstream side
  • lets us add parallelism where needed

Dedupe 'em

Repo id to repo name mappings have lots of duplication in them. Deduplicate them as much as we can before making the end database deal with it.

Are we blocking on sending events?

The download threads download files from S3, parse them into events then sends that to a bounded crossbeams channel.

It'd be nice to see if we ever fill that channel. This can let us see if the bottleneck is downloading/parsing or handling the events. With this info we can decide how we can run RvH: something with more bandwidth or more compute power.

See is_full or len in https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-channel/src/channel.rs .

Process all years

Right now our request to S3 is limited to a single year because we specify the key must start with the year.

Support not providing the year argument and process everything. ๐Ÿค˜

Keep the year arg as optional.

Obfuscate actor names

We don't need to make this an amazing, impossible to break hash, just something harder than "look at the database contents" to find the actor name.

SHA1 without a salt may be sufficient for this.

Use a BTreeMap for storing committers?

Right now we store everything in a vector and sort and deduplicate it, a lot.

How about using a BTreeMap and its entry API instead? Automatic deduplication.

Use CommitEvent as the key and we could even track how many times they committed if we wanted to. Value of number of commits. Or just a sentinel value like a boolean we don't really use.

Use `log` crate instead of printlns

Still got some println macros hanging around, some commented out. Use log and env_logger to do things like debug and info level logging as needed.

Update to latest version of Rusoto

Some bugs around AWS closing connections have been fixed in the latest version of Rusoto. This allows us to remove some wait and retry blocks we have.

Bleed over from short years to next year

When processing exactly one year's worth of hours of 2011, we sometimes process items from 2012 since there isn't a full year of data in 2011.

Let's not do that: if we're processing 2011, only process 2011.

Filter on the S3 file list step?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.