matthewkmayer / rusty-von-humboldt Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 1.0 260 KB

Exploring GitHub Archive with Rust

License: MIT License

Rust 91.93% Dockerfile 1.35% Makefile 6.72%

rusty-von-humboldt's People

Contributors

Stargazers

Watchers

Forkers

pombredanne

rusty-von-humboldt's Issues

Use S3 Select when Rusoto supports it

Much of runtime is spent waiting for files to download from S3, even inside AWS. S3 Select will let us take the files on S3 and select only the fields we want for commits and repository ID mappings.

Having this functionality should vastly reduce runtime. 🎉

Retry on S3 error

Couldn't GET object currently panics and brings the whole house down when it happens.

Retrying is a better way.

Remove S3 object randomization

Per https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/ we don't need to randomize keys to get better read throughput. 👍

Improve S3 read performance

Right now we do sequential reads of the files which makes hotspots in the S3 service. Best practices for S3 performance says we should distribute requests across the object hashes equally.

We can do this by randomizing the list of files we get to spread the load.

Use crossbeam channels instead of stdlib ones

Crossbeam channels are more performant.

https://docs.rs/crossbeam/0.7.2/crossbeam/channel/index.html

Progress indicator?

It'd be rad if we had an idea on how many files have been processed. https://github.com/matthewkmayer/release-party-BR uses a progress indicator to show that.

Bring back clap for CLI args

Add things like "number of files/hours to process" and "year to start." This will let us not have to recompile every time we change those settings.

Switch repo mapping to BTreeMap

Like #94 .

Actor count on repo

Time for the rubber to hit the road.

Output SQL containing similar things to the repo id to name mapping:

event_id
repo_id
actor_id

If we put it all in one big table it could look like:

auto-increment bigserial as PK
repo_id (index here)
actor_id (unique constraint for repo_id + actor_id to avoid duplicates

Upserts would be DO NOTHING on conflict. 👍

Run both repo mapping and committer counts in a single run?

Can we look at repo mappings and committer counts in a single run? That way we don't have to run RvH twice.

Probably have to be careful with memory usage.

Report time spent

Instead of wrapping execution in time, RvH should time itself and output how long it took. Perhaps include what year and how many hours were processed.

Reducing memory usage ideas

Have serde convert the event ID directly to i64 instead of having another field in the event struct (see serde-rs/json#317 ) #14 ✅
For determining up to date repo names, use types that are only what we need (event id, repo id, repo name) #15
For determining up to date repo names, don't keep them all in memory: use a data store that can update if the incoming event ID is higher (newer event). Redis, Postgres, Dynamo?
Committer count: filter events out right after deserialization to avoid having lots of events we don't care about

Do better pipelining

Instead of reading in huge chunks and processing them as big chunks, let's reduce memory usage by taking a pipeline approach.

Rough idea:

Make four channels. Each channel gets a thread that:

listens for an S3 file
downloads and parses the file, converting to the ultimate end type (mostly parallelized right now)
collect a certain number of the end types (CommitEvent or RepoIdToName)
dedupe items if possible
once the "certain number" has been hit after deduplication, write to SQL, compress and upload to destination bucket. Maybe add another set of channels for this so we can ensure it's not the bottleneck

Advantages:

(much) less memory consumption
the more items we collect the more we can dedupe them before the destination database has to
backpressure from the slowest steps will prevent us from going hog-wild on the upstream side
lets us add parallelism where needed

Use env vars for `mode` switching

Right now we have to recompile to switch between committer counts and repo id mapping. Let's use env vars instead.

Improve SQL output

Right now our SQL output is pretty naive. We could do things to improve insert performance, such as:

multiple row insert (insert into repo_mapping (col_foo, col_bar) values (a, b), (a, c))
explicit transactions to group upserts
use prepared statements

Reference: https://www.depesz.com/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/ .

Dedupe 'em

Repo id to repo name mappings have lots of duplication in them. Deduplicate them as much as we can before making the end database deal with it.

Do more better error handling

Less unwrap and expect, more use of failure.

Are we blocking on sending events?

The download threads download files from S3, parse them into events then sends that to a bounded crossbeams channel.

It'd be nice to see if we ever fill that channel. This can let us see if the bottleneck is downloading/parsing or handling the events. With this info we can decide how we can run RvH: something with more bandwidth or more compute power.

See is_full or len in https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-channel/src/channel.rs .

Handle pre-2015 formatted events

Pre 2015 Github events look different. We should handle those too.

Try switching to futures

Process all years

Right now our request to S3 is limited to a single year because we specify the key must start with the year.

Support not providing the year argument and process everything. 🤘

Keep the year arg as optional.

Obfuscate actor names

We don't need to make this an amazing, impossible to break hash, just something harder than "look at the database contents" to find the actor name.

SHA1 without a salt may be sufficient for this.

Use a BTreeMap for storing committers?

Right now we store everything in a vector and sort and deduplicate it, a lot.

How about using a BTreeMap and its entry API instead? Automatic deduplication.

Use CommitEvent as the key and we could even track how many times they committed if we wanted to. Value of number of commits. Or just a sentinel value like a boolean we don't really use.

Verify we have write access to destination S3 bucket before starting analysis

It's Not Great ™️ to find out we can't upload an hour or two of number crunching right at the end.

Try uploading a test file to the destination bucket before we start working on the analysis.

Give github actions a fair shake

GHAs aren't really used right now but this is a small enough repo to experiment with it.

Get Docker image working again

We can publish this as a Docker container and run via ECS or Fargate.

Mark project as no longer maintained

This experiment has achieved its goal and I'm not planning on working on it any more. 😄

Use `log` crate instead of printlns

Still got some println macros hanging around, some commented out. Use log and env_logger to do things like debug and info level logging as needed.

Dockerize it?

Can we Dockerize this in https://github.com/frol/docker-alpine-rust ?

If nothing else, can we compile a Linux binary on an OSX machine without dealing with the openssl cross-compilation nightmare?

Update to latest version of Rusoto

Some bugs around AWS closing connections have been fixed in the latest version of Rusoto. This allows us to remove some wait and retry blocks we have.

Use S3-Select

We pull down a lot of data we throw away. S3 Select will let us greatly reduce the amount of data we transfer from S3 to the instance running Rusty von Humboldt.

This should be supported in Rusoto 0.35.0: https://rusoto.github.io/rusoto/rusoto_s3/struct.SelectObjectContentRequest.html .