matthewkmayer / rusty-von-humboldt Goto Github PK
View Code? Open in Web Editor NEWExploring GitHub Archive with Rust
License: MIT License
Exploring GitHub Archive with Rust
License: MIT License
Much of runtime is spent waiting for files to download from S3, even inside AWS. S3 Select will let us take the files on S3 and select only the fields we want for commits and repository ID mappings.
Having this functionality should vastly reduce runtime. ๐
Couldn't GET object
currently panics and brings the whole house down when it happens.
Retrying is a better way.
Per https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/ we don't need to randomize keys to get better read throughput. ๐
Right now we do sequential reads of the files which makes hotspots in the S3 service. Best practices for S3 performance says we should distribute requests across the object hashes equally.
We can do this by randomizing the list of files we get to spread the load.
Crossbeam channels are more performant.
https://docs.rs/crossbeam/0.7.2/crossbeam/channel/index.html
It'd be rad if we had an idea on how many files have been processed. https://github.com/matthewkmayer/release-party-BR uses a progress indicator to show that.
Recently released!
Add things like "number of files/hours to process" and "year to start." This will let us not have to recompile every time we change those settings.
Like #94 .
Time for the rubber to hit the road.
Output SQL containing similar things to the repo id to name mapping:
If we put it all in one big table it could look like:
Upserts would be DO NOTHING
on conflict. ๐
Can we look at repo mappings and committer counts in a single run? That way we don't have to run RvH twice.
Probably have to be careful with memory usage.
Instead of wrapping execution in time
, RvH should time itself and output how long it took. Perhaps include what year and how many hours were processed.
i64
instead of having another field in the event struct (see serde-rs/json#317 ) #14 โ
Instead of reading in huge chunks and processing them as big chunks, let's reduce memory usage by taking a pipeline approach.
Rough idea:
Make four channels. Each channel gets a thread that:
CommitEvent
or RepoIdToName
)Advantages:
Right now we have to recompile to switch between committer counts and repo id mapping. Let's use env vars instead.
Right now our SQL output is pretty naive. We could do things to improve insert performance, such as:
insert into repo_mapping (col_foo, col_bar) values (a, b), (a, c)
)Reference: https://www.depesz.com/2007/07/05/how-to-insert-data-to-database-as-fast-as-possible/ .
Repo id to repo name mappings have lots of duplication in them. Deduplicate them as much as we can before making the end database deal with it.
Less unwrap
and expect
, more use of failure.
The download threads download files from S3, parse them into events then sends that to a bounded crossbeams channel.
It'd be nice to see if we ever fill that channel. This can let us see if the bottleneck is downloading/parsing or handling the events. With this info we can decide how we can run RvH: something with more bandwidth or more compute power.
See is_full
or len
in https://github.com/crossbeam-rs/crossbeam/blob/master/crossbeam-channel/src/channel.rs .
Pre 2015 Github events look different. We should handle those too.
Right now our request to S3 is limited to a single year because we specify the key must start with the year.
Support not providing the year argument and process everything. ๐ค
Keep the year arg as optional.
We don't need to make this an amazing, impossible to break hash, just something harder than "look at the database contents" to find the actor name.
SHA1 without a salt may be sufficient for this.
Right now we store everything in a vector and sort and deduplicate it, a lot.
How about using a BTreeMap and its entry API instead? Automatic deduplication.
Use CommitEvent
as the key and we could even track how many times they committed if we wanted to. Value of number of commits. Or just a sentinel value like a boolean we don't really use.
It's Not Great โข๏ธ to find out we can't upload an hour or two of number crunching right at the end.
Try uploading a test file to the destination bucket before we start working on the analysis.
GHAs aren't really used right now but this is a small enough repo to experiment with it.
We can publish this as a Docker container and run via ECS or Fargate.
This experiment has achieved its goal and I'm not planning on working on it any more. ๐
Still got some println
macros hanging around, some commented out. Use log
and env_logger
to do things like debug
and info
level logging as needed.
Can we Dockerize this in https://github.com/frol/docker-alpine-rust ?
If nothing else, can we compile a Linux binary on an OSX machine without dealing with the openssl cross-compilation nightmare?
Some bugs around AWS closing connections have been fixed in the latest version of Rusoto. This allows us to remove some wait and retry blocks we have.
We pull down a lot of data we throw away. S3 Select will let us greatly reduce the amount of data we transfer from S3 to the instance running Rusty von Humboldt.
This should be supported in Rusoto 0.35.0: https://rusoto.github.io/rusoto/rusoto_s3/struct.SelectObjectContentRequest.html .
Rusoto 0.30.0 has the sweet, sweet streaming Read
trait!
When processing exactly one year's worth of hours of 2011, we sometimes process items from 2012 since there isn't a full year of data in 2011.
Let's not do that: if we're processing 2011, only process 2011.
Filter on the S3 file list step?
Couldn't deserialize event file.: ErrorImpl { code: Message("invalid type: string \"RickCHodgin\", expected struct Actor")
Seen in 2014 year of data.
If the show_progress_bar
feature isn't turned on, we shouldn't compile the progress bar crate or do any code related to it.
May be worth making this configuration tested/compiled in Travis.
S3 limits the amount of objects that are returned in a single call to 1,000. We can get the continuation token from ListObjectsV2Output
and supply that to ListObjectsV2Request
.
See the Rusoto S3 integration test for an example of doing this.
Lots of big functions that do a lot of things. Refactor and maybe split things into different files.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.