onchainification / candlestick_retriever Goto Github PK

View Code? Open in Web Editor NEW

151.0 8.0 45.0 69 KB

Retrieve all historical candlestick data from crypto exchange Binance and upload it to Kaggle.

Home Page: https://www.kaggle.com/jorijnsmit/binance-full-history

License: GNU General Public License v3.0

Python 100.00%

candlesticks cryptocurrencies market-data kaggle-dataset klines binance-api defi

candlestick_retriever's Introduction

candlestick_retriever

Retrieve all historical candlestick data from crypto exchange Binance and upload it to Kaggle.

Dependencies

pandas
requests
pyarrow
kaggle

Running

Simply run ./main.py to either download or update every single pair available:

[...]
2020-08-22 17:44:24.178846 959/970 Wrote 83000 new lines to file for DOGE-BTC 
2020-08-22 17:45:13.963455 960/970 Wrote 83000 new lines to file for NULS-ETH 
2020-08-22 17:45:14.573595 961/970 Already up to date with BTCB-BTC
2020-08-22 17:46:06.781870 962/970 Wrote 83000 new lines to file for ATOM-BTC 
2020-08-22 17:46:08.669972 963/970 Already up to date with LSK-BNB
[...]

Once that is completed you should end up with a directory with a Parquet file for each pair, currently 970 files totaling ~12GB.

candlestick_retriever's People

Contributors

Stargazers

Watchers

candlestick_retriever's Issues

consider getting rid of csvs completely and working with parquets only

This would also make it possible to continue working with the kaggle data instead of needing two separate directories.

dockerise the whole thing and microservice it

most preferably with #13 in place due to storage requirements

storage size can be drastically decreased

use .parquet files
convert to a lower precision dtype, e.g. float32 and int16.
drop some unnecessary columns

This brought a CSV file of 170MB back to 52MB; ~30%. 50GB dataset (all pairs) then becomes 15GB!

quick data integrity checker needed to make sure no duplicates or missing values are in the uploaded dataset

delete smallest files to fit dataset within kaggle's 1000 files limit

automatically include correct headers in all files, always

This creates better readable csv files and displays them better on the kaggle page.

automate kaggle upload with api

can't find docs

https://stackoverflow.com/questions/55934733/documentation-for-kaggle-api-within-python

Didn't get further than this:

import kaggle
kaggle.api.authenticate()

Right now using CLI: kaggle datasets version -p . -m "<NOTE>"

generate descriptions for all files in metadata.json automatically

e.g.:

    "data": [
        {
            "description": null,
            "name": "ADA-BTC.csv",
            "totalBytes": 126269896,
            "columns": [
                {
                    "name": "open_time",
                    "description": null,
                    "type": "Uuid"
                },
                {
                    "name": "open",
                    "description": null,
                    "type": "Uuid"
                },

etc.

inactive currency pairs are always redownloaded completely

Currency pairs that do not have candlesticks on the day of running the script have to be redownloaded again, due to this check: https://github.com/gosuto-ai/candlestick_scraper/blob/b18eadce3f42f9f764dd4cd44bf156817b8d7140/main.py#L62

This applies to maybe 30 out of the ~200 BTC pairs.

use zstd for improved compression

Pandas to_parquet() method uses snappy compression by default. You can get significantly better compression (20% lower file size or better) and keep good decompression speed by passing compression=zstd when saving to parquet.
It's worth noting that zstd allows many compression levels and I'm not sure if pandas automatically chooses the highest compression level, importing the df into a duckdb table and saving the parquet from there specifying zstd as compression could result in lower-sized parquet files