Giter VIP home page Giter VIP logo

conduit-db's Introduction

ConduitDB

ConduitDB is a horizontally scalable blockchain parser and indexer. It uses ScyllaDB with a tight database schema optimized for write throughput and makes limited use of Redis for managing some of the state between worker processes.

ConduitDB is designed to be a flexible, value-added service that is capable of doing all of the indexing and data storage work entirely on its own (without running a node at all). This can include storing and indexing the full 10TB+ blockchain with:

  • Raw transaction lookups
  • Merkle proof lookups
  • UTXO lookups
  • Historic address-based / or pushdata-based lookups from genesis to chain tip. This is mainly to support seed-based restoration of wallets but could also be used for _unwriter protocol namespace lookups and querying
  • Tip filter notification API gives notifications when pushdatas are seen of 20, 33 or 65 bytes in length i.e. pubkey hash (20 bytes) addresses for P2PKH, public keys for P2PK (33 or 65 for compressed vs uncompressed). Other use cases in the token and diverse namespace protocols are also likely.
  • Output spend notifications API gives notifications when a particular utxo is spent. This is useful to know that a broadcast transaction has indeed propagated to the mempool and subsequently been included in a block OR if it has been affected by a subsequent reorg. The SPV wallet can react accordingly for each of these events and fetch the new merkle proof - updating their local wallet database.

It is the opinion of the author that these APIs cover all of the basic requirements for an SPV wallet or application.

You can also turn off unnecessary features if these APIs can already be covered by other services such as a local bitcoin node, or if you want to fetch raw transactions from a 3rd party API such as conduitdb.com / whatsonchain.com / TAAL. This flexibility avoids storing or indexing the 10TB+ blockchain twice.

ConduitDB therefore fits into your existing stack to add value in a way that you only pay for what you use. It can either be run in a lightweight mode for the bare minimum required support for SPV wallets / applications like ElectrumSV or it can be configured to store and index absolutely everything or anything in-between - perhaps only scanning for a token sub-protocol or bitcom namespace of interest to you and your application.

An instance that indexes and stores everything is running at conduit-db.com and is free for public use but may need to be rate-limited depending on demand.

Licence MIT
Language Python 3.10
Author Hayden Donnelly (AustEcon)

Getting Started

ConduitDB deployment is docker-based only. ScyllaDB only loses 3% performance in docker when properly configured

ConduitDB connects to the p2p Bitcoin network so doesn't technically require you to run your own full node. However, ideally you will have a bare metal, localhost bitcoind instance. You can get the latest version from here (https://www.bsvblockchain.org/svnode). See notes below on using a non-localhost node.

Once you have a node to connect to, update the .env.docker.production config file where it says:

NODE_HOST=127.0.0.1
NODE_PORT=8333

Set the other configuration options in .env.docker.production such as:

SCYLLA_DATA_DIR=./scylla/data
CONDUIT_RAW_DATA_HDD=./conduit_raw_data_hdd
CONDUIT_RAW_DATA_SSD=./conduit_raw_data_ssd
CONDUIT_INDEX_DATA_HDD=./conduit_index_data_hdd
CONDUIT_INDEX_DATA_SSD=./conduit_index_data_ssd
REFERENCE_SERVER_DIR=./reference_server

ConduitDB makes deliberate use of fast (HDD) vs slow (SSD/NVME) storage volumes for the data directories. Raw blocks and long arrays of transaction hashes are written sequentially to HDD to economise on disc usage. SSD/NVME is used for memory mapped files and of course ScyllaDB.

By default all directories will be bind mounded (by Docker Compose) into the above locations at the root directory of this cloned repository. If you're running in prune mode (deleting raw block data after parsing), then the defaults will work well for you on NVME storage. This is the recommended configuration.

Running the production configuration is only supported on linux. Building the python_base image only needs to be done once (and every time the requirements.txt changes).

docker build -f ./contrib/python_base/Dockerfile . -t python_base

If you really want to test out the production configuration on windows you could use WSL.

./run_production.sh

To tail the docker container logs:

./tail_production_logs.sh

Development

Python Version

Currently python3.10 is required

Build all of the images

docker build -f ./contrib/python_base/Dockerfile . -t python_base
docker-compose -f docker-compose.yml build --parallel --no-cache

Run static analysis checks

./run_static_checks.bat

Or on Unix:

./run_static_checks.sh

Run all functional tests and unittests locally

./run_all_tests_fresh.bat

Or on Unix:

./run_all_tests_fresh.sh

2) Running ConduitRaw & ConduitIndex

Windows cmd.exe:

git clone https://github.com/conduit-db/conduit.git
cd conduit-db
set PYTHONPATH=.

Now install packages and run ConduitRaw (in one terminal)

py -m pip install -r .\contrib\requirements.txt
py .\conduit_raw\run_conduit_raw.py

And ConduitIndex (in another terminal)

py -m pip install -r .\contrib\requirements.txt
py .\conduit_index\run_conduit_index.py

Unix Bash terminal

git clone https://github.com/conduit-db/conduit.git
cd conduit-db
export PYTHONPATH=.

Now install packages and run ConduitRaw (in one terminal)

python3 -m pip install -r ./contrib/requirements.txt
python3 ./conduit_raw/conduit_server.py

And ConduitIndex (in another terminal)

python3 -m pip install -r ./contrib/requirements.txt
python3 ./conduit_index/conduit_server.py

Configuration

All configuration is done via the .env files in the top level directory.

  • .env is for bare metal instances of ConduitDB services when iterating in active development
  • .env.docker.development is for the docker-compose.yml which is used for automated testing and the CI/CD pipeline
  • .env.docker.production is for the docker-compose.production.yml which is used for production deployments.

Notes on using a non-localhost node

Indexing from a non-localhost node is an experimental feature at the present moment but I hope to improve upon this at a later date.

If you still want to go ahead with connecting to a remote node, ideally your IP address should be a whitelisted on this node and it should be a low-latency connection (i.e. ideally in the same data center or at least in the same geographical region). Remote nodes that have not whitelisted you will likely throttle your initial block download.

Even with a local bitcoin node, if it's main raw block storage is on a magnetic harddrive, it will max out the sequential read capacity of the harddrive at around 200MB/sec. This becomes the speed limit for everything.

Acknowedgments

  • This project makes heavy use of the bitcoinx bitcoin library created by kyuupichan for tracking of headers and chain forks in a memory mapped file
  • The idea for indexing pushdata hashes came from discussions with Roger Taylor the lead maintainer of electrumsv

conduit-db's People

Contributors

austecon avatar rt121212121 avatar

Watchers

 avatar

Forkers

gsciw

conduit-db's Issues

Handle mempool table overflow

See max_heap_table_size in mariadb my.cnf file. Generally I set this to 10GB

It should ideally function as a cache where lower fee txs get evicted and the higher fee txs remain...
But it may not need to be that sophisticated for the first iteration.

The number 1 priority is just to prevent OOM'ing.
Secondary objectives would be to match the node's eviction criteria.

Prune Mode

This would allow ConduitRaw to still function normally but rely on the node to store the raw block data (e.g. a miner we could collocate with).

It should be rather simple to implement. Would basically just redirect http requests to the node instead and make ConduitRaw prune unwanted raw block data (but only after ConduitIndex has synchronised it).

Todo list

  • Get mempool txs from the node directly (p2p) not via conduit_raw
  • Batch mempool txs for bulk flushing to MySQL
  • Test conduit_raw on HDD -> For full archive of mainnet I think it makes a lot of sense but should segregate raw block data
    from LMDB and store raw blocks as flat files outside of LMDB. (Result: Performs very well)
  • Add back the post-IBD mode event to conduit_index if the unnecessary mempool lookups show any slowdown for IBD
    (may be unnecessary now with fully in-memory mempool lookup table)
  • Upgrade db tables from height -> block num (with a separate table for block hash lookups)
  • Tune rocksdb for write-heavy loads
  • Script for printing out blockchain metrics such as cumulative blockchain size and total tx count etc. for translating logging timestamp progress to meaningful metrics about what to expect under ever increasing loads.
  • ShortHashing - essential for cutting footprint of index on NVMe storage and in-memory caches / network IO.
  • Return full tx hashes and pushdata hashes in the API
  • Test coverage of internal aiohttp API (json)
  • TSC merkle proof endpoint
  • Run tests in Azure pipeline
  • Reorg handling
  • Db repair on startup needs to do index-only lookups for the deletes so that it scales.
  • Add reorg functional testing
  • Pylint checks passing
  • Mypy checks passing
  • Tx offset data - move to flat files

Top priority post-MVP launch

  • Add db integrity check for conduit_raw (the same as is already done and working well for conduit_index).
  • Swagger documentation page (use Redoc https://github.com/Redocly/redoc)
  • Nginx proxy to add https://

Not required for MVP launch:

  • Simplify p2p network client
  • Dynamically adjust size of main batch size as the moving average block size increases (otherwise python dictionary caches etc. will get overly large if storing metadata for numerous large blocks at a time)
  • Make all configuration via a single top-level .env file

Parallel block download over p2p (in order to eventually ditch the node)

We need a HeaderSV-like "Peer manager" & "Chain tip tracker". It could be named something like: ConduitConnect

Its most important function would be to act as a pre-fetcher for raw blocks in parallel on behalf of ConduitRaw.

ConduitRaw would send it a request for blocks from e.g. height 1000 - 1500.

ConduitConnect would now divide up the work into contiguous sections of blocks.
If there were 10 peers there would be 10 x 50 header chunks. E.g. height 1050 - 1100, 1150 - 1200 etc.
There would be 10 open p2p socket connections - one for each node
There would be a max network buffer of 128MB. Therefore max memory usage of 10 x 128MB = 1280MB for streaming raw blocks.

A block < 128MB in size is considered a SMALL_BLOCK type. And so ConduitConnect would respond to ConduitRaw with this raw block immediately (keeping this small block in memory / network buffers the whole time)

A block > 128MB in size is considered a LARGE_BLOCK type. And so ConduitConnect would stream the raw block first to disc to avoid OOM'ing. The filename would == the block hash in hex and would be written to a kind of "staging area" (the file can be simply renamed and moved later to the proper location ๐Ÿ˜ƒ )

  • If LARGE_BLOCK type then ConduitRaw would receive a different response. It would be given the location of the raw block (already written to disc) and to process it (the merkle tree info and tx offsets etc.) it would read it from disc - once again... this processing should be streams-processing-based and should not need to load the whole block into memory at once. At the end, it would NOT re-write it to disc. It would just move the file / rename it - should be instantaneous.

Full hashes for pushdata search

To continue returning full hashes in the response there are a few tasks that need completed:

  • Change tx table from height -> block num
  • Use the block num + tx position in the block for either the output or input tx_hashX to fetch the full tx_hash from LMDB (using the merkle tree db with an array of ordered tx_hashes for each block).
  • This endpoint could be provided either via the socket server or via an aiohttp server. I think I lean towards socket server for now to keep everything under "one roof" - however, the main bottleneck will be random seeks on HDD for the tx hashes.
  • The pushdata hash is already given in the request so can be matched up again - no need to re-parse the rawtx for it.
  • Internally ConduitRaw would keep a small LRU cache for hashX -> fullHash given the likelihood of recent batches overlapping on the same txs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.