Giter VIP home page Giter VIP logo

autoretrieve's Introduction

Autoretrieve

Autoretrieve is a standalone Graphsync-to-Bitswap proxy server, which allows IPFS clients to retrieve data which may be available on the Filecoin network but not on IPFS (such as data that has become unpinned, or data that was uploaded to a Filecoin storage provider but was never sent to an IPFS provider).

What problem does Autoretrieve solve?

Protocol Labs develops two decentralized data transfer protocols - Bitswap and GraphSync, which back the IPFS and Filecoin networks, respectively. Although built on similar principles, these networks are fundamentally incompatible and separate - a client of one network cannot retrieve data from a provider of the other. [TODO: link more info about differences.] This raises the following issue: what if there is data that exists on Filecoin, but not on IPFS?

Autoretrieve is a "translational proxy" that allows data to be transferred from Filecoin to IPFS in an automated fashion. The existing alternatives for Autoretrieve's Filecoin-to-IPFS flow are the Boost IPFS node that providers may optionally enable, and manual transfer.

Boost IPFS node is not always a feasible option for several reasons:

  • Providers are not incentivized to enable this feature
  • Only free retrievals are supported

In comparison, Autoretrieve:

  • Is a dedicated node, and operational burden/cost does not fall on storage provider operators
  • Supports paid retrievals (the Autoretrieve node operator covers the payment)

How does Autoretrieve work?

Autoretrieve is at its core a Bitswap server. When a Bitswap request comes in, Autoretrieve queries an indexer for Filecoin storage providers that have the requested CID. The providers are sorted and retrieval is attempted sequentially until a successful retrieval is opened. As the GraphSync data lands into Autoretrieve from the storage provider, the data is streamed live back to the IPFS client.

In order for IPFS clients to be able to retrieve Filecoin data using Autoretrieve, they must be connected to Autoretrieve. Currently, Autoretrieve can be advertised to the indexer (and by extension the DHT) by Estuary. Autoretrieve does not currently have an independent way to advertise its own data.

If an Autoretrieve node is not advertised, clients may still download data from it if a connection is established either manually, or by chance while walking through the DHT searching for other providers.

Usage

Autoretrieve uses Docker with Buildkit for build caching. Docker rebuilds are quite fast, and it is usable for local development. Check the docker-compose documentation for more help.

$ DOCKER_BUILDKIT=1 docker-compose up

You may optionally set FULLNODE_API_INFO to a custom fullnode's WebSocket address. The default is FULLNODE_API_INFO=wss://api.chain.love.

By default, config files and cache are stored at ~/.autoretrieve. When using docker-compose, a binding is created to this directory. This location can be configured by setting AUTORETRIEVE_DATA_DIR.

Internally, the Docker volume's path on the image is /root/.autoretrieve. Keep this in mind when using the Docker image directly.

Configuration

Some CLI flags and corresponding environment variables are available for basic configuration.

For more advanced configuration, config.yaml may be used. It lives in the autoretrieve data directory, and will be automatically generated by running autoretrieve. It may also be manually generated using the gen-config subcommand.

Configurations are applied in the following order, from least to most important:

  • YAML config
  • Environment variables
  • CLI flags

YAML Example

advertise-endpoint-url: # leave blank to disable, example https://api.estuary.tech/autoretrieve/heartbeat (must be registered)
advertise-endpoint-token: # leave blank to disable
lookup-endpoint-type: indexer # indexer | estuary
lookup-endpoint-url: https://cid.contact # for estuary endpoint-type: https://api.estuary.tech/retrieval-candidates
max-bitswap-workers: 1
routing-table-type: dht
prune-threshold: 1GiB # 1000000000, 1 GB, etc. Uses go-humanize for parsing. Table of valid byte sizes can be found here: https://github.com/dustin/go-humanize/blob/v1.0.0/bytes.go#L34-L62
pin-duration: 1h # 1h30m, etc.
log-resource-manager: false
log-retrieval-stats: false
disable-retrieval: false
cid-blacklist:
  - QmCID01234
  - QmCID56789
  - QmCIDabcde
miner-blacklist:
  - f01234
  - f05678
miner-whitelist:
  - f01234
default-miner-config:
  retrieval-timeout: 1m
  max-concurrent-retrievals: 1
miner-configs:
  f01234:
    retrieval-timeout: 2m30s
    max-concurrent-retrievals: 2
  f05678:
    max-concurrent-retrievals: 10

autoretrieve's People

Contributors

elijaharita avatar galargh avatar github-actions[bot] avatar gmelodie avatar hannahhoward avatar kylehuntsman avatar masih avatar rvagg avatar willscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autoretrieve's Issues

Separate Bedrock deployment configuration into it's own repository

The Bedrock deployment configuration doesn't make sense for cross-team use cases, and should therefore be pulled out into a Bedrock specific repository for managing the Bedrock development instance.

  • Autoretrieve container building to ECR on push to master will stay in the autoretrieve repository. For now we'll leave the ECR in the STI account. We might want to move this later to something ARG owns?
  • Kubernetes deployment manifests and CD will be moved to a new Bedrock specific repository. It will then just make PRs to that repo's cd/dev branch and deploy to the same sti-dev cluster it's being deployed to now.

Add additional metrics

Goals

Autoretrieve can produce useful statistics about the performance of retrieval on the network

Requested Metrics

  1. Deal Success Rate
  2. Deal Acceptance Rate
  3. Deal Volume
  4. Time to First Byte (TTFB)
  5. Avg transfer speed
  6. (unique) number of SPs
  7. (unique) number of IPFS clients
  8. total data volume transferred

Integration Test

An integration test of the flow of autoretrieve:

  • Bitswap node makes request to auto retrieve
  • Auto retrieve bitswap provider converts to filecoin retrieval
  • Pings an indexer node to find content
  • Fetches data via filecoin retrieval from a miner

The trick here will be properly setting up all the players, ideally using simple versions or mocked values for the services.

  • can use itests for Lotus miner
  • probably can manually construct libp2p + bitswap node and just pair it
  • not sure how to get the indexer but it's just an http endpoint so we could mock it

Add ops metrics

Add some system operational metrics to monitor the status of the Autoretrieve service, which can help identify scaling issues and help with capacity planning:

  • Uptime graph (how often is the service restarting)
  • Storage usage (ensure pruning is working)
  • Bandwidth usage (check if we're maxing out on connections or bandwidth, etc.)
  • Memory and Process Usage (can the machine handle the request load)

cc @hannahhoward @kylehuntsman

Add automatic datastore pruning

In order to launch autoretrieve at scale, the service must periodically clear out retrieval downloads to not run out of space on the machine.

per-provider rate limit

we should be able to have a knob where we can limit how many concurrent / rate of data we pull from a given provider.

Global retrieval limit

We need a config property for limiting the number of active retrievals for an autoretrieve instance.

Seperate CLI from main module

What

The main logic should not live directly inside the file that parse CLI args and reads config. The CLI logic should move to a cmd directory

Add funnel metrics

After discussing w/ @kylehuntsman, some additional metrics to monitor on grafana include measuring the overall autoretrieve funnel (feel free to break out each metric into its own issue as needed):

  • Bitswap requests
  • Cache Hits (i.e. previously retrieved by Autoretrieve, available to serve immediately)
  • Cache Misses (this should result in an Filecoin retrieval request)
  • Indexer Requests (should map 1:1 to a cache miss)
  • Indexer Response Codes (200, 404, errors)
  • Index Misses (either the content is not indexed, or there's an error)
  • Filecoin retrieval attempt (should map 1:1 to a 200 indexer response)
  • Filecoin retrieval stages (connect, query ask, deal accept, transfer, complete, etc.)
  • Filecoin retrieval successes and failures

cc @hannahhoward

Blockstore deadlock

from hannah - "I think you’ve got a deadlock in your code after this commit to autoretrieve — be0a8b3
-- Put takes the lock and doesn’t release till the end, but then calls notifyWaitCallback which attempts to take the lock from within the lock Put has taken. Fix incoming I think though I’m not the expert here."

Validate config structure on instance startup

If we're going to be letting users manually create configs, then we might need to start verifying the structure of the config file to ensure we're not assuming bad state as valid. Verification of the config file would need to happen upon start of the instance after accepting any cli option overrides. If verification of the config structure fails, we should report an error to the user and stop processing. Only start the execution of the instance of verification succeeds.

Create default config if no config file is found in given autoretrieve path

Create a default config file using the given AUTORETRIEVE_PATH if autoretrieve is started and a config file is not found. Possibly add a warning that a config file could not be a found and a default will be generated. I could also see the case for making this an error, and guiding the user to create a config file and restart the instance.

Automatically prune local blockstore

Goals

If I run autoretrieve, especially with FullRT on, I'm going to make a lot of requests. It would be great to be able to set a maximum size for the local blockstore cache, and start to prune it (perhaps the least recently requested block first?) once it reaches that size and I need to make room for more data.

Metrics Review

Once #80 and #18 are done, we need to review #47 & #46 to verify what we have and don't have, and what is correct.

Add command for generating default config file

Add a command for generating a default config file using the given AUTORETRIEVE_PATH. This would allow users to easily create an autoretrieve instance config file, and then configure it, before running the instance.

Look into miner failing most retrievals

There is one miner that autoretrieve is connecting to that is constantly failing retrievals. Let's look into why this miner is doing this, and possibly reach out to gain a better understanding.

TODO: Add miner PeerID

[Placeholder] Index Estuary Content

As an improvement to serving online retrieval requests via autoretrieve (for the bitswap nodes that happen to be peered to the service), this placeholder ticket will track the work to proactively index Estuary content so all content queries (beyond the peered bitswap nodes) can served by autoretrieve.

cc @hannahhoward

Grafana Dashboards Review

Once #85 is done, we need to review the grafana dashboards for correctness, and add additional dashboards as needed.

Metric: Distribution of observed accepted requests by provider

I think the plot we end up wanting is that at a given time, how many providers are we doing 1 retrieval from? how many are we doing 2 from?, etc. We’re worried that there are defaults at a couple places, and we may see that even as we do a lot of transfers, we may have a bunch of providers who rate limit us to not allow more than e.g. 20 simultaneous transfers.

  • Underlying data can be short lived for now
  • Grafana metrics don't need the provider ID, just the distribution of number of requests per provider

Send retrieval DSR metrics to Pando

Send the network level DSR metrics, as measured by autoretrieve, to the Pando service.

  • For each deal, record:
    • success or fail (possibly why?)
    • miner who made deal
  • periodically, package up recent DSR records into an IPLD data structure written to a store, with a CID
  • publish the record to Pando using similar code as used in Dealbot

Pando Spec: https://kencloud.com/pando.html
Dealbot Pando integration: https://github.com/filecoin-project/dealbot/blob/main/controller/publisher/pando_publisher.go

When doing multiple retrievels at once, all retrievals fail after the first one succeeds.

In our testing, we observe the following:

  1. When we do a single retrieval with a miner, it finishes
  2. When we do multiple retrievals with the same miner at once, one retrieval fails, but then all other retrievals fail shortly afterwards. Automatic data transfer restarts do not seem to solve the issue.

Inspecting the logs, we see the GraphSync stream was reset:

2022-01-27T01:52:23.174Z	[34mINFO[0m	graphsync	impl/graphsync.go:468	Graphsync ReceiveError from 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: stream reset

and also on autoretrieve:

2022-01-27T01:52:23.104Z	�[35mDEBUG�[0m	filclient	[email protected]/filclient.go:1256	retrieval event	{"dealID": "134345", "rootCid": "bafybeihpbyg4xizni7voackqxomab6444fhwl2gkk7a2tmuphyqg5ip6mq", "miner": "f08403", "name": "ReceiveDataError", "code": 27, "message": "stream reset", "blocksIndex": 10994, "totalReceived": 2291825565}

Miner logs show the miner attempts several times to re-establish the stream unsuccessfully (this is just reattempting to connect, not yet restarting the data transfer itself):

2022-01-27T01:52:28.179Z	�[34mINFO�[0m	graphsync	messagequeue/messagequeue.go:219	cant open message sender to peer 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: failed to dial 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx:
2022-01-27T01:52:28.180Z	�[34mINFO�[0m	graphsync	messagequeue/messagequeue.go:219	cant open message sender to peer 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: failed to dial 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx:
2022-01-27T01:52:28.180Z	�[34mINFO�[0m	graphsync	messagequeue/messagequeue.go:219	cant open message sender to peer 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: failed to dial 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx:
2022-01-27T01:52:28.180Z	�[34mINFO�[0m	graphsync	messagequeue/messagequeue.go:219	cant open message sender to peer 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: failed to dial 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx:
2022-01-27T01:52:28.180Z	�[34mINFO�[0m	graphsync	messagequeue/messagequeue.go:219	cant open message sender to peer 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx: failed to dial 12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx:

Miner logs show at the same time, multiple other streams between the miner and autoretrieve:

2022-01-27T01:52:29.306Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:31.175Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:33.785Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:35.349Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:37.225Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:37.935Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:39.002Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:39.866Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:40.507Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}
2022-01-27T01:52:41.069Z	�[34mINFO�[0m	net/identify	identify/id.go:372	failed negotiate identify protocol with peer	{"peer": "12D3KooWNTiHg8eQsTRx8XV7TiJbq3379EgwG6Mo3V3MdwAfThsx", "error": "stream reset"}

Additional notes:

  • Autoretrieve also shows other stream errors that appear to reference the resource manager (autoretrieve is running libp2p 1.18 -- the miner is not) -- it shows these messages long before the failure -- at several times:
2022-01-27T01:52:29.236Z	�[31mERROR�[0m	autoretrieve	metrics/basic.go:52	Failed to query miner f08403 for Qme5eFv4f1SY7WyuQ3gG6CPDR2cY68JVZjjdweU4Bgi1Mu (root QmezTdYeKyjPFoREStJQQbvATUP8yRJdHMMZx2rZ86p9g9): failed to open stream to peer: peer:12D3KooWBwUERBhJPtZ7hg5N3q1DesvJ67xx9RLdSaStBz9Y6Ny8: resource limit exceeded
  • The connection between the miner and autoretrieve should have Protect called on it, once per transfer, with a different tag for each transfer. When the first transfer finishes, one of these tags is removed, but the other two should still be there.

  • One additional temporary stream is established at the end of the transfer in order for the miner to ack to autoretrieve that it's finished the transfer. The message is sent from the miner successfully, but does not appear to be received on the client. It appears sending this message happens right before the failure.

  • Autoretrieve is running the FullRT DHT Client ((https://github.com/libp2p/go-libp2p-kad-dht/tree/master/fullrt) which crawls the network to establish a DHT index. As a result, autoretrieve is connected to a LOT of peers. We actually did not expect to get any bitswap requests on autoretrieve till we started publishing records in the DHT, but actually, we get lots presumably cause we're in the swarms of every peer we connect to when building a DHT index. This presents a potential issue for bitswap chatter should FullRT be deployed widely.

Assume default location for autoretrieve path

We discussed assuming a default location in ~/.autoretrieve for the data dir. This can be controlled via a AUTORETRIEVE_PATH environment variable on run. File I/O will use this path to assume locations of files and directories.

Monitor and adjust autoretrieve real-time metrics post-launch

Once in production, monitor the grafana dashboard to better understand autoretrieve patterns, and tune the dashboard as necessary. Other metrics to consider: number of cache hits vs. making a Filecoin retrieval. Performance comparison between cache and non-cache responses. etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.