Giter VIP home page Giter VIP logo

bifrost-gateway's Introduction

bifrost-gateway

A lightweight IPFS Gateway daemon backed by a remote data store.

Maintainers

IPFS Stewards

About

bifrost-gateway provides a single binary daemon implementation of HTTP+Web Gateway Specs.

It is capable of serving requests to:

Supported response types include both deserialized flat files, and verifiable Block/CAR.

For more information about IPFS Gateways, see:

Usage

Local build

$ go build
$ ./bifrost-gateway --help

Configuration

See ./bifrost-gateway --help and ./docs/environment-variables.md

Docker

Official Docker images are provided at hub.docker.com/r/ipfs/bifrost-gateway.

  • ๐ŸŸข Releases
    • latest and release always point at the latest release
    • vN.N.N point at a specific release tag
  • ๐ŸŸ  Developer builds
    • main-latest always points at the HEAD of the main branch
    • main-YYYY-DD-MM-GITSHA points at a specific commit from the main branch
  • โš ๏ธ Experimental, unstable builds
    • staging-latest always points at the HEAD of the staging branch
    • staging-YYYY-DD-MM-GITSHA points at a specific commit from the staging branch
    • This tag is used by developers for internal testing, not intended for end users

When using Docker, make sure to pass necessary config via -e:

$ docker pull ipfs/bifrost-gateway:release
$ docker run --rm -it --net=host -e PROXY_GATEWAY_URL=http://127.0.0.1:8080  ipfs/bifrost-gateway:release

See ./docs/environment-variables.md.

FAQ

How to use other gateway as a block backend

All you need is a trustless gateway endpoint that supports verifiable response types. The minimum requirement is support for GET /ipfs/cid with application/vnd.ipld.raw (block by block).

To run against a compatible, local trustless gateway provided by Kubo or IPFS Desktop:

$ PROXY_GATEWAY_URL="http://127.0.0.1:8080" ./bifrost-gateway

See Proxy Backend in ./docs/environment-variables.md

How to run with Saturn CDN backend

Saturn is a CDN that provides a pool of trustless gateways.

bifrost-gateway supports it via the Caboose backend, which takes care of discovering and evaluating Block/CAR gateways (in Saturn called L1 nodes/peers) for increased availability.

See Saturn Backend in ./docs/environment-variables.md

How to debug

See GOLOG_LOG_LEVEL.

How to use tracing

For tracing configuration, please check boxo/docs/tracing.md on how to generate the traceparent HTTP header in order to be able to easily identify specific requests.

How could this work for hosting a public IPFS gateway

This is WIP, but the high level architecture is to move from

Old Kubo-based architecture:

graph LR
    A(((fa:fa-person HTTP</br>clients)))
    K[[Kubo]]
    N(((BGP Anycast,<br>HTTP load-balancers,<br>TLS termination)))

    D(((DHT)))

    P((( IPFS<br>Peers)))

    A -->| Accept: text/html, *| N
    A -->| Accept: application/vnd.ipld.raw | N
    A -->| Accept: application/vnd.ipld.car | N
    A -->| Accept: application/vnd.ipld.dag-json | N
    A -->| Accept: application/vnd.ipld.dag-cbor | N
    A -->| Accept: application/json | N
    A -->| Accept: application/cbor | N
    A -->| Accept: application/x-tar | N
    A -->| Accept: application/vnd.ipfs.ipns-record | N
    A -->| DNSLink Host: en.wikipedia-on-ipfs.org | N
    A -->| Subdomain Host: cid.ipfs.dweb.link | N

    N ==>| fa:fa-link HTTP GET <br> Content Path | K

    K -.- D
    K ===|fa:fa-cube bitswapl | P
    P -.- D

New Rhea architecture:

graph LR
    A(((fa:fa-person HTTP</br>clients)))
    B[[bifrost-gateway]]
    N(((BGP Anycast,<br>HTTP load-balancers,<br>TLS termination)))
    S(((Saturn<br>CDN)))
    I[[IPNI]]
    D(((DHT)))

    P((( IPFS<br>Peers)))

    A -->| Accept: text/html, *| N
    A -->| Accept: application/vnd.ipld.raw | N
    A -->| Accept: application/vnd.ipld.car | N
    A -->| Accept: application/vnd.ipld.dag-json | N
    A -->| Accept: application/vnd.ipld.dag-cbor | N
    A -->| Accept: application/json | N
    A -->| Accept: application/cbor | N
    A -->| Accept: application/x-tar | N
    A -->| Accept: application/vnd.ipfs.ipns-record | N
    A -->| DNSLink Host: en.wikipedia-on-ipfs.org | N
    A -->| Subdomain Host: cid.ipfs.dweb.link | N

    N ==>| fa:fa-link HTTP GET <br> Content Path | B
    
    B ==>|fa:fa-cube HTTP GET <br> Blocks | S
    S -.- I 
    I -.- D 
    D -.- P -.- I
  
    P ===|fa:fa-cube the best block/dag <br> transfer protocol | S

bifrost-gateway nodes are responsible for processing requests to:

Caveats:

  • IPFS Gateway interface based on reference implementation from boxo/gateway.
  • IPFS Backend based on https://saturn.tech and HTTP client talking to it via caboose with STRN_LOGGER_SECRET.
  • Remaining functional gaps facilitated by:
    • (initially) temporary delegation to legacy Kubo RPC (/api/v0) at https://node[0-3].delegate.ipfs.io infra (legacy nodes used by js-ipfs, in process of deprecation).
    • (long-term) IPNS_RECORD_GATEWAY_URL endpoint capable of resolving GET /ipns/{name} with Accept: application/vnd.ipfs.ipns-record

How does high level overview look like

Some high level areas:

mindmap
  root[bifrost-gateway]
    (boxo/gateway.IPFSBackend)
        Block Backend
        CAR Backend
    Ephemeral Storage
        Block Cache
        Exchange Backend
            Plain HTTP Fetch
            Caboose Saturn Fetch
    Resolving Content Paths
        Raw
        CAR
        UnixFS
        IPLD Data Model
            [DAG-JSON]
            [DAG-CBOR]
        Web
            HTTP Host Header
            HTML dir listings
            index.html
            _redirects
            HTTP Range Requests
        Namesys
            DNSLink
                EoDoH<br>ENS over DNS over HTTPS
            IPNS Records
    Metrics and Tracing
        Prometheus
            Counters
            Histograms
        OpenTelemetry
            Spans
            Exporters
            Trace Context

Contributing

Contributions are welcome! This repository is part of the IPFS project and therefore governed by our contributing guidelines.

License

SPDX-License-Identifier: Apache-2.0 OR MIT

bifrost-gateway's People

Contributors

aarshkshah1992 avatar ameanasad avatar aschmahmann avatar galargh avatar guanzo avatar hacdias avatar laurentsenta avatar lidel avatar web-flow avatar web3-bot avatar willscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bifrost-gateway's Issues

Add metric to track partial responses

Clear visibility over how many responses have headers written with a 200 response code but time out / don't fully respond will be useful as part of our understanding of correctness

CAR based Gateway implementation

Done Criteria

While there is an implementation of gateway.IPFSBackend that can leverage retrievals of CAR files with the relevant data in them.

It should implemented the proposed version of the API here, which shouldn't have major changes before the above PR lands.

Implementation stages

Why Important

Implementation Phases

  • (1) Fetch CAR into per-request memory blockstore and serve response
  • (2) Fetch CAR into shared memory blockstore and serve response along with a blockservice that does block requests for missing data
  • (3) Start doing the walk locally and then if a path segment is incomplete send a request for a CAR/blocks and upon every received block try to continue using the blockservice
  • (4) Start doing the walk locally and keep a list of "plausible" blocks, if after issuing a request we get a non-plausible block then report them and attempt to recover by redoing the last segment
  • (5) Don't redo the last segment fully if it's part of a UnixFS file and we can do range requests

Details and Dependencies

ECD: 2023-03-27

  • ipfs/boxo#173 (resolved by ipfs/boxo#176 ). It will now be possible to build an IPFS HTTP Gateway implementation where individual HTTP requests are more closely tied to Go API calls into a configurable backend.

Blockers for mirroring traffic for Rhea

ECD: 2023-03-29

  • Resolve memory issues
  • Add more metrics tracking to the new implementation

The work is happening in #61. See there for more details

Blockers for production traffic for Rhea

ECD: TBD - Date for a date/plan: 2023-03-30

We need to have sufficient testing of the bifrost-gateway code given we aren't able to run Kubo's battery of sharness tests against it (per #58 ).

Options being considered:

  • Enough of testing in #66 that we can be reasonably confident in the new implementation
    • Note: we may want to be cautious in some of our implementation work here to increase the chance that kubo sharness tests will catch errors while the conformance tests improve (i.e. use something like the current strategy with the same BlocksGateway implementation kubo uses but with DAG prefetching of blocks happening underneath)
  • Can happen alongside some confidence building by comparing production ipfs.io/dweb.link traffic status codes + response sizes to Rhea ones.

Completion tasks to mark this done-done-done

  • Turning an inbound gateway.IPFSBackend request into a CAR request (should be relatively straightforward)
  • Doing incremental verification of the responses
  • Handle what happens if the CAR response sends back bad data (e.g. for Caboose report the problem upstream)
  • Handle what happens if the CAR response dies in the middle (i.e. resumption or restarting of download)
  • Handle OOM/out-of-disk-space errors
    • because the CAR responses do not have duplicate blocks, but a block may be reused in a graph traversal, either the entire graph needs to be buffered/stored before the blocks are thrown away or it needs to be possible to re-issue block requests for data we recently received but might have thrown away

Additional Notes

There already is an implementation of gateway.IPFSBackend that uses the existing tooling for block-based storage/retrieval here (and related to #57).

Some details related to Caboose:

  • Since Caboose is in charge of selecting which Saturn peers to ask for which content there may be some affinity information (perhaps just what already exists) that it wants in order to optimize which nodes it sends requests to (e.g. for a given CAR request that fulfills an IPFS HTTP Gateway request understanding if it wants to split the load, send it all to a specific L1, send it to a set of L1s, etc.).
  • IIUC the current plan is to send all data for a given high level IPFS HTTP Gateway request to a single L1 which shouldn't be too bad. Note: it may not be exactly 1 IPFS HTTP Gateway request -> 1 CAR file request due to various optimizations however the total number of requests should certainly go down dramatically

If we need to make some compromises in the implementation here in order to start collecting some data that's doable, but if so they should be explicitly called out and issues filed. Additionally, it should continue to be possible to use a blocks gateway implementation here via config.

cc @Jorropo @aarshkshah1992

Expose the same logging capability as Kubo

I believe our infra is logging request details and then processing it, we need to keep ability to have that after switching to bifrost-gateway binary.

Kubo can be run with adjusted log level, we could do the same here,
but i dont know if the HTTP access logs come from Kubo, or Nginx in front of it.

Rename ?format=car URL params to match IPIP-402

This issue is based on discussions I had with @aschmahmann and with @hannahhoward (slack) about around depth=0|1|all.
cc ipfs/specs#348

Problem

The depth=0|1|all parameter was provisionally introduced as part of Rhea project and we quickly realized we need to improve if we want to upstream it to https://specs.ipfs.tech/http-gateways/trustless-gateway/

The name is unfortunate, because after all Graph API interations we've ended up with "depth" that lost original meaning ("how deep to fetch a DAG (or path)"), and ended up meaning "logical depth from the perspective of end user thinking in terms of a single block, a file or a directory enumeration".

In that framing, the name makes very little sense:

  • a block can be resolved only via depth=0
  • things that support byte-range seeking, like unixfs files, can have max depth=1
  • hypothetical depth=2 makes sense only for directories and DAGs built out of IPLD things like DAG-JSON/CBOR documents (but we don't use this for Rhea nor need for gateway atm).

I donโ€™t think we should have an integer range where there are only two possible values with well-defined behavior.

I agree, we ended up with three states:

  • 0 - blocks for path + only the root block for terminal CID
    • Gateway uses this for IPFSBackend.GetBlock and for IPFSBackend.ResolvePath
  • 1 - blocks for path + all blocks for a file, or a minimal set of blocks to enumerate directory or HAMT
    • Gateway uses this for all IPFSBackend.Get requests, as we don't want to over-fetch directories, and we don't know which path ends with a dir
  • all - blocks for path + all blocks for entire DAG
    • Gateway uses this only for TAR (IPFSBackend.GetAll)and CAR (IPFSBackend.GetCAR) responses, but this is also the implicit default for ?format=car, so we don't really need this.

I think we both want to clean this up. and we agree "depth" is simply an invalid term here. I also agree with you, strings are more meaningful than numbers for this abstraction layer.

What we want

  • this is not "depth" โ†’ this is more a predefined "scope" with specific meaning
  • hard to reason what integers mean, or what the parameter is related to โ†’ include "car" in name, avoid magical IPLD/ADL terms, and other PL-specific words
  • we don't accept integers >1 โ†’ make it opaque string
  • we may want to use "depth" for selecting actual DAG depth in the future โ†’ rename

Solution

Based on my notes and discussion with Hannah, we agreed car-scope=block|file|all is a better framing:

  • depth=0 โ†’ car-scope=block
  • depth=1 โ†’ car-scope=file (non-files, like directories, and IPLD Maps only return blocks necessary for their enumeration)
    • we use file here to make more sense to non-PL user. "file" is a synonym for "the blocks needed to interact with the ADL in the transformed view", and the pathing on /ipfs/ defines the implicit ADLs to use.
  • depth=all โ†’ car-scope=all (implicit default when car-scope is unset, send everything)

This is way easier to reason about, no need to read docs, hard to make a mistake,
and a better fit for a future IPIP that updates https://specs.ipfs.tech/http-gateways/trustless-gateway/ which we want to do this year.

Transition plan.

Updated plan

  • bifrost-gateway starts sending requests with depth, car-scope and dag-scope AND both bytes and entity-bytes as noted in IPIP-402 โ€“ #108
  • new parameters released and deployed to Rhea prod
  • Lassie implements new parameters filecoin-project/lassie#238
  • Saturn L1 implements new parameters
  • dotStorage implements new parameters
  • Live with it for some time
  • remove old params from bifrost-gateway #144

2023-02-17-3e0550b hangs after a while

Upstream issue: filecoin-saturn/caboose#31

filecoin-saturn/caboose#30 fixed panic, but we see a different problem now:

bifrost-bank1-ny:/data# curl http://127.0.0.1:8080/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  116k  100  116k    0     0  1228k      0 --:--:-- --:--:-- --:--:-- 1231k

runs for a few minutes, but then entire binary dies around ~4k requests:

bifrost-bank1-ny:/data# curl http://127.0.0.1:8041/debug/metrics/prometheus -s | grep caboose_fetch_err
# HELP ipfs_caboose_fetch_errors Errors fetching from Caboose Peers
# TYPE ipfs_caboose_fetch_errors counter
ipfs_caboose_fetch_errors{code="0"} 3563
ipfs_caboose_fetch_errors{code="200"} 369
bifrost-bank1-ny:/data# curl http://127.0.0.1:8080/ipfs/bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) Recv failure: Connection reset by peer

If there is no easy fix, I'd have to revert caboose updates before the EOD and run with the old version.

Adjust size of in-memory block cache

bifrost-gateway runs with in-memory 2Q cache with size set to 1024 blocks.

2Q is an enhancement over the standard LRU cache in that it tracks both frequently and recently used entries separately. This avoids a burst in access to new entries from evicting frequently used entries.

Current cache performance: ~50% cache HIT

Cache metrics from bifrost-stage1-ny after one day (~48%):

ipfs_http_blockstore_cache_hit 7.273594e+06
ipfs_http_blockstore_cache_requests 1.515003e+07

And second sample from other day (~50%):

ipfs_http_blockstore_cache_hit 2.7508843e+07
ipfs_http_blockstore_cache_requests 5.4966088e+07

iiuc the above means that in-memory "frecency" cache of 1024 blocks produces cache HIT ~50% of time.
This is not that surprising, every website will cause the same parent blocks to be read multiple times for every subresource on a page.

We run on machines that have 64GiB of memory and bifrost-gateway only utilizes ~5GiB.

Proposal: increase cache size

Improving cache hit here won't improve things like video seeking or fetching big files, but will have impact for how fast popular websites and directory enumerations load, avoiding trashing of the most popular content.

Tasks

  • refactor cache size configuration: remove CLI parameter, and use ENV variable instead (to match plan from #43 and new configuration convention agreed with George)
  • with ability to tweak cache size with env variable, run experiments on bifrost-stage1-ny and increase block cache size, let's say initiallly x5 (to 5120 blocks) and see if it improves cache hit, or if it produces diminishing returns.
  • Once we find the optimal cache size on staging, update implicit default

Set up Github permissions and CI automation

@galargh need you help/sanity check ๐Ÿ™

This repo will be GOLANG code that produces a single binary and a Docker image for running it (similar to Kubo, but leaner).

https://github.com/protocol/github-mgmt does not exist, for now I've added permissions to this repo manually.

What is the proper way of doing below tasks?

If any of the above is too difficult, we can move it to ipfs/bifrost-gateway for now, but would like to keep it in protocol/ because its code specific to bifrost-infra.

panic in [email protected]/gateway/handler_car.go

Seems to be related to library introduced in #71 (not critical, seems to happen sporadically)

2023-04-03T19:17:38.277Z	ERROR	core/server	gateway/handler.go:319	A panic occurred in the gateway handler!
2023-04-03T19:17:38.277Z	ERROR	core/server	gateway/handler.go:320	runtime error: invalid memory address or nil pointer dereference
goroutine 323 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x19
github.com/ipfs/boxo/gateway.panicHandler({0x1229fe0, 0xc0006563c0})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:321 +0xfc
panic({0xd2c760, 0x181bf40})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/ipfs/boxo/gateway.(*handler).serveCAR(0xc0008b0770, {0x122a690, 0xc002c010e0}, {0x1229fe0?, 0xc0006563c0}, 0xc0005350a8, {{0x122a8c0?, 0xc0003a4c60?}}, {0x122a8c0, 0xc0003a4c60}, ...)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler_car.go:87 +0xbc1
github.com/ipfs/boxo/gateway.(*handler).getOrHeadHandler(0xc0008b0770, {0x1229fe0, 0xc0006563c0}, 0xc0005352c8)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:292 +0x13fa
github.com/ipfs/boxo/gateway.(*handler).ServeHTTP(0xc0008b0770, {0x1229fe0, 0xc0006563c0}, 0xc000339500)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:167 +0x234
net/http.(*ServeMux).ServeHTTP(0xc000535478?, {0x1229fe0, 0xc0006563c0}, 0xc000339500)
	/usr/local/go/src/net/http/server.go:2487 +0x149
main.withConnect.func1({0x1229fe0?, 0xc0006563c0?}, 0xc002afc615?)
	/go/src/github.com/ipfs/bifrost-gateway/handlers.go:57 +0x73
net/http.HandlerFunc.ServeHTTP(0x122a690?, {0x1229fe0?, 0xc0006563c0?}, 0xc000b68f00?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/ipfs/boxo/gateway.WithHostname.func1({0x1229fe0, 0xc0006563c0}, 0xc000339500)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/hostname.go:241 +0x4e2
net/http.HandlerFunc.ServeHTTP(0x0?, {0x1229fe0?, 0xc0006563c0?}, 0x0?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/mitchellh/go-server-timing.Middleware.func1({0x7fade48c69f8, 0xc000656320}, 0xc000339400)
	/go/pkg/mod/github.com/mitchellh/[email protected]/middleware.go:74 +0x32b
net/http.HandlerFunc.ServeHTTP(0x1229320?, {0x7fade48c69f8?, 0xc000656320?}, 0xc0009b79a0?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x1229320?, 0xc00183a380?}, 0xc000339400)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:288 +0xc5
net/http.HandlerFunc.ServeHTTP(0xc0009b7a50?, {0x1229320?, 0xc00183a380?}, 0xc00081a6e0?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
main.withRequestLogger.func1({0x1229320, 0xc00183a380}, 0xc000339400)
	/go/src/github.com/ipfs/bifrost-gateway/handlers.go:65 +0x171
net/http.HandlerFunc.ServeHTTP(0xc00067a463?, {0x1229320?, 0xc00183a380?}, 0x46d96e?)
	/usr/local/go/src/net/http/server.go:2109 +0x2f
net/http.serverHandler.ServeHTTP({0xc000658630?}, {0x1229320, 0xc00183a380}, 0xc000339400)
	/usr/local/go/src/net/http/server.go:2947 +0x30c
net/http.(*conn).serve(0xc00041c280, {0x122a690, 0xc0009ca120})
	/usr/local/go/src/net/http/server.go:1991 +0x607
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3102 +0x4db

Don't die when strn.pl DNS fails to resolve

This is a placeholder for avoiding the incident from today:

$ curl https://orchestrator.strn.pl/nodes/nearby/
curl: (6) Could not resolve host: orchestrator.strn.pl

A single DNS name failure effectively broke the entire gateway.

pass in context for related calls to allow for caboose affinity

caboose was coded in it's current blockstore incarnation to expect that requests would have a key on the context for the 'root cid' of the request that could better group related inbound requests.

this is set via an affinityKey https://github.com/filecoin-saturn/caboose/blob/main/caboose.go#L134 to the config, and then
using context.WithValue() to set the key to a value when it makes its way to the blockstore.

We can probably have that value just be the inbound request URL and things will get much better
(e.g. the caching of block on L1s will actually work)

http: superfluous response.WriteHeader call from [..] prometheus/client_golang

main-2023-02-18-d66784d prints the same prometheus error i've seen years ago while working on subdomain gateways in go-ipfs:

2023/02/18 01:29:30 http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader (delegator.go:65)

iirc this is low priority, does not impact metrics in meaningful way, just pollutes stdout โ€“ something to clean up in spare moment (try to find go-ipfs fix first, this was fixed there already before)

Expose /debug/pprof

To investigate issues like #92 we need to port some features from Kubo and expose them on the "metrics" port, potential candidates:

  • goroutine dump
    • curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks
  • 30 second cpu profile
    • curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
  • heap trace dump
    • curl localhost:5001/debug/pprof/heap > ipfs.heap
  • memory statistics (in json, see "memstats" object)
    • curl localhost:5001/debug/vars > ipfs.vars
  • system information
    • ipfs diag sys > ipfs.sysinfo

Ref. https://github.com/ipfs/kubo/blob/master/docs/debug-guide.md

Decide how to handle CAR and Block requests sent to bifrost-gateway

Responses to requests for block and CAR can be verified by clients.

This means, for verification purposes, bifrost-gateway does not need to act as a proxy, and could return HTTP 302 response with URL to Saturn endpoint capable of responding to Accept: application/vnd.ipld.raw and Accept: application/vnd.ipld.car requests.

Open questions / tasks

  • Decide initial behavior
    • Initial demo should have no special handling for these response types, do the usual proxying.
  • Is Saturn capable of handling these requests directly at some point? Are there any Saturn-specific reasons to not do this?
    • Saturn needs logs for accounting. If client follows HTTP 302 redirect, then we no longer have them (unless logging 302 is enough?).
    • If so, do we need to change default response in go-libipfs/gateway for /ipfs/cid/a/b return full DAG for /b + all blocks for cid/a/b path, to align behavior with Saturn?

Delegate resolution of DNSlink and IPNS records

Initial idea (for Feb 17 binary)

  • For resolving /ipns/id content paths to /ipfs/cid ones
    • Delegate RPC at https://node[0-2].delegate.ipfs.io/api/v0/resolve (it handles both DNSlink and IPNS)
    • (TBD) if we need local cache per id to avoid sending lookup, we can do it in two stages:
      • naive mvp can be 1 minute LRU cache, matching current behavior in Kubo 0.18 (which we want to fix long term)
      • correct cache is based on record TTL, so we reuse resolution for subsequent requests
        • this may require us to switch away from /api/v0/resolve and handle IPNS and DNSLink separately
          • DNSLink: run own DNS lookup for TXT record, and cache based on TTL of TXT record
          • IPNS record: use /api/v0/routing/get and cache based on TTL of the IPNS record
  • For resolving raw IPNS record
    • Delegate RPC at https://node[0-2].delegate.ipfs.io/api/v0/routing/get (cc @hacdias to confirm, I believe we already do this is tests)
  • Ensure resolution is as fast as possible:

Long term, we want to use delegated HTTP routing for IPNS (follow-up for IPIP-337, when that API supports IPNS lookups), and DoH for DNSLink (https://github.com/libp2p/go-doh-resolver/ ?)

Investigate why success rate gets worse over time

Restarting biforst-gateway on staging produces very close success rate, but over time, it erodes into worse and worse state:

Inspect yourself:

Summary from the latter:

Screenshot 2023-04-11 at 16-49-25 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Some ideas/thoughts why:

  • in-memory block cache perf regression is unlikely, cache size is symbolic, aims to limit roundtrips per requests.
    Staging runs with BLOCK_CACHE_SIZE=16k (#47 (comment)) and the slowness will happens way after that is filled up multiple times, and we see on the next graph the duration increase of CAR fetch happens on Caboose side:

    Screenshot 2023-04-11 at 17-40-26 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana
    Screenshot 2023-04-11 at 17-43-42 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

  • Saturn L1 pool health gets worse for some reason:

    Screenshot 2023-04-11 at 16-56-42 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

  • Saturn per-L1 CAR fetch durations increase while other durations stay the same:

    2023-04-11_17-37

  • HTTP 499s suggests clients giving up before they get our response, which is aligned with things getting slower over time, and more and more clients giving up waiting for response. This is not specific to Rhea, the old mirrored node that runs Kubo is also seeing more 499s over time, but it is less prominent:

    Screenshot 2023-04-11 at 19-08-23 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Any feedback / thoughts / hypothesis are welcome. ๐Ÿ™

Fix CAR and other metrics

Some things I've noticed while adding CAR panels to bifrost-gw staging (see "bifrost-gateway daemon" section)

  • car metrics (histograms) looks to be capped at 1m โ€“ need to verify in filecoin-saturn/caboose#76
    • *_car_peer_success|failure* (caboose)
    • *_car_success|failure* (caboose)
  • car size metric (*_fetch_size) looks to be capped at 4MiB โ€“ need to verify filecoin-saturn/caboose#75
  • boxo/gateway โ€“ ipfs/boxo#265
    • IPFSBackend aapi call capped at 1m
    • GET durations (global and per type) capped at 1m

I think the first step is to add more buckets beyond 1m, let's say until 5m.

Bifrost makes calls to the Caboose Block and CAR fetch API with an already cancelled context

CAR Fetches
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-1h&to=now&editPanel=35

~ 41% of the requests by Bifrost to Caboose for CAR Fetches are made with a context that has already been cancelled.
~ 22% of the requests fail because Bifrost cancels the context while the CAR fetch is in progress.

Block Fetches
https://protocollabs.grafana.net/d/6g0_YjBVk/bifrost-caboose-staging?orgId=1&from=now-1h&to=now&editPanel=36

~ 18% of the requests by Bifrost to Caboose for Block API(Has, Get, Size) are made with a context that has already been cancelled.
~ 21% of the requests fail because Bifrost cancels the context while the Block fetch is in progress.

Add metrics specific to bifrost-gateway setup

This is meta-issue about useful metrics in bifrost-gateway.
We may ship only a subset of the below for the project Rhea.

Overview

The go-libipfs/gateway library will provide some visibility into incoming requests (1),
but we need to add metrics to track performance of saturn client (block provider) (2)
and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.

graph LR
    A(((fa:fa-person HTTP</br>clients)))
    B[bifrost-gateway]
    N[[fa:fa-hive bifrost-infra:<br>HTTP load-balancers<br> nginx, TLS termination]]
    S(((saturn.pl<br>CDN)))
    M0[( 0 <br>NGINX/LB<br/>LOGS&METRICS)]
    M1[( 1 <br>HTTP<br/>METRICS:<br/> ipfs_http_*)]
    M2[( 2 <br>BLOCK<br/>PROVIDER<br/>METRICS <br/>???)]
    M3[( 3 <br>INTERNAL<br/>METRICS<br/>???)]


   A -->| Accept: .. <br>?format=<br>Host:| N


    N --> M1 --> B
    N .-> M0
    
    B --> M2 ---> S
    B .-> M3

(0) are metrics tracked before bifrost-gateway and are out of scope.

Proposed metrics [WIP]

Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start

For (1)

  • Per request type
    • Duration Histogram per request type
      • We want global variant, and per namespace (/ipfs/ or /ipns/)
      • See Appendinx below for example how histogram looks like
      • Why?
        • We need to measure each request types informed by ?format= and Accept header because
          • They have different complexity involved, and will have different latency costs
          • We want to be able to see which ones are most popular, and comparing _sum from histograms will allow us to see % distribution
        • We need to measure /ipfs/ and /ipns separately to see the impact additional resolution step (IPNS or DNSLink) has.
    • Response Size Histogram per request type
      • We want global variant, and per namespace (/ipfs/ or /ipns/)
      • Why?
        • Understanding what is the average response size allows us to correctly interpret the Duration. Without this, Duration of unixfs response does not tell us of file was big, or our stack is slow.
  • Count GET vs HEAD requests
    • Per each, count requests with Cache-Control: only-if-cached
      • Open question (can be answered later, after we see initial counts) shoud we exclude these requests from totals? My initial suggestion is to exclude them. If they become popular, they will skew numbers, as request for 4GB file will be "insanely fast"
  • Count 200 vs 2xx vs 3xxx vs 400 vs 500 response codes

For (2)

  • Initially, we will only request raw blocks (application/vnd.ipld.raw) from Staurn:

    • Duration Histogram for block request
    • Response Size Histogram for block request
    • Count 200 vs non-200 response codes
  • TBD: Future (fancy application/vnd.ipld.car)

    • All requests will be for resolved /ipfs/
    • We will most likely want to track:
      • Duration and response size per original request type (histograms)
      • If we support sub-paths, then we will need to have to track Requested Content Path length (histogram)
  • TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type

For (3)

Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.

  • Duration Histogram for /ipfs resolution
    • Why? Allows us to eyeball when resolution became the source of general slowness / regression in TTFB
  • Requested Content Path length Histogram for /ipfs
    • Why? We want to know % of direct requests for a CID vs
  • Duration Histograms for /ipns resolutions of DNSLink, IPNS Record, both single lookup and recursive until /ipfs/ is hit
    • Why?
      • bifrost-gateway will be delegating resolution to remote HTTP endpoint
      • Both can be recursive, so the metrics will be skewed unless we measure both a single lookup and recursive
      • We want to be able to see which ones are most popular, and how often recursive values are present. Comparing _sum from histograms will allow us to see % distribution.

Appendix: how histogram from go-libipfs/gateway look like

When I mean "histogram", I mean _sum and _buckets we use in Kubo's /debug/metrics/prometheus:

Click to expand
# HELP ipfs_http_gw_raw_block_get_duration_seconds The time to GET an entire raw Block from the gateway.
# TYPE ipfs_http_gw_raw_block_get_duration_seconds histogram
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.05"} 927
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 984
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.25"} 1062
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1067
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="10"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="30"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="60"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_sum{gateway="ipfs"} 19.696413685999993
ipfs_http_gw_raw_block_get_duration_seconds_count{gateway="ipfs"} 1068

We can change the bucket distribution if that gives us better data, but it should be done on both ends.

Run sharness tests from Kubo with PROXY_GATEWAY_URL="http://127.0.0.1:8080"

Until we have coverage and confidence in https://github.com/ipfs/gateway-conformance, the *gateway*.sh sharness tests in https://github.com/ipfs/kubo/tree/master/test/sharness are the only E2E regression and correctness tests we have.

Proposal: let's run them in this repo

We now have PROXY_GATEWAY_URL which enables us to run bifrost-gateway against any HTTP backend that supports GET /ipfs/cid?format=raw requests.

This means we could run bifrost-gateway against regular Kubo gateway and use it as the remote blockstore (and in the future, fetch graphs as CARs via ?format=car).

This would add basic end-to-end tests that catch problems in glue code that is not in go-libipfs, and also allow us to test graph requests in both Kubo and bifrost-gateway, when the time comes.

TODO

  • deterministic CAR fixtures ipfs/kubo#9657 to remove surface for bugs caused by change in the way fixtures are generated by a specific Kubo version (now, we only care about static, immutable blocks)
  • Github action that checkouts Kubo master or specific commit and runs kubo/tests/sharness/*gateway*.sh against GWAY_PORT of bifrost-gateway
  • Kubo's ipfs daemon should run on unique port, and endpoint should be exported as PROXY_GATEWAY_URL
  • (tbd) Surgical tweak of kubo/tests/sharness/lib logic to override GWAY_PORT with value from PROXY_GATEWAY_URL if that env is present.
    • This means requests will hit bifrost-gateway that is in front of Kubo, but all the fixture and assert preparation based on kubo CLI like ipfs dag import or ipfs block get will still work, because these do not use GWAY_PORT

Support DNSLink on ENS (move namesys defaults out of Kubo)

Problem

bifrost-gateway must have feature and behavior parity with Kubo, but we have a bug around ENS:

Potential cause

Seems that we are running different namesys config than Kubo.

It should use the same set of defaultResolvers to support ENS and UD, as noted in the config docs, but *.eth resolution does not work atm.

Cleanup proposal

Move defaultResolvers to one of upstream libraries, and make them implicit default to avoid ENS failures like this one.

Option A: move to go-namesys (global)

This applies defaults to both DNSLinks and Multiaddrs, allowing use of *.eth in /dnsaddr too

Option B: move to go-libipfs/gateway (minimal, scoped to DNSLinks)

This would only apply to DNSLinks, but that is enough given the utility of ENS today (just DNSLinks).


@hacdias @aschmahmann Thoughts?

I feel (B) may be cleaner:

  • WithNameSystem allows users to pass custom resolvers
  • if no WithNameSystem is passed we apply defaultResolvers on the library level as implicit default

Smart handling of legacy /api/v0 endpoints

/api/v0 is not part of Gateway, it is RPC specific to Kubo, but ipfs.io exposes a subset of it for legacy reasons.

Based on @aschmahmann's research: https://www.notion.so/pl-strflt/API-v0-Distribution-9342e803ecee49619989427d62dd0f42

resolves: name/resolve, resolve, dag resolve, dns

These are the majority of requests and need to remain fast, as they are used by various tools, including ipfs-companion users that have no local node, but still want to copy CID. These are the only thing we need to support inside `bifrost-gateway, and we should have sensible caching for these.

We will require routing these to a Kubo RPC at box with accelerated DHT client and IPNS. Seee https://github.com/protocol/bifrost-infra/issues/2327

gets: cat, dag get, block/get

  • cat return HTTP 302 redirect to ipfs.io/ipfs/cid
  • dag get return HTTP 302 redirect to ipfs.io/ipfs/cid?format=dag-json (or json/cbor/dag-cbor, if they passed explicit one in &output-codec=
  • block/get return HTTP 302 redirect to ipfs.io/ipfs/cid?format=raw

everything else

Return HTTP 501 Not Implemented with plain/text body explaining /api/v0 RPC is being removed from gateways and if someone needs it they can self-host Kubo instance + link to https://docs.ipfs.tech/install/command-line/.

If there is a good reason to special-handle some additional endpoints, we can. Drop comment below.

Update Saturn logger configuration

Extracting tasks fro bifrost-gateway from:
https://www.notion.so/pl-strflt/Saturn-Payments-Logging-8db8d7b691a9471aaf786947a6b8b0a6

Changes requested by Saturn:

  • change default endpoint to https://logs.strn.network (HTTP POST), make it implicit (remove from CLI, but allow override via ENV STRN_LOGGER_URL)
  • read logger token via ENV STRN_LOGGER_URL โ€“ confirmed this convention with George
  • switch to latest caboose version
  • create separate http client for log reporting, and add secret token to every log report (Authorization: Bearer <base64_jwt>) โ€“ see how here: #51 (comment)
  • dont break local dev: send logs only when secret token is present, support local dev without it, or figure out how devs can get tokens if they are mandatory for caboose to operate correctly

log warning about "URL query contains semicolon"

2023/02/20 14:08:03 http: URL query contains semicolon, which is no longer a supported separator; parts of the query may be stripped when parsed; see golang.org/issue/25192
2023/02/20 14:08:36 http: URL query contains semicolon, which is no longer a supported separator; parts of the query may be stripped when parsed; see golang.org/issue/25192
2023/02/20 14:09:07 http: URL query contains semicolon, which is no longer a supported separator; parts of the query may be stripped when parsed; see golang.org/issue/25192
2023/02/20 14:29:58 http: URL query contains semicolon, which is no longer a supported separator; parts of the query may be stripped when parsed; see golang.org/issue/25192

Support Trace Context HTTP headers

Ref.

Quick thoughts:

  • Whatever we establish here, others in IPFS Ecosystem will see, and assume this is "standard best practice" โ€“ we should plan for long term
  • This should be turn-key solution for everyone running boxo/gateway, not just bifrost-gateway
  • bifrost-gateway should ensure tracing info is always present
    • If it is missing in request, bifrost-gateway would create one. (spec)
    • If it is present in request, we update ids in the chain.
      • For Rhea, we will be creating trace parent on LBs (slack thread), but if there is no load balancer in front of bifrost-gateway, then it should initialize trace chain.

Tasks

  1. hacdias
  2. hacdias
  3. guanzo

error: Saturn data did not match given hash

@hacdias mind looking into why stdout gets spamed with messages like ones below?

got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmVv6aC9qcfEr7db155WUWvzE8dj9nD7oKAkwAJXGVnCij
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmR6hSR7jQygGCd9FjmTWF2nZgVsjQjEGqoPnWHHbvxp3W
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbnP2sCk6AEqGWrM8DBtUNfQfoZBnEgKX5gceAxJFnj44
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmcUSpZNzHGUnrdFGr4Mowsq3izzrExic8j45Lo6q9Stgu
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmT7Qe9MLouG1dfqkBx1CG391L58hLsHRMbrw6qCLXYjR7
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmT7Qe9MLouG1dfqkBx1CG391L58hLsHRMbrw6qCLXYjR7
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmcUSpZNzHGUnrdFGr4Mowsq3izzrExic8j45Lo6q9Stgu
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmcUSpZNzHGUnrdFGr4Mowsq3izzrExic8j45Lo6q9Stgu
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbBiyK1Uf7naVeWH5CgacnNYmpHoKdNCkKhNF3JZxTvoK
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbBiyK1Uf7naVeWH5CgacnNYmpHoKdNCkKhNF3JZxTvoK
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmT7Qe9MLouG1dfqkBx1CG391L58hLsHRMbrw6qCLXYjR7
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbBiyK1Uf7naVeWH5CgacnNYmpHoKdNCkKhNF3JZxTvoK
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmT7Qe9MLouG1dfqkBx1CG391L58hLsHRMbrw6qCLXYjR7
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbfKn7MJjetHCRVjj2qQL7vKsdQ8wKuHe3pZjT79YEdx6
got QmPciteMQZnB6swbx35qA9geqe4AmESFJAM3Pa9WoxdcNc vs QmbfKn7MJjetHCRVjj2qQL7vKsdQ8wKuHe3pZjT79YEdx6
...

Are these related to hash verification errors? If you browse ipfs website long enough, Saturn will fail and we get this:

2023-02-08-044911_822x146_scrot

Suggested improvements

  • dont spam stdout ๐Ÿ™ƒ if we really need to, write summary once a minute (N of M blocks returned by saturn were invalid, print which PoPs were at fault)
  • bifrost-gateway should not fail like this. It must be robust enough to retry block read using different PoP, and blacklist PoPs that return garbage for some time.

Expose configuration for necessary HTTP endpoints

This is a quick dump of things what need to be configurable.

My initial proposal is to use env variables, which removes the need for babysitting config file.

So far, things we need to be able to configure are URLs of HTTP endpoints.

Saturn Orchestrator endpoint

SATURN_ORCHESTRATOR="https://orchestrator.strn.pl"

We don't want it hardcoded like this, switching to alternative endpoint should not require recompiling binary.

Kubo RPC /api/v0 endpoint(s)

I propose space separated pool, we use them in random order to spread the load:

KUBO_RPC="https://node0.delegate.ipfs.io https://node0.delegate.ipfs.io https://node1.delegate.ipfs.io https://node2.delegate.ipfs.io https://node3.delegate.ipfs.io"

RPC endpoints that will be used:

  • /api/v0/resolve for resolving /ipns/{id} to /ipfs/{cid} (until we have delegated IPNS lookups over IPNI)
  • we will be redirecting / proxying legacy /api/v0 endpoints exposed on ipfs.io (to minimize blast radius)

error: browsing HAMT dirs extremely slow

Opening /ipns/en.wikipedia-on-ipfs.org/wiki/ requires enumerating a HTML-sharded UnixFS directory with ~20 milion entries.

This means multiple blocks need to be fetched before we can find the CID of /ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/index.html

How to reproduce

Try browsing http://localhost:8080/ipfs/bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze/wiki/ and then going to some pages that are not cached on saturn.

OR bafybeiggvykl7skb2ndlmacg2k5modvudocffxjesexlod2pfvg5yhwrqm which is a directory with 10k of JPGs.

Potential fixes

sane:

  • basic LRU cache for blocks may help #15
  • maybe additional cache for blocks that belong to unixfs HAMT nodes?

insane:

  • cheat, and delegate ResolvePath of expensive HAMTs to /api/v0/resolve
    • less insane: race ResolvePath backed by local LRU cache vs remote RPC and use whichever returns faster result

Research / improve timeouts on stalled L1 responses

(this is a stub โ€“ needs research, more details will follow)

Current HTTP Client

Caboose uses a dedicated HTTP Client with own configuration, it includes raised timeouts, namely:

  • Timeout for entire connection โ€“ set to 30m (caboose.DefaultSaturnCarRequestTimeout)
  • IdleConnTimeout for how long to keep connection without ongoing request/response

saturnRetrievalClient := &http.Client{
Timeout: caboose.DefaultSaturnCarRequestTimeout,
Transport: &customTransport{
RoundTripper: &http.Transport{
// Increasing concurrency defaults from http.DefaultTransport
MaxIdleConns: 1000,
MaxConnsPerHost: 100,
MaxIdleConnsPerHost: 100,
IdleConnTimeout: 90 * time.Second,
DialContext: cdns.dialWithCachedDNS,
// Saturn Weltschmerz
TLSClientConfig: &tls.Config{
// Saturn use TLS in controversial ways, which sooner or
// later will force them to switch away to different domain
// name and certs, in which case they will break us. Since
// we are fetching raw blocks and dont really care about
// TLS cert being legitimate, let's disable verification
// to save CPU and to avoid catastrophic failure when
// Saturn L1s suddenly switch to certs with different DNS name.
InsecureSkipVerify: true,
// ServerName: "strn.pl",
},
},
},
}

Problems

  • CAR response can stall after the initial few blocks, and we waste time waiting for it for the next 28m
    • in practice, usually some other timeout kicks in: either in bifrost-gateway or L1s nginx, we see P95 cut-off happening around ~8m, but this is still too long wait

Solution

CAR stream will include blocks, so we could introduce "NewBlockTimeout" which is the amount of time without any NEW block arriving from the server as a response to ?format=car request. This comes with a nice composability, we could apply the same timeout per block here, as we already do for ?format=raw, and keep them uniform.

We could also (or instead) count time any bytes arrived per request, but that needs research what is the best way of doing this in modern golang.

panic on staging: runtime error: slice bounds out of range [:-1]

Happened twice, around boxo/gateway/blocks_gateway.go โ†’ getPathRoots โ†’ pathRoots[:len(pathRoots)-1]

2023/04/05 21:20:56 Starting bifrost-gateway 2023-04-05-d8d2849
...
2023-04-06T02:07:08.766Z	ERROR	core/server	gateway/handler.go:315	A panic occurred in the gateway handler!
2023-04-06T02:07:08.766Z	ERROR	core/server	gateway/handler.go:316	runtime error: slice bounds out of range [:-1]
...
2023-04-06T02:07:08.799Z	ERROR	core/server	gateway/handler.go:315	A panic occurred in the gateway handler!
2023-04-06T02:07:08.800Z	ERROR	core/server	gateway/handler.go:316	runtime error: slice bounds out of range [:-1]
goroutine 8209872 [running]:
runtime/debug.Stack()
	/usr/local/go/src/runtime/debug/stack.go:24 +0x65
runtime/debug.PrintStack()
	/usr/local/go/src/runtime/debug/stack.go:16 +0x19
github.com/ipfs/boxo/gateway.panicHandler({0x123d680, 0xc1cb717680})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:317 +0xfc
panic({0xdfa960, 0xc148178b40})
	/usr/local/go/src/runtime/panic.go:884 +0x213
github.com/ipfs/boxo/gateway.(*BlocksGateway).getPathRoots(0x0?, {0x123da38, 0xc1afb4d890}, {{0x123dc68?, 0xc247a56a10?}})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/blocks_gateway.go:325 +0x5c6
github.com/ipfs/boxo/gateway.(*BlocksGateway).getNode(0xc204cf1e00, {0x123da38, 0xc1afb4d890}, {{0x123dc68?, 0xc247a56a10?}})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/blocks_gateway.go:261 +0x6b
github.com/ipfs/boxo/gateway.(*BlocksGateway).Get(0xc204cf1e00, {0x123da38, 0xc1afb4d890}, {{0x123dc68?, 0xc247a56a10?}}, {0xd09fe0?, 0x1231950?, 0xe1ede0?})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/blocks_gateway.go:143 +0x7a
github.com/ipfs/bifrost-gateway/lib.(*GraphGateway).Get(0xc000184d80, {0x123da38, 0xc1afb4d890}, {{0x123dc68?, 0xc247a56a10?}}, {0xc247a56ad0, 0x1, 0x1})
	/go/src/github.com/ipfs/bifrost-gateway/lib/graph_gateway.go:366 +0x5e5
github.com/ipfs/boxo/gateway.(*ipfsBackendWithMetrics).Get(0xc000129a28, {0x123da38, 0xc1afb4d860}, {{0x123dc68?, 0xc247a56a10?}}, {0xc247a56ad0, 0x1, 0x1})
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/metrics.go:69 +0x4b7
github.com/ipfs/boxo/gateway.(*handler).serveDefaults(0xc00087ecc0, {0x123da38, 0xc1afb4d3b0}, {0x123d680, 0xc1cb717680}, 0xc013ae50b0, {{0x123dc68?, 0xc247a56a10?}}, {{0x123dc68, 0xc247a56a10}}, ...)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler_defaults.go:68 +0x469
github.com/ipfs/boxo/gateway.(*handler).getOrHeadHandler(0xc00087ecc0, {0x123d680, 0xc1cb717680}, 0xc013ae52d0)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:281 +0xc34
github.com/ipfs/boxo/gateway.(*handler).ServeHTTP(0xc00087ecc0, {0x123d680, 0xc1cb717680}, 0xc1824ad500)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/handler.go:163 +0x234
net/http.(*ServeMux).ServeHTTP(0xc013ae5480?, {0x123d680, 0xc1cb717680}, 0xc1824ad500)
	/usr/local/go/src/net/http/server.go:2500 +0x149
main.withConnect.func1({0x123d680?, 0xc1cb717680?}, 0xc0e2e1be20?)
	/go/src/github.com/ipfs/bifrost-gateway/handlers.go:57 +0x73
net/http.HandlerFunc.ServeHTTP(0x123da38?, {0x123d680?, 0xc1cb717680?}, 0xc000184d80?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/ipfs/boxo/gateway.WithHostname.func1({0x123d680, 0xc1cb717680}, 0xc1824ad500)
	/go/pkg/mod/github.com/ipfs/[email protected]/gateway/hostname.go:241 +0x4e2
net/http.HandlerFunc.ServeHTTP(0x0?, {0x123d680?, 0xc1cb717680?}, 0x0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/mitchellh/go-server-timing.Middleware.func1({0x7fc9001d18e0, 0xc1cb717590}, 0xc1824ad400)
	/go/pkg/mod/github.com/mitchellh/[email protected]/middleware.go:74 +0x32b
net/http.HandlerFunc.ServeHTTP(0x123c810?, {0x7fc9001d18e0?, 0xc1cb717590?}, 0xdd95a0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
github.com/prometheus/client_golang/prometheus/promhttp.InstrumentHandlerResponseSize.func1({0x123c810?, 0xc11958e1c0?}, 0xc1824ad400)
	/go/pkg/mod/github.com/prometheus/[email protected]/prometheus/promhttp/instrument_server.go:288 +0xc5
net/http.HandlerFunc.ServeHTTP(0xc042f88a58?, {0x123c810?, 0xc11958e1c0?}, 0xc0000e04e0?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
main.withRequestLogger.func1({0x123c810, 0xc11958e1c0}, 0xc1824ad400)
	/go/src/github.com/ipfs/bifrost-gateway/handlers.go:65 +0x171
net/http.HandlerFunc.ServeHTTP(0x0?, {0x123c810?, 0xc11958e1c0?}, 0x46f24e?)
	/usr/local/go/src/net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0xc1afb4c8a0?}, {0x123c810, 0xc11958e1c0}, 0xc1824ad400)
	/usr/local/go/src/net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc173373200, {0x123da38, 0xc000a1c150})
	/usr/local/go/src/net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:3089 +0x5ed

internalWebError: %!s(<nil>)

Opening bafybeiggvykl7skb2ndlmacg2k5modvudocffxjesexlod2pfvg5yhwrqm errored badly:

2023-02-09-031728_1707x406_scrot

we should track down what produces errors like this, and improve them

Expose the same metrics as Kubo

Kubo Gateway exposes a bunch of metrics on RPC port: http://127.0.0.1:5001/debug/metrics/prometheus

bifrost-gateway should do the same

Details

We already have the some code extracted from Kubo:

https://github.com/ipfs/go-libipfs/blob/302b2799386dea7afb72ba0b4c32a5c427215d06/gateway/handler.go#L220-L262

We need to also move things from MetricsCollectionOption:

https://github.com/ipfs/kubo/blob/14649aa8ba8d7612ce9e35bba776fe7e7498b343/core/corehttp/metrics.go#L79

While at it, we should fill the gaps mentioned in ipfs/boxo#154

Support Server-Timing header

TODO

This feature may be useful enough to be promoted upstream:

  • support timing info from caboose support in #71.
  • add gateway-related timing here
  • evaluate usefullness
  • decision to move middleware to upstream boxo/gateway library

What and why

Ref. https://www.w3.org/TR/server-timing/

TLDR: we want to leverage UI in modern browsers to give more info about where Gateway request was spent:


Source: https://ma.ttias.be/server-timings-chrome-devtools/

How

TBD, there is a library at https://github.com/mitchellh/go-server-timing + slack thread

Middleware for injecting the server timing struct into the request Context and writing the Server-Timing header.

there is slight risk middleware will not be compatible with how we do subdomains, but that is tbd.
we can give it a try in bifrost-gateway and see how it goes.

on the surface level it sounds sensible,

  • only normal header, no Trailers (mdn/browser-compat-data#14703)
  • context thing does not require another API rewrite, caboose can set whatever it wants, we would return it
  • if we find this useful, boxo/gateway could add own metrics and/or we could make this part of Gateway spec

meta: GRAPH_BACKEND fixes and latency improvements

This is a meta issue for tracking things we can do to improve correctness/latency.

Spec conformance

Tasks

Perf. Improvements

Tasks

  1. lidel
  2. lidel

Note: these things aren't blocking for enabling GRAPH_BACKEND=true. The blocker is #92

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.