ipni / go-indexer-core Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 6.0 10.38 MB

Core go datastructure of a cid index

License: Other

Go 100.00%

go-indexer-core's People

Contributors

Stargazers

Watchers

Forkers

strategist922 shenzhen-cloudatawalk-technology-co-ltd crypto-forks puddingbot isabella232

go-indexer-core's Issues

Add CBOR value codec

Add COBR value codec and benchmarks that compare it with the existing JSON and custom binary format. Set CBOR as default if its performance is comparable with the custom binary format in favour of using more common data formats.

storethehash is taking a long time to initialize

With a 179G value store on disk it is taking 12 minutes to initialize storethehash

2022-03-17T18:36:40.832Z        INFO    indexer command/daemon.go:95    Valuestore initializing/opening {"type": "sth", "path": "/data/valuestore-sth"}
2022-03-17T18:48:15.657Z        INFO    indexer command/daemon.go:100   Valuestore initialized

See: ipld/go-storethehash#26

Use multihash instead of CID

Since only the multihash portion of a CID is used to index content, this should be made clear by replacing CID with multihash in all function signatures.

Support parallel read of current MH values during put

Currently during a put of a batch of multihashes, the core reads the current value of each hash sequentially as part of updating those values.

To support large batches from a single provider, we should support sharding these reads cross multiple threads so that we can parallelize waiting on the open/read syscalls for these reads.

dhstore should be another value store implementation, not an engine option

To create the indexer core engine takes a valueStore interface to use for storing unencrypted index data. Options can also be specified to give a dhstore to use for storing encrypted index data. I do not think the same core should not be doing both. Instead, when creating a core engine instance it should only be given a valueStore interface. That interface can either be an unencrypted or an encrypted implementation.

If we really want to support both, then the indexer should use multiple cores, but there should be no reason since we do not need to look up unencrypted results to then store them in an encrypted DB. We only needed that during our transition from one to the other.

Engine panics with `segmentation violation code=0x1`

After upgrading x/sys dependency panic still occurs on put when values retrieved from cache are checked for equality.

[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f7f772e6999 pc=0x410af1]

goroutine 275 [running]:
runtime.throw({0x15d88b4?, 0xc00536a000?})
	/usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc00903e898 sp=0xc00903e868 pc=0x4473b1
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:825 +0x305 fp=0xc00903e8e8 sp=0xc00903e898 pc=0x45d785
memeqbody()
	/usr/local/go/src/internal/bytealg/equal_amd64.s:108 +0xd1 fp=0xc00903e8f0 sp=0xc00903e8e8 pc=0x410af1
bytes.Equal(...)
	/usr/local/go/src/bytes/bytes.go:20
github.com/filecoin-project/go-indexer-core.Value.Match(...)
	/go/pkg/mod/github.com/filecoin-project/[email protected]/value.go:58
github.com/filecoin-project/go-indexer-core.Value.Equal(...)
	/go/pkg/mod/github.com/filecoin-project/[email protected]/value.go:63
github.com/filecoin-project/go-indexer-core/engine.(*Engine).Put(0xc002dc0b00, {{0xc0103faed0, 0x22}, {0xc013080d40, 0x3d, 0x3f}, {0xc002fe2db8, 0x2, 0x2}}, {0xc00a1f4000, ...})
	/go/pkg/mod/github.com/filecoin-project/[email protected]/engine/engine.go:93 +0x645 fp=0xc00903eae8 sp=0xc00903e8f0 pc=0xa28945
github.com/filecoin-project/storetheindex/internal/ingest.(*Ingester).indexAdMultihashes(0xc00052c3c0, {{0x1a54eb0, 0xc0009c1060}, {0xc013080f00, 0x3b}, {0xc0009c1240, 0x1, 0x1}, {0xc009343400, 0x272, ...}, ...}, ...)
	/storetheindex/internal/ingest/linksystem.go:534 +0x705 fp=0xc00903ed18 sp=0xc00903eae8 pc=0xde7765
github.com/filecoin-project/storetheindex/internal/ingest.(*Ingester).ingestEntryChunk(0xc00052c3c0, {0x1a55828?, 0xc0084d8660?}, {{0x1a54eb0, 0xc0009c1060}, {0xc013080f00, 0x3b}, {0xc0009c1240, 0x1, 0x1}, ...}, ...)
	/storetheindex/internal/ingest/linksystem.go:475 +0x13e fp=0xc00903ee50 sp=0xc00903ed18 pc=0xde6d7e

track bitsize used in storethehash setup

currently a bitsize option is allowed, but if the index is subsequently opened with a different bitsize set, bad things will happen.
the used bitsize should be persisted somewhere with a store the hash datastore.

Add dealID (or alike) to Value

Per ipni/index-provider#15, we'll need to update indexer.Value to include a contextual ID (like a dealID) to identify updates over metadata for the same CID.

This will also mean slightly changing how the indexer-core behaves on Put. A Put for the same dealID shouldn't append a new entry but update the existing one.

// Value is the value of an index entry that is stored for each CID in the indexer.
type Value struct {
       // Contextual ID used to identify different entries.
        DealID cid.Cid
	// PrividerID is the peer ID of the provider of the CID
	ProviderID peer.ID
	// Metadata is serialized data that provides information about retrieving
	// data, for the indexed CID, from the identified provider.
	Metadata []byte
}

// cc @gammazero

RemoveProvider implementation in persistence

If we need to remove all data for a provider in the persistence layer there's no implementation for it yet.

https://github.com/filecoin-project/go-indexer-core/blob/913aef796c01b8da2188d0ad18b83ed279deb412/store/storethehash/storethehash.go#L186

It may require an offline process that iterates through memory checking all CIDs that have metadata for that provider. Maybe we can also design a new index that keeps track of providers and CIDs to optimize this process (it may add a significant storage overhead, we'll need to benchmark it)

Add additional metrics to core engine

Add metrics to report:

Put latency normalised by number of multihashes (similar to the existing latency measure in Get ) -- so that we know how much time was spent exclusively on core layer to put indices.
Average Put batch size of multihashes -- so that we can reason about optimisation of batching multihashes across multiple ads.
Average size of keys and values -- so that we can reason about key-value store optimisation parameters.
Average metadata size
Removal rate by context ID and by provider -- so that we can estimate garbage size.

Populate README

Other teams are starting to use this code, I think we should start considering populating the README with a brief description of the project (bearing in mind that our interfaces may still be subject to minor changes).

Not all processed records stored in Pebble are found

After advertisements are ingested by an indexer instance backed by pebble, not all records are found when multihashes are lookup via the finder API.

Default value codec to binary with fallback to JSON

Considering the efficiency gain of using binary value codec, make it the default choice and "do the right thing" in case the value codec is JSON.

The idea is to automatically detect the encoding and migrate values on the fly to binary format. That means we will avoid running a lengthy migration and would opportunistically move values while taking advantage of binary efficiency for any newly stored values.

manual release created (v0.7.1)

@ischasny just pushed a release tag: v0.7.1.
Please manually verify validity (using `gorelease`), and update `version.json` to reflect the manually released version, if necessary.
In the future, please use the automated process.

Define a dedicated type for metadata that has receivers for marshalling/unmarshalling

I think we should define a dedicated type for metadata in go-indexer-core that has receivers for marshalling/unmarshalling. This is a type used in both indexer node and provider and seems like it belongs to core.

Functionality is already implemented as part of indexer.Value It just needs to be refactored into its own type.

It would also make it easier to write go-doc for metadata and spell out its dependency to transport protocol.

@gammazero what are your thoughts? I am happy to pick this up if the suggestion makes sense.

Originally posted by @masih in ipni/index-provider#48 (comment)

Implement an iterator over the Index

@adlrocha confirmed offline that both the backends support iteration. We should expose and implement an iteration API so users can use it for introspection/debugging.