Giter VIP home page Giter VIP logo

go-indexer-core's People

Contributors

adlrocha avatar dependabot[bot] avatar gammazero avatar ischasny avatar masih avatar web3-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

go-indexer-core's Issues

Add CBOR value codec

Add COBR value codec and benchmarks that compare it with the existing JSON and custom binary format. Set CBOR as default if its performance is comparable with the custom binary format in favour of using more common data formats.

storethehash is taking a long time to initialize

With a 179G value store on disk it is taking 12 minutes to initialize storethehash

2022-03-17T18:36:40.832Z        INFO    indexer command/daemon.go:95    Valuestore initializing/opening {"type": "sth", "path": "/data/valuestore-sth"}
2022-03-17T18:48:15.657Z        INFO    indexer command/daemon.go:100   Valuestore initialized

See: ipld/go-storethehash#26

Use multihash instead of CID

Since only the multihash portion of a CID is used to index content, this should be made clear by replacing CID with multihash in all function signatures.

Support parallel read of current MH values during put

Currently during a put of a batch of multihashes, the core reads the current value of each hash sequentially as part of updating those values.

To support large batches from a single provider, we should support sharding these reads cross multiple threads so that we can parallelize waiting on the open/read syscalls for these reads.

dhstore should be another value store implementation, not an engine option

To create the indexer core engine takes a valueStore interface to use for storing unencrypted index data. Options can also be specified to give a dhstore to use for storing encrypted index data. I do not think the same core should not be doing both. Instead, when creating a core engine instance it should only be given a valueStore interface. That interface can either be an unencrypted or an encrypted implementation.

If we really want to support both, then the indexer should use multiple cores, but there should be no reason since we do not need to look up unencrypted results to then store them in an encrypted DB. We only needed that during our transition from one to the other.

Engine panics with `segmentation violation code=0x1`

After upgrading x/sys dependency panic still occurs on put when values retrieved from cache are checked for equality.

[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f7f772e6999 pc=0x410af1]

goroutine 275 [running]:
runtime.throw({0x15d88b4?, 0xc00536a000?})
	/usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc00903e898 sp=0xc00903e868 pc=0x4473b1
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:825 +0x305 fp=0xc00903e8e8 sp=0xc00903e898 pc=0x45d785
memeqbody()
	/usr/local/go/src/internal/bytealg/equal_amd64.s:108 +0xd1 fp=0xc00903e8f0 sp=0xc00903e8e8 pc=0x410af1
bytes.Equal(...)
	/usr/local/go/src/bytes/bytes.go:20
github.com/filecoin-project/go-indexer-core.Value.Match(...)
	/go/pkg/mod/github.com/filecoin-project/[email protected]/value.go:58
github.com/filecoin-project/go-indexer-core.Value.Equal(...)
	/go/pkg/mod/github.com/filecoin-project/[email protected]/value.go:63
github.com/filecoin-project/go-indexer-core/engine.(*Engine).Put(0xc002dc0b00, {{0xc0103faed0, 0x22}, {0xc013080d40, 0x3d, 0x3f}, {0xc002fe2db8, 0x2, 0x2}}, {0xc00a1f4000, ...})
	/go/pkg/mod/github.com/filecoin-project/[email protected]/engine/engine.go:93 +0x645 fp=0xc00903eae8 sp=0xc00903e8f0 pc=0xa28945
github.com/filecoin-project/storetheindex/internal/ingest.(*Ingester).indexAdMultihashes(0xc00052c3c0, {{0x1a54eb0, 0xc0009c1060}, {0xc013080f00, 0x3b}, {0xc0009c1240, 0x1, 0x1}, {0xc009343400, 0x272, ...}, ...}, ...)
	/storetheindex/internal/ingest/linksystem.go:534 +0x705 fp=0xc00903ed18 sp=0xc00903eae8 pc=0xde7765
github.com/filecoin-project/storetheindex/internal/ingest.(*Ingester).ingestEntryChunk(0xc00052c3c0, {0x1a55828?, 0xc0084d8660?}, {{0x1a54eb0, 0xc0009c1060}, {0xc013080f00, 0x3b}, {0xc0009c1240, 0x1, 0x1}, ...}, ...)
	/storetheindex/internal/ingest/linksystem.go:475 +0x13e fp=0xc00903ee50 sp=0xc00903ed18 pc=0xde6d7e

track bitsize used in storethehash setup

currently a bitsize option is allowed, but if the index is subsequently opened with a different bitsize set, bad things will happen.
the used bitsize should be persisted somewhere with a store the hash datastore.

Add dealID (or alike) to Value

Per ipni/index-provider#15, we'll need to update indexer.Value to include a contextual ID (like a dealID) to identify updates over metadata for the same CID.

This will also mean slightly changing how the indexer-core behaves on Put. A Put for the same dealID shouldn't append a new entry but update the existing one.

// Value is the value of an index entry that is stored for each CID in the indexer.
type Value struct {
       // Contextual ID used to identify different entries.
        DealID cid.Cid
	// PrividerID is the peer ID of the provider of the CID
	ProviderID peer.ID
	// Metadata is serialized data that provides information about retrieving
	// data, for the indexed CID, from the identified provider.
	Metadata []byte
}

// cc @gammazero

RemoveProvider implementation in persistence

If we need to remove all data for a provider in the persistence layer there's no implementation for it yet.

https://github.com/filecoin-project/go-indexer-core/blob/913aef796c01b8da2188d0ad18b83ed279deb412/store/storethehash/storethehash.go#L186

It may require an offline process that iterates through memory checking all CIDs that have metadata for that provider. Maybe we can also design a new index that keeps track of providers and CIDs to optimize this process (it may add a significant storage overhead, we'll need to benchmark it)

Add additional metrics to core engine

Add metrics to report:

  • Put latency normalised by number of multihashes (similar to the existing latency measure in Get ) -- so that we know how much time was spent exclusively on core layer to put indices.
  • Average Put batch size of multihashes -- so that we can reason about optimisation of batching multihashes across multiple ads.
  • Average size of keys and values -- so that we can reason about key-value store optimisation parameters.
  • Average metadata size
  • Removal rate by context ID and by provider -- so that we can estimate garbage size.

Populate README

Other teams are starting to use this code, I think we should start considering populating the README with a brief description of the project (bearing in mind that our interfaces may still be subject to minor changes).

Default value codec to binary with fallback to JSON

Considering the efficiency gain of using binary value codec, make it the default choice and "do the right thing" in case the value codec is JSON.

The idea is to automatically detect the encoding and migrate values on the fly to binary format. That means we will avoid running a lengthy migration and would opportunistically move values while taking advantage of binary efficiency for any newly stored values.

Define a dedicated type for metadata that has receivers for marshalling/unmarshalling

I think we should define a dedicated type for metadata in go-indexer-core that has receivers for marshalling/unmarshalling. This is a type used in both indexer node and provider and seems like it belongs to core.

Functionality is already implemented as part of indexer.Value It just needs to be refactored into its own type.

It would also make it easier to write go-doc for metadata and spell out its dependency to transport protocol.

@gammazero what are your thoughts? I am happy to pick this up if the suggestion makes sense.

Originally posted by @masih in ipni/index-provider#48 (comment)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.