Giter VIP home page Giter VIP logo

singularity's Introduction

Singularity

codecov Go Report Card Go Reference Build

The new pure-go implementation of Singularity provides everything you need to onboard your, or your client's data to Filecoin network.

Documentation

Read the Doc

Related projects

License

Dual-licensed under MIT + Apache 2.0

singularity's People

Contributors

xinaxu avatar hannahhoward avatar masih avatar gammazero avatar elijaharita avatar rvagg avatar web3-bot avatar ianconsolata avatar jcace avatar criadoperez avatar parkan avatar zorlin avatar peeja avatar willscott avatar

Stargazers

Sonia John avatar Scott Donnelly avatar Dennis Zou avatar Baby Commando (JP) avatar Alan Shaw avatar Liam Zebedee avatar orun avatar Tony Riemer avatar Xilin avatar Seth Docherty avatar OC avatar  avatar  avatar  avatar  avatar gpt dev avatar Jack avatar Javed Khan avatar

Watchers

Peter Rabbitson avatar  avatar  avatar Nathanial Marsh avatar  avatar distora avatar Lauren Spiegel avatar  avatar OC avatar  avatar

singularity's Issues

For Retrieval: How to link back to deals from a DataSetID / Source?

I am looking at the data model, and it's not clear for the retrieval case how source/directory/item/itempart/chunk/car etc get linked to deals that are made.

A CAR model is passed to the MakeDeal command, but there's no link stored to the CAR in the model for Deal.

The reason this matters is I assume that for retrieval, we're eventually going to have to have essentially an "Unpack" process where the requested retrieval is a DataSetID, Source, Directory, or Item.

Am I missing something about how you imagine the data model to work?

Metadata API

Description

Metadata API is an API that tells how a piece can be assembled using the original datasources. So that the retriever can retrieve the CAR file using such metadata as well as original content from the data source

Acceptance Criteria

Following the test from inline preparation #3
Run the API using

./singularity run api

Metadata is offered by two ways, one is coming from content provider

./singularity run content-provider
curl http://127.0.0.1:7777/piece/metadata/<piece_cid>

Another is coming from API

./singualrity run api
curl http://127.0.0.1:9090/api/piece/<piece_cid>/metadata

Check if the result is a valid JSON that makes sense

Content Provider / Bitswap

Description

Make Singularity act as booster-bitswap while offloading data to Storage provider

The database backed block store can be used to serve blocks.
If the data source has been exported to CAR files, those blocks will be served directly from the CAR.
Otherwise, block retrieval will be a range HTTP request to the original data source which may be inefficient as it will be translated to range request.

Data source / handling of big files

Currently, big files are handled as below

  • It first chunk the files into multiple 1GiB chunk and fill up the sectors until they are full
  • The protobuf blocks used to stitch them together are stored in the final DAG car file

This brings an issue that if the user choose not to seal the DAG, then we are only storing each chunk of the file, rather than the whole file.

To solve this issue, all those blocks should be stored within the same CAR file that stores the actual file chunks. Since we want to scale the dataset preparation, whoever prepares the last chunk should be picking up those blocks.

A drawback is that the prepared CAR files will end up with different pieceCID when using multiple dataset worker threads.

Address Switching

Description

If one dataset is associated with multiple address, singualrity should always use the address with enough datacap

Acceptance Criteria

For non verified deals, singualrity will choose randomly from associated wallet address of a dataset

For verified deals, singualrity will choose randomly from the address that has enough datacap. "Has enough datacap" means the current remaining datacap from the chain, minus all proposed but not yet published verified deals for the verified client. This can be verified by keep sending verified deals to a client without publishing the deal. this address will no longer be used for deal making once the unpublished deals exceeds remaining datacap

Dashboard / Deal view

A view to show how deals are made over time. This is to show how provider

  • time series
  • breakdown by provider
  • breakdown by deal state

Util / Monitor

Description

Expose an API to emit metrics that can be consumed by Prometheus.
Metrics will include

  • Current running dataset worker and their status, time to pack, number of items scanned and packed.
  • Deal tracking worker and its status, number of deals updated by state
  • Deal making worker and its status, number of deals made, deal proposal latency
  • Content Provider, similar metrics to boost, plus number of retrieval requests, latency, etc
  • API and the usage of different API by type

mock environment for testing

there should be mock services representing the behavior of a storage provider that can be used for testing singularity packages.

for many of the components, we should be able to abstract the deal making internals and filecoin chain interactions as that's specific to the deal making package code.

It is probably worth building out a sufficient mock environment that we can simulate:

  • the SP pulling a deal from singularity
  • checking the status / seeing an event indicating faults

Proper handling of item hash

Currently, we only get the hash of the file if the file is not local. In fact, we can use Fs.Features.SlowHash to decide whether such fs backend supports fast hash.
We also need to explore how to can utilize other features, i.e. SlowModTime, IsLocal, CanHaveEmptyDirectories

WebDAV API

Description

Offer webDAV API so it can be integrated with any frontend solutions, i.e. filestash, nextcloud, owncloud.
This will offer a dropbox like experience.

Acceptance Criteria

The user should be able to setup integration between the webdav API and filestash and all basic webdav operations should work

  • list directory
  • rm directory
  • rm file
  • create file
  • update file
  • delete file

For deletion, the file already sealed will not be removed from SP, it will be simply marked deleted in Singularity database and removed from the folder structure DAG.

User can choose to use whatever backend storage service that's supported by Singularity (local file system, google cloud, etc.)

Data source / HTTP site

Description

Allow the user to scan though a website hosted by Nginx and prepare all files hosted on the website

Data source / Folder structure

Description

By default, the folder structure will not be exported to CAR file because

  • Singularity does not know whether the user wants the store the folder structure, especially if encryption is enabled.
  • Singualrtiy does not know when the final state has reached, especially when data source rescan is enabled.

Singualrity willl offer a utility to trigger the folder structure generation that exports folder structure to a CAR file

Acceptance Criteria

Run singularity datasource daggen to trigger folder structure dag generation which will export to a new CAR file. The dataset worker will pick up the request and process the request.

  • [Once #15 is done] The user will be able to retrieve the whole directory or a sub directory of a dataset using the rootCID of the dataset and any subpath

Basic test flow

This issue serves as a general instruction for how to perform testing with Singularity

Find edge cases

This instruction only serves as a base flow of testing, it does not over all edge cases. Please use your own wisdom to identify edge cases as well as error cases.

Installation from source

git clone https://github.com/data-preservation-programs/singularity.git
cd singularity

For already cloned repo, always pull the latest master for testing

git pull

Build the software

make build

Web API test

For all CLI commands (except for run commands), there is a web API equivalent. Make sure you also test the Web API

./singularity run api

Then goto http://127.0.0.1:9090/swagger/index.html to try API request

Reset the database (that drops all tables and recreates them)

./singularity admin reset

Create a test dataset

./singularity dataset create -o <car_output_path> test

Use local folder as data source

./singularity datasource add local test <folder_path>

Run the dataset worker to prepare the data

./singularity run dataset-worker --exit-on-complete --exit-on-error

Make deals to storage providers

TBD

Delete original file after exporting to CAR

Description

From PikNik's ask, delete original file when they are already exported to CAR files

Accept Criteria

Following the general test flow guidance #64
When adding datasource use "--delete-after-export"

  • Verify that the data source file is deleted whenever it is exported to CAR file
  • Verify the exported CAR file won't be corrupted and resolve to the same hash value with the non deleted example
  • Verify with local file system and remote system such as S3
  • Use big files (>32GiB) which will only be deleted after all chunks are exported

Encryption / File level encryption using Age

Description

Allow the user to encrypt the file using their provided key. This will be the default built-in solution for encryption.

Accept Criteria

https://protocol-labs-2.gitbook.io/singularity-1/topics/encryption

The user should be able to use age to generate public/private key pair. The user can then supply the public key to Singualrity to enable file encryption.

  • Verify encryption with small files and large files (>32GiB). Because large file encryption is not parallelisable, we want to make sure the file is correctly encrypted.
  • Verify encrytion with multiple recipients(public keys)
  • [Depending on #14 ], download the file and decrypt the file, verify it resolves to the same hash

Deal Tracking

Description

Singualrity should be able to track the status of deals proposed by relevant client address with reasonable delay

Acceptance Criteria

When running singularity deal tracking service

singularity run deal-tracking

It should proactively track deal status

  • The delay should be ~ 1 hour as the on-chain status are downloaded each hour
  • Should work with the default chain status source (GLIF)
  • Should work with a custom lotus API (by using --lotus-api and --lotus-token)
  • Deals can be listed using singularity deal list

Deal status and their meaning

  • Deal Proposed - deal proposed by singularity
  • Deal Publihsed - deal published by SP and is observed on the chain
  • Deal Active - deal sealed by SP and is active on the chain
  • Deal Expired - deal that was sealed by SP but has exceeded its end epoch (expired)
  • Deal Proposal Expired - deal has been proposed but SP has not sealed it before deal start epoch
  • Deal Rejected - not used
  • Deal Slashed - Sp has slashed the deal
  • Deal Errored - not used

All below deals are tracked:

  • Deals active from other sources (outside of singularity)
  • Deals made by Singualrity deal scheculer
  • Deals made using singualrity deal send-manual
  • Deals made using singualrity deal self service

Data source / S3

Description

User should be able to prepare dataset that's stored on AWS S3. This applies to both public and private dataset

Acceptance Criteria

User should be able to prepare dataset that's stored on AWS S3.
Use #64 as the basic test flow.

Check https://protocol-labs-2.gitbook.io/singularity-1/cli-reference/datasource/add/s3 for how to add an S3 data source.
Both private and public dataset needs to be tested.

Expect to see CAR files generated.

  • Try cases with millions of files or folders with nesting
  • Try cases with empty file, large file (>32GiB)

[Epic] Singularity App

Description

ETA: 2023-11-30

A dashboard to explore the prepared dataset. At the first stage, the dashboard is read only, the implementation needs to account for potential future feature that makes it interactive, i.e. creation / update of dataset, data source, etc.

Global View

  • Instance ID
  • Version
  • About page

Wallet List View

  • Show a list of wallets
  • For each wallet, show associated datasets
  • For each wallet, can also show associated deal schedules
  • Add wallet using private key
  • Remove wallet

Global Dashboard View

  • Charts for deals, i.e. deals breakdown by state, SP, dataset, time series
  • Charts for data prep, i.e. number of files, number of CARs, total size of files

Workers view

  • List of workers and their status

Dataset Selection

  • Select a dataset to get to Per Dataset view
  • Create / Remove dataset - Dataset Creation View

Dataset Creation View

  • Name
  • maxSize, pieceSize
  • Encryption key

Dataset Connect to destination View

  • Output Destination - can be all supported types, i.e. local path, S3, Dropbox, etc

Per Dataset view

Datasource List View

  • List of datasource
  • Add / remove datasource - deleteAfterExport, rescanInterval, etc

Wallet association View

  • Add / remove wallet association

Piece List View

  • List of pieces that have been prepared
    • each piece has link to PieceFileListView and PieceDealView
  • Add new piece manually (piece_cid, piece_size, [root_cid])

Deal Schedule List View

  • List of deal schedules associated with this dataset and their status, link to Deal Schedule View
  • A button to create new deal schedule
  • Pause / Unpause

Deal Schedule Creation View

  • see database schema

Per Dataset Dashboard

  • Charts for deals, i.e. deals breakdown by state, SP, dataset, time series
  • Charts for data prep, i.e. number of files, number of CARs, total size of files

Deal Schedule View

  • parameters, i.e. provider, client, cron pattern, restriction, deal parameters
  • Chart for deals made by this schedule, time series, pie chart, etc

Per Datasource View

Onboarding View

  • Explore datasource by path and enqueue items from datasource
  • List of enqueued items

Exploration View

  • Explore prepared files by path
  • Each file goes to File View

File view

Default to last version of file and user can see and choose previous versions

  • Download button which renders a File Download View
  • List of pieces that includes this file (1 for small files, and more for large files)
  • Show all relevant deals and charts for distribution

Tasks

  1. 1 of 4

Exception from manual `dataset add-piece`

I'm trying to manually add a piece to this dataset, however I keep getting this error panic: invalid cid: cid too short. From the stack trace it appears to be coming from the --root-cid, but I'm not sure why it's saying "cid too short"

singularity dataset add-piece singularity-test baga6ea4seaqblmkqfesvijszk34r3j6oairnl4fhi2ehamt7f3knn3gwkyylmlq 34359738368 --root-cid bafybeidylyizmuhqny6dj5vblzokmrmgyq5tocssps3nw3g22dnlty7bhy --file-size 18010019221                 
panic: invalid cid: cid too short                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                            
goroutine 1 [running]:                                                                                                                                                                                                                                                                                                      
github.com/ipfs/go-cid.MustParse(...)                                                                                                                                                                                                                                                                                       
        /go/pkg/mod/github.com/ipfs/[email protected]/cid.go:210                                                                                                                                                                                                                                                                
github.com/data-preservation-programs/singularity/handler/dataset.AddPieceHandler(0xc0006195f0, {0x7ffc8b627ec7, 0x10}, {{0x7ffc8b627ed8, 0x40}, {0x7ffc8b627f19, 0xb}, {0x0, 0x0}, 0x0, ...})                                                                                                                              
        /app/handler/dataset/addpiece.go:113 +0x1485                                                                                                                                                                                                                                                                        
github.com/data-preservation-programs/singularity/cmd/dataset.glob..func1(0xc0007d0480)                                                                                                                                                                                                                                     
        /app/cmd/dataset/addpiece.go:34 +0x2c7                                                                                                                                                                                                                                                                              
github.com/urfave/cli/v2.(*Command).Run(0x5065720, 0xc0007d0480, {0xc0001f0b00, 0x8, 0x8})                                                                                                                                                                                                                                  
        /go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:274 +0xa42                                                                                                                                                                                                                                                  
github.com/urfave/cli/v2.(*Command).Run(0xc0001ccdc0, 0xc0007d0340, {0xc000e7d200, 0x9, 0x9})                                                                                                                                                                                                                               
        /go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:267 +0xc97                                                                                                                                                                                                                                                  
github.com/urfave/cli/v2.(*Command).Run(0xc0001cd8c0, 0xc0007d0180, {0xc000000140, 0xa, 0xa})                                                                                                                                                                                                                               
        /go/pkg/mod/github.com/urfave/cli/[email protected]/command.go:267 +0xc97                                                                                                                                                                                                                                                  
github.com/urfave/cli/v2.(*App).RunContext(0xc0007ce000, {0x3cfaf50?, 0xc000058038}, {0xc000000140, 0xa, 0xa})                                                                                                                                                                                                              
        /go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:332 +0x616                                                                                                                                                                                                                                                      
github.com/urfave/cli/v2.(*App).Run(...)                                                                                                                                                                                                                                                                                    
        /go/pkg/mod/github.com/urfave/cli/[email protected]/app.go:309                                                                                                                                                                                                                                                             
main.main()                                                                                                                                                                                                                                                                                                                 
        /app/singularity.go:160 +0x1005                                                                                                                                                                                                                                                                                     

Data source / file system

Description

Prepare dataset that is stored on local file system

Acceptance Criteria

User should be able to prepare the dataset that's stored on local file system.
Use #64 as a reference for how to test this scenario

Expect to see CAR files generated.

  • Try cases with millions of files or folders with nesting
  • Try cases with empty file, large file (>32GiB)

Content Provider / GraphSync

Description

Make Singualrity act as boost that serves graphsync retrieval while offloading data to storage providers

The database backed block store can be used to serve blocks.

If the data source has been exported to CAR files, those blocks will be served directly from the CAR.
Otherwise, block retrieval will be a range HTTP request to the original data source which may be inefficient as it will be translated to range request.

Datasource / Rescan

Description

Allow rescanning of the datasource to discover new files. Deletion is not in the scope of this item.

Acceptance criteria

Use #64 as the basic test flow
User can use --rescan-interval when adding the data source to enable data source rescanning.

  • Run dataset worker, after data preparation ends, create a new file in the data source
  • The dataset worker will pickup the new file once rescan interval has passed.
  • New CAR file containing the new file will be exported
  • Rescan can also be triggered using singularity datasource rescan

Dashboard / Piece explorer

Description

Display piece listing view

  • List all pieces (CAR) that's associated with a dataset
  • List files and folders that's included in each piece
  • List deals that's releveant to this piece
  • Download the piece as CAR file

Content Provider / HTTP piece

Description

Allow Singularity to stand up an HTTP server to serve CAR file retrieval

Accept Criteria

User should be able to download CAR file from the HTTP server

singularity run content-provider
wget http://127.0.0.1:7777/piece/<piece_CID>
  • Test with multitthreaded downloadiing client (aria2c, axel)
  • Inline preparation download covered in #3
  • Non inline preparation should also work with CAR file deleted (will act as inline preparation)
  • Verify the hash of the CAR file

Remote Signer

Description

Implement a solution to allow remote signing so the client does not need to provide the private key to Singualrity (data preparer). We can reuse the relay signing service and client that's being built for Spade.

The client will run the server to accept and signing proposal. Singualrity will be sending deal proposals to the client server with a centralized relay service.

https://github.com/data-preservation-programs/filsigner-relayed

Preparation Worker with Inline Preparation

Description

https://protocol-labs-2.gitbook.io/singularity-1/topics/inline-preparation

Acceptance Criteria

In general, follow #64 for testing data preparation.
During dataset creation, do not specify -o so it does not export CAR files

After preparation, run singularity run content-provider and download the CAR files using
wget http://127.0.0.1:7777/piece/<piece_cid> to download the CAR file

  • verify the CAR file resolves to the same hash as the case without inline preparation
  • verify multithreaded download works (i.e. using axel and aria2c)
  • measure / monitor the CPU and RAM overhead (the content-provider service)
  • Use at least two different data source, i.e. S3 and local file system

Data source / manual push API

Description

This is a manual push API to tell Singularity that something needs to be grabbed for preparation, i.e.
A file in the local file system or a path of the S3 object.

Accept Criteria

A user should first create a dataset and add a data source (without rescanning)
Then, call the push API to tell Singularity that a new item is in such data source
The API is available using

singularity run api

http://127.0.0.1:9090/swagger/index.html

  • Try with both local file system and remote storage solution such as S3

Data preparation overview docs

I think it would be helpful to write up an overview of how data preparation works.

I've spent a day code diving to get it in my head and I think this would save others some work.

Specifically, I'd like to run through:

  • Scanning (finding all the files and making Directories/Items/ItemParts/Chunks -- including that we chunk files into 1GB parts)
  • Packing (Putting Chunks into CAR Files)
  • DAGGen (this was the hardest to understand but if I have it correctly I believe you are making one more CAR file that links the item parts into items and then links the items into directories up to a root directory CID)

I'm happy to take this on.

I will obviously need help if we're doing Chinese translation :)

Metadata download client

Description

A download client is a client download utility that first query the metadata API to get a plan of how to assemble a CAR file using metadata and original data source, then download and assemble the file locally.

Acceptance Criteria

Following #17 testing
Now run the below command to download the CAR file (this only requires content-provider to be running)

singualrity download <piece_cid>
  • Try dataset prepared without inline preparation and with CAR file deleted - should also work
  • Try concurrency flag -j
  • Verify the downloaded CAR file resolves to the correct hash value
  • Try datasource that does not need auth (i.e. local file system) as well as data source that requries auth (private S3 bucket) - note the credential needs to supplied at the client side as it needs to retrieve the data from the source directly
  • Measure and monitor the CPU/RAM overhead

Data source / File deletion handling

Description

Deleted files should be removed from the folder DAG and also marked as deleted in the database.
Ways to trigger file deletion includes

  • During a datasource rescan, it is found deleted
  • An exposed API (webdav, S3, etc) that deletes the file

Deal Making Scheduler

Description

Allow pushing deals to specific storage provider with certain restriction and schedule

  • Boost specific options - httpheader, ipni announcement, URL to download CAR file
  • Deal proposal options - duration, price, verified, keepunsealed
  • Scheduling options - cron pattern, number per trigger, total number of deals, max pending deals
  • Others from user asks - notes, allowedCIDlist

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.