Giter VIP home page Giter VIP logo

blockset's People

Contributors

sergey-shandar avatar

Stargazers

 avatar  avatar  avatar  avatar

blockset's Issues

`integrity check` command

The command could check repository for:

  • unreferenced block parts,
  • invalid blocks,
  • invalid file names in the directory.

Modes for the `add` command.

  • default: quick, no checks, no overwrite existing blocks,
  • validation mode, if the file exists, then validate. If it's broken then overwrite.
  • always overwrite.

address standard

  • 45 characters. latin letters and numbers except u and U. Case-insensitive.
    • normalized: only low case letters, no u, i, l and o.

Validation output:

  • error:
    • invalid address (e.g. only few symbols provided or invalid symbols)
    • invalid parity bit.
  • warning: not normalized. the address will not work in a file system.
  • ok

Show progress during store/retrieve

  • use processed / file_length to calculate progress during adding.
  • use current_block_level0# / total_blocks_levle0# + current_block_level1# / (total_blocks_level#1 * total_blocks_level#0) + ... during getting.

Split the program to state machines

Currently, we use push_ireator instead of a formal state machine.

It will allow us to switch to asynchronous I/O.

According to this post, we may use async-std instead of Tokyo. To avoid dependencies, we can split the project to two parts: a library and CLI. CLI will depend on async-std.

Splitting storage to folders

Discussed in https://github.com/orgs/datablockset/discussions/71

Firstly, these folders:

  • roots/
  • parts/

Then we would like to reduce the number of files in the folder. We split further:

https://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory

# of symbols # of folders # of files for one level # of files for two levels
1 32 a/ 1_024 a/b/ 32768
2 1024 ab/ 1_048_576 ab/cd/ 1_073_741_824
3 32768 abc/ 1_073_741_824 abc/def/ 35_184_372_088_832

Git keeps two hex symbols. 256 folders.

I like two base32 symbols ab/, and two levels ab/cd/. With a limit of about 1024 files and folders in one folder, we can keep about 1 billion files, which will occupy several TB of space.

Q: should we do it for roots as well?

Note: filenames can also be reduced.

Disadvantages:

  • to synchronize two repositories, we need to copy the files recursively.
  • if the file name is reduced, we cut 20 bits from the hash. It became 204 bits (41 characters) if someone just copied the file. However, the hash of the file can be easily restored, so it shouldn't be an issue. We can have an application that restores hashes and puts the files into the right folder. The algorithm can even determine if the file is a block or a part.

Garbage collector.

A garbage collector can check if a part is not reachable from root blocks. If not, we can delete the part. In this case, a user may delete a file from the repository by deleting a root block and run GC.

Recognize a text file and replace \r\n on Windows to \n

This applies only to Windows:

Methods:

  1. file extension. This is more error-prone.
  2. UTF-8 compatibility.

add command

The command can scan a file to see if it's UTF-8 compatible. It's more important than the get command. Options:

  1. during add, the program scans a file to see if it's a UTF-8 file. If yes, then it will add with conversion.
  2. a command that will convert all UTF-8 files into UNIX format.

Note: Using # 1, be cautious about sequences like this: \r\r\n. We shouldn't convert it to \r\n because during the get command we will not revert it back to \r\r\n.

get command

During get we can set a flag if the file is UTF-8. Also, we calculate how many new lines \n in the file w/o \r. If yes, we convert a temporary file to Windows format, and after that, we copy the file. We are converting the file by extending it from the back by a number of \n.

`info` command

Info will go through all folders and display the result:

roots: {}, {}B.
parts: {}, {}B.
total: {}, {}B.

For example

roots: 3, 4_567 B.
parts: 124, 1_567_345 B.
total: 127, 1_571_912 B.

The command can display progress and estimation, for example:

roots: 3, 4_567 B.
parts: ~124, ~1_567_345 B.
total: ~127, ~1_571_912 B.
45%

add/get folders

We don't have to use TAR.

{
  "fileCatalog": {
     "a/fileA.txt": "cwjisj449...",
     "b/c/file.exe": {
        "ref": "dijoi3j....",
        "mode": ["executable", ....],
     },
     "a/c": {
         // "type": "directory"
         "dirFiles": "dsdiouh48ds....",
     }
  },

  /// .......
  "signature": ....
  "timestamp": ....
  "message": ...
  "previous": ...
}

Better info estimation.

We should merge roots and parts into one.

struct Path {
    roots: Vec<String>,
    path: String,
}

We should merge the two sets of hashes as two sets of random numbers with different sizes.

A block recognition application

The block recognition application should be able to recognize blocks from storage without any additional information. The application may use cache, but cache is not source of truth.

Recognize these formats:

The recognizer should be able to work on a stream of bytes.

Change the tree building algorithm to variable levels.

Instead of using fixed numbers like 8 and 4 (for data and node levels), we can use a threshold and decrease/increase the level (resolution) according to the threshold.

Test cases:

  • we can increase a level for a constant value data
  • we should decrease a level when a level node reaches the threshold.
    • we may go to sublevel nodes.

Support for other hash formats

We can do it using additional folders, like sha256/ and keep there files which contain a reference to 'cdt0' roots. For example,

cdt0/
  roots/...
  parts/...
  sha256/ab/cd/efg...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.