celestiaorg / smt Goto Github PK

View Code? Open in Web Editor NEW

136.0 9.0 53.0 180 KB

A Go library that implements a Sparse Merkle tree for a key-value map.

Home Page: https://godoc.org/github.com/celestiaorg/smt

License: MIT License

Go 99.73% Shell 0.27%

merkle-tree sparse-merkle-tree smt

smt's Introduction

smt

A Go library that implements a Sparse Merkle tree for a key-value map. The tree implements the same optimisations specified in the Libra whitepaper, to reduce the number of hash operations required per tree operation to O(k) where k is the number of non-empty elements in the tree.

Example

package main

import (
	"crypto/sha256"
	"fmt"

	"github.com/celestiaorg/smt"
)

func main() {
	// Initialise two new key-value store to store the nodes and values of the tree
	nodeStore := smt.NewSimpleMap()
	valueStore := smt.NewSimpleMap()
	// Initialise the tree
	tree := smt.NewSparseMerkleTree(nodeStore, valueStore, sha256.New())

	// Update the key "foo" with the value "bar"
	_, _ = tree.Update([]byte("foo"), []byte("bar"))

	// Generate a Merkle proof for foo=bar
	proof, _ := tree.Prove([]byte("foo"))
	root := tree.Root() // We also need the current tree root for the proof

	// Verify the Merkle proof for foo=bar
	if smt.VerifyProof(proof, root, []byte("foo"), []byte("bar"), sha256.New()) {
		fmt.Println("Proof verification succeeded.")
	} else {
		fmt.Println("Proof verification failed.")
	}
}

smt's People

Contributors

Stargazers

Watchers

smt's Issues

(*SparseMerkleTree).th.parseNode should check that data retrieved is within limits else panic

If we look at this code here

smt/treehasher.go

Lines 68 to 70 in a99c0f5

 func (th *treeHasher) parseNode(data []byte) ([]byte, []byte) { 

 return data[len(nodePrefix) : th.pathSize()+len(nodePrefix)], data[len(nodePrefix)+th.pathSize():] 

 }

we notice that the authors assumed that we'd always have data with a length of at least 33 bytes. However, this code unfortunately doesn't recall that to create a SparseMerkleTree, we need to pass in the nodes MapStore as well as the values MapStore. If we run this code, we'll get a panic:

func TestInvalidNodesValues(t *testing.T) {
        smn, smv := NewSimpleMap(), NewSimpleMap()
        smt := NewSparseMerkleTree(smn, smv, sha256.New())
        smn.Set([]byte("key"), []byte("v"))
        smn.Set([]byte("not my root"), []byte("v"))

        smt.UpdateForRoot([]byte("key"), []byte("values"), []byte("not my root"))
}

go test -run=InvalidNodes
--- FAIL: TestInvalidNodesValues (0.00s)
panic: runtime error: slice bounds out of range [:33] with capacity 1 [recovered]
	panic: runtime error: slice bounds out of range [:33] with capacity 1

goroutine 18 [running]:
testing.tRunner.func1.2({0x11232a0, 0xc00010e000})
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/testing/testing.go:1209 +0x24e
testing.tRunner.func1()
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/testing/testing.go:1212 +0x226
panic({0x11232a0, 0xc00010e000})
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/runtime/panic.go:814 +0x214
github.com/celestiaorg/smt.(*treeHasher).parseNode(0xc0000a41e0, {0xc0000a80dc, 0x1, 0x1})
	/Users/emmanuelodeke/go/src/github.com/celestiaorg/smt/treehasher.go:69 +0x139
github.com/celestiaorg/smt.(*SparseMerkleTree).sideNodesForRoot(0xc0000a41e0, {0xc0000ea040, 0x20, 0x20}, {0xc0000a8100, 0x100c78b, 0xb}, 0x0)
	/Users/emmanuelodeke/go/src/github.com/celestiaorg/smt/smt.go:339 +0x485
github.com/celestiaorg/smt.(*SparseMerkleTree).UpdateForRoot(0xc0000a41e0, {0xc0000a80eb, 0x3, 0x3}, {0xc0000a80f0, 0x1220d00, 0x6}, {0xc0000a8100, 0xb, 0xb})
	/Users/emmanuelodeke/go/src/github.com/celestiaorg/smt/smt.go:119 +0xc5
github.com/celestiaorg/smt.TestInvalidNodesValues(0x0)
	/Users/emmanuelodeke/go/src/github.com/celestiaorg/smt/proofs_test.go:289 +0x305
testing.tRunner(0xc000083380, 0x11386a0)
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
	/Users/emmanuelodeke/go/src/go.googlesource.com/go/src/testing/testing.go:1306 +0x35f
exit status 2
FAIL	github.com/celestiaorg/smt	0.129s

Suggestion

Given this package is going to be general purpose, we need to also be defensive about code whose limits that we know we control, we should return errors and be defensive whenever we try to access data with blind bounds

/cc @cuonglm @adlerjohn @liamsi

Sparse/Incremental merkle tree, on-chain append

Hi,

I am working on an incremental merkle tree, aiming to reduce the on-chain cost of adding items to it.

See my notes at: HarryR/ethsnarks#16

Python implementation https://github.com/HarryR/ethsnarks/blob/master/ethsnarks/merkletree.py

The main problem is, if the merkle tree is built on-chain using on-chain storage for all of the nodes, then it is expensive (e.g. 750k gas for 29 levels). e.g. think of every database store costing money, and every database access costing money, but passing arguments to the program (e.g. a couple of proof paths) is much cheaper.

Do you have any ideas about this?

Release new version

The latest release isn't usable as it still uses lazyledger in the module name.

Start versioning

Right now there are no versions/releases. And we will definitely need to have it in place.

Use raw key as path instead of hash(key)

Note: this is a backwards-incompatible change in the interface, but shouldn't change the commitment so long as the library is used the same way.

Currently (as of #37), values are stored in a map hash(key) -> value. This allows constant-time reads. However, an outstanding change is using the raw key instead of the hash of the key: #14 (comment)

This is needed to allow applications to store certain entries at specific keys in the tree, which also allows for algorithms that operate over subtrees (see above-linked comment for more context). Users who do not want to use this functionality can simply hash their keys before passing them to the library, resulting in an identical tree as before the change proposed here.

Store hash(key) => value in the DB to reduce I/O operations for reads

Offset calculated on every iteration

Just noticed this while working through the code. Does this need to be calculated on every iteration?

https://github.com/lazyledger/smt/blob/master/smt.go#L289

When I move it above the for loop, all test still pass.

Rebrand lazyledger -> celestiaorg

go.mod
coveralls
godoc

Coverage tracking broken

Coverage tracking was broken by #23. New GitHub actions file seems to use codecov, whereas the repo was originally using coveralls. Codecov does not appear to be configured.

Introduce Delete and Has methods

Right now it's not obvious how to execute Delete operation and check value existence (you need to take a look in the code).
I think Delete and Has methods should be added.

Alternatively, defaultValue should be exported and comments to Get and Update added to explicitly state this.

Explore optimisation to only require one database read per key access

In this Merkle AVL implementation, the following optimisation has been made:

In the backing key/value store, nodes are stored using their key/value pair key as the database key, and a binary encoding that contains the fields in the above Node structure - minus the key field since that is already implied by the database entry.

Storing nodes by key rather than by hash is an important optimization, and is the reason why inner nodes each have a key/value pair. The implication is that reading a key does not require traversing through the tree structure but only requires a single read in the backing key/value store, meaning there is practically no overhead versus using the backing store without a tree structure. Additionally, we can efficiently iterate through nodes in the tree in their in-order traversal just by iterating by key in the backing store (which RocksDB and LevelDB are optimized for).

I think this should be doable for Sparse Merkle trees too.

Add string version for SimpleMap methods

Currently, all methods of SimpleMap get in a []byte then convert it to string as key. This is ok for GET/DELETE operation, since when the compiler can optimize map lookup to not allocate new string in string([]byte) conversion.

For Set operation, which involve map assign, the string([]byte) conversion will allocate new string. If we instead do the conversion in the caller side, and pass in the string, the we can save allocation.

I did a quick version:

func (sm *SimpleMap) SetString(key string, value []byte) error {
	sm.m[key] = value
	return nil
}

For following benchmarks:

package smt

import (
	"testing"
)

func BenchmarkSimpleMap_Set(b *testing.B) {
	sm := NewSimpleMap()
	s := "foobar"
	bs := []byte(s)

	for i := 0; i < b.N; i++ {
		sm.Set(bs, bs)
	}
}

func BenchmarkSimpleMap_SetString(b *testing.B) {
	sm := NewSimpleMap()
	s := "foobar"
	bs := []byte(s)

	for i := 0; i < b.N; i++ {
		sm.SetString(s, bs)
	}
}

The result:

name                   time/op
SimpleMap_Set-8        18.0ns ± 0%
SimpleMap_SetString-8  8.98ns ± 0%

name                   alloc/op
SimpleMap_Set-8         8.00B ± 0%
SimpleMap_SetString-8   0.00B

name                   allocs/op
SimpleMap_Set-8          1.00 ± 0%
SimpleMap_SetString-8    0.00

Don't recompute the root on every update; only on commit

We should pull @roysc's rewrite of smt.go upstream here (https://github.com/vulcanize/smt), as it provides a substantial and well-written performance improvement by not recomputing the root on every update, that also produces the same roots as this smt.

In order to do so we need to first:

Re-implement deepsubtree;
Re-add value storage.

Given that the Cosmos SDK store v2 may be abandoned, we are looking into resuming the work of integrating SMT into store v1 (celestiaorg/cosmos-sdk#12) by @tzdybal. The way that we re-add value storage will depend on the requirements of store v1 and in particular its requirements for historical state access, rollback, and pruning, so this needs to be analyzed first.

The root calculated by an empty Sparse Merkle Tree does not match the expected output defined by the specification

The Celestia specification of the Sparse Merkle Tree defines the root of an empty tree with the following:

The base case (an empty tree) is defined as the hash of the empty string:

node.v = 0xe3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

However, the implementation, as currently written, yields a different result. Calling for the root of an empty SMT gives the 32 byte 0 array, i.e.:

smt.Root() = 0x0000000000000000000000000000000000000000000000000000000000000000

This is because the default root of an SMT is a placeholder, where placeholders are simply the 32 byte 0 array. The root is not the hash of the of the placeholder.

I verified this behaviour with the following test code:

package smt

import (
	"crypto/sha256"
	"encoding/binary"
	"testing"
)

func TestEmptyRoot(t *testing.T) {
	smt := NewSparseMerkleTree(NewSimpleMap(), NewSimpleMap(), sha256.New())
	t.Logf("ROOT: %x", smt.Root())
}

This outputs the following:

=== RUN   TestEmptyRoot
    brandon_test.go:11: ROOT: 0000000000000000000000000000000000000000000000000000000000000000
--- PASS: TestEmptyRoot (0.00s)
PASS

Process finished with the exit code 0

It's not immediately clear which value is, in fact, the "correct" value, whether that be the value defined in the spec, or the value outputed by the test.

Batch writes

Currently during Update/Delete operation every write on the path from leaf to root is done in a separate call.
Those operations could be executed as batch.
Pros:

improved consistency (no orphaned nodes in case of crash)
(most probably) improved performance
Cons:
more complicated MapStore interface

Most KV stores implements batching, and actually quite often even single operation is also wrapped in a batch (AFAIK this is true for LevelDB, RocksDB, Badger).

Delete incorrectly tries to read side nodes

#22 uncovered a bug whereby deleteWithSideNodes tries to Get side nodes, but those nodes may not exist in the tree. The case where these nodes don't exist in the tree is when a branch is added with AddBranch (which only includes the side nodes hashes, but not their values).

This is mostly an issue with the DSMST, where it is constructed by adding branches rather than updating.

Specifically, here. This Get only happens during the first iteration of the loop, i.e. it only applies to the sibling of the leaf node to delete. This node can either be another leaf node or an internal node, but cannot be a placeholder.

Automated CHANGELOG.md check for smt

There should be an automated step that ensures that a CHANGELOG.md file is appropriately updated for all merges to the main branch.

There should be a proper description in an issue and corresponding PR as well as a summary in the CHANGELOG.md file. This will be enforced through an automated action. In rare cases for small PRs it may be acceptable to skip this.

Benchmark different appending schemes

Currently, the SMT appends new side nodes to a slice at the end:

sideNodes = append(sideNodes, sideNode)
// ...
return reverseSideNodes(sideNodes)

This requires two allocations (one initially and one for the reverse slice) and one copy (to copy to the reverse slice). Alternatives would be to insert the new node at the front,

sideNodes = append(sideNodes, nil)
copy(sideNodes[1:], sideNodes)
sideNodes[0] = sideNode

or to create a new slice and append the old one to the new one

sideNodes = append([][]byte{sideNode}, sideNodes...)

Both these alternatives should result in a new allocation and copy for every insertion, and should result in an equal number of allocations and copies. However, this SO answer suggests the second alternative would result in fewer allocations from the first, but this is most likely an erroneous interpretation that the second alternative would result in fewer allocations than another, less efficient, alternative (not shown here).

The current scheme and two alternatives should be benchmarked, as generating a list of side nodes is along the hot path of any write or prove operation.

Consider changing key bit traversal order to be prefix-based

Ref: the secondary change of celestiaorg/celestia-specs#129.

Currently, keys are traversed in a way that jumps between bits when looking at the key:

https://github.com/lazyledger/smt/blob/2b025d36fc57c4f0a008d49bb77fdf13c051b0cc/utils.go#L4

The reason is that arrays are ordered 0 to length-1 (left to right), while bits are ordered length-1 to 0 (left to right). As an example, consider the following 2-byte key. Byte indices are on the top and bit indices are on the bottom.

[     byte 0    ][     byte 1    ]
[7 6 5 4 3 2 1 0][7 6 5 4 3 2 1 0]

Here are the positions of different bit indices:

bit = 0
[     byte 0    ][     byte 1    ]
[7 6 5 4 3 2 1 0][7 6 5 4 3 2 1 0]
               ^
               this one!

bit = 1
[     byte 0    ][     byte 1    ]
[7 6 5 4 3 2 1 0][7 6 5 4 3 2 1 0]
             ^
             this one!

bit = 7
[     byte 0    ][     byte 1    ]
[7 6 5 4 3 2 1 0][7 6 5 4 3 2 1 0]
 ^
 this one!

bit = 8
[     byte 0    ][     byte 1    ]
[7 6 5 4 3 2 1 0][7 6 5 4 3 2 1 0]
                                ^
                                this one!

Now, a tree that traverses keys in this bit order can certainly be consistent with itself, but won't be intuitive to use and require non-standard verification logic. In terms of intuitiveness, consider two keys in binary representation: one starting with 0b0... and the other with 0b1.... You would expect the two keys to be on the left and right subtrees respectively from root. Currently, this is not the case: keys on the left subtree begin with 0bxxxxxxx0... and the right subtree 0bxxxxxxx1.... In other words, currently leaves are not actually ordered by key (if you interpret the 32-byte key as a big endian unsigned integer), which makes things confusing.

I propose we should instead traverse both bits and bytes in the same order, and ideally most-significant-first (i.e. most significant on the left). This change would be backwards incompatible.

replace mapstore with tm-db

Proposal:

Replace the mapstore with tm-db. tm-db supports 5 different key value stores. This change is trivial and I can pick it up if its accepted.

tm-db

(*SparseMerkleTree).HasDescend should firstly check the error from GetDescend before value comparisons

A minor nit, but if we look at the code in

smt/deepsubtree.go

Lines 123 to 126 in 2c90766

 func (smt *SparseMerkleTree) HasDescend(key []byte) (bool, error) { 

 val, err := smt.GetDescend(key) 

 return !bytes.Equal(defaultValue, val), err 

 }

it can potentially return (true, non-nil error) which is an odd return composite, we’d expect to have either:
(true, nil error)
(false, non-nil error)

If at any point defaultValue changes, the current assumption of bytes.Equal((*bytes)(nil), []byte{}) will fail so it is safer to fool proof that code by just checking the error from GetDescend firstly.

Add more fuzzing-like tests

e.g. ideally, stuff like #25 would be found by this kind of testing

DeleteForRoot is not right

DeleteForRoot is not right, after DeleteForRoot, the value still can be retrieved. In fact, Delete method in MapStore is not called in DeleteForRoot

Add contribution guidelines

Besides the usual foo, clarify which parts of the API are supposed to be non-breaking. Additionally, clarify how contributors can still propose breaking changes (and what increases their chances to land such changes in this library).
If there are any relevant specs contributors should be aware of, these should be mentioned in the contribution guidelines, too.

SMT v0.4 potential wishlist

Storage

For value store, store the values by key, rather than path. (This will allow iterating over keys efficiently, as may be expected in Cosmos SDK store v1, though not for historical trees. We should determine if this is sufficient.)
Store nodes as an n-ary tree using tiles, allow for more efficient updates to the tree.
Store nodes in memory as an actual node tree, and only recompute the root when the tree is committed.

Versioning and pruning

Enable the option to keep the historical versions of the tree.
Add pruning support (i.e. only keep X historical versions) using reference-counting and a "deletion pending" database.
Add rollback support using a journal.

Note: this would allow for O(1) reading of keys from the latest tree, but O(log(n)) for historical trees where n is the keys in the tree, unless versioning is also implemented for the value store. Note that IAVL's reads are always O(log(n)) anyway. Furthermore, rollbacks would be O(o) where o is the number of operations being rolled back (this is the same as Ethereum's state trie). We should understand Cosmos SDK store requirements for this.

Batching

Add support for batch writes/commit. (#13)

Documentation

Document storage layer architecture if tiling or pruning is added.
Document tree algorithms in more detail than current specs, or move specs here.

Remove or move bench test

Does it make sense to make a benchmark test as part of the main test suite? It's not testing any software assurances, and just making running the tests slower.

IMO this should be removed, or moved to a different package.

#42

Lines 44 to 54 in a99c0f5

 // Check that the sibling data hashes to the first side node if not nil 

 if proof.SiblingData != nil { 

 siblingHash := th.digest(proof.SiblingData) 

 if proof.SideNodes != nil && len(proof.SideNodes) > 0 { 

 if !bytes.Equal(proof.SideNodes[0], siblingHash) { 

 return false 

 } 

 } 

 } 

 return true

Notice

We are calculating a cryptographic hash even if len(proof.SideNodes) conditions could cause a return

smt/proofs.go

Lines 46 to 48 in a99c0f5

siblingHash := th.digest(proof.SiblingData)

if proof.SideNodes != nil && len(proof.SideNodes) > 0 {

if !bytes.Equal(proof.SideNodes[0], siblingHash) {
Those nested conditions could be further simplified from

Before

smt/proofs.go

Lines 45 to 52 in a99c0f5

 if proof.SiblingData != nil { 

 siblingHash := th.digest(proof.SiblingData) 

 if proof.SideNodes != nil && len(proof.SideNodes) > 0 { 

 if !bytes.Equal(proof.SideNodes[0], siblingHash) { 

 return false 

 } 

 } 

 }

After

        // Check that the sibling data hashes to the first side node if not nil
        if proof.SiblingData == nil {
                return true
        }
        
        if len(proof.SideNodes) == 0 {
                return true
        }
        
        siblingHash := th.digest(proof.SiblingData)
        return bytes.Equal(proof.SideNodes[0], siblingHash)

Add Changelog.md

And mark breaking and non-breaking changes too.

	func (th *treeHasher) parseNode(data []byte) ([]byte, []byte) {
	return data[len(nodePrefix) : th.pathSize()+len(nodePrefix)], data[len(nodePrefix)+th.pathSize():]
	}

	func (smt *SparseMerkleTree) HasDescend(key []byte) (bool, error) {
	val, err := smt.GetDescend(key)
	return !bytes.Equal(defaultValue, val), err
	}

	// Check that the sibling data hashes to the first side node if not nil
	if proof.SiblingData != nil {
	siblingHash := th.digest(proof.SiblingData)
	if proof.SideNodes != nil && len(proof.SideNodes) > 0 {
	if !bytes.Equal(proof.SideNodes[0], siblingHash) {
	return false
	}
	}
	}

	return true