Giter VIP home page Giter VIP logo

topling / toplingdb Goto Github PK

View Code? Open in Web Editor NEW
891.0 85.0 160.0 100.89 MB

ToplingDB is a cloud native LSM Key-Value Store with searchable compression algo and distributed compaction

License: GNU General Public License v2.0

CMake 0.38% Makefile 0.80% Python 1.56% Shell 0.96% Perl 0.99% PowerShell 0.06% C++ 84.18% C 1.64% Java 9.32% Dockerfile 0.01% Assembly 0.06% BitBake 0.03%
kvstore rocksdb distributed-database compaction nosql database

toplingdb's Introduction

ToplingDB: A Persistent Key-Value Store for External Storage

ToplingDB is developed and maintained by Topling Inc. It is built with RocksDB. See ToplingDB Branch Name Convention.

ToplingDB's submodule rockside is the entry point of ToplingDB, see SidePlugin wiki.

ToplingDB has much more key features than RocksDB:

  1. SidePlugin enables users to write a json(or yaml) to define DB configs
  2. Embedded Http Server enables users to view almost all DB info on web, this is a component of SidePlugin
  3. Embedded Http Server enables users to online change db/cf options and all db meta objects(such as MemTabFactory, TableFactory, WriteBufferManager ...) without restart the running process
  4. Many improvements and refactories on RocksDB, aimed for performance and extendibility
  5. Topling transaction lock management, 5x faster than rocksdb
  6. MultiGet with concurrent IO by fiber/coroutine + io_uring, much faster than RocksDB's async MultiGet
  7. Topling de-virtualization, de-virtualize hotspot (virtual) functions, and key prefix caches, bechmarks
  8. Topling zero copy for point search(Get/MultiGet) and Iterator
  9. Builtin SidePlugins for existing RocksDB components(Cache, Comparator, TableFactory, MemTableFactory...)
  10. Builtin Prometheus metrics support, this is based on Embedded Http Server
  11. Many bugfixes for RocksDB, a small part of such fixes was Pull Requested to upstream RocksDB

ToplingDB cloud native DB services

  1. MyTopling(MySQL on ToplingDB), MyTopling on aliyun
  2. Todis(Redis on ToplingDB)

ToplingDB Components

With SidePlugin mechanics, plugins/components can be physically separated from core toplingdb

  1. Can be compiled to a separated dynamic lib and loaded at runtime
  2. User code need not any changes, just change json/yaml files
  3. Topling's non-open-source enterprise plugins/components are delivered in this way

Repository dir structure

toplingdb
 \__ sideplugin
      \__ rockside                 (submodule , sideplugin core and framework)
      \__ topling-zip              (auto clone, zip and core lib)
      \__ cspp-memtab              (auto clone, sideplugin component)
      \__ cspp-wbwi                (auto clone, sideplugin component)
      \__ topling-sst              (auto clone, sideplugin component)
      \__ topling-rocks            (auto clone, sideplugin component)
      \__ topling-zip_table_reader (auto clone, sideplugin component)
      \__ topling-dcompact         (auto clone, sideplugin component)
           \_ tools/dcompact       (dcompact-worker binary app)
Repository Permission Description (and components)
ToplingDB public Top repository, forked from RocksDB with our fixes, refactories and enhancements
rockside public This is a submodule, contains:
  • SidePlugin framework and Builtin SidePlugins
  • Embedded Http Server and Prometheus metrics
cspp-wbwi
(WriteBatchWithIndex)
public With CSPP and carefully coding, CSPP_WBWI is 20x faster than rocksdb SkipList based WBWI
cspp-memtable public (CSPP is Crash Safe Parallel Patricia trie) MemTab, which outperforms SkipList on all aspects: 3x lower memory usage, 7x single thread performance, perfect multi-thread scaling)
topling-sst public 1. SingleFastTable(designed for L0 and L1)
2. VecAutoSortTable(designed for MyTopling bulk_load).
3. Deprecated ToplingFastTable, CSPPAutoSortTable
topling-dcompact public Distributed Compaction with general dcompact_worker application, offload compactions to elastic computing clusters, much more powerful than RocksDB's Remote Compaction
topling-rocks private ToplingZipTable, an SST implementation optimized for RAM and SSD space, aimed for L2+ level compaction, which uses topling dedicated searchable in-memory data compression algorithms
topling-zip_table_reader public For read ToplingZipTable by community users, builder of ToplingZipTable is in topling-rocks

To simplify the compiling, repos are auto cloned in ToplingDB's Makefile, community users will auto clone public repo successfully but fail to auto clone private repo, thus ToplingDB is built without private components, this is so called community version.

Run db_bench

ToplingDB requires C++17, gcc 8.3 or newer is recommended, clang also works.

Even without ToplingZipTable, ToplingDB is much faster than upstream RocksDB:

sudo yum -y install git libaio-devel gcc-c++ gflags-devel zlib-devel bzip2-devel libcurl-devel liburing-devel
#sudo apt-get update -y && sudo apt-get install -y libjemalloc-dev libaio-dev libgflags-dev zlib1g-dev libbz2-dev libcurl4-gnutls-dev liburing-dev libsnappy-dev libbz2-dev liblz4-dev libzstd-dev
git clone https://github.com/topling/toplingdb
cd toplingdb
make -j`nproc` db_bench DEBUG_LEVEL=0
cp sideplugin/rockside/src/topling/web/{style.css,index.html} ${/path/to/dbdir}
cp sideplugin/rockside/sample-conf/db_bench_*.yaml .
export LD_LIBRARY_PATH=`find sideplugin -name lib_shared`
# change db_bench_community.yaml as your needs
# 1. use default path(/dev/shm) if you have no fast disk(such as a cloud server)
# 2. change max_background_compactions to your cpu core num
# 3. if you have github repo topling-rocks permissions, you can use db_bench_enterprise.yaml
# 4. use db_bench_community.yaml is faster than upstream RocksDB
# 5. use db_bench_enterprise.yaml is much faster than db_bench_community.yaml
# command option -json can accept json and yaml files, here use yaml file for more human readable
./db_bench -json=db_bench_community.yaml -num=10000000 -disable_wal=true -value_size=20 -benchmarks=fillrandom,readrandom -batch_size=10
# you can access http://127.0.0.1:2011 to see webview
# you can see this db_bench is much faster than RocksDB

Configurable features

For performance and simplicity, ToplingDB disabled some RocksDB features by default:

Feature Control MACRO
Dynamic creation of ColumnFamily ROCKSDB_DYNAMIC_CREATE_CF
User level timestamp on key TOPLINGDB_WITH_TIMESTAMP
Wide Columns TOPLINGDB_WITH_WIDE_COLUMNS

Note: Dynamic creation of ColumnFamily is not supported by SidePlugin

To enable these features, add -D${MACRO_NAME} to var EXTRA_CXXFLAGS, such as build ToplingDB for java with dynamic ColumnFamily:

make -j`nproc` EXTRA_CXXFLAGS='-DROCKSDB_DYNAMIC_CREATE_CF' rocksdbjava

License

To conform open source license, the following term of disallowing bytedance is deleted since 2023-04-24, that is say: bytedance using ToplingDB is no longer illeagal and is not a shame.

We disallow bytedance using this software, other terms are identidal with upstream rocksdb license, see LICENSE.Apache, COPYING and LICENSE.leveldb.

The terms of disallowing bytedance are also deleted in LICENSE.Apache, COPYING and LICENSE.leveldb.




RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat ([email protected]) and Jeff Dean ([email protected])

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.

toplingdb's People

Contributors

adamretter avatar agiardullo avatar ajkr avatar akankshamahajan15 avatar cbi42 avatar dhruba avatar emayanke avatar fyrz avatar haoboxu avatar hx235 avatar igorcanadi avatar islamabdelrahman avatar jay-zhuang avatar joelmarcey avatar lightmark avatar liukai avatar ltamasi avatar maysamyabandeh avatar mdcallag avatar miasantreble avatar mrambacher avatar pdillinger avatar riversand963 avatar rockeet avatar rven1 avatar sagar0 avatar siying avatar yhchiang avatar yuslepukhin avatar zhichao-cao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

toplingdb's Issues

SstFileWriter: Allow unsorted input Key Value sequence

In MyTopling, bulk load is relaxed to allow unsorted input, this is implemented by using Topling VecAutoSortTable.

Using VecAutoSortTable, we avoided MyRocks's MergeTree which is very slow.

Relevant commits

1e4fbdc leipeng 2022-10-04 12:59:37 +0800 Add TableBuilder::GetBoundaryUserKey() for sst file writer
6a85877 leipeng 2022-10-04 11:59:03 +0800 sst_file_writer.cc: auto sort assert(internal_comparator.IsBytewise())
db3cb9e leipeng 2022-09-20 11:29:31 +0800 db_iter.cc,sst_file_writer.cc,write_batch_with_index_internal.cc: ROCKSDB_FLATTEN, final, UNLIKELY
4775f2b leipeng 2022-09-19 19:27:13 +0800 sst_file_writer.cc: for TOPLINGDB_WITH_TIMESTAMP
592f75b leipeng 2022-09-19 14:01:19 +0800 sst_file_writer: AddImpl: use alloca & SetInternalKey(char* buf, ...)
9327631 leipeng 2022-07-21 15:06:55 +0800 Merge branch 'sideplugin-7.04.0-415200d7' into sideplugin-7.06.0-a0c63083
e949439 leipeng 2022-06-19 17:16:15 +0800 Add fixed_value_len, details --
aa3e822 leipeng 2022-06-17 17:21:08 +0800 SstFileWriter: adapt AutoSort TableFactory - use EstimatedFileSize
92198bb leipeng 2022-06-17 17:17:15 +0800 SstFileWriter: adapt AutoSort TableFactory
4c3f449 rockeet 2016-09-18 13:30:43 +0800 Add TableBuilderOptions::level and relevant changes (#1335)

Add Trace support to web REST api

# StartTrace
curl -d '{"file": "trace.txt", "filter": "kTraceFilterNone"}' "http://somehost:port/db/mydb?cmd=StartTrace"
# EndTrace
curl -d '{}' "http://somehost:port/db/mydb?cmd=EndTrace"
Start End
StartTrace EndTrace
StartIOTrace EndIOTrace
StartBlockCacheTrace EndBlockCacheTrace

Build Issue

  • Follow steps in wiki, error happens.
port/port_posix.cc:31:10: fatal error: terark/util/fast_getcpu.hpp: No such file or directory
 #include <terark/util/fast_getcpu.hpp>
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
  • Besides, the MADV_COLD flag for madvise seems not available in linux kernel version 5.4.0-120-generic. Which version is the appropriate kernel version? Or any other depedencies?

RocksIterator can not get the data on secondary instance

Note: Please use Issues only for bug reports. For questions, discussions, feature requests, etc. post to dev group: https://groups.google.com/forum/#!forum/rocksdb or https://www.facebook.com/groups/rocksdb.dev

One of Rocksdb secondary instance can not get the correct data after 16 hours.

Expected behavior
startup one primary and 3 secondary instances on k8s. every instances can get the same result for the same query.

Actual behavior
after 16 hours, primary and 2 secondary instances got 1000 records for user. one instance got 500 records.

Steps to reproduce the behavior
startup one primary and 3 secondary instances on k8s.
secondary instances invoke tryCatchUpWithPrimary every 3 seconds.
everything wroks fine at this time, can get the items correctly on the 3 secondary instances
after one day, there is a secondary instance become abnormal, can not get the whole data comparing to other instances, no exception throwed by tryCatchUpWithPrimary. example: other 3 instance(primary and other 2 secondary) can get 1000 records, but this instance can only get 500 records. and can not get the newest item inserted by primary as well
first saw on Java RocksdbJni 8.8.1, then downgrade to 6.29.4, still the same.
env:
Java + Spring Boot
aws k8s/aws efs
Rocksdb version: 6.29.4
Java RocksdbJni: 6.29.4.1

have any suggestion for this issue?

这个issue在RocksDB提了下,在国人社区也提下,请问下有碰到类似这种情况么?

ForwardIterator: used newly added VersionStorageInfo::FindFileInRange

This was a PR to upstream facebook/rocksdb#10592

Copied from facebook/rocksdb#10592

  1. FindFileInRange is defined in version_set.cc, ForwardIterator should use it to reduce code explosion
  2. The only one FindFileInRange can be optimized some way
    • We have optimized FindFileInRange by devirtualization and prefix caching for known Comparator (now just BytewiseComparator and ReverseBytewiseComparator) with 30x+ speed up, if rocksdb likes this optimization, I'll create a PR.

Relevant commits

aaff2e1

ToplingZipTable Builder: support distributed compressing

Compaction needs CompactionFilter, which may use DB::Get for metadata(such as pika/todis/kvrocks), in distributed compaction, compact_worker has no DB object, thus can not support such compaction.

ToplingZipTable Builder use two-pass scanning, it save decompressed kv data into tmp files, in second pass scaning, it read data from tmp file, thus we can run first pass scaning in DB side(local compation), and run second pass scanning in compaction worker to compress data -- compressing consumes 80+% CPU time for ToplingZipTable.

MergingIterator: use union for minHeap_ and maxHeap_

This is a PR to upstream facebook/rocksdb#9035

Copied from facebook/rocksdb#9035

Current code use maxHeap_ as a unique_ptr, when using maxHeap_, minHeap_ is still a valid object while consuming memory.

This PR use C++ union for minHeap_ and maxHeap_, this reduces memory usage.

Since MergingIterator is a very low level component, all existing test passed can ensure correctness.

Relevant commits

4ffd72b use union for minHeap_ and maxHeap_
4c277ab MergingIterator: rearrange fields to reduce paddings (#9024)

Feature Request: Add DB::ApproximateKeyAnchors

This feature request is picked from facebook/rocksdb#10888

Now RocksDB has TableReader::ApproximateKeyAnchors for sampling key boundaries for sub compaction.

It is better to expose ApproximateKeyAnchors to DB for applications, such as:

In MyRocks, ddl operations such as create index, can using this function to partition the input data to processing with multi threads. (InnoDB has innodb_ddl_threads for this purpose)

I had filed an Feature Request for MyRocks about this feature: facebook/mysql-5.6#1245

ubuntu 22 编译报错

echo 'Libs: -L${libdir}  -Wl,-rpath -Wl,'$ORIGIN' -lrocksdb' >> rocksdb.pc                                                                                                            [12/586]
echo 'Libs.private: -lterark-zbs-r -lterark-fsa-r -lterark-core-r ' >> rocksdb.pc
echo 'Cflags: -I${includedir} -march=haswell  -isystem third-party/gtest-1.8.1/fused-src' >> rocksdb.pc
echo 'Requires: ' >> rocksdb.pc
install -d /root/git/topling/lib
install -d /root/git/topling/lib/pkgconfig
for header_dir in ` "include/rocksdb" -type d`; do \
        install -d //usr/local/$header_dir; \
done
/usr/bin/bash: line 1: include/rocksdb: Is a directory
for header in ` "include/rocksdb" -type f -name *.h`; do \
        install -C -m 644 $header //usr/local/$header; \
done
/usr/bin/bash: line 1: include/rocksdb: Is a directory
for header in ; do \
        install -d //usr/local/include/rocksdb/`dirname $header`; \
        install -C -m 644 $header //usr/local/include/rocksdb/$header; \
done
install -d                                  //usr/local/include/topling
install -C -m 644 sideplugin/rockside/src/topling/json.h     //usr/local/include/topling
install -C -m 644 sideplugin/rockside/src/topling/json_fwd.h //usr/local/include/topling
install -C -m 644 sideplugin/rockside/src/topling/builtin_table_factory.h //usr/local/include/topling
install -C -m 644 sideplugin/rockside/src/topling/side_plugin_repo.h      //usr/local/include/topling
install -C -m 644 sideplugin/rockside/src/topling/side_plugin_factory.h   //usr/local/include/topling
install -d //usr/local/include/terark
install -d //usr/local/include/terark/io
install -d //usr/local/include/terark/succinct
install -d //usr/local/include/terark/thread
install -d //usr/local/include/terark/util
install -d //usr/local/include/terark/fsa
install -d //usr/local/include/terark/fsa/ppi
install -d //usr/local/include/terark/zbs
install -C -m 644 sideplugin/topling-zip/src/terark/*.hpp          //usr/local/include/terark
install -C -m 644 sideplugin/topling-zip/src/terark/io/*.hpp       //usr/local/include/terark/io
install -C -m 644 sideplugin/topling-zip/src/terark/succinct/*.hpp //usr/local/include/terark/succinct
install -C -m 644 sideplugin/topling-zip/src/terark/thread/*.hpp   //usr/local/include/terark/thread
install -C -m 644 sideplugin/topling-zip/src/terark/util/*.hpp     //usr/local/include/terark/util
install -C -m 644 sideplugin/topling-zip/src/terark/fsa/*.hpp      //usr/local/include/terark/fsa
install -C -m 644 sideplugin/topling-zip/src/terark/fsa/*.inl      //usr/local/include/terark/fsa
install -C -m 644 sideplugin/topling-zip/src/terark/fsa/ppi/*.hpp  //usr/local/include/terark/fsa/ppi
install -C -m 644 sideplugin/topling-zip/src/terark/zbs/*.hpp      //usr/local/include/terark/zbs
cp -ar sideplugin/topling-zip/boost-include/boost  //usr/local/include
Linking ... build/Linux-x86_64-g++-11.3-bmi2-1/rls/dcompact_worker.exe
g++  -Wl,-unresolved-symbols=ignore-in-shared-libs -o build/Linux-x86_64-g++-11.3-bmi2-1/rls/dcompact_worker.exe build/Linux-x86_64-g++-11.3-bmi2-1/rls/dcompact_worker.o -L../../../.. -lrock
sdb -L../../../topling-zip/build/Linux-x86_64-g++-11.3-bmi2-1/lib_shared -lterark-{zbs,fsa,core}-g++-11.3-r -lrt -lpthread
/usr/bin/ld: cannot find -lrocksdb: No such file or directory
/usr/bin/ld: cannot find -lterark-zbs-g++-11.3-r: No such file or directory
/usr/bin/ld: cannot find -lterark-fsa-g++-11.3-r: No such file or directory
/usr/bin/ld: cannot find -lterark-core-g++-11.3-r: No such file or directory
collect2: error: ld returned 1 exit status
make[1]: *** [exe-common.mk:325: build/Linux-x86_64-g++-11.3-bmi2-1/rls/dcompact_worker.exe] Error 1
make[1]: Leaving directory '/root/git/topling/toplingdb/sideplugin/topling-dcompact/tools/dcompact'

Add enum_reflection.h & preproc.h

This was a PR to upstream facebook/rocksdb#10665

2 years ago, I had created PR facebook/rocksdb#7081 which failed on old MSVC thus was rejected by rocksdb.

Now rocksdb has upgraded to c++17, PR facebook/rocksdb#7081 will not fail in CI, we can continuously migrating existing enum/string conversion code by using this enum refelection.

Below is the brief introduction about enum reflection(copied from PR facebook/rocksdb#7081):


With enum reflection, we can convert enum to/from string

For example:

ROCKSDB_ENUM_PLAIN(CompactionStyle, char,
  kCompactionStyleLevel = 0x0,
  kCompactionStyleUniversal = 0x1,
  kCompactionStyleFIFO = 0x2,
  kCompactionStyleNone = 0x3 // comma(,) can not be present here
);
assert(enum_name(kCompactionStyleUniversal) == "kCompactionStyleUniversal");
assert(enum_name(CompactionStyle(100)).size() == 0);
CompactionStyle cs= kCompactionStyleLevel;
assert(enum_value("kCompactionStyleUniversal", &cs) && cs == kCompactionStyleUniversal);
assert(!enum_value("bad", &cs) && cs == kCompactionStyleUniversal); // cs is not changed

There are 4 macros to define a enum with reflection

// plain old enum defined in a namespace(not in a class/struct)
ROCKSDB_ENUM_PLAIN(EnumType, IntRep, e1 = 1, e2 = 2);
// this generates:
enum EnumType : IntRep { e1 = 1, e2 = 2 };
// enum reflection supporting code ...
// ...
// the supporting code makes template function enum_name and
// enum_value works for this EnumType

// other three macros are similar with some difference:

// enum class defined in a namespace(not in a class/struct)
ROCKSDB_ENUM_CLASS(EnumType, IntRep, Enum values ...);

// plain old enum defined in a class/struct(not in a namespace)
ROCKSDB_ENUM_PLAIN_INCLASS(EnumType, IntRep, Enum values ...);

// enum class defined in a class/struct (not in a a namespace)
ROCKSDB_ENUM_CLASS_INCLASS(EnumType, IntRep, Enum values ...);

Comparator: Add func: IsBytewiseComparator & IsReverseBytewiseComparator

This was a PR to upstream facebook/rocksdb#10645

Copied from facebook/rocksdb#10645

Virtual function call to comparator is very frequent thus is a hot spot
In most use cases, the default BytewiseComparator or ReverseBytewiseComparator is used
This PR provide the basic support for our later PRs for FindFileInRange and MergingIterator:

devirtualize such virtual functions calls for BytewiseComparator or ReverseBytewiseComparator
Add prefix cache to omit most memcmp and indirect memory access to key
Performance of FindFileInRange was improved 20x+, MergingIterator was improved 3x+.

see PR: facebook/rocksdb#10646 FindFileInRange devirtualization and prefix cache

PR MergingIterator depends on PR facebook/rocksdb#9035 thus we would create it later.

Relevant commits

331715c leipeng 2022-09-19 19:26:36 +0800 Add Comparator::opt_cmp_type()
9327631 leipeng 2022-07-21 15:06:55 +0800 Merge branch 'sideplugin-7.04.0-415200d7' into sideplugin-7.06.0-a0c63083
27a169f leipeng 2022-06-20 21:40:41 +0800 IsBytewiseComparator: optimize add cmp type to Comparator
f033dac leipeng 2022-06-16 22:56:59 +0800 Add IsReverseBytewiseComparator()
5c62088 leipeng 2022-06-09 18:37:24 +0800 Merge branch 'sideplugin-7.01.0-a5e51305' into sideplugin-7.03.0-f85b31a2
b65e06f leipeng 2022-03-30 11:21:17 +0800 Merge branch 'sideplugin-6.28.0-677d2b4a' into sideplugin-7.01.0-a5e51305
de07870 leipeng 2021-12-31 14:35:01 +0800 Merge branch 'sideplugin-6.26.0-28bab0ef' into sideplugin-6.28.0-677d2b4a
8287e70 leipeng 2021-12-11 12:53:05 +0800 Move IsBytewiseComparator ... from topling-rocks to toplingdb repo

Omit compare userkey in memtable get

CSPPMemTable does not need to compare userkey for checking switch to different userkey, because CSPPMemTable is realized by 2 dimentions, it knows userkey boundary naturally.

relavant commits:
9980087 leipeng 2023-04-24 21:21:17 +0800 memtable.cc: SaveValue: omit load ucmp if possible - fix comment
0541a65 leipeng 2023-04-24 21:05:34 +0800 Add MemTableRep::NeedsUserKeyCompareInGet() and relavant changes

set level0_file_num_compaction_trigger=-1 should disable intra L0 compaction

Expected behavior

set level0_file_num_compaction_trigger=-1 should disable intra L0 compaction

Actual behavior

set level0_file_num_compaction_trigger=-1 triggers infinite write stop

Steps to reproduce the behavior

set write_buffer_size=1G and target_file_size_base=1M, then write data to DB.


RocksDB will schedule intra L0 compaction in some cases, now ToplingDB regards level0_file_num_compaction_trigger=-1 for flag of Disable intra L0 compactions, but it triggers this infinite write stop.

Omit L0 Flush

Description

Here Omit L0 Flush is: definitely reduce IO, memory and CPU. Vaule content should not be stored in MemTable, instead storing value offset(of WAL Log) and size in MemTable. So WAL also need to be mmap'ed. The complexity is:

  1. there are padding bytes in some single WAL Log entry, when the entry stridding page boundary
    • so we need to implement a new WAL Log format which has no padding in any single entry
    • truncate and mmap WAL Log file during WAL Log file creation
  2. rocksdb can have multiple column families, which share WAL Log, but do not share MemTable and SST
    • SST, MemTable and WAL Log mapping and management are required
  3. many changes on DB Write code path are required

Related feature

We have realized feature Convert MemTable to L0 SST, this feature needs MemTableRep to implement a new method ConvertToSST, now CSPPMemTab realized this feature by write data to file mmap.

The issue is: to be reliable, write data to file mmap does not reduce IO, it just spread the IO pressure evenly during the lifetime of MemTable.

In the best cases, we set CSPPMemTab.sync_sst_file=false, let the operating system to perform the sync appropriatly, thus when the file is deleted after L0->L1 compact while the corresponding page caches have not write back to devices, the write back to devices can be omited.

Lazy Load Value for DBIter::Prev

Now DBIter::Next supports lazy load value, but DBIter::Prev does not support lazy load for first visible is kValueType.

DBIter::Prev needs to calling underlying iter->Prev to get the iter pos of first visible kValueType, this needs to backup iter->value(), thus can not realize lazy load.

ToplingZipTable can load the value by ValueID, thus we can backup the ValueID instead of value content to realize lazy load. -- If zero copy is applicable, the lazy load is not needed.

autovector: performance improves

This was a PR to upstream facebook/rocksdb#10230

Copied from facebook/rocksdb#10230

  1. use union values_ instead of point values_ point to internal buf_
    a. this reduces autovector object size
    b. this reduces a memory load for pointer value_
    c. this makes autovector relocatable(memmoveable) because there is no ptr point to object self
  2. rearrange fields for cpu cache friendly
  3. two exception-safe fix
  4. delete ~iterator_impl
  5. This PR also fix two bugs(operator= and assign) which old code may call operator= on uninitialized memory

Relevant commits

8b353cf leipeng 2022-07-30 23:06:18 +0800 autovector.h: perf improve & exception-safe fix
58d069c leipeng 2022-06-25 13:31:18 +0800 autovector: optimize front() and back()
f08745c leipeng 2022-06-24 17:15:50 +0800 autovector.h: fix a typo destory -> destroy
cfc7f1a leipeng 2022-06-23 16:27:27 +0800 autovector: add missing std::move
a4ab12e leipeng 2022-06-23 16:23:50 +0800 autovector: optimize copy-cons & move-cons
8216629 leipeng 2022-06-22 22:26:16 +0800 autovector.h: pick fixes from pull request to rocksdb
58d43b3 leipeng 2022-06-22 21:20:56 +0800 MemTable::Get: mark as attribute flatten
505b5b2 leipeng 2022-06-22 21:10:51 +0800 autovector: performance improves
7ae3109 leipeng 2022-06-22 19:39:20 +0800 autovector.h: add cons with initial size
c0aad3c leipeng 2021-09-27 18:07:20 +0800 Add autovector::reserve()

Add WBWIIterator to MergingIterator

Now Transaction DB use BaseDeltaIterator, but it needs compare base_iter key and delta_iter key on each Next()/Prev(), this waste CPU.

We can add delta iter to the (heap of) underlying MergingIterator of DBIter, thus improving performance -- delta iter is unlikely going to heap top.

MergingIterator.inline.comparator

This is a PR to upstream facebook/rocksdb#10151

Copied from facebook/rocksdb#10151

MergingIterator is a performance critical class, one of the CPU hotpots are InternalComparator & UserComparator which are both virtual functions.
Since rocksdb Comparator has Name, this PR inline bytewise comparators code by checking comparator Name.

This PR is based on PR facebook/rocksdb#9035

Relevant commits

9f20ffc merging_iterator.cc: fix FORCE_INLINE
02458c4 merging_iterator.cc: add override
6bb244c merging_iterator.cc: ignore forceinline fail
7314205 merging_iterator.cc: format code
b91733d MergingIterator inline bytewise comparator

Feature Request: Iterator::Refresh() with a snapshot

This feature request is picked from facebook/rocksdb#10487

Current Iterator::Refresh() does not support snapshot, we have no way to refresh an iterator to a specified snapshot, instead we create a new iterator, but creating new iterator is heavy.

Refresh iterator to a specified snapshot, we can avoid creating new iterator thus improving performance.

CompactionJob::FinishCompactionOutputFile: sync FileMeta with TableProperties

This is PR to upstream facebook/rocksdb#9018

Copied from facebook/rocksdb#9018

the output file meta may be used later.

We have a in house branch of rocksdb which added a compaction.output.file.raw.size histogram, we use meta.raw_key_size + meta.raw_value_size as the histogram value, then we saw the histogram is always zero, which was caused by this bug.

Relevant commits

2e93acd CompactionJob::FinishCompactionOutputFile: sync FileMeta with TableProperties

version 7.09: MergingIterator devirtualization and prefix cache

In branch sideplugin-7.09.0-2022-10-27-5fef34fd, MergingIterator is using upstream RocksDB's version because MergingIterator was greately changed in 2022-09 for speed up DeleteRange, so it is hard to merge with ToplingDB's devirtualization and prefix cache.

Prior to devirtualization and prefix cache, ToplingDB use union to storing minHeap and maxHeap, this is also an improvement to upstream RocksDB.

This issue is a task to accommodate ToplingDB's MergingIterator:

  1. union of minHeap and maxHeap
  2. devirtualization
  3. key prefix cache

Reduce max/min key computation in SstFileWriter

VecAutoSortTable need to compute max/min key for building SST.

SstFileWriter also need max/min key for FileMetaData.

These two computation has redundancy, we remove the computation in SstFileWriter, and get the result after TableBuilder::Finish(), in which VecAutoSortTable will return the computed result.

Relevant commits

1e4fbdc leipeng 2022-10-04 12:59:37 +0800 Add TableBuilder::GetBoundaryUserKey() for sst file writer

and relevant commits in topling-rocks:

  1. GetBoundaryUserKey: Add missing return Status::OK()
  2. AutoSort: Add override GetBoundaryUserKey

Improve MultiCFSnapshot

This was a PR to upstream facebook/rocksdb#10365

Copied from facebook/rocksdb#10365

struct MultiGetColumnFamilyData was defined in db_impl.h and function MultiCFSnapshot has an param iter_deref_func which can be optimized out.

This PR move struct MultiGetColumnFamilyData to anonymous namespace in db_imp.cc and delete param iter_deref_func, this change both improve performance and greatly simplify the code.

This PR also extract read_options.timestamp out of the loop.

Relevant commits

656481c MultiGet: simplify and improve MultiCFSnapshot

The problem of compilation failure after importing `port/likely.h`

When I compiled my demo, I found that you imported port/likely.h, and at the same time I also found that the dependency on port/likely.h has been removed from rocksdb. This dependency leads to the need to add another search path, such as issues facebook/rocksdb#2008. So is this behavior what we expected?

Expected behavior

demo CMakeLists.txt:

......
find_path(ROCKSDB_INCLUDE_DIR rocksdb/db.h PATHS)
include_directories(${ROCKSDB_INCLUDE_DIR})

add_executable(toplingdb_forbid_l0_compact script.cc)
target_link_libraries(rocksdb_test rocksdb lz4 -lpthread -lz -lsnappy -lbz2 -lzstd -ldl)

After the introduction of rocksdb/db.h, the compilation can be passed.

Actual behavior

In file included from /workspaces/toplingdb/script/main.cc:1:
In file included from /usr/local/include/rocksdb/db.h:21:
In file included from /usr/local/include/rocksdb/listener.h:15:
In file included from /usr/local/include/rocksdb/advanced_options.h:13:
In file included from /usr/local/include/rocksdb/cache.h:30:
In file included from /usr/local/include/rocksdb/compression_type.h:9:
In file included from /usr/local/include/rocksdb/enum_reflection.h:4:
/usr/local/include/rocksdb/preproc.h:473:10: fatal error: 'port/likely.h' file not found
#include "port/likely.h"
         ^~~~~~~~~~~~~~~
1 error generated.
make[2]: *** [CMakeFiles/rocksdb_test.dir/build.make:76: CMakeFiles/rocksdb_test.dir/main.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/rocksdb_test.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

Steps to reproduce the behavior

introduction of rocksdb/db.h(complete a Put/Get logic) and cmake .. && make -j20 your project.

Git clone failed when we compile project use the Makefile.

Dependent information

Branch : All
Os : ubuntu 20.04

Expected behavior

The git pull repositories use https:// schema not ssh@ schema when I make the Makefile.

Actual behavior

In toplingdb Makefile, git use ssh@ schema to pull repositories. On a machine without rsa we would get error:

Please make sure you have the correct access rights
and the repository exists.
+ cd sideplugin
+ git clone [email protected]:topling/cspp-memtable
Cloning into 'cspp-memtable'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
+ cd sideplugin
+ git clone [email protected]:topling/cspp-wbwi
Cloning into 'cspp-wbwi'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Steps to reproduce the behavior

After resolving dependencies

cd toplingdb/
make -j 20

Report Fee Support

As a cloud native DB, distributed compaction needs fee charge, this involves some code changes.

RLE compression on column

For Key and Value, same column pos of different rows may use RLE compression, and can use Rank-Select to map RLE runs to compressed form.

make check failed

Two Unit Tests failed.

Fixes

e12b69d Fix for UT: DBCompactionTestBlobError/DBCompactionTestBlobError.CompactionError/1
dda8b8e //ASSERT_GT(compaction_stats[1].bytes_written, 0); // ToplingDB, known issue

MergingIterator: add key_prefix cache

by using key prefix cache, FindFileInRange gained 10x+ speed up.

MergingIterator can also benefit from key prefix cache.

Relevant commit: 8011db2 merging_iterator.cc: add key_prefix cache

Add.read options.cache_sst_file_iter

This was a PR to upstream facebook/rocksdb#10593

Copied from facebook/rocksdb#10593

This PR resolves facebook/rocksdb#10591:

DB Iterator is an heavy object and it should be reused if possible, thus when DB Iterator is reused the underlying SST's should also be reused, we have add the SST Iter cache to LevelIterator.

Failed unit test are all passed
With env CACHE_SST_FILE_ITER set to 1, ReadOptions::cache_sst_file_iter will be set to true by default, thus all unit test can be reused.

There are still 2 unit test failed because SST File Iterator is cached, I just skip the corresponding ASSERT when cache_sst_file_iter is set to 1.

Now all related bugs was fixed

Relevant commits

d043644 cache_sst_file_iter: fix iter leak for cache_sst_file_iter = false
e86b8d7 cache_sst_file_iter: change relavent code being similar to upstream
e80fe1b cache_sst_file_iter: bugfix for pinned_iters_mgr_
96a20be LevelIterator: Add ReadOptions::cache_sst_file_iter

Enhancement: FixedLenKeyIndex: remove same bytes in the middle

Now we have removed common prefix len in FixedLenKeyIndex.

In some cases, there are many common bytes in the middle of keys, for example:

In MyTopling, secondary keys has primary key at the end, common prefix in primary keys are the middle bytes of secondary keys, these bytes can be removed.

CREATE TABLE a (
  id BIGINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
  ts DATETIME,
  INDEX(ts)
);

Encoded key of secondary index ts has form ts(DATETIME) id(BIGINT), in which there are many zero leading bytes in id.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.