tikv / titan Goto Github PK

View Code? Open in Web Editor NEW

476.0 36.0 164.0 742 KB

A RocksDB plugin for key-value separation, inspired by WiscKey.

Home Page: https://pingcap.com/blog/titan-storage-engine-design-and-implementation/

License: Apache License 2.0

C++ 96.47% CMake 1.63% Shell 1.58% C 0.31%

database rocksdb

titan's Introduction

Titan: A RocksDB Plugin to Reduce Write Amplification

Titan is a RocksDB Plugin for key-value separation, inspired by WiscKey. For introduction and design details, see our blog post.

Build and Test

Titan relies on RocksDB source code to build. You need to checkout RocksDB source code locally, and provide the path to Titan build script.

# To build:
mkdir -p build
cd build
cmake ..
make -j<n>

# To specify custom rocksdb
cmake .. -DROCKSDB_DIR=<rocksdb_source_dir>
# or
cmake .. -DROCKSDB_GIT_REPO=<git_repo> -DROCKSDB_GIT_BRANCH=<branch>

# Build static lib (i.e. libtitan.a) only:
make titan -j<n>

# Release build:
cmake .. -DROCKSDB_DIR=<rocksdb_source_dir> -DCMAKE_BUILD_TYPE=Release

# Building with sanitizer (e.g. ASAN):
cmake .. -DROCKSDB_DIR=<rocksdb_source_dir> -DWITH_ASAN=ON

# Building with compression libraries (e.g. snappy):
cmake .. -DROCKSDB_DIR=<rocksdb_source_dir> -DWITH_SNAPPY=ON

# Run tests after build. You need to filter tests by "titan" prefix.
ctest -R titan

# To format code, install clang-format and run the script.
bash scripts/format-diff.sh

Compatibility with RocksDB

Current version of Titan is developed and tested with TiKV's fork of RocksDB 6.29. Another version that is based off TiKV's fork of RocksDB 6.4 can be found in the tikv-6.1 branch.

titan's People

Contributors

Stargazers

Watchers

Forkers

yiwu-arbug zz198808 connor1996 tabokie prcyangli glitterisme zhouyuan jiayuzzz little-wallace wangshuil lhy1024 phoenixhadoop yangkeao windshg plutolove gogoyao nrc sre-bot foeb wayslog yupan1010-h3c liumh8 haoxiang47 skyitachi xiaming9880 dorianzheng murolo fredchenbj conflux-chain gotoxu ti-srebot lday0321 dongdaoguang panda-sheep jiyingtk chux0519 anglenet lsmdb rpbear88 githubprogramman rickif dorokhov hicqu sticnarf hustergs zhenhangong lidaohang zhuwenshen beyyes sinhasantos yiqinxiong yuan-luo thegaram caipengbo sz-npe rainays hustwp233 baronstack zaorangyang zrealshadow perrynzhou ibixiong jasonjoo2010 wiltonlazary jayson-huang lihuibng cxytt apple-ouyang qysheida xiaoyong-z busyjay gzhuflyer pengpengsir borelset simkxa yiyanwannian dragonfruit9 compressionduck lroethan waynexia kevin-xianliu ethercflow distributed-system-lab paynie ysjyx7 revantk kinwaiyuen nsq974487195 o4thkeeper pdsl-dpu-kv llforever yzhang2495 chejinge ljluestc flowbehappy peiqi0714 yujuncen topscrew smxr v01dstar

titan's Issues

`TitanDBImpl::CloseImpl` dead on waiting for GC condvar

While use titandb_bench to bench titan, TitanDBImpl::CloseImpl cannot finish and is waiting for a CondVar.

Bug Command:

./titandb_bench --benchmarks=fillseq --use_existing_db=0 --sync=0 --db=./db --wal_dir=./wal --num=$((3*1024*1024)) --num_levels=6 --key_size=20 --value_size=400 --block_size=8192 --cache_size=1073741824 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=snappy --level_compaction_dynamic_level_bytes=true --bytes_per_sync=8388608 --cache_index_and_filter_blocks=0 --pin_l0_filter_and_index_blocks_in_cache=1 --benchmark_write_rate_limit=0 --hard_rate_limit=3 --rate_limit_delay_max_milliseconds=1000000 --write_buffer_size=134217728 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=1 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --use_titan=true --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=12 --level0_stop_writes_trigger=20 --max_background_jobs=20 --max_write_buffer_number=8 --min_level_to_compress=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --seed=1565596221

Backtrace:

#0  0x00007f63e204d0fc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00005582b8c1608d in rocksdb::port::CondVar::Wait() ()
#2  0x00005582b897c768 in rocksdb::titandb::TitanDBImpl::CloseImpl() ()
#3  0x00005582b897c98c in rocksdb::titandb::TitanDBImpl::Close() ()
#4  0x00005582b897e817 in rocksdb::titandb::TitanDBImpl::~TitanDBImpl() ()
#5  0x00005582b897eec1 in rocksdb::titandb::TitanDBImpl::~TitanDBImpl() ()
#6  0x00005582b89557d2 in rocksdb::Benchmark::~Benchmark() ()
#7  0x00005582b8945271 in rocksdb::db_bench_tool(int, char**) ()
#8  0x00005582b890a61e in main ()

It is waiting for bg_gc_scheduled_ condvar.

At the same time, background gc thread dead loop in

for (iter.Next();
      iterated_size < sample_size_window && iter.status().ok() && iter.Valid();
      iter.Next()) {
  BlobIndex blob_index = iter.GetBlobIndex();
  uint64_t total_length = blob_index.blob_handle.size;
  iterated_size += total_length;
  bool discardable = false;
  printf("SAMPLE HERE %d %" PRIu64 " %" PRIu64 "\n", size++, iterated_size, sample_size_window);
  s = DiscardEntry(iter.key(), blob_index, &discardable);
  if (!s.ok()) {
    return s;
  }
  if (discardable) {
    discardable_size += total_length;
  }
}

Maybe it's not dead, but it cannot stop in a reasonable time.

performance enhancements on RaftMsgCollector

3 machines :
CPU: 2 x Intel Xeon Gold 5118 @ 2.30GHz, 24 Cores 48 Threads in total
Memory: 196GB
SSD: 8 x INTEL SSDSC2KB96 (960GB)
topology:
one pd-server and on tikv-server on every machine, no tidb-servers
workload:
there are 500M keys, the values of key are from 1 to 500M. all the values are 1k size. 100K QPS to read.
titan engine is used.
profile:

from the flamegraph the 'Collect' method‘s width is far larger then 'rallocx' mothed which is revoked by itself.
i think it corresponds 'BatchCollector<Vec, RaftMessage> for RaftMsgCollector'.
so is precomputing the RaftMessage's size a good idea?

titan_stress failure

[root@b57112560f33 /]# /usr/local/titandb/bin/titandb_stress --ops_per_thread=10000 --threads=40 --value_size_mult=33 --set_options_one_in=1000 --compact_range_one_in=1000 --acquire_snapshot_one_in=1 --ingest_external_file_one_in=1000 --compact_files_one_in=0 --checkpoint_one_in=0 --backup_one_in=0 --nooverwritepercent=0 --delrangepercent=10  --delpercent=15 --readpercent=15 --writepercent=30 --iterpercent=15 --prefixpercent=15 --max_background_compactions=20 --max_background_flushes=20 --enable_pipelined_write=true --min_blob_size=64
2019/09/03-08:42:52  Initializing db_stress
RocksDB version           : 5.18
Format version            : 2
TransactionDB             : false
Column families           : 10
Clear CFs one in          : 1000000
Number of threads         : 40
Ops per thread            : 10000
Time to live(sec)         : unused
Read percentage           : 15%
Prefix percentage         : 15%
Write percentage          : 30%
Delete percentage         : 15%
Delete range percentage   : 10%
No overwrite percentage   : 0%
Iterate percentage        : 15%
DB-write-buffer-size      : 0
Write-buffer-size         : 67108864
Iterations                : 10
Max key                   : 1048576
Ratio #ops/#keys          : 0.381470
Num times DB reopens      : 10
Batches/snapshots         : 0
Do update in place        : 0
Num keys per lock         : 4
Compression               : Snappy
Checksum type             : kCRC32c
Max subcompactions        : 1
Memtablerep               : prefix_hash
Test kill odd             : 0
------------------------------------------------
DB path: [/tmp/rocksdbtest-0/dbstress]
Choosing random keys with no overwrite
Creating 2621440 locks
2019/09/03-08:42:52  Initializing worker threads
2019/09/03-08:42:52  Starting database operations
2019/09/03-08:42:54  Reopening database for the 1th time
DB path: [/tmp/rocksdbtest-0/dbstress]
2019/09/03-08:42:56  Reopening database for the 2th time
DB path: [/tmp/rocksdbtest-0/dbstress]
[CF 2] Dropping and recreating column family. new name: 10
delete error: Operation aborted: Column family drop
put or merge error: Operation aborted: Column family drop
terminate called recursively
put or merge error: Operation aborted: Column family drop
terminate called without an active exception
terminate called recursively
Aborted (core dumped)

[CF 2] Dropping and recreating column family. new name: 10
put or merge error: Operation aborted: Column family drop
terminate called without an active exception

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff1ffff700 (LWP 123692)]
0x00007ffff6b76277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56        return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
Missing separate debuginfos, use: debuginfo-install libatomic-4.8.5-36.el7_6.2.x86_64 scylla-libgcc73-7.3.1-1.2.el7.centos.x86_64 scylla-libstdc++73-7.3.1-1.2.el7.centos.x86_64 snappy-1.1.0-3.el7.x86_64
(gdb) bt
#0  0x00007ffff6b76277 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007ffff6b77968 in __GI_abort () at abort.c:90
#2  0x00007ffff74b8575 in __gnu_cxx::__verbose_terminate_handler() () from /opt/scylladb/lib64/libstdc++.so.6
#3  0x00007ffff74b6166 in ?? () from /opt/scylladb/lib64/libstdc++.so.6
#4  0x00007ffff74b61b1 in std::terminate() () from /opt/scylladb/lib64/libstdc++.so.6
#5  0x0000000000a797f6 in rocksdb::NonBatchedOpsStressTest::TestPut(rocksdb::(anonymous namespace)::ThreadState*, rocksdb::WriteOptions&, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&, char (&) [1048576], std::unique_ptr<rocksdb::MutexLock, std::default_delete<rocksdb::MutexLock> >&) ()
#6  0x0000000000a76aeb in rocksdb::StressTest::OperateDb(rocksdb::(anonymous namespace)::ThreadState*) ()
#7  0x0000000000a8d406 in rocksdb::StressTest::ThreadBody(void*) ()
#8  0x0000000000dbfae4 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) ()
#9  0x00007ffff77b8e25 in start_thread (arg=0x7fff1ffff700) at pthread_create.c:308
#10 0x00007ffff6c3ebad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Implement PrioritizedCache

Implement PrioritizedCache, which wraps a single LRUCache, but have additional API to return child Cache instances that:

HighPriCache: always insert into the cache with high-priority, regardless of user provided option
LowPriCache: always insert into the cache with low-priority, regardless of user provided option

Part of tikv/tikv#5742

Fix Mac build

Mac build is failing. Fixing it. Also add mac build to Travis.

Rate limit GC speed with rate_limter

Limit GC and Titan read IO with rate_limiter, to avoid burst of IO starve online read and write.

error when include titan library in my project

when I add "titan.h", I saw the below issue :
**

/usr/include/titan/options.h:6:10: fatal error: logging/logging.h: No such file or directory

**
Where is the location of "logging/logging.h" ?
Thanks!

OnFlushCompleted should update file discardable size

Currently we only update file discardable size in OnCompactionCompleted. We assume blob file data would not be discarded during flush, which is not true. Consider the case:

There's a key "foo" in blob file b1.
GC rewrite b1 into b2 and insert "foo" to memtable.
Before memtable is flushed, user delete "foo".
After flush, "foo" is removed from the output of flush, since it's deleted. The fact that "foo" is discarded is not reflected in b2's discardable size.

GC is not triggered as expected

titan/src/blob_gc_picker.cc

Line 47 in 4dc4ba8

estimate_output_size >= cf_options_.blob_file_target_size) {

titan/src/blob_gc_picker.cc

Line 69 in 4dc4ba8

if (blob_files.empty() || batch_size < cf_options_.min_gc_batch_size) {

In some cases, "estimate_output_size" is larger than "cf_options_.blob_file_target_size" , but "batch_size" is less than "cf_options_.max_gc_batch_size", then BasicBlobGCPicker will return nullptr, but there are many blob files whose "gc_score" is larger than "cf_options_.blob_file_discardable_ratio" and need to be gc in fact. In one of my tests, Titan use disk 190GB after writing some random data, then I remove the condition "estimate_output_size >= cf_options_.blob_file_target_size" and in the same test, Titan use disk 170GB.

Reduce locking for blob cache hit queries

Description

Currently we query blob cache in BlobFileReader. It is done after getting file metadata (BlobStorage::FindFile) and getting the file reader from BlobFileCache, both of which require mutex lock. If we move blob cache query to before those two calls, we saves the two mutex lock and other overheads for a cache hit query. Demo can be find here: https://github.com/yiwu-arbug/titan/tree/cache titandb_bench shows 6% improvement in throughput for in-memory workload.

Benchmark:

# Fill 10G DB
./titandb_bench --db="/dev/shm/titan_bench" --use_existing_db=false --titan_min_blob_size=0 --value_size=1024 --num=10000000 --compression_type=none --benchmarks="fillseq"
# Readrandom with 20G blob cache
./titandb_bench --db="/dev/shm/titan_bench" --use_existing_db=true --titan_min_blob_size=0 --value_size=1024 --num=10000000 --compression_type=none --benchmarks="readrandom" --threads=40 --duration=300 --titan_blob_cache_size=20000000000

Result:

master: readrandom   :      65.210 micros/op 613374 ops/sec;  608.4 MB/s (4599999 of 4599999 found)
demo: readrandom   :      61.398 micros/op 651462 ops/sec;  646.1 MB/s (4823999 of 4823999 found)

what's left to be done on top of the demo:

Do the same for Iterator path. The demo only change point-get path.
Allocate blob cache memory using memory allocator: https://github.com/facebook/rocksdb/blob/master/include/rocksdb/cache.h#L78
Remove OwnedSlice wrapper. Cache raw char* and use the deleter from CacheAllocationPtr as cache item deleter

Score

3000

SIG slack channel（must）:

#sig-engine

Mentor

@Connor1996

Recommended Skills：

c++

Learning Materials（optional）

Consider removing blob index deletion marker

As mentioned in the comment, before we use the deletion marker as a workaround for the empty result after merge. But actually, the result shouldn't be empty because it would expose stale versions of the key. Instead, it should output the delete type which provides two profits:

reduce code complexity
rocksdb compaction can help remove the delete record in the bottommost level instead of keeping the deletion marker forever

As we already make the merge operator support changing value type https://github.com/tikv/rocksdb/blob/6.4.tikv/include/rocksdb/merge_operator.h#L121,
seems we can remove the blob index deletion marker.

/cc @tabokie, not sure whether there was any other consideration before.
Also, please push facebook/rocksdb#6447 to be merged in upstream

build error

I use rocksdb rocksdb-6.0.2, when build titan, it gives me an error:
/proj/titan/src/util.cc:16:15: error: ‘const class rocksdb::CompressionContext’ has no member named ‘type’
*type = ctx.type();

So, I want to know which rocksdb version does titan use?

Turn Titan into read-only mode if encountered critical error

RocksDB will turn itself into read-only mode if it encounter unrecoverable error. We should do the same in Titan. Let's do so for the following case:

During compaction and GC when Titan generate new blob file, it will call version_set->LogAndApply() to log the existence of the file in manifest. If it return non-okay status, we should turn DB into read only mode. Moreover, we need to call OnBackgroundError() for all options.listeners, so that user of Titan can handle the error.

Please also add unit test for the change.

[Master Task] Reduce locking on read path

Currently for a point-lookup, Titan acquire 7 locks:

rocksdb db_mutex: to get snapshot
table cache mutex: to get block-based table reader
block cache mutex: to read blob index
titan db mutex: to get pointer to BlobStorage
BlobStorage mutex: to get blob file metadata
BlobFileCache mutex: to get file reader
blob cache mutex: to read cached blob

Compare with vanilla rocksdb, only 2 locks are acquired:

table cache mutex: to get block-based table reader
block cache mutex: to read blob index

It gives opportunities to optimize for CPU bounded scenarios. For example, by just reordering blob cache access to before getting blob file metadata, we get 6% throughput improvement in titandb_bench of in-memory workload. Here we list some of the ideas to reduce locking overhead:

Move blob cache access to `BlobStorage` level

#140

Currently we check blob cache in BlobFileReader. If we move the check to BlobStorage level, before BlobStorage::FindFile call, we can avoid BlobStorage mutex lock and BlobFileCache mutex lock for cache hit case.

The change is safe because, when we read a blob index from rocksdb, we assume the blob file it points to always exists (otherwise we should return Status::Corruption). So if we have a blob cache hit, we don't need to check blob file existence.

Move `BlobFileCache` check to before `BlobStorage::FindFile` call

Similarly, we can query BlobFileCache to get blob file reader, and if it's a hit, skip BlobStorage::FindFile call. That way for a cache hit (which is common) we saves a BlobStorage mutex lock.

Store `BlobStorage` pointer in `ColumnFamilyHandle`

rocksdb embed cfd pointer in ColumnFamilyHandleImpl. One of the benefit is, with the handle, a read doesn't need to acquire db_mutex to obtain cfd pointer. Similarly, we can embed BlobStorage pointer in the handle to save mutex lock to obtain BlobStorage pointer. We can define TitanColumnFamilyHandle as following and return it to caller. The struct is safe to be used as rocksdb::ColumnFamilyHandle when calling rocksdb methods.

struct TitanColumnFamilyHandle : public rocksdb::ColumnFamilyHandleImpl {
    std::shared_ptr<BlobStorage> blob_storage;
};

Avoid use `weak_ptr`

In the code we use the pattern a lot:

std::weak_ptr<Something> FindSomthing() {
    ...
    return ptr; // ptr is a std::shared_ptr
}

std::shared_ptr<Something> something = FindSomething().lock();

This is not necessary. We can have FindSomething return shared_ptr. Using weak_ptr have no benefit, and incur one more atomic ref-count operation.

Making `rocksdb::DBImpl::GetSnapshot` lock-free

It is also possible to make rocksdb GetSnapshot lock-free, though that's very involving.

Titan support write without WAL

Current GC implementation assume WAL is always enabled for both user writes and GC writes. However, if user disable WAL, it could lead to data inconsistency after GC.

Example:

There's two version of a key, (k, v1), and (k, v2). (k, v1) has been flushed and persisted in SST file and blob file b1. (k, v2) is in memtable.
A GC job kinks in and use b1 as input. It skip overwriting (k, v1) to new blob file, since there's a newer version of the key.
After the GC job, b1 is deleted.
db restart. Since there's no WAL, (k, v2) is dropped since its in memtable, which is expected. However, (k, v1) is missing since b1 is deleted, which is not expected.

We need to find a way to allow user write without WAL.

Segment fault when creates TitanCompactionFilter

version: master. I don't know how to reproduce it and the core dump is in [email protected]:/data2/core.133685

(gdb) bt
#0  0x0000555557099f06 in TitanCompactionFilter (skip_value=false, blob_storage=..., owned_filter=<synthetic pointer>,
    original=0x0, cf_name=..., db=<optimized out>, this=0x7fffd722a190)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/libtitan_sys/titan/src/compaction_filter.h:27
#1  rocksdb::titandb::TitanCompactionFilterFactory::CreateCompactionFilter (this=0x7ffff6b90fd8, context=...)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/libtitan_sys/titan/src/compaction_filter.h:162
#2  0x00005555571e61ef in rocksdb::Compaction::CreateCompactionFilter (this=<optimized out>)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/compaction/compaction.cc:528
#3  0x00005555571fa3c1 in rocksdb::CompactionJob::ProcessKeyValueCompaction (this=this@entry=0x7fffd8d58fc0,
    sub_compact=0x7fffd723c000)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/compaction/compaction_job.cc:820
#4  0x00005555571fc8a0 in rocksdb::CompactionJob::Run (this=this@entry=0x7fffd8d58fc0)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/compaction/compaction_job.cc:590
#5  0x0000555556f1e1f1 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7ffff6b5c400,
    made_progress=made_progress@entry=0x7fffd8d5942e, job_context=job_context@entry=0x7fffd8d59450,
    log_buffer=log_buffer@entry=0x7fffd8d59620, prepicked_compaction=prepicked_compaction@entry=0x0,
    thread_pri=thread_pri@entry=rocksdb::Env::LOW)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2759
#6  0x0000555556f22bb2 in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7ffff6b5c400,
    prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2317
#7  0x0000555556f230aa in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2092
#8  0x000055555722ae9a in rocksdb::ThreadPoolImpl::Impl::BGThread (this=this@entry=0x7fffe7292180,
    thread_id=thread_id@entry=1)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/util/threadpool_imp.cc:266
#9  0x000055555722b08e in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper (arg=0x7ffff6a3d4b0)
    at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/d60144d/librocksdb_sys/rocksdb/util/threadpool_imp.cc:307
#10 0x0000555557177330 in execute_native_thread_routine ()
#11 0x00007ffff7bc6ea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007ffff72cd8dd in clone () from /lib64/libc.so.6

Titan db_bench fillrandom slower than rocksdb

The write patch of Titan is mostly the same as rocksdb (except for the TitanDB wrapper), so their performance should be similar. However titandb_bench show Titan is slower than rocksdb in fillrandom benchmark.

db_bench command:

/titandb_bench_6.4 --db=/data2/yiwu/db_bench --use_existing_db=false --benchmarks=fillrandom --num=10000000 --duration=300 --use_titan=true|false

with rocksdb 5.18 (commit 19ea115)
with titan

fillrandom   :       6.627 micros/op 150901 ops/sec;   16.7 MB/s

without titan

fillrandom   :       3.978 micros/op 251408 ops/sec;   27.8 MB/s

with rocksdb 6.4 (commit c99cf9d)
with titan

fillrandom   :       4.684 micros/op 213505 ops/sec;   23.6 MB/s

without titan

fillrandom   :       4.011 micros/op 249314 ops/sec;   27.6 MB/s

Need to check if other benchmarks (e.g. read path) has similar perf regression compare to rocksdb.

live-blob-size is much larger than total size of blob file

Value of metric live-blob-size is much larger than total size of blob file, and when most of data has been removed, it is still so larger (ten times than its actual size).

The reason is Titan would add live-blob-size by the length of raw value every time it adds a new key into Blob file. , rather than encoded value. And it collects size of encoded value in BlobFileSizeCollector. When Titan deleted files, it would subtract live-blob-size by size stored in BlobFileSizeCollector. So the value it subtracted is always less than the value it added.

[feature request] titan support merge operator

In our business application, our tikv branch uses rocksdb merge operator to implement a table-style colomn's update logic. If titan could support fundamental merge operator function, we could use this great engine, and improve performance.

This feature request doesn't modify the community tikv repo. And if you are interested in with our application, see a closed pr tikv/tikv#4095

Some statistics are not correct after reopen

After reopen, live blob file num/size is zero.

It is because we initialize the stats_ after vset_->Open.

Discardable ratio collected by compactions are not accurate

Have tried to downgrade Titan to vanilla RocksDB, found there are some blob files still with 0 discardable ratio so that they can't be cleaned by GC.
Whereas after restart, the discardable ratio stats are recalculated as 100% as expected which is constructed by scanning SST property. So later the blob files can be cleaned by GC and full downgraded to vanilla RocksDB. It indicates that we have missed some points so that the discardable ratio collected by compactions are not accurate.

RocksDB makes sure the order of notifying OnFlushCompleted and OnCompactionCompleted

In this PR facebook/rocksdb#6342, it fixes the issue that OnFlushCompleted and OnCompactionCompleted are called out of order which is the root cause of what #172 fixes.

After it is merged into our branch, we can remove the extra logic in Titan

Could titan support file IO with spdk user space driven mode？

Using BOTTOM priority threadpool to schedule GC will influence bottom level compaction

Currently we use BOTTOM priority threadpool to schedule gc, but rocksdb will schedule bottom level compaction with BOTTOM priority too if that threadpool is not empty. So

If we set low max_background_gc but high max_background_jobs, data will be prevented from compacting to last level especially if we enabled level merge.
There will be resource competition between gc thread and compaction thread.

Drop CF while GC is running causing background error

Found by db_stress #64. On DropColumnFamilies we should wait for any running GC job before proceed.

Need pick up #181 PR into tikv-4.x branch

Can you pick up 9df8f05 into the tikv-4.x branch?

We have encountered this problem in our system and need to be fixed!

Restrict buffered value size for blob zstd dictionary compression

When blob zstd dictionary compression is enabled, all the values will be buffered and replayed after the compression dictionary is finalized. So if there are multiple concurrent flushes and compactions, the memory footprint would be considerable considering the blob file size is 256MB by default.
For a small RAM instance, it may cause OOM easily. It's better to add a config to control the max concurrent buffered size.

Titan should return user value to compaction filter

Currently when user using compaction filter, the compaction filter is pass to underlying rocksdb as-is. As a result, the value returned is the blob index rather than the user value. Titan should wrap around the user compaction filter to return user value to the compaction filter. Also it needs to allow user compaction filter co-exists with the compaction filter used by Titan itself when gc_merge_rewrite=true.

Add more metrics

Make configs can be changed dynamically

By SetOptions, we should able to change following configs dynamically

PCP-23-1: Update blob file format to accommodate zstd dictionary

This is subtask of tikv/tikv#5743.

We need to add a new version to blob file format. If version = 2, it means the blob file come with a zstd dictionary and the dictionary is stored in the meta section (pointed by meta index handle in the footer). Update BlobFileHeader and BlobFileFooter and their comments accordingly.

Also update BlobEncoder and BlobDecoder to allow passing in a dictionary and use it to compress/decompress. Add unit test for it.

This should be a good warmup task for you to get familiar with Titan development.

Inline large value at the highest level?

Can we inline large value at the highest level for improving range scan performance at the cost of some potential write performance regression.

concurrent flush cause data loss after GC

Titan use event listener to hook OnFlushCompleted. When the event triggered, Titan assume corresponding memtable has been converted to SST, and mark blob file generated from the flush as normal file, so that GC can pick the file up. However, the assumption is not correct. If there are concurrent flush happened, OnFlushCompleted can fire before memtable being converted to SST (facebook/rocksdb#5892). If GC pick up the blob file before memtable flush finished, it can receive is_blob_index=false for all keys in the blob file (since they are still in memtable), and drop all data in the file by mistake, causing data loss.

Reproduced with unit test: yiwu-arbug@5172201

test sync issue

Support portable build

Newer CPU instruction will be compiled into titan's object file, it may cause coredump if it is executed in an older CPU.

Program received signal SIGILL, Illegal instruction.
0x000055555671dad2 in _M_bkt_for_elements (this=0x555557191560 <rocksdb::titandb::TitanOptionsHelper::blob_run_mode_string_map+32>, __n=<optimized out>)
    at /usr/include/c++/4.8.2/bits/hashtable_policy.h:373
373  /usr/include/c++/4.8.2/bits/hashtable_policy.h:.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-196.el7.x86_64 libgcc-4.8.5-16.el7.x86_64
(gdb) disassemble
Dump of assembler code for function std::_Hashtable<std::string, std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode>, std::allocator<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_Hashtable<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*>(std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*, std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*, unsigned long, std::hash<std::string> const&, std::__detail::_Mod_range_hashing const&, std::__detail::_Default_ranged_hash const&, std::equal_to<std::string> const&, std::__detail::_Select1st const&, std::allocator<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> > const&):
   0x000055555671da70 <+0>:   push   %rbp
   0x000055555671da71 <+1>:   mov    %rdx,%rax
   0x000055555671da74 <+4>:   sub    %rsi,%rax
   0x000055555671da77 <+7>:   mov    %rsp,%rbp
   0x000055555671da7a <+10>:  push   %r15
   0x000055555671da7c <+12>:  sar    $0x4,%rax
   0x000055555671da80 <+16>:  push   %r14
   0x000055555671da82 <+18>:  mov    %rdi,%r14
   0x000055555671da85 <+21>:  push   %r13
   0x000055555671da87 <+23>:  mov    %rsi,%r13
   0x000055555671da8a <+26>:  push   %r12
   0x000055555671da8c <+28>:  push   %rbx
   0x000055555671da8d <+29>:  mov    %rcx,%rbx
   0x000055555671da90 <+32>:  sub    $0x38,%rsp
   0x000055555671da94 <+36>:  movq   $0x0,0x8(%rdi)
   0x000055555671da9c <+44>:  movq   $0x0,0x10(%rdi)
   0x000055555671daa4 <+52>:  movq   $0x0,0x18(%rdi)
   0x000055555671daac <+60>:  movl   $0x3f800000,0x20(%rdi)
   0x000055555671dab3 <+67>:  test   %rax,%rax
   0x000055555671dab6 <+70>:  movq   $0x0,0x28(%rdi)
   0x000055555671dabe <+78>:  mov    %rax,-0x40(%rbp)
   0x000055555671dac2 <+82>:  mov    %rdx,-0x58(%rbp)
   0x000055555671dac6 <+86>:  fildll -0x40(%rbp)
   0x000055555671dac9 <+89>:  js     0x55555671dc3b <std::_Hashtable<std::string, std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode>, std::allocator<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> >, std::__detail::_Select1st, std::equal_to<std::string>, std::hash<std::string>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_Hashtable<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*>(std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*, std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> const*, unsigned long, std::hash<std::string> const&, std::__detail::_Mod_range_hashing const&, std::__detail::_Default_ranged_hash const&, std::equal_to<std::string> const&, std::__detail::_Select1st const&, std::allocator<std::pair<std::string const, rocksdb::titandb::TitanBlobRunMode> > const&)+459>
   0x000055555671dacf <+95>:  fstpl  -0x38(%rbp)
=> 0x000055555671dad2 <+98>:  vmovsd -0x38(%rbp),%xmm0
   0x000055555671dad7 <+103>: callq  0x5555557807d0
...

Blob file may has been deleted twice

Found by https://asktug.com/t/topic/37263
But the log is missing, Titan can not restart due to blob file has been deleted twice

May remove blob storage too early

When calling DestroyColumnFamilyHandle we may remove related blob storage directly.
But if there are multiples column family handles, Titan removes related blob storage on the first time to call DestroyColumnFamilyHandle, so using other non-destroyed column family handles to get will encounter blob-storage-not-found.

TitanDB::Open(db_options, dbname_, descs, &cf_handles_, &db_));

// ... put some data

std::vector<ColumnFamilyHandle*> cf_handles_tmp;
for (auto& handle : cf_handles_) {
    cf_handles_tmp.push_back(db_impl_->GetColumnFamilyHandleUnlocked(handle->GetID()));
 }

// ... get some data. Success!

for (auto& handle : cf_handles_tmp) {
    ASSERT_TRUE(handle);
    db_->DestroyColumnFamilyHandle(handle);
 }

// ... get some data. Fail!

But it's okay for normal usage, cause GetColumnFamilyHandleUnlocked can't be called exteriorly.

PCP-23-2: Build blob file using zstd dictionary compression

This is subtask of tikv/tikv#5743

Add CompressionOptions blob_file_compression_options to TitanCFOptions. BlobFileBuilder should respect blob_file_compression_options.enabled. Also if max_dict_bytes > 0, encode the content using zstd dictionary. If zstd_max_train_bytes = 0, it should use all the content as samples to train dictionary; otherwise, the first zstd_max_train_bytes is used. The samples should be store in memory before the dictionary is trained, and then flush to file after that. The dictionary will be stored in blob file as meta block.

For now, update BlobFileReader and BlobFileIterator to return Status::NotSupported if they encounter a file come with dictionary compression. We will update them in the next task.

Update table_builder_test to add tests for the change.

Build error on MacOS with apple clang 11.0

see diem/diem#168

I have tested on MacOS Mojave 10.14.5 with clang 10.0, everything works fine.

Cannot open a titandb instance by following your example in the blog

I cloned titan by git, then I used the following commands to compile it.

mkdir build
cd build
cmake ..
make

Then I got two library files libtitan.a and librocksdb.a.
I wrote the following code to run an example from your blog:

#include <titan/db.h>

int main() {
    rocksdb::titandb::TitanDB* db;
    rocksdb::titandb::TitanOptions options;
    options.create_if_missing = true;
    rocksdb::Status status = rocksdb::titandb::TitanDB::Open(options, "/tmp/testdb", &db);
    return 0;
}

The sample is compiled by using the following command, and we can get a executable file named a.out.

g++ -std=c++11 -pthread test.cc -I/root/git/titan/include -I/root/git/titan/build/rocksdb -L /root/git/titan/build/ -ltitan -L /root/git/titan/build/rocksdb/ -lrocksdb

But when ./a.out is used, it shows segmentation fault.
So I want to ask whether the blog post is too old to use.
How can I run a correct example?

Thank you so much!

add monitor item - blob cache usage rate

when testing the titan engine ,i want to know blob cache usage rate to evaluate the effect of blob cache. i want to add the monitoring item on grafana

Corruption may occur for GC with a concurrent DeleteFilesInRange.

tikv panic due to titan deletes a already deleted blob

[2020/05/27 21:31:06.177 +08:00] [FATAL] [lib.rs:480] ["rocksdb background error. db: kv, reason: compaction, error: Corruption: B
lob file 256 has been deleted already"] [backtrace="stack backtrace:\n   0: tikv_util::set_panic_hook::{{closure}}\n             a
t components/tikv_util/src/lib.rs:479\n   1: std::panicking::rust_panic_with_hook\n             at src/libstd/panicking.rs:475\n
 2: rust_begin_unwind\n             at src/libstd/panicking.rs:375\n   3: std::panicking::begin_panic_fmt\n             at src/lib
std/panicking.rs:326\n   4: <engine_rocks::event_listener::RocksEventListener as rocksdb::event_listener::EventListener>::on_backg
round_error\n             at components/engine_rocks/src/event_listener.rs:66\n   5: rocksdb::event_listener::on_background_error\
n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/src/event_listener.rs:254\n   6: _ZN24crocksdb_eventlis
tener_t17OnBackgroundErrorEN7rocksdb21BackgroundErrorReasonEPNS0_6StatusE\n             at crocksdb/c.cc:2140\n   7: _ZN7rocksdb7t
itandb11TitanDBImpl10SetBGErrorERKNS_6StatusE\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/librocksd
b_sys/libtitan_sys/titan/src/db_impl.cc:1323\n   8: _ZN7rocksdb7titandb11TitanDBImpl12BackgroundGCEPNS_9LogBufferEj\n
at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/librocksdb_sys/libtitan_sys/titan/src/db_impl_gc.cc:237\n   9: _ZN7ro
cksdb7titandb11TitanDBImpl16BackgroundCallGCEv\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/librocks
db_sys/libtitan_sys/titan/src/db_impl_gc.cc:139\n  10: _ZN7rocksdb14ThreadPoolImpl4Impl8BGThreadEm\n             at /rust/git/chec
kouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/librocksdb_sys/rocksdb/util/threadpool_imp.cc:266\n  11: _ZN7rocksdb14ThreadPoolImpl4I
mpl15BGThreadWrapperEPv\n             at /rust/git/checkouts/rust-rocksdb-a9a28e74c6ead8ef/98aea25/librocksdb_sys/rocksdb/util/thr
eadpool_imp.cc:307\n  12: execute_native_thread_routine\n  13: start_thread\n  14: __clone\n"] [location=components/engine_rocks/s
rc/event_listener.rs:66] [thread_name=<unnamed>]

Titan log:

LOG.old.1590593095234019:2020/05/27-21:31:05.600112 7f6b93739480 [lob_file_set.cc:117]
 Blob files for CF 0 found: 234, 233, 231, 230, 229, 226, 225, 224, 223, 221, 219, 215, 216, 214, 
209, 200, 199, 198, 197, 196, 195, 194, 193, 192, 190, 188, 187, 182, 176, 170, 172, 171, 161, 155, 
141, 140, 134, 79, 78, 77, 205, 103, 109, 237, 98, 239, 240, 100, 241, 101, 242, 44, 243, 45, 244, 
104, 201, 245, 51, 247, 53, 107, 248, 108, 249, 55, 254, 253, 266, 267, 69, 268, 105, 246, 71, 270, 
76, 271, 272, 74, 273, 75, 274, 80, 251, 269, 70, 122, 263, 260, 265, 66, 60, 259, 264, 65, 117, 258,
 262, 63, 255, 256, 261, 62, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86, 85, 84, 213, 19, 218, 106,
 9, 208, 13, 212, 227, 33, 232, 217, 23, 222, 238, 39, 102, 5, 257, 58, 29, 36, 99

LOG.old.1590593095234019:2020/05/27-21:31:05.860018 7f6b93739480 [lob_storage.cc:65] Get 1 blob files [256] in the range 
[7A7480000000000000FF335F698000000000FF0000010175736572FF39303236FF383231FF3
335343235FF3730FF363533393600FE00FE, 7A7480000000000000FF335F728000000000FF06A5690000000000FA)

LOG.old.1590593095234019:2020/05/27-21:31:06.126564 7f6b62bff700 [lob_gc_job.cc:530] 
Titan add obsolete file [256] range [7A7480000000000000FF335F728000000000FF03B0A70000000000FAFA36BA00B443FF26,
 7A7480000000000000FF335F728000000000FF06A5680000000000FAFA36B9FD63D3FF84]

LOG.old.1590593095234019:2020/05/27-21:31:06.126582 7f6b62bff700 [ERROR] [dit_collector.h:165] blob file 256 has been deleted already

LOG.old.1590593095234019:2020/05/27-21:31:06.126588 7f6b62bff700 (Original Log Time 
2020/05/27-21:31:05.731084) [lob_gc_job.cc:138] [default] Titan GC candidates[256]

From the log, we can see that Titan GC has already selected the blob file 256 as a candidate. Whereas the DeleteFilesInRange deletes the blob file 256 directly. So when the GC is finished and to delete blob file 256, it may find blob file 256 is already deleted and panic.

Influence:

make tikv panic, but doesn't affect data correctness. After a restart(need to delete panic_mark manually), everything will work well.

how to set the blob_file_compression modle

I am a freshman in titan!
I want test different compression strategy，I change the option,h file
from CompressionType blob_file_compression{kNoCompression}
to CompressionType blob_file_compression{kLZ4Compression} but it don't work,
how can I do it? Thanks for your help

Flush memtable after GC

Currently Titan GC enables WAL to update blob index to persist GC result. For users not using WAL (e.g. tikv plan to remove kvdb WAL), even using WAL on GC does not guarantee data consistency after recovery. To avoid the issue, we should provide an option where GC does not write WAL, but flush memtable at the end to persistent its result.

Rewrite valid keys by ingesting SST files during GC

Since the write callback is removed by #121 , it's possible to generate SST files of valid keys and to ingest these SST files by IngestExternalFile

titan_thread_safety_test failed

10/14 Test #10: titan_thread_safety_test ...............Child aborted***Exception:   5.00 sec
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from TitanThreadSafetyTest
[ RUN      ] TitanThreadSafetyTest.Insert
[       OK ] TitanThreadSafetyTest.Insert (4173 ms)
[ RUN      ] TitanThreadSafetyTest.InsertAndDelete
Assertion failed: (false), function OnCompactionCompleted, file /Users/travis/build/tikv/titan/src/db_impl.cc, line 1239.

occurs in #169

blob file picker

if gc_score is smaller than cf_options_.blob_file_discardable_ratio, use continue but not break!

for (auto& gc_score : blob_storage->gc_score()) {
    if (gc_score.score < cf_options_.blob_file_discardable_ratio) {
      break;
    }
    auto blob_file = blob_storage->FindFile(gc_score.file_number).lock();
    if (!CheckBlobFile(blob_file.get())) {
      RecordTick(stats_, TitanStats::GC_NO_NEED, 1);
      // Skip this file id this file is being GCed
      // or this file had been GCed
      ROCKS_LOG_INFO(db_options_.info_log, "Blob file %" PRIu64 " no need gc",
                     blob_file->file_number());
      continue;
    }
    if (!stop_picking) {
      blob_files.push_back(blob_file.get());
      batch_size += blob_file->file_size();
      estimate_output_size +=
          (blob_file->file_size() - blob_file->discardable_size());
      if (batch_size >= cf_options_.max_gc_batch_size ||
          estimate_output_size >= cf_options_.blob_file_target_size) {
        // Stop pick file for this gc, but still check file for whether need
        // trigger gc after this
        stop_picking = true;
      }
    } else {
      next_gc_size += blob_file->file_size();
      if (next_gc_size > cf_options_.min_gc_batch_size) {
        maybe_continue_next_time = true;
        RecordTick(stats_, TitanStats::GC_REMAIN, 1);
        ROCKS_LOG_INFO(db_options_.info_log,
                       "remain more than %" PRIu64
                       " bytes to be gc and trigger after this gc",
                       next_gc_size);
        break;
      }
    }
  }

After enabling titan, whether the value obtained in the compaction filter is BlobIndex

我们在TiKV中定制了一个Compaction Filter，逻辑是根据保存的value中的内容做一些过滤。假如现在我改用titan引擎，那么我在Filter API中拿到的value是否就变成BlobIndex了？

如果的确如此，那么：

如何判断value是真实值还是BlobIndex
如果是BlobIndex，那么如何通过BlobIndex获取BlobRecord

is HasBGError() strong enough when some error happened in LogAndApply

we can see that a bg_error_ will be set when db_->blob_file_set_->LogAndApply(edit) return some error at db_impl.cpp:87. and the corresponding has_bg_error_ will be set to true as well to indicate that some bg error happend.

The lightweight flag has_bg_error_ is used to indicate whether some bg error happend and turn the whole db into read only mode and forbid subsequence read/write.

However, in HasBGError() do not hold the mutex_, but return the has_bg_error_ directly. I think there is a chance that some error has been happedn in LogAndApply and bg_error_ has been set to an error code, but has_bg_error_ is still false. In this scenario, the put/write from user is still allowed since HasBGError() return false at this point, then some unpredictable behavior will be happend, such as subsequent version edit will be apply to the manifest file and the manifest info on the disk and the manifest info in the memory (BlobFileSet) is not the same.

Could this be a bug in the titan? HasBGError() should hold the mutex_ as well?

Should update gc mark after sampling

if gc mark is not updated, the picker will always select the same blob files and do not make progress.