speedb-io / speedb Goto Github PK
View Code? Open in Web Editor NEWA RocksDB compliant high performance scalable embedded key-value store
Home Page: https://www.speedb.io/
License: Apache License 2.0
A RocksDB compliant high performance scalable embedded key-value store
Home Page: https://www.speedb.io/
License: Apache License 2.0
When using benchmark_rate_limit the response time includes the wait time. This bug disable the use of db_bench to measure response time under different loads.
Need to define a memory manager that will allow the system the work at max performance while limiting the usage of memory
As part of #7 (and later #27 as well) I made some changes to customizable_test
because it was failing to build using Clang 12 and 13 with the following error:
options/customizable_test.cc:230:7: error: offset of on non-standard-layout type 'struct SimpleOptions' [-Werror,-Winvalid-offsetof]
{offsetof(struct SimpleOptions, b), OptionType::kBoolean,
^ ~
/usr/lib/clang/12.0.1/include/stddef.h:104:24: note: expanded from macro 'offsetof'
#define offsetof(t, d) __builtin_offsetof(t, d)
^ ~
options/customizable_test.cc:233:20: error: offset of on non-standard-layout type 'struct SimpleOptions' [-Werror,-Winvalid-offsetof]
offsetof(struct SimpleOptions, cu),
^ ~~
/usr/lib/clang/12.0.1/include/stddef.h:104:24: note: expanded from macro 'offsetof'
#define offsetof(t, d) __builtin_offsetof(t, d)
^ ~
options/customizable_test.cc:236:20: error: offset of on non-standard-layout type 'struct SimpleOptions' [-Werror,-Winvalid-offsetof]
offsetof(struct SimpleOptions, cs),
^ ~~
/usr/lib/clang/12.0.1/include/stddef.h:104:24: note: expanded from macro 'offsetof'
#define offsetof(t, d) __builtin_offsetof(t, d)
^ ~
options/customizable_test.cc:239:21: error: offset of on non-standard-layout type 'struct SimpleOptions' [-Werror,-Winvalid-offsetof]
offsetof(struct SimpleOptions, cp),
^ ~~
/usr/lib/clang/12.0.1/include/stddef.h:104:24: note: expanded from macro 'offsetof'
#define offsetof(t, d) __builtin_offsetof(t, d)
^ ~
4 errors generated.
I also opened facebook/rocksdb#8525 to track the issue on RocksDB's side.
Updating to Clang 14 fixed the issue on my system, which indicates that this is likely a compiler bug that was fixed, so let's revert the changes back to the original version.
Name space requirements:
Why:
Improve write performance (for multiple and single threads)
Allow parallelism when working with the memtable
What :
Inserting data into the memtable without sorting it first will eliminate the locking constraints of the write process.
Since data is inserted unsorted a method of getting fast read is needed - generate 1M buckets and add the keys based on hash to one of them - statistically will be one per bucket.
Who:
Write and Read while wrote oriented workload.
Why:
Improve performance of multithreaded writes
What :
Redesign the writeflow to spend less time on locks mainly in a multithreaded environment
The write flow feature improves the overall performance by:
Who:
Multi threaded write environments
Why:
What:
Hold as much metadata as possible in the cache by dynamically switching between pinned and LRU while prioritizing upper LSM levels
Create a read intensity grade for each CF: Based on monitoring of memory consumption and if there is not enough space to pin the full index\filter data move lower levels and least read intensive CF into LRU and evict them from the bottom up.
Once a threshold reached evict pinned to LRU based on least read intensive CF and lower levels
At the moment - there are several unit tests failing in 'main'.
4 of them are mentioned here:
#9 (comment)
Since those 4 are failing in RocksDB 7.2.2 we should decide weather to fix them or remove them.
EnvTest.GenerateRawUniqueIdTrackEnvDetailsOnly
EventListenerTest.OnBlobFileOperationTest
EventListenerTest.OnFileOperationTest
EventListenerTest.ReadManifestAndWALOnRecovery
additional one:
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from PerfContextTest
[ RUN ] PerfContextTest.DBMutexLockCounter
db/perf_context_test.cc:611: Failure
Expected: (get_perf_context()->db_mutex_lock_nanos) > (0), actual: 0 vs 0
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): db/perf_context_test.cc:611: Failure
Expected: (get_perf_context()->db_mutex_lock_nanos) > (0), actual: 0 vs 0
Aborted (core dumped)
As long as those tests keep failing we will not be able to make the automation process work as expected.
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Why:
improve write performance by actively changing the memtable size, flush schedules and speed of writes.
what:
Introduce a new concept - memtable to L0 flush speed at mb\s
Introduce a decision algorithm based on:
Current disk IO vs disk capabilities - system wide (% of possible)
Current memory utilization vs quota - system wide and CF specific (%of dirty allocated vs taken)
CF specific usage per allocation
Current user write rate
In the algorithm the user write rate should be the goal, and flushes \ memtable sizes \ flush speed \ should be determined by the utilization.
Algorithm should take under consideration under-utilized CFs and flushes that can be slowed down
Testing:
Heavy (maximum) writes to a CF while writing 3 mb\s to 3 other CF
Verify full system performance better than rocksdb
Verify heavy CF performance better than rocksdb
Verify small CF over time performance do not degrigate
The delayed write option in RocksDB is used to slow down writes when flush or compaction can’t keep up with the incoming write rate. This mechanism ensures the write rate will be aligned with the HW capabilities. This mechanism monitors three parameters to identify stalls (number of unflushed memtables, number of files in L0, and the urgency level of pending flushes). In case a threshold of one of them is exceeded, the delayed write limits the write rate to the constant delayed_write_rate value.
When delayed write is enabled and active, the write rate is decreased gradually to moderate the write stalls. It currently does not take memory limitation under consideration and does not stop writes when a memory limit is reached. This might cause OOM error, which can be fixed with a memory monitoring feature.
Why :
stabilizing write performance.
slight increase in write performance due to stabilization
What :
reduce the amount of immutable memtables per flush
flushes will be in smaller steps that finishes faster allowing the proper write delay configuration to maintain more stable performance
Who :
write intensive workload that suffer from gittery write delays
Why: Improve seek and reduce CPU usage.
When there are many levels and many sst files in each level, the seek operation is slow. This is relevant when the hybrid compaction enabled.
What: Seek only in the relevant SSTs that are in the seek criteria range and avoid opening SST that is not in the required range.
Who: Hybrid compaction, when there are too many levels and too many sst files/holes existing
Technical details:
The Hmap is a linked list of overlap ranges between levels. The linked list should be updated after every flush/compaction completed.
Cost:
Note: This feature should work when the ORC has not completed (hence, there are still many levels, holes, sst files)
Need to finalize the criteria when to start/stop the hmap.
Needs to support speedb plug in the automation make process
We are now going to use speedb plug for various speedb options (memtablerep, bloomfilter) so to be able use it we need to add to the compile command ROCKSDB_PLUGINS=speedb
import general bug fixes from speedb
Why:
Improve performance during read heavy snapshot workload
What :
In Read only workload, when taking a snapshot while Read, don’t create a new snap if it hasn’t changed, use the last one.
In Read only workload, don’t use mutex when the last snapshot is still valid.
Why:
stabilizing write performance.
slight increase in write performance due to stabilization
What :
always take maximum of 50% (or lower) of the amount of L0 files to be compacted with L1
compaction steps will be smaller and finish faster freeing up both L0 and L1 for flush\compaction
Who:
write intensive workload that suffer from gittery performance due to L0 flushes
Why:
Reduce CPU and improve performance of read and seek operation, especially when data resides in memory
What:
Accelerate the index search by using a more sophisticated searching algorithm than binary search.
By utilizing a probabilistic search algorithm (will be explained in a different scientific paper) save on search time and cpu cycles (do less searches then binary search)
In order to select between binary search and probabilistic search another step in the sst creation needs to be done to not waste time on “trying” the right search algorithm.
The README file to include the sections:
With the changes in #5, when we need to purge a large amount of files, or large files which don't have their blocks allocated contiguously, or on bitmap-based filesystems (even when we have a contiguous range of blocks for each file), we may incur a high load on the disk when purging obsolete files that would eat into the bandwidth available to user operations and compactions.
Change the implementation of the default Env
's DeleteFile()
function to delete by truncating in modestly sized chunks (preliminary tests showed that around 500MiB might be a good place to start and tune from there), and allowing the disk to recover in between.
This needs to have extensive performance tests in conjunction with #5.
Why:
Enforce the soft and hard limits of the memory and issue relevant stalls \ delays \ eviction based on the other memory management capabilities to make sure we don't go OOM (Out Of Memory)
What :
Respect and hard enforce the memory quota assignment (cache size)
Accept 2 parameters, soft and hard.
In Speedb (proprietary) we made some changes to db_stress to fix bugs and support desired behaviours, so we need to port them to the OSS version.
db_stress is behaving as in Speedb (proprietary), without bugs.
There are some outstanding bugs, and missing features that need to be merged.
Compare with the Speedb (proprietary) version.
running make blackbox_ubsan_crash_test on rocksdb v7.2.2 results in the following errors:
monitoring/perf_step_timer.h:19:31: runtime error: load of null pointer of type 'PerfLevel'
file/writable_file_writer.cc:523:9: runtime error: member access within null pointer of type 'struct IOStatsContext'
seems like these variables are not initialized when reached.
accessing the members through a getter instead of directly, fixes these errors. e.g. GetPerfLevel() instead of perf_level directly.
perf_context is declared and defined in a very similar way (though with thread_local instead of __thread) but no error is raised from it. thats probably just a coincidence where the first access to perf_context is through get_perf_context() and the rest already see an initialized member.
no errors
errors
make blackbox_ubsan_crash_test
Add build fixes from Speedb into OSS
The code compiles.
The code breaks the build.
Compile with GCC on CentOS 7.9, with GCC 12, and with Clang 13.
MockMemTable members are not full initialized and have a garbage value that cause an assertion on each x runs
purges are deletions of obsolete compaction files. by doing the cleanup of these files in the HIGH priority thread pool, the flushes are hindered. we’d like to avoid disturbing the flushes and instead do this cleaning job as a LOW priority.
this change is also favored when using the SPDB-680 (delete a file in chunks) feature since with it, the deletion takes longer which would further stall the flushes.
also, consider changing priority here as well db/db_impl/db_impl_compaction_flush.cc:
void DBImpl::BGWorkPurge(void* db) {
IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);
TEST_SYNC_POINT("DBImpl::BGWorkPurge:start");
reinterpret_cast<DBImpl*>(db)->BackgroundCallPurge();
TEST_SYNC_POINT("DBImpl::BGWorkPurge:end");
}
Owner: @isaac-io
This issue handles updates needed to the CONTRIBUTING guide, as per the Speedb development process
Why:
Improve performance of multiple column families during writes
Improve memory utilization of multiple CF by distributing it dynamically
Eliminate IO stalls due to flushes
Framework for memory quota enforcement
What :
Proactively trigger flushes to different CF in order to free up used memory.
Better utilize overall memory usage and improve write performance.
Who (is it for):
Multiple DB or multiple CF where DB\CF are not evenly used (some CF\DB are more write intensive then others)
Why:
Seek is a commonly used operation with rocksdb. Improving it can significantly improve the overall performance of the application.
improve short seek performance with large amount of levels that don’t have much overlap
What :
Improve the creation of range filters to hold less irrelevant items and reducing the seek times
these params are:
table_cache_numshardbits is 4 while default is 6.
new table reader for comp inputs
block cache 8mb
index shortening mode
enable_pipelined_write - true in db bench and false otherwise.
delayed_write_rate = 8Mb and default value is 0.
and possibly more.
fix by making db_bench take the default value
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License athttp://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Building Speedb on my machine using Clang 14 fails with the following errors:
/home/isaac/projects/speedb-io/options/db_options.h:120:21: error: definition of implicit copy constructor for 'MutableDBOptions' is deprecated because it has a user-declared copy assignment operator [-Werror,-Wdeprecated-copy]
MutableDBOptions& operator=(const MutableDBOptions&) = default;
^
/home/isaac/projects/speedb-io/db/compaction/compaction_job.cc:439:7: note: in implicit copy constructor for 'rocksdb::MutableDBOptions' first required here
mutable_db_options_copy_(mutable_db_options),
^
mkdir build
CC=clang CXX=clang++ cmake .. -GNinja
ninja
The build completes successfully.
uname -omsr
and distribution name and version): Arch Linux with Linux 5.18.10-arch1-1 x86_64 GNU/Linux
gcc --version
):clang version 14.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Why :
Reduce false positives rate while using the same amount of memory.
What:
Develop a filter which is fast and low on CPU consumption on the one hand, but with a better memory footprint- FPR trade-off on the other hand.
Technical detail:
In the traditional bloom filter there is a tradeoff between memory usage and performance. Rocksdb blocked bloom filter takes less time but consumes extra memory.
Ribbon filter, on the other hand, takes ~30% less memory but is much slower than the bloom filter (factor of 4).
The idea is to improve bloom filter in both memory consumption and keep it high performant.
Who:
The proposed filter should be most beneficial when there is a need for a very small FPR. Typically this happens when the penalty of a false positive is very big compared to the filter test time (database on the disk), and when true positives are rare.
Integrate a new type of filter policy: Paired Block Bloom Filter
As part of the OSS plan, let's extract the relevant parts of the delayed write mechanism from the proprietary version and port it (as is) to the open source version, gated behind a flag to decide whether to use the RocksDB mechanism (based on the WriteController
logic) or use the Speedb one. If all tests pass, we'll set the default value of the flag so that the Speedb logic is used.
Note that a flag needs to be added to db_bench
as well in order to allow setting the new option in performance tests and allow comparing runs with and without it.
Currently db_bench
, db_stress
, and related tools don't allow providing plugin URIs for the memtable factory. This precludes the use of the new memtable (#22) in these tools during testing. Add support for providing memtable factory URI for the use of plugins.
needs_flush_speedup
is considered in a branch of the code that's effectively dead code.
DBImpl::MaybeScheduleFlushOrCompaction() {
...
bool is_flush_pool_empty =
env_->GetBackgroundThreads(Env::Priority::HIGH) == 0;
if (!is_flush_pool_empty) {
...
} else {
// // this code is never reached // //
while (unscheduled_flushes_ > 0 &&
(bg_flush_scheduled_ + bg_compaction_scheduled_ <
bg_job_limits.max_flushes ||
(needs_flush_speedup_ &&
bg_flush_scheduled_ <= bg_job_limits.max_flushes))) {
bg_flush_scheduled_++;
FlushThreadArg* fta = new FlushThreadArg;
fta->db_ = this;
fta->thread_pri_ = Env::Priority::LOW;
env_->Schedule(&DBImpl::BGWorkFlush, fta, Env::Priority::LOW, this,
&DBImpl::UnscheduleFlushCallback);
--unscheduled_flushes_;
}
}
The reason is that env_->GetBackgroundThreads()
returns the pool capacity, not the amount of available slots, so the flush pool will never actually be empty (the intention in RocksDB here was to support a case where flushes are supposed to have the same priority as compactions and thus max_background_flushes
is set to 0, but our use of needs_flush_speedup
needs to know if there are high priority jobs available, regardless of capacity allocation).
Available capacity is considered for scheduling based on flush urgency.
Allocated capacity is considered.
N/A
What
Provide an interface to change configurable parameters on the fly even if the application does not support it
Why
Who
Functional Requirements:
error:
[ RUN ] CustomizableTest.PrepareOptionsTest
options/customizable_test.cc:667: Failure
base->ConfigureFromString(prepared, "unique={id=P; can_prepare=true}")
NotFound: Missing configurable object: pointer
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): options/customizable_test.cc:667: Failure
base->ConfigureFromString(prepared, "unique={id=P; can_prepare=true}")
NotFound: Missing configurable object: pointer
Received signal 6 (Aborted)
./customizable_test
improve memtable switch latency with memtables that require large upfront allocation, such as as the hash-table-based memtable
in db test changing dynamically write buffer size and it is not aligned to the fact we have an active memtable and a pending memtable...
so the check of the L0 files (meainng the memtable that have done flush) depending the increase/decrease write buffer size is incorrect cause the pending memtable is created with the previous write buffer size ...
When there's no data in the buffer, there's nothing to drop anyway, and
providing 0 to the rand->Uniform() function results in a division by zero
error (SIGFPE), so just check and do nothing in that case.
Why :
Have a single component that knows the memory status of the system.
Know the status for each CF as a reporting and decision support for other features
what:
Track both clean and dirty memory consumption per CF
The memory monitor should register each CF during build and remove it during destroy
The memory monitor should be able to calculate when needed the clean and dirty and flush pending memory size for each CF
The memory monitor should have a method to report individual CF status and system wide status on clean, dirty, sum
User interface for getting this report - logs
Who :
reporting and monitoring.
decision support for other features (clean \ dirty \ quota managers)
Changing the compactions threads from 1 to 8 showed dramatic performance changes especially in the write tests but not only.
We would like a research to be done as to:
We want to understand the theory behind it before we start the testing to pinpoint our testing efforts (and the release to which this will be active) accordingly.
Currently we have one set of performance tests, including 4 benchmarks with 11 same tests each.
We use mainly this set which good and comprehensive but not necessarily fits for all our purposes.
We are creating four levels of testing:
The Daily level should be bigger, vaster and more comprehensive then the Regression level. Same goes for Weekly vs. Daily and Release vs. Weekly.
Regression level should be short, no longer than 2 hours and give RnD the very basic approval for nothing was broken and no degradation was found. In any case, that same night the Daily run will take place to have a more profound verification.
Daily Can be several hours - In any case no more than 10 hours.
Weekly should be longer and might include huge DB and threads/CF, features and important non default configurations.
Release is the biggest one. It should include all our features - including all that are set to false by default, different non default configurations, all mentions in https://speedb.atlassian.net/browse/SPDB-552 and other edge cases.
We would like to rethink over the necessity of the current 11 tests for each of the current benchmarks (maybe there should be difference in the tests between the different benchmarks) and the ones to come and consider adding new ones (delete e.g.). In addition. we should consider having images instead of the "fillrandom" test.
The outcome for this ticket should be a document for all the four levels - what should be tested in each level - spread to specific benchmarks and tests.
#29 is the issue to implement the new filter.
The new filter needs documentation - Wiki page + benchmark results
This issue is to track the documentation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.