Comments (22)
450GB is a very large SQLite database obviously.
So what happens is this -- on startup the SQLite database must be rebuilt by the Raft system. This involves making a copy of your 450GB database. It's possible this could be optimized, such that the copy is avoided, I have some ideas but have yet to implement.
So the question (for now) is "it's taking rqlited over an hour to perform a 450GB copy on my system. Is that too long?". I have some questions that will help us answer the question.
- Is your disk operating at maximum capacity during this startup?
- How long does it take to copy the 450GB file if you simply run
cp <sqilte.file> <somewhere else>
. That is our baseline performance.
I do think long-term for such large databases rqlite needs to avoid the copy. But it needs careful design and implementation.
from rqlite.
ok thank you
from rqlite.
In principle it could be solved like this (for my own records, mostly):
- if the SQLite database under management is the same as that in the Snapshot store (do a checksum match) then skip the copy. This could be automatically handled via the Hashicorp flag:
NoSnapshotRestoreOnStart
For this to be safe, the following would need to change in the code:
- fsync the SQLite file under management in the function fsmSnapshot() once the WAL file has been checkpointed into the file.
- disable compacting WAL snapshotting so both SQLite file under management and that in the Store are byte-for-byte the same (presumably). Compacting scanner would need to change to a simple file copy of the entire WAL file into the Snapshot store. This is the hardest trade-off to make -- not technically difficult, just that it would need testing to know its concrete impact. It may not make a huge difference for most use cases, and might be the right trade-off.
- on start up, instead of unconditionally delete the SQLite file and any WAL file, just delete the WAL file. Then do the checksum compare. If it passes, then we don't delete the SQLite file, and skip the restore i.e. skip the extra copy.
from rqlite.
The key concern I need to address is the following: the copy of the SQLite database that is there at restart time (not the one in the snapshot store, but the one under management) is not correct and has been corrupted somehow. One way around this could be the following:
- at fsmSnapshot() time calculate a checksum for the SQLite file under management. This doesn't need to block writes, as it could be done in the Persist stage. This snapshot is then written to the Snapshot store along with the snapshot.
- on startup read the checksum from the snapshot store. If it matches what's under management, then don't delete the SQLite file and set
NoSnapshotRestoreOnStart
before Raft starts.
from rqlite.
Is this check done only at first node startup? For example: let's say I have a running cluster of 3 nodes, one node goes offline, I add a new node. Of course this node has to resync, but the cluster remains operational, right? The snapshot is reloaded from scratch only when the first node of a cluster is started?
from rqlite.
and about this feature:
Log Compaction and Truncation
rqlite automatically performs log compaction, so that disk usage due to the log remains bounded. After a configurable number of changes rqlite snapshots the SQLite database, and truncates the Raft log. This is a technical feature of the Raft consensus system, and most users of rqlite need not be concerned with this.
does the snapshot lock the database? and does this mean the disk usage is double the size of the database?
from rqlite.
For example: let's say I have a running cluster of 3 nodes, one node goes offline, I add a new node. Of course this node has to resync, but the cluster remains operational, right?
Assuming I follow your question, yes, the cluster remains operational.
does the snapshot lock the database?
Yes, for a brief period of time, but usually too short to notice unless your write load is continuously very high.
and does this mean the disk usage is double the size of the database?
Basically yes. This is core to the way Raft works, it is not something special that rqlite does.
from rqlite.
The PR linked to this issue fixes the start-up issue. It actually appears much easier to fix than I realised before. While the initial boot will take as long as it's going to take, restarts will skip the copy.
One cost is that graceful shutdowns of a node may take a little longer, but shouldn't be more than a second or two.
from rqlite.
Wow! Great! I'm going to test it in the next few days.
And about the periodic snapshot, just to be sure to understand, is it a full copy? Going back to my situation, the snapshot of a ~450Gb database would take the same time of copying the whole file?
from rqlite.
No, snapshotting takes an incremental approach. See:
https://www.philipotoole.com/rqlite-8-0-large-dataset-support-and-core-feature-upgrades/
"A new approach to Raft snapshotting"
from rqlite.
Going back to my situation, the snapshot of a ~450Gb database would take the same time of copying the whole file?
Snapshotting works in an incremental manner. It copies just the changes that have taken place since the last snapshot. So once your initial Boot operation is done, Snapshotting doesn't copy the entire 450GB data every time. If your boot your system, and then write, say, 1MB of data to rqlite, then at most Snapshotting will copy 1MB of data.
The issue you hit is just a restart thing. In that case the entire 450GB dataset is copied from the Raft subsystem back into rqlite on node restart. However I believe I have an approach that can avoid even that copy -- it looks easier than I initially thought (see the PR), which is why I never worked on it before.
I may have an experimental version you could test in a day or two. Are you able to build from source?
from rqlite.
I may have an experimental version you could test in a day or two. Are you able to build from source?
sure
from rqlite.
Thinking about this some more, it's given me some ideas for more fundamental changes that will solve this issue better -- and save disk space. So it may actually be sometime (weeks) before I decrease start-up times for such large databases.
from rqlite.
I built the branch reuse-sqlite-db, the startup time has decreased from more than 1 hour to about 10 minutes... It's very good, do you think it will be even faster? Or are you talking about better code?
from rqlite.
I might be able to make it faster. Can you show me the logs from your startup? I wonder where the 10 mins is being spent -- it must be in the sha256sum check.
Background: to be sure the "main" SQLite database can be used without getting a copy from the Snapshot store, rqlite first calculates a SHA256 hash of the SQLite file. I guess it's taking 10 mins to calculate that sum?
How long does it take to calculate such a sum of a 450GB file on the machine? Can you check?
from rqlite.
$ time sha256sum output_file
ac13bb3e699306c49ff43e054804aa04a4c6656642db2c57efc5292cd79ec870 output_file
real 0m41.938s
user 0m15.180s
sys 0m15.071s
This is for a ~20GB file. So it takes ~40 seconds on a VM, which is running on modern hardware (yeah, I know this is kinda vague). So a file 20x large would take minutes. So a 10 minute start-up with a 450GB SQLite file seems about right.
Simply skipping the checksum check would probably get your startup down to seconds. It could be an optional check.
from rqlite.
We could also consider a much faster integrity check, such as CRC32. Might be good for rqlite, where I am not concerned about malicious tampering. A CRC32 of the same data takes 13 seconds to run.
from rqlite.
Yes the time spent at startup is due to the sha256sum check. Could it be done somehow on chunks of data, to parallelize and make it faster?
from rqlite.
Yes the time spent at startup is due to the sha256sum check.
How do you know? Did you add logging?
Could it be done somehow on chunks of data, to parallelize and make it faster?
Yes, use a different hash technique (see above), or just skip it.
from rqlite.
Rqlited startup told me this
[snapshot-store] 2024/06/12 08:42:32 store initialized using /media/NVMEdisk2/rqlite/file/data/rsnapshots [snapshot-store] 2024/06/12 08:42:32 checking consistency of snapshot store at /media/NVMEdisk2/rqlite/file/data/rsnapshots [snapshot-store] 2024/06/12 08:42:32 check complete [store] 2024/06/12 08:42:32 1 preexisting snapshots present [store] 2024/06/12 08:58:37 reusing existing database at /media/NVMEdisk2/rqlite/file/data/db.sqlite, skipping copy from Snapshot store
and looking at your code in store.go it seems to me that the only time consuming operation is the checksum in createOnDisk function
from rqlite.
Could probably get a ~4x speed up by using this function:
func computeCRC32(filePath string) (string, error) {
file, err := os.Open(filePath)
if err != nil {
return "", fmt.Errorf("failed to open file: %v", err)
}
defer file.Close()
hash := crc32.NewIEEE()
if _, err := io.Copy(hash, file); err != nil {
return "", fmt.Errorf("failed to compute checksum: %v", err)
}
checksum := hex.EncodeToString(hash.Sum(nil))
return checksum, nil
}
instead of the existing function -- going forward that is.
from rqlite.
My plan is to fix this in 9.0.
I've already started development of 9.0: #1810
from rqlite.
Related Issues (20)
- Build and upload binaries automatically HOT 4
- Which version of Sqlite dialect is supported? HOT 4
- PRAGMA foreign_keys are turning them off even we turn them on. HOT 4
- connect rqlite over unix sockets HOT 2
- how to debug random "database disk image is malformed" error? HOT 10
- rqlited + DNS client 100% CPU usage after network disconnecting (windows) HOT 21
- CTRL-C should stop the process HOT 1
- [FeatureRequest] Make number of retries configurable for /nodes and potentially other relevant http calls HOT 7
- Build from source fails in Windows VM HOT 7
- Synchronisation bug related to http.Server.AllowedOrigin HOT 1
- Dynamic backup file naming HOT 9
- Multiple issues with snapshot process HOT 5
- SELECT MAX() or COUNT() fail on big tables HOT 6
- Support for streaming BLOB I/O HOT 3
- CAS conflict HOT 23
- ERR! leader not found, after connecting to the leader timeout
- leader not found, after slave gets killed
- When the cluster is abnormal, the node is not working
- Write through rqlite, read through sqlite - issues
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rqlite.