Giter VIP home page Giter VIP logo

Comments (22)

otoolep avatar otoolep commented on June 29, 2024

450GB is a very large SQLite database obviously.

So what happens is this -- on startup the SQLite database must be rebuilt by the Raft system. This involves making a copy of your 450GB database. It's possible this could be optimized, such that the copy is avoided, I have some ideas but have yet to implement.

So the question (for now) is "it's taking rqlited over an hour to perform a 450GB copy on my system. Is that too long?". I have some questions that will help us answer the question.

  • Is your disk operating at maximum capacity during this startup?
  • How long does it take to copy the 450GB file if you simply run cp <sqilte.file> <somewhere else>. That is our baseline performance.

I do think long-term for such large databases rqlite needs to avoid the copy. But it needs careful design and implementation.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

ok thank you

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

In principle it could be solved like this (for my own records, mostly):

  • if the SQLite database under management is the same as that in the Snapshot store (do a checksum match) then skip the copy. This could be automatically handled via the Hashicorp flag: NoSnapshotRestoreOnStart

For this to be safe, the following would need to change in the code:

  • fsync the SQLite file under management in the function fsmSnapshot() once the WAL file has been checkpointed into the file.
  • disable compacting WAL snapshotting so both SQLite file under management and that in the Store are byte-for-byte the same (presumably). Compacting scanner would need to change to a simple file copy of the entire WAL file into the Snapshot store. This is the hardest trade-off to make -- not technically difficult, just that it would need testing to know its concrete impact. It may not make a huge difference for most use cases, and might be the right trade-off.
  • on start up, instead of unconditionally delete the SQLite file and any WAL file, just delete the WAL file. Then do the checksum compare. If it passes, then we don't delete the SQLite file, and skip the restore i.e. skip the extra copy.

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

The key concern I need to address is the following: the copy of the SQLite database that is there at restart time (not the one in the snapshot store, but the one under management) is not correct and has been corrupted somehow. One way around this could be the following:

  • at fsmSnapshot() time calculate a checksum for the SQLite file under management. This doesn't need to block writes, as it could be done in the Persist stage. This snapshot is then written to the Snapshot store along with the snapshot.
  • on startup read the checksum from the snapshot store. If it matches what's under management, then don't delete the SQLite file and set NoSnapshotRestoreOnStart before Raft starts.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

Is this check done only at first node startup? For example: let's say I have a running cluster of 3 nodes, one node goes offline, I add a new node. Of course this node has to resync, but the cluster remains operational, right? The snapshot is reloaded from scratch only when the first node of a cluster is started?

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

and about this feature:

Log Compaction and Truncation
rqlite automatically performs log compaction, so that disk usage due to the log remains bounded. After a configurable number of changes rqlite snapshots the SQLite database, and truncates the Raft log. This is a technical feature of the Raft consensus system, and most users of rqlite need not be concerned with this.

does the snapshot lock the database? and does this mean the disk usage is double the size of the database?

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

For example: let's say I have a running cluster of 3 nodes, one node goes offline, I add a new node. Of course this node has to resync, but the cluster remains operational, right?

Assuming I follow your question, yes, the cluster remains operational.

does the snapshot lock the database?

Yes, for a brief period of time, but usually too short to notice unless your write load is continuously very high.

and does this mean the disk usage is double the size of the database?

Basically yes. This is core to the way Raft works, it is not something special that rqlite does.

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

The PR linked to this issue fixes the start-up issue. It actually appears much easier to fix than I realised before. While the initial boot will take as long as it's going to take, restarts will skip the copy.

One cost is that graceful shutdowns of a node may take a little longer, but shouldn't be more than a second or two.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

Wow! Great! I'm going to test it in the next few days.

And about the periodic snapshot, just to be sure to understand, is it a full copy? Going back to my situation, the snapshot of a ~450Gb database would take the same time of copying the whole file?

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

No, snapshotting takes an incremental approach. See:

https://www.philipotoole.com/rqlite-8-0-large-dataset-support-and-core-feature-upgrades/

"A new approach to Raft snapshotting"

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

Going back to my situation, the snapshot of a ~450Gb database would take the same time of copying the whole file?

Snapshotting works in an incremental manner. It copies just the changes that have taken place since the last snapshot. So once your initial Boot operation is done, Snapshotting doesn't copy the entire 450GB data every time. If your boot your system, and then write, say, 1MB of data to rqlite, then at most Snapshotting will copy 1MB of data.

The issue you hit is just a restart thing. In that case the entire 450GB dataset is copied from the Raft subsystem back into rqlite on node restart. However I believe I have an approach that can avoid even that copy -- it looks easier than I initially thought (see the PR), which is why I never worked on it before.

I may have an experimental version you could test in a day or two. Are you able to build from source?

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

I may have an experimental version you could test in a day or two. Are you able to build from source?

sure

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

Thinking about this some more, it's given me some ideas for more fundamental changes that will solve this issue better -- and save disk space. So it may actually be sometime (weeks) before I decrease start-up times for such large databases.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

I built the branch reuse-sqlite-db, the startup time has decreased from more than 1 hour to about 10 minutes... It's very good, do you think it will be even faster? Or are you talking about better code?

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

I might be able to make it faster. Can you show me the logs from your startup? I wonder where the 10 mins is being spent -- it must be in the sha256sum check.

Background: to be sure the "main" SQLite database can be used without getting a copy from the Snapshot store, rqlite first calculates a SHA256 hash of the SQLite file. I guess it's taking 10 mins to calculate that sum?

How long does it take to calculate such a sum of a 450GB file on the machine? Can you check?

from rqlite.

otoolep avatar otoolep commented on June 29, 2024
$ time sha256sum output_file 
ac13bb3e699306c49ff43e054804aa04a4c6656642db2c57efc5292cd79ec870  output_file

real	0m41.938s
user	0m15.180s
sys	0m15.071s

This is for a ~20GB file. So it takes ~40 seconds on a VM, which is running on modern hardware (yeah, I know this is kinda vague). So a file 20x large would take minutes. So a 10 minute start-up with a 450GB SQLite file seems about right.

Simply skipping the checksum check would probably get your startup down to seconds. It could be an optional check.

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

We could also consider a much faster integrity check, such as CRC32. Might be good for rqlite, where I am not concerned about malicious tampering. A CRC32 of the same data takes 13 seconds to run.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

Yes the time spent at startup is due to the sha256sum check. Could it be done somehow on chunks of data, to parallelize and make it faster?

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

Yes the time spent at startup is due to the sha256sum check.

How do you know? Did you add logging?

Could it be done somehow on chunks of data, to parallelize and make it faster?

Yes, use a different hash technique (see above), or just skip it.

from rqlite.

softwaremarkdahl avatar softwaremarkdahl commented on June 29, 2024

Rqlited startup told me this
[snapshot-store] 2024/06/12 08:42:32 store initialized using /media/NVMEdisk2/rqlite/file/data/rsnapshots [snapshot-store] 2024/06/12 08:42:32 checking consistency of snapshot store at /media/NVMEdisk2/rqlite/file/data/rsnapshots [snapshot-store] 2024/06/12 08:42:32 check complete [store] 2024/06/12 08:42:32 1 preexisting snapshots present [store] 2024/06/12 08:58:37 reusing existing database at /media/NVMEdisk2/rqlite/file/data/db.sqlite, skipping copy from Snapshot store
and looking at your code in store.go it seems to me that the only time consuming operation is the checksum in createOnDisk function

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

Could probably get a ~4x speed up by using this function:

func computeCRC32(filePath string) (string, error) {
	file, err := os.Open(filePath)
	if err != nil {
		return "", fmt.Errorf("failed to open file: %v", err)
	}
	defer file.Close()

	hash := crc32.NewIEEE()
	if _, err := io.Copy(hash, file); err != nil {
		return "", fmt.Errorf("failed to compute checksum: %v", err)
	}
	checksum := hex.EncodeToString(hash.Sum(nil))
	return checksum, nil
}

instead of the existing function -- going forward that is.

from rqlite.

otoolep avatar otoolep commented on June 29, 2024

My plan is to fix this in 9.0.

I've already started development of 9.0: #1810

from rqlite.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.