Giter VIP home page Giter VIP logo

sqlite-s3vfs's Introduction

sqlite-s3vfs

PyPI package Test suite Code coverage

Python virtual filesystem for SQLite to read from and write to S3.

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

Based on simonwo's gist, and inspired by phiresky's sql.js-httpvfs, dacort's Stack Overflow answer, and michalc's sqlite-s3-query.

How does it work?

sqlite-s3vfs stores the SQLite database in fixed-sized blocks, and each is stored as a separate object in S3. SQLite stores its data in fixed-size pages, and always writes exactly a page at a time. This virtual filesystem translates page reads and writes to block reads and writes. In the case of SQLite pages being the same size as blocks, which is the case by default, each page write results in exactly one block write.

Separate objects are required since S3 does not support the partial replace of an object; to change even 1 byte, it must be re-uploaded in full.

Installation

sqlite-s3vfs can be installed from PyPI using pip.

pip install sqlite-s3vfs

This will automatically install boto3, APSW, and any of their dependencies.

Usage

sqlite-s3vfs is an APSW virtual filesystem that requires boto3 for its communication with S3.

import apsw
import boto3
import sqlite_s3vfs

# A boto3 bucket resource
bucket = boto3.Session().resource('s3').Bucket('my-bucket')

# An S3VFS for that bucket
s3vfs = sqlite_s3vfs.S3VFS(bucket=bucket)

# sqlite-s3vfs stores many objects under this prefix
# Note that it's not typical to start a key prefix with '/'
key_prefix = 'my/path/cool.sqlite'

# Connect, insert data, and query
with apsw.Connection(key_prefix, vfs=s3vfs.name) as db:
    cursor = db.cursor()
    cursor.execute('''
        CREATE TABLE foo(x,y);
        INSERT INTO foo VALUES(1,2);
    ''')
    cursor.execute('SELECT * FROM foo;')
    print(cursor.fetchall())

See the APSW documentation for more examples.

Serializing (getting a regular SQLite file out of the VFS)

The bytes corresponding to a regular SQLite file can be extracted with the serialize_iter function, which returns an iterable,

for chunk in s3vfs.serialize_iter(key_prefix=key_prefix):
    print(chunk)

or with serialize_fileobj, which returns a non-seekable file-like object. This can be passed to Boto3's upload_fileobj method to upload a regular SQLite file to S3.

target_obj = boto3.Session().resource('s3').Bucket('my-target-bucket').Object('target/cool.sqlite')
target_obj.upload_fileobj(s3vfs.serialize_fileobj(key_prefix=key_prefix))

Deserializing (getting a regular SQLite file into the VFS)

# Any iterable that yields bytes can be used. In this example, bytes come from
# a regular SQLite file already in S3
source_obj = boto3.Session().resource('s3').Bucket('my-source-bucket').Object('source/cool.sqlite')
bytes_iter = source_obj.get()['Body'].iter_chunks()

s3vfs.deserialize_iter(key_prefix='my/path/cool.sqlite', bytes_iter=bytes_iter)

Block size and page size

SQLite writes data in pages, which are 4096 bytes by default. sqlite-s3vfs stores data in blocks, which are also 4096 bytes by default. If you change one you should change the other to match for performance reasons.

s3vfs = sqlite_s3vfs.S3VFS(bucket=bucket, block_size=65536)
with apsw.Connection(key_prefix, vfs=s3vfs.name) as db:
    cursor = db.cursor()
    cursor.execute('''
        PRAGMA page_size = 65536;
    ''')

Tests

The tests require the dev dependencies and MinIO started

pip install -e ".[dev]"
./start-minio.sh

can be run with pytest

pytest

and finally Minio stopped

./stop-minio.sh

sqlite-s3vfs's People

Contributors

josefsmith avatar michalc avatar simonwo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sqlite-s3vfs's Issues

Consider decoupling the file size from the page size

This may be a documentation issue, or implementation may be needed

From the docs:

Block size and page size

SQLite writes data in pages, which are 4096 bytes by default. sqlite-s3vfs stores data in blocks, which are also 4096 bytes by default. If you change one you should change the other to match for performance reasons.

4096 is a good number locally, it matches the default memory page size on Linux and Windows and the block size on ext3/4 filesystems.

On s3 or http generally 4096 seems small as the overhead of each request will be a large % of the time, and less http requests is probably optimal.

Q: Are block size and page size really coupled, or can block size be a multiple of page size and still be performant ?

If it's fine as a multiple, can the docs be updated to mention this ?

Action:

Consider recommending a larger block size (may need testing to find an optimal size, but the 65536 from the docs doesn't seem like a bad start).

It may be worth using uktrade/tamato to run these tests as it is a user of sqlite-s3vfs.

Advertise correct permissions in xAccess

xAccess is meant to advise SQLite whether it can read or write to a file depending on the permissions it asks for. At the moment it is only capable of telling it whether the file already exists.

xAccess should hook into the AWS ACL stuff and work out whether files are readable or writable as well.

Return correct file sizes from xFileSize

The File size stuff could probably be better โ€“ I had an idea that you could make the last block actually be the correct size instead of just requiring all blocks be 64kb, and then you can easily work out the file size by just summing the file sizes of all the blocks from an AWS HEAD call, but I didn't implement that.

This probably won't cause many problems but it would be confusing if SQLite was to xTruncate a file to a certain size and then subsequently read back the size and it be different to expected.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.