- concept of
file based database for partitioning and event sourced data
- like a local file storage version of a kafka database
- 20230930041146 fsdb developing ideas
- run build script
build-fsdb
to build optimized components.- i tried to make the whole thing with just scripts, but it is just too slow to handle billions of rows.
- fsdb can load roughly 30 million rows of real data per hour with a 12 core machine on a SSD
- 20231004133128 an optimized hashcode generator for partitioning work into multiple processes
- the first field of the data is regarded as an ID
- data stored as TSV or CSV in two partition files:
- a compressed file holds all existing data
- an uncompressed file holds recently added data.
- this data can be compressed and appended to the compressed file periodically as data is inserted so the overall size of the database doesn't grow too fast
- data is stored in partition files named after the partition number
- optionally store a timestamp with each row
- timestamps in fsdb
- should ignore it when printing unless provided with a command line option to print timestamp?
- initialize database with
-t
to enable timestamps - search for timestamps using
searchtime
subcommand
- bloom filter could be optional feature implemented with hooks
- 20231002021919 bloom filter for fsdb
- can use as large lookup table like dynamo db
- 20231003062001 fsdb use case - using as a set
- join with another file or stream piped to standard input. this is possible if the ID is the first column.
-
basic set of features / subcommands needed for database
- initialize and set up number of partitions
- search for one ID or multiple
- search for IDs missing from the database - set difference
- ingest data - pipe it into standard input and an awk script will put it where it belongs
- print all data
- compress subcommand - compress text files and append to gzip streams together. called by ingest when a partition gets too large
-
testing timestamps for data
awk '
BEGIN {
print systime()
}
'
1696194880
zet/20230929145418/README.md
- 20221008042814 WIP
- 20230928171946 data analysis scripting hub
- 20230929153207 computer science experiments hub
- 20230929194846 testing if shortcuts will work and not pick up extra files
- 20230930041146 fsdb developing ideas
- 20230930224454 problems encountered with zkvr while testing this environment
- 20231001151606 hashcodes for fsdb partitioning
- 20231001222624 test gzip append functionality
- 20221009192000 stuff to put on main page
- 20231003022851 data engineering hub
- 20231003062001 fsdb use cases
- 20231003063630 adding timestamps to fsdb
- 20231004133128 an optimized hashcode generator for partitioning work into multiple processes
- 20230905015223 install scripts
- 20231116151546 how to implement a modular subcommand with lightweight scripts - ideas for a blog post
- 20231122053807 fsdb implementation details
- 20231122231545 publishing fsdb to github as a standalone project in a repository
Tags:
#data #file #database #project #shortcmd