fstpackage / fstlib Goto Github PK
View Code? Open in Web Editor NEWA C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.
License: Mozilla Public License 2.0
A C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.
License: Mozilla Public License 2.0
Explaining the goals behind the fstlib
library and the differences with the arrow
/parguet
philosophy.
I was looking into lib/factor/factor_v7.cpp and see code like if (*nrOfLevels < 128)
. In the comment it says
// use 1 byte per int (Na encoding takes 1 bit)
which seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).
Without much background, I assume it's to do with how R encodes the values, so it's always stored as int
instead of unsigned int
. I know if it's too expensive in terms of performance to relax this to 256 by converting to unsigned int
. I know Julia supports unsigned with its UInt8 type.
To include only the lib subdirectory. Also, add a coveralls
banner to the homepage.
Is linux supported out of the box?
If yes, what is the recommended way to compile on linux?
See for example here. To lower memory requirements, fstlib
allocates larger blocks of memory that are written to by several threads. In such cases, cache line pollution must be avoided.
A solution is to make sure each sub-block has a size that is a multiple of the cache line size (64 bytes on most modern Intel processors).
I know it's going to be a bit of work, but a full-description of the fst format will help build connectors into it. From Julia, Python, and any other programming language. The potential is huge for such an awesome on-disk data manipulation framework!
I will try to help when I know enough C++. I secretly hope that once the format is well known, there can be independent implementation in Julia and Rust (at the risk of running out of sync with C++) but native implementations would be fun. But calling into C++ is also a good option.
With the metadata intact
Currently, there is no clear documentation on how to setup a C++
project using the fstlib
library.
A sample C++
project would be a good starting point for potential users. Some ideas:
fst
filefst
filefst
fileparguet
files to fst
files using the arrow
library as the in-memory tabular representation. That would really showcase the flexibility of the fstlib
library.Also, it would be interesting to compare the read- and write- speed of a pure C++
consumer to that of the fst
package.
To enable client applications to effectively implement the fstlib
library
First thanks for the library!
What is the recommended approach to write large datasets (e.g. 20+ GB csv files). Is there any way to stream reading / writing ?
I have a hard time finding documentation on how to use it. The only one I found uses data frames. I am not an expert on R but I think it is in memory only.
Also I would ideally like to use it in a rust program, which means I'll probably need to do a rust binding for the required parts. Happy to share it if you want!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.