fstpackage / fst Goto Github PK
View Code? Open in Web Editor NEWLightning Fast Serialization of Data Frames for R
Home Page: http://www.fstpackage.org/fst/
License: GNU Affero General Public License v3.0
Lightning Fast Serialization of Data Frames for R
Home Page: http://www.fstpackage.org/fst/
License: GNU Affero General Public License v3.0
Not sure where else to leave this, but wanted to provide some benchmark feedback.
My data isn't huge, but it does need to be loaded into a Shiny app, so every millisecond counts. The data is also periodically transferred from a remote server to shinyapps.io, where the app is hosted, so file size is also a concern.
pryr::object_size(fbHistoricResultsSQL)
> 143 MB
Writing the object to the files:
microbenchmark(times=10,
write.csv(fbHistoricResultsSQL, "fbHistoricResultsSQL.csv"),
data.table::fwrite(fbHistoricResultsSQL, "fbHistoricResultsSQL_fwrite.csv"),
saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds"),
feather::write_feather(fbHistoricResultsSQL, "fbHistoricResultsSQL.feather"),
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL.fst"),
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst", compress = 100)
)
Rds
is left at default compression settings.
Speed to write isn't a huge concern for me, as it's done on a remote server controlled by a cron job. fst
is fastest, while fst
with 100 compression is still quicker than Rds
, but slower than feather
and fwrite
.
Unit: milliseconds
expr min lq mean median uq max neval
write.csv(fbHistoricResultsSQL, "fbHistoricResultsSQL.csv") 43327.6840 44443.1966 51946.9457 51962.6121 58955.6631 62506.981 10
data.table::fwrite(fbHistoricResultsSQL, "fbHistoricResultsSQL_fwrite.csv") 499.8059 517.6073 643.5355 525.9915 672.7274 1077.391 10
saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds") 11399.7927 13771.2299 17093.8948 18078.2103 20357.4133 20481.531 10
feather::write_feather(fbHistoricResultsSQL, "fbHistoricResultsSQL.feather") 354.8188 416.0689 632.8047 505.7227 878.3790 1132.388 10
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL.fst") 340.8930 362.4958 591.8688 418.2773 1050.4093 1059.357 10
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst", compress = 100) 1655.4058 2262.0219 3968.6550 4650.9530 4881.5290 6272.078 10
Rds
does create the smallest file sizes, with 100 compressed fst
a little way behind.
> file.size("fbHistoricResultsSQL.csv")
[1] 114849990
> file.size("fbHistoricResultsSQL_fwrite.csv")
[1] 90092546
> file.size("fbHistoricResultsSQL.Rds")
[1] 26563307
> file.size("fbHistoricResultsSQL.feather")
[1] 148798120
> file.size("fbHistoricResultsSQL.fst")
[1] 148459382
> file.size("fbHistoricResultsSQL_compressed.fst")
[1] 34119811
For ease of viewing, the Rds
file is about 25.3MB and the 100 compressed fst
is 32.5MB.
Now, reading the data back in from the saved files:
microbenchmark(times=10,
csv <- read.csv("fbHistoricResultsSQL.csv", sep=",", stringsAsFactors = FALSE),
fread_csv <- fread("fbHistoricResultsSQL_fwrite.csv"),
Rds_in <- readRDS("fbHistoricResultsSQL.Rds"),
feather_in <- feather::read_feather("fbHistoricResultsSQL.feather"),
fst_in <- fst::read.fst("fbHistoricResultsSQL.fst"),
fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst")
)
fst
wins again, but crucially, 100 compressed fst
is significantly faster than Rds
.
Unit: milliseconds
expr min lq mean median uq max neval
csv <- read.csv("fbHistoricResultsSQL.csv", sep = ",", stringsAsFactors = FALSE) 17274.9592 18683.0098 20032.4658 19753.4444 21068.422 23651.957 10
fread_csv <- fread("fbHistoricResultsSQL_fwrite.csv") 2784.1564 3508.4621 3729.2320 3723.2384 4078.078 4672.571 10
Rds_in <- readRDS("fbHistoricResultsSQL.Rds") 3381.8752 5061.3580 5383.2858 5359.9506 5845.889 6525.150 10
feather_in <- feather::read_feather("fbHistoricResultsSQL.feather") 356.4536 437.6393 821.1549 741.2754 1121.273 1700.306 10
fst_in <- fst::read.fst("fbHistoricResultsSQL.fst") 355.2809 363.5862 712.1544 594.1401 1088.736 1438.211 10
fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst") 667.0044 853.3418 1357.8370 1489.1858 1627.346 1957.504 10
For my use case, which is a combination of desiring small binary file size and fastest read time, the 100 compressed fst
looks to be just about the right job.
Of course any advances in compression to bring the 100 compressed fst
below the Rds
file size, without dramatic read speed impact, would be fantastic.
The build-in dictionary could speed-up compression of character columns significantly. Alternatively, we could use the ZSTD compressor with a pre-trained dictionary.
Amazing package, excellent for building a cache mechanism. I have a matrix with time series with dates in the rows (700) and the entity in the columns (2 million). I tried coercing this matrix to a data.table and read/write with fst I loose many columns
write.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 2191021 variables:
read.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 28333 variables:
Just saw your new package. Looks very nice! Thanks for the praise on fwrite
.
I'm finding the headline chart hard to grasp, though. For example, what does "0" for fread
mean compared to a speed of 3,000 for fst
? Is this a test on 1e7 rows and what are the actual timings? Here's the chart and below it my suggestion :
I'd much prefer to see a size chosen and then the amount of time reported, like I did in this article :
Where does fst
fit into that table for example?
From Jean-Luc by email: during a write.fst
operation there is no error message when the storage device runs out of space and data is only partially written. A subsequent read.fst
does not detect the erroneous file which leads to undefined behavior.
A list of currently planned milestones for fst
with some key features:
format-complete: the fst
format allows for:
a) row binding of data frames
b) column binding of data frames
c) persisting (custom) column attributes
d) persisting and indexing table keys
e) a range of compression algorithms
f) storing hashes for each data block (#49 )
stand-alone C++ core library
a) the core code for fst
is available as a separate C++ library
interface: fst
streaming object can be used like a data.frame
:
a) (simple) on-the-fly sub-setting (requires far less memory)
b) selection of columns
c) append columns
d) append rows
e) rbind several fst
files or rbind fst
files with in-memory data sets
multi-threading:
a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.
added functionality:
a) lapply like functionality creating a fst
file using a list of inputs (csv's, custom methods, etc.)
b) directly convert csv
to fst
without memory overhead
interoperability:
a) import data from Apache Parguet files
b) types used in fst
C++ core library are close to Apache Arrow
c) Python interface?
advanced operations:
a) on the fly sequential and parallel grouping using custom methods
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) adding columns using a merge operation (with a fst
file acting as the right-join data set)
d) fst
file can be sorted into a new fst
file using merge-sort algorithm
e) multiple fst
-files represent a single data set
f) operations can be performed on the set of fst
files in parallel
g) set of fst
-files can be sorted in parallel into a new set of fst
files. This avoids the slow end-phase of sorting algorithms like merge sort.
h) user-defined map-reduce operations that can be used on the fst
file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
i) fill a data set range with specific rows from a fst
file, overwriting data in-memory (#29).
performance enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
e) high compression mode for slow IO (network) speeds (#23).
This list is subject to a lot of change depending on features and issues requested/reported by users of the fst
package :-)
By setting the first parameter to a vector of file names.
By specifying a condition on one or more columns of the stored table, data can be read using far less memory than a full read combined with a selection of rows. Related to issue #15 and issue #16: data can be read using a stream object and selection can be done on chunks of data, rather than the complete data set. Restrictions:
median(ColA) / sum(ColA)
.data.table
's rbindlist
). This will have an effect on performance.This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.
I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.
All attributes are stored in the fst
file format. Complex types can be stored by using R's native serialization mechanism and compressing with the fast LZ4 or ZSTD for large attribute objects. It might be a good idea to store the col.names atribute as a normal column for speed. Allowing attributes would facilitate adding user-defined metadata to a fst
file.
Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.
Just to be clear say we have a data.table with date,id,col1,col2,col3
saved as a fst file.
I'd like to be able to do something like read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")
I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents.
You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.
Regards and thanks
Is it possible to append to an fst without having to load it (completely)?
To perform a binary search on the key columns of a fst
file, a key index is required when using compression. If a key index is not present, we have to decompress many complete blocks (16 kB each) to use a single row from the block in our binary search. This has a high cost. Instead, if we write the first row of a compression block to a separate index in the fst
file format, we can perform a binary search on the key index and only decompress a single block to get to the actual row that we search for. This will increase the fst
file size with less than 0.05 percent for doubles (1/2000), so perfectly acceptable. A binary search on key values will probably be implemented in the version after the next.
Methods fst.rbind
and fst.cbind
will be added to the package in the next version.
Processing character columns is by far the slowest of all data types. For character columns (that are not completely random) we can solve this problem by first converting the vector into a factor. Factors can be efficiently serialized, provided the number of levels is significantly smaller than the number of rows. Random access will suffer because we have to load all levels even for a small subset of data. This can be partly solved by reading with a streaming object that caches the levels after a first read. Subsequent reads will then be faster.
write.fst(fname,df) # imagine df$date is Date
df <- read.fst(fname) # df$date comes back as "int"
This would reduce the memory footprint of writing to a fst
file. It is also possible to use a parLapply
approach, where data is generated in a parallel, but serialized to a fst
file sequentially. In the latter scenario, we have to test the speed of data transfer from the nodes to the master, because it might be too slow for practical purposes (for example on a MPI cluster on Windows).
Hi Marcus,
I've just learned about your package, and its performance on the benchmarks looks absolutely impressive!
However could you please clarify some details about the test environment:
Thanks, and keep up the good work!
No matter what I choose the option column doesn't seem to work.
myDT <- read.fst("myfile.fst", columns = "firstcol",as.data.table = TRUE)
It says
Error in fstRead(fileName, columns, from, to) :
Selected column not found.
But I know it exist.
Using a vector of names doesn't work either.
But if I just use it without any column option it works well
myDT <- read.fst("myfile.fst" ,as.data.table = TRUE)
That would involve creating a fst
file-connection object (similar to base-R file
method). With that object data can be streamed row-by-row until the file is depleted (or the connection is closed). The binary file format allows for streaming from compressed fst
files as well. In addition, a connection object could also be used to stream to a fst
file. The fst
format needs to accommodate multiple chunks for that options and the connection object needs a (custom sized) buffer for maximum performance.
Hello.
read.fst is able to read columns indicating its name.
It would be great (and I think easy) to also be able to select the columns by its number position and more important with a vector of TRUES and FALSES.
It's common to have that vectors when performing other operations, such as greps, or when reading from excel files with annotated information for every column.
By following the steps taken in this dockerfile on a fresh Ubuntu install.
I have a column with UTF-8 (european accented) strings in a data frame. After I do a write.fst() and retrieve the df back via read.fst() the strings become mangled. Are you supporting multiple string encodings here?
I can't save fst
to home directory when using ~/
instead of full path.
library(fst)
# Save file to home directory - works
write.fst(mtcars, "/home/USER/dummy.fst")
# Save file to home directory - produces error
write.fst(mtcars, "~/dummy.fst")
Error:
Error in fstStore(path, x, as.integer(compress)) :
There was an error creating the file. Please check for a correct filename.
On the other hand, both read.fst(mtcars, "~/dummy.fst")
and read.fst(mtcars, "/home/USER/dummy.fst")
work fine.
Session info:
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fst_0.7.2 memuse_3.0-2
loaded via a namespace (and not attached):
[1] clisymbols_1.1.0 tools_3.3.2 Rcpp_0.12.10 data.table_1.10.4
The conversion needs very little memory, as we can use the rbind
functionality of fst
to append chunks from the csv
file. The resulting fst
file would have random row and column access and could be used to perform calculations of data sets that are too big to fit into memory.
For example: https://stackoverflow.com/questions/42379995/bit64-integers-with-fst
library(bit64)
library(data.table)
library(fst)
library(magrittr)
# Prepare example csvs
DT64_orig <- data.table(x = (c(2345612345679, 1234567890, 8714567890)))
fwrite(DT64_orig, "DT64_orig.csv")
# Read and move to fst
DT64 <- fread("DT64_orig.csv")
write.fst(DT64, "DT64_fst.fst")
DT_fst2 <-
read.fst("DT64_fst.fst") %>%
setDT
# bit64 integers not preserved:
identical(DT_fst2, DT64)
With this feature you can populate say row 1001:2000 in a 1e6 row data.table
with a 1000 row read from fst.read
. All this is done in memory. This feature is very useful for combining data from multiple (fst
) sources into a single result table without having the overhead of copies. For example, when performing the merge sort algorithm on a set of data files, you need to
This can be performed efficiently in R by using data.table
's fast sorting and populating the result table in memory. With such an algorithm operating on a collection of fst
files, we basically have a method of sorting arbitrary large fst
files without running out of memory (and it can be done with multiple threads!).
Is it possible to support tilde expansion in read
and write.fst
? I find that write.fst
doesn't support it at all, but read.fst
does sporadically. I haven't been able to reproduce an instance of it not working
library(fst)
stopifnot(dir.exists("~/sandbox"))
z <- as.data.frame(x = 1)
write.fst(z, "~/sandbox/z.fst")
Error in fstStore(path, x, as.integer(compress)) :
There was an error creating the file. Please check for a correct filename.
write.fst(z, file.path(path.expand("~"), "sandbox", "z.fst")) # Works
zz <- read.fst("~/sandbox/z.fst") # Works
When trying to save a data.frame with 0 row. fst R package throw an error with error message:
Error in fstStore(normalizePath(path, mustWork = FALSE), x, as.integer(compress)) :
The dataset contains no data.
And the resulted .fst file is malformed. read.fst() report the file format is not recognized.
3.7M observations of 257 variables. Wrote to fst format without error but trying to when trying read the file back I got "Error in fstRead(fileName, columns, from, to) :
embedded nul in string:"
base load/save and feather both handle reads / writes of this data to file without issue.
Here is a link to the Stata version of the data set available via the web: https://pl-garlock.s3.amazonaws.com/05-Databases/GST-8002/Garlock%20Analytical%20Database/exposure%20data%20-%20all%20disease-REDACTED.dta
Hi, nice package!
It would be a great competitor to feather package if it was compatible with python pandas dataframes.
Any plan to make it available in python?
Cheers,
Benoit
PS: my own benchmarks
> r_bench <- microbenchmark(
+ read_f = {dt1 <- read_feather(path = filename)},
+ read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+ read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+ read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+ read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+ read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+ times = 3)
>
> print(r_bench)
Unit: milliseconds
expr min lq mean median uq max neval
read_f 73.49535 74.38310 74.80852 75.27085 75.46511 75.65938 3
read_dt 409.07989 410.28315 411.33413 411.48641 412.46125 413.43609 3
read_fst 67.21488 69.68649 74.13367 72.15810 77.59306 83.02803 3
read_fstc 113.58359 113.87905 114.01423 114.17451 114.22955 114.28458 3
read_rds 363.55270 366.95543 370.44090 370.35816 373.88500 377.41183 3
read_rdsc 571.20738 571.27464 575.87312 571.34189 578.20598 585.07008 3
> w_bench <- microbenchmark(
+ write_f = {write_feather(x = dt, path = filename)},
+ write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+ write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+ write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+ write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+ write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+ times = 3)
>
> print(w_bench)
Unit: milliseconds
expr min lq mean median uq max neval
write_f 77.57399 81.01968 84.72863 84.46536 88.30596 92.14655 3
write_dt 65.89461 69.54576 538.90557 73.19692 775.41105 1477.62517 3
write_fst 73.60318 75.90385 626.80981 78.20452 903.41312 1728.62172 3
write_fstc 202.33712 211.38273 220.21007 220.42834 229.14654 237.86473 3
write_rds 329.07046 3128.41469 4061.86755 5927.75891 5928.26610 5928.77328 3
write_rdsc 2436.99475 2443.04194 2447.12685 2449.08913 2452.19291 2455.29668 3
And provide fast compression with random access to the matix. Check if there is a use-case for such a feature.
From Jean-Luc by email: when trying to read a file that was not saved with fst
I got a message that is not self explanatory. Would it be possible to have something more explicit (like 'Wrong file type'). I suppose that this suggests to put your signature in the header of the file. It is possibly to late now to change the structure.
Error in fstRead(fileName, columns, from, to) : std::bad_array_new_length
Hi! Thank you for great package!
I have some broblem: when I try to install it on UBUNTU 14.04 on Microsoft R Open 3.3.1
from github or cran it fails (recipe for target 'FastStore.o' failed).
It appears that if an object is a data.table, saving as and restoring from fst loses the object class, reverting to data.frame.
fst version 0.7.2
library("pryr")
library("fst")
library("data.table")
pryr::object_size(fbHistoricResultsSQL)
437 MB
str(fbHistoricResultsSQL)
Classes ‘data.table’ and 'data.frame': 962306 obs. of 64 variables:
$ DATE : Date, format: "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" ...
$ TIME : chr "1605" "1605" "1605" "1605" ...
$ FBRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ FBR : int 9 9 9 9 9 9 9 9 9 9 ...
$ FBR Vodds : num 15.2 15.2 15.2 15.2 15.2 ...
$ FBR V% : num 13.99 3.74 5.59 -0.43 -0.79 ...
$ RSF : num 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
$ POW : num 1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
$ POWr : int 12 9 11 2 4 7 2 5 6 8 ...
$ POW Vodds : num 25.5 16.1 22.1 13.1 13.3 ...
$ POW V% : num 7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
$ RSP : num 0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
$ PACEr : int NA NA NA NA NA NA NA NA NA NA ...
$ PACE : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACE% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACELTO : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACELTO% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMP : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMPr : int NA NA NA NA NA NA NA NA NA NA ...
$ DSLR : num NA NA NA NA NA NA NA NA NA NA ...
$ COMr : int 1 1 1 1 1 1 1 1 1 1 ...
$ COM : int 0 0 0 0 0 0 0 0 0 0 ...
$ SKFr : int 7 7 7 1 1 7 1 7 1 7 ...
$ TRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ JRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ SRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ bfwinr : int 12 10 11 3 1 7 6 7 4 9 ...
$ DRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ POS : chr "5" "7" "12" "4" ...
$ SP : num 50 33 50 6.5 2 20 12 20 10 25 ...
$ BFOddsWin : num 227.41 71.97 100 8.6 3.15 ...
$ BFOddsPlace: num 32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
$ IPMIN : num 10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
$ IPMAX : num 570 120 1000 44 1000 1000 22 32 90 110 ...
$ AMWAP : num 1 21.26 1 9.64 3.93 ...
$ RTA : chr "NHFT" "NHFT" "NHFT" "NHFT" ...
$ rta2 : chr "Maiden" "Maiden" "Maiden" "Maiden" ...
$ TYPE : chr "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
$ TRAINER : chr "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
$ JOCKEY : chr "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
$ COURSE : chr "Exeter" "Exeter" "Exeter" "Exeter" ...
$ DISTANCE : num 17 17 17 17 17 17 17 17 17 17 ...
$ GOING : chr "Good" "Good" "Good" "Good" ...
$ DRAW : chr "" "" "" "" ...
$ HCAP : chr "No" "No" "No" "No" ...
$ STAT : chr "" "D1" "" "" ...
$ HEADGEAR : chr "" "" "" "" ...
$ CLASS : chr "5" "5" "5" "5" ...
$ RUNNERS : num 13 13 13 13 13 13 13 13 13 13 ...
$ SIRE : chr "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
$ DAMSIRE : chr "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
$ CLMove : num NA NA NA NA NA NA NA NA NA NA ...
$ DIMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEIGHT : chr "140" "147" "140" "147" ...
$ SEX : chr "m" "g" "m" "g" ...
$ SKF : int 1 1 1 3 3 1 3 1 3 1 ...
$ prob : num NA NA NA NA NA NA NA NA NA NA ...
$ ability : int NA NA NA NA NA NA NA NA NA NA ...
$ cons : int 0 0 0 0 0 0 0 0 0 0 ...
$ HORSE : chr "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
$ FBRVPERC : num 1399 374 559 -43 -79 ...
$ PACEPERC : num NA NA NA NA NA NA NA NA NA NA ...
$ Year : chr "2009" "2009" "2009" "2009" ...
- attr(*, ".internal.selfref")=<externalptr>
saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds")
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst", compress = 100)
Rds_in <- readRDS("fbHistoricResultsSQL.Rds")
fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst")
str(fst_compressed_in)
'data.frame': 962306 obs. of 64 variables:
$ DATE : num 14245 14245 14245 14245 14245 ...
$ TIME : chr "1605" "1605" "1605" "1605" ...
$ FBRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ FBR : int 9 9 9 9 9 9 9 9 9 9 ...
$ FBR Vodds : num 15.2 15.2 15.2 15.2 15.2 ...
$ FBR V% : num 13.99 3.74 5.59 -0.43 -0.79 ...
$ RSF : num 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
$ POW : num 1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
$ POWr : int 12 9 11 2 4 7 2 5 6 8 ...
$ POW Vodds : num 25.5 16.1 22.1 13.1 13.3 ...
$ POW V% : num 7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
$ RSP : num 0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
$ PACEr : int NA NA NA NA NA NA NA NA NA NA ...
$ PACE : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACE% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACELTO : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACELTO% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMP : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMPr : int NA NA NA NA NA NA NA NA NA NA ...
$ DSLR : num NA NA NA NA NA NA NA NA NA NA ...
$ COMr : int 1 1 1 1 1 1 1 1 1 1 ...
$ COM : int 0 0 0 0 0 0 0 0 0 0 ...
$ SKFr : int 7 7 7 1 1 7 1 7 1 7 ...
$ TRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ JRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ SRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ bfwinr : int 12 10 11 3 1 7 6 7 4 9 ...
$ DRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ POS : chr "5" "7" "12" "4" ...
$ SP : num 50 33 50 6.5 2 20 12 20 10 25 ...
$ BFOddsWin : num 227.41 71.97 100 8.6 3.15 ...
$ BFOddsPlace: num 32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
$ IPMIN : num 10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
$ IPMAX : num 570 120 1000 44 1000 1000 22 32 90 110 ...
$ AMWAP : num 1 21.26 1 9.64 3.93 ...
$ RTA : chr "NHFT" "NHFT" "NHFT" "NHFT" ...
$ rta2 : chr "Maiden" "Maiden" "Maiden" "Maiden" ...
$ TYPE : chr "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
$ TRAINER : chr "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
$ JOCKEY : chr "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
$ COURSE : chr "Exeter" "Exeter" "Exeter" "Exeter" ...
$ DISTANCE : num 17 17 17 17 17 17 17 17 17 17 ...
$ GOING : chr "Good" "Good" "Good" "Good" ...
$ DRAW : chr "" "" "" "" ...
$ HCAP : chr "No" "No" "No" "No" ...
$ STAT : chr "" "D1" "" "" ...
$ HEADGEAR : chr "" "" "" "" ...
$ CLASS : chr "5" "5" "5" "5" ...
$ RUNNERS : num 13 13 13 13 13 13 13 13 13 13 ...
$ SIRE : chr "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
$ DAMSIRE : chr "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
$ CLMove : num NA NA NA NA NA NA NA NA NA NA ...
$ DIMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEIGHT : chr "140" "147" "140" "147" ...
$ SEX : chr "m" "g" "m" "g" ...
$ SKF : int 1 1 1 3 3 1 3 1 3 1 ...
$ prob : num NA NA NA NA NA NA NA NA NA NA ...
$ ability : int NA NA NA NA NA NA NA NA NA NA ...
$ cons : int 0 0 0 0 0 0 0 0 0 0 ...
$ HORSE : chr "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
$ FBRVPERC : num 1399 374 559 -43 -79 ...
$ PACEPERC : num NA NA NA NA NA NA NA NA NA NA ...
$ Year : chr "2009" "2009" "2009" "2009" ...
str(Rds_in)
Classes ‘data.table’ and 'data.frame': 962306 obs. of 64 variables:
$ DATE : Date, format: "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" ...
$ TIME : chr "1605" "1605" "1605" "1605" ...
$ FBRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ FBR : int 9 9 9 9 9 9 9 9 9 9 ...
$ FBR Vodds : num 15.2 15.2 15.2 15.2 15.2 ...
$ FBR V% : num 13.99 3.74 5.59 -0.43 -0.79 ...
$ RSF : num 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
$ POW : num 1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
$ POWr : int 12 9 11 2 4 7 2 5 6 8 ...
$ POW Vodds : num 25.5 16.1 22.1 13.1 13.3 ...
$ POW V% : num 7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
$ RSP : num 0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
$ PACEr : int NA NA NA NA NA NA NA NA NA NA ...
$ PACE : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACE% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACELTO : num 0 0 0 0 0 0 0 0 0 0 ...
$ PACELTO% : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMP : num NA NA NA NA NA NA NA NA NA NA ...
$ PACEIMPr : int NA NA NA NA NA NA NA NA NA NA ...
$ DSLR : num NA NA NA NA NA NA NA NA NA NA ...
$ COMr : int 1 1 1 1 1 1 1 1 1 1 ...
$ COM : int 0 0 0 0 0 0 0 0 0 0 ...
$ SKFr : int 7 7 7 1 1 7 1 7 1 7 ...
$ TRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ JRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ SRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ bfwinr : int 12 10 11 3 1 7 6 7 4 9 ...
$ DRr : int 1 1 1 1 1 1 1 1 1 1 ...
$ POS : chr "5" "7" "12" "4" ...
$ SP : num 50 33 50 6.5 2 20 12 20 10 25 ...
$ BFOddsWin : num 227.41 71.97 100 8.6 3.15 ...
$ BFOddsPlace: num 32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
$ IPMIN : num 10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
$ IPMAX : num 570 120 1000 44 1000 1000 22 32 90 110 ...
$ AMWAP : num 1 21.26 1 9.64 3.93 ...
$ RTA : chr "NHFT" "NHFT" "NHFT" "NHFT" ...
$ rta2 : chr "Maiden" "Maiden" "Maiden" "Maiden" ...
$ TYPE : chr "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
$ TRAINER : chr "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
$ JOCKEY : chr "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
$ COURSE : chr "Exeter" "Exeter" "Exeter" "Exeter" ...
$ DISTANCE : num 17 17 17 17 17 17 17 17 17 17 ...
$ GOING : chr "Good" "Good" "Good" "Good" ...
$ DRAW : chr "" "" "" "" ...
$ HCAP : chr "No" "No" "No" "No" ...
$ STAT : chr "" "D1" "" "" ...
$ HEADGEAR : chr "" "" "" "" ...
$ CLASS : chr "5" "5" "5" "5" ...
$ RUNNERS : num 13 13 13 13 13 13 13 13 13 13 ...
$ SIRE : chr "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
$ DAMSIRE : chr "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
$ CLMove : num NA NA NA NA NA NA NA NA NA NA ...
$ DIMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEMove : num NA NA NA NA NA NA NA NA NA NA ...
$ WEIGHT : chr "140" "147" "140" "147" ...
$ SEX : chr "m" "g" "m" "g" ...
$ SKF : int 1 1 1 3 3 1 3 1 3 1 ...
$ prob : num NA NA NA NA NA NA NA NA NA NA ...
$ ability : int NA NA NA NA NA NA NA NA NA NA ...
$ cons : int 0 0 0 0 0 0 0 0 0 0 ...
$ HORSE : chr "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
$ FBRVPERC : num 1399 374 559 -43 -79 ...
$ PACEPERC : num NA NA NA NA NA NA NA NA NA NA ...
$ Year : chr "2009" "2009" "2009" "2009" ...
- attr(*, ".internal.selfref")=<externalptr>
Data integrity is pretty important in the organization I work for.
Accountants ask for 'hashvalues' of datasets, when they perform audits. It must be certain that the right dataset is used and that is has not changed.
Is it possible to calculate a hashvalue and save it as metadata (in the same file) when you write the data to disk? When you read the file, it would be nice that the hashvalue is calculated 'on the fly' and compared with the hashvalue recieved as metadata. In that way, you are always certain that you have the same data.
And add a section on benchmarking on https://fstpackage.github.io
When a sorted data set is stored as a fst
binary file, sorting metadata is stored alongside the data. Using this metadata, a binary search can be performed on the key-columns before actually reading the data. For example, only 32 random seeks are needed in the binary file to search 4 billion rows for a begin- and end- value from a selected range. The performance penalty will be very small (seeking with modern SSD's is very fast).
When saving to a network drive without compression I get about 50Mbps and with compression at 100 I get 8Mbps. Saving to a local drive however is much much faster so the CPU doesn't seem to be the bottleneck.
I don't know if this is requested. It would be helpful to provide functions that gives back some basic info about the data frame in the save fst file without opening the whole file. For example:
Thanks.
fst is not respecting dates.. I hope that is an easy fix
> u1=data.frame(DT=Sys.Date()-1:10,variable=rep(c("A","B"),5), value=1:10)
> write.fst(u1,"c:\\t\\u1.fst")
> u1
DT variable value
1 2017-02-05 A 1
2 2017-02-04 B 2
3 2017-02-03 A 3
4 2017-02-02 B 4
5 2017-02-01 A 5
6 2017-01-31 B 6
7 2017-01-30 A 7
8 2017-01-29 B 8
9 2017-01-28 A 9
10 2017-01-27 B 10
> u2=read.fst("c:\\t\\u1.fst")
> u2
DT variable value
1 17202 A 1
2 17201 B 2
3 17200 A 3
4 17199 B 4
5 17198 A 5
6 17197 B 6
7 17196 A 7
8 17195 B 8
9 17194 A 9
10 17193 B 10
>
I am having problem saving and reading very large files in fst R package.
I used both CRAN version and development version.
The problem occurred intermittently. Every few tries on saving will get me a success save. Most of the reading so far have failed.
However, if I read a subset of the file, the reads are mostly successful.
The data is a big data.table frame.
I don't know how to provide more info.
The following is the error from read.fst():
system.time({a=read.fst('/dev/shm/AllHorizonDT_00.fst',as.data.table=T)})
*** caught segfault ***
address 0x7f531143e038, cause 'memory not mapped'
*** caught segfault ***
address 0x7f71c25db020, cause 'memory not mapped'
Traceback:
1: .Call("fst_fstRead", PACKAGE = "fst", fileName, columnSelection, startRow, endRow)
2: fstRead(fileName, columns, from, to)
3: read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T)
4: system.time({ a = read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T)})
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: *** caught segfault ***
Selection: address 0x7f531143e038, cause 'memory not mapped'
This is the error from write.fst():
Saving All data for all horizons ...
*** caught segfault ***
address 0x7f201085103a, cause 'memory not mapped'
Traceback:
1: .Call("fst_fstStore", PACKAGE = "fst", fileName, table, compression)
2: fstStore(normalizePath(path, mustWork = FALSE), x, as.integer(compress))
3: write.fst(AllHorizonDT, path = filename, compress = fst_compress_level)
4: save_horizon_data(AllHorizonDT, formatsub = "Formatted/", maturedsub = "matured/", agesub = "Ages/", horizsub = "AllHorizon", savefst = T)
5: eval(expr, envir, enclos)
6: eval(ei, envir)
7: withVisible(eval(ei, envir))
8: source("Product_DataPrep.R")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Neither are very informative. Basically looks like some kind of member problem.
Let me know what I can do to help debugging this.
Thanks.
After initializing a fst
object, the underlying fst
file can be accessed in the same manner as a data.frame
.
this is standard with utils::write.csv and readr::write_csv.
great FAST package here- thank you
Excellent package. I read that it supports data.table
s. Would it be possible to also add support for reading FST files as tibble
s ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.