fstpackage / fst Goto Github PK

View Code? Open in Web Editor NEW

616.0 616.0 42.0 2.39 MB

Lightning Fast Serialization of Data Frames for R

Home Page: http://www.fstpackage.org/fst/

License: GNU Affero General Public License v3.0

R 90.34% C++ 9.66%

compression data-frame data-storage r

fst's Issues

Some Benchmark Feedback

Not sure where else to leave this, but wanted to provide some benchmark feedback.

My data isn't huge, but it does need to be loaded into a Shiny app, so every millisecond counts. The data is also periodically transferred from a remote server to shinyapps.io, where the app is hosted, so file size is also a concern.

pryr::object_size(fbHistoricResultsSQL)

> 143 MB

Writing the object to the files:

microbenchmark(times=10,
write.csv(fbHistoricResultsSQL, "fbHistoricResultsSQL.csv"),
data.table::fwrite(fbHistoricResultsSQL, "fbHistoricResultsSQL_fwrite.csv"),
saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds"),
feather::write_feather(fbHistoricResultsSQL, "fbHistoricResultsSQL.feather"),
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL.fst"),
fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst", compress = 100)
)

Rds is left at default compression settings.

Speed to write isn't a huge concern for me, as it's done on a remote server controlled by a cron job. fst is fastest, while fst with 100 compression is still quicker than Rds, but slower than feather and fwrite.

Unit: milliseconds
                                                                                             expr        min         lq       mean     median         uq       max neval
                                      write.csv(fbHistoricResultsSQL, "fbHistoricResultsSQL.csv") 43327.6840 44443.1966 51946.9457 51962.6121 58955.6631 62506.981    10
                      data.table::fwrite(fbHistoricResultsSQL, "fbHistoricResultsSQL_fwrite.csv")   499.8059   517.6073   643.5355   525.9915   672.7274  1077.391    10
                                        saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds") 11399.7927 13771.2299 17093.8948 18078.2103 20357.4133 20481.531    10
                     feather::write_feather(fbHistoricResultsSQL, "fbHistoricResultsSQL.feather")   354.8188   416.0689   632.8047   505.7227   878.3790  1132.388    10
                                 fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL.fst")   340.8930   362.4958   591.8688   418.2773  1050.4093  1059.357    10
 fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst",      compress = 100)  1655.4058  2262.0219  3968.6550  4650.9530  4881.5290  6272.078    10

Rds does create the smallest file sizes, with 100 compressed fst a little way behind.

> file.size("fbHistoricResultsSQL.csv")
[1] 114849990
> file.size("fbHistoricResultsSQL_fwrite.csv")
[1] 90092546
> file.size("fbHistoricResultsSQL.Rds")
[1] 26563307
> file.size("fbHistoricResultsSQL.feather")
[1] 148798120
> file.size("fbHistoricResultsSQL.fst")
[1] 148459382
> file.size("fbHistoricResultsSQL_compressed.fst")
[1] 34119811

For ease of viewing, the Rds file is about 25.3MB and the 100 compressed fst is 32.5MB.

Now, reading the data back in from the saved files:

microbenchmark(times=10,
csv <- read.csv("fbHistoricResultsSQL.csv", sep=",", stringsAsFactors = FALSE),
fread_csv <- fread("fbHistoricResultsSQL_fwrite.csv"),
Rds_in <- readRDS("fbHistoricResultsSQL.Rds"),
feather_in <- feather::read_feather("fbHistoricResultsSQL.feather"),
fst_in <- fst::read.fst("fbHistoricResultsSQL.fst"),
fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst")
)

fst wins again, but crucially, 100 compressed fst is significantly faster than Rds.

Unit: milliseconds
                                                                             expr        min         lq       mean     median        uq       max neval
 csv <- read.csv("fbHistoricResultsSQL.csv", sep = ",", stringsAsFactors = FALSE) 17274.9592 18683.0098 20032.4658 19753.4444 21068.422 23651.957    10
                            fread_csv <- fread("fbHistoricResultsSQL_fwrite.csv")  2784.1564  3508.4621  3729.2320  3723.2384  4078.078  4672.571    10
                                    Rds_in <- readRDS("fbHistoricResultsSQL.Rds")  3381.8752  5061.3580  5383.2858  5359.9506  5845.889  6525.150    10
              feather_in <- feather::read_feather("fbHistoricResultsSQL.feather")   356.4536   437.6393   821.1549   741.2754  1121.273  1700.306    10
                              fst_in <- fst::read.fst("fbHistoricResultsSQL.fst")   355.2809   363.5862   712.1544   594.1401  1088.736  1438.211    10
        fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst")   667.0044   853.3418  1357.8370  1489.1858  1627.346  1957.504    10

For my use case, which is a combination of desiring small binary file size and fastest read time, the 100 compressed fst looks to be just about the right job.

Of course any advances in compression to bring the 100 compressed fst below the Rds file size, without dramatic read speed impact, would be fantastic.

Increase compression performance for text with brotli

The build-in dictionary could speed-up compression of character columns significantly. Alternatively, we could use the ZSTD compressor with a pre-trained dictionary.

IO is processed on a separate thread

data.table looses columns (when more than 2M)

Amazing package, excellent for building a cache mechanism. I have a matrix with time series with dates in the rows (700) and the entity in the columns (2 million). I tried coercing this matrix to a data.table and read/write with fst I loose many columns

write.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 2191021 variables:
read.fst: Classes ‘data.table’ and 'data.frame': 559 obs. of 28333 variables:

Look great. But better y-axis please.

Just saw your new package. Looks very nice! Thanks for the praise on fwrite.

I'm finding the headline chart hard to grasp, though. For example, what does "0" for fread mean compared to a speed of 3,000 for fst? Is this a test on 1e7 rows and what are the actual timings? Here's the chart and below it my suggestion :

I'd much prefer to see a size chosen and then the amount of time reported, like I did in this article :

Where does fst fit into that table for example?

No error message when disk runs out of space

From Jean-Luc by email: during a write.fst operation there is no error message when the storage device runs out of space and data is only partially written. A subsequent read.fst does not detect the erroneous file which leads to undefined behavior.

Stored data frame can be encrypted using AES encryption

Currently planned milestones for fst

A list of currently planned milestones for fst with some key features:

format-complete: the fst format allows for:
a) row binding of data frames
b) column binding of data frames
c) persisting (custom) column attributes
d) persisting and indexing table keys
e) a range of compression algorithms
f) storing hashes for each data block (#49 )
stand-alone C++ core library
a) the core code for fst is available as a separate C++ library
interface: fst streaming object can be used like a data.frame:
a) (simple) on-the-fly sub-setting (requires far less memory)
b) selection of columns
c) append columns
d) append rows
e) rbind several fst files or rbind fst files with in-memory data sets
multi-threading:
a) multi-threaded compression/decompression and multi-threaded IO using RcppParallel or tinythread++
b) benchmark suite tracking performance for each column type. Should be run after each commit to monitor performance after future changes and further enhancements.
added functionality:
a) lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)
b) directly convert csv to fst without memory overhead
interoperability:
a) import data from Apache Parguet files
b) types used in fst C++ core library are close to Apache Arrow
c) Python interface?
advanced operations:
a) on the fly sequential and parallel grouping using custom methods
b) binary search on table key columns (extremely fast sub-setting of a key range)
c) adding columns using a merge operation (with a fst file acting as the right-join data set)
d) fst file can be sorted into a new fst file using merge-sort algorithm
e) multiple fst-files represent a single data set
f) operations can be performed on the set of fst files in parallel
g) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort.
h) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median.
i) fill a data set range with specific rows from a fst file, overwriting data in-memory (#29).
performance enhancements:
a) encryption
b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors
d) test using Brotli compression character columns (Brotli packs a pre-build dictionary)
e) high compression mode for slow IO (network) speeds (#23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

read.fst reads multiple fst files into a single data set

By setting the first parameter to a vector of file names.

Conditional read on a fst file

By specifying a condition on one or more columns of the stored table, data can be read using far less memory than a full read combined with a selection of rows. Related to issue #15 and issue #16: data can be read using a stream object and selection can be done on chunks of data, rather than the complete data set. Restrictions:

Condition cannot contain aggregate statements that depend on the whole set, e.g. median(ColA) / sum(ColA).
Size of result is not known in advance, so a binding of smaller result sets is required (like data.table's rbindlist). This will have an effect on performance.

Serialization of list columns

This is a great package -- halved my data loading time by half, but with some effort. I frequently group data into lists (e.g. a time series "dataset" with data in a data.table, inventory in a small data.frame and xts dates/representations) of the form mydata=list("x"=data.table(..), "y"=data.table, "z" = chr) etc.

I was able to write a wrapper around these to parse component datatables to separate .fst files, but it would be great if you generalized the read and write to more general data structures. Eventually, I think this can really be a replacement for save and load.

Attributes are serialized in fst binary format

All attributes are stored in the fst file format. Complex types can be stored by using R's native serialization mechanism and compressing with the fast LZ4 or ZSTD for large attribute objects. It might be a good idea to store the col.names atribute as a normal column for speed. Allowing attributes would facilitate adding user-defined metadata to a fst file.

Section on benchmarking on https://fstpackage.github.io

Reading subsets of fst-stored data.tables

Something I would find incredibly useful is to be able to run select-like queries when reading from fst. Given that data.tables have keys I was thinking that either this data.table feature could be leverage, or re-using some of its code we might get this feature.

Just to be clear say we have a data.table with date,id,col1,col2,col3 saved as a fst file.
I'd like to be able to do something like read.fst(path=myPath, columns=myColumns, select="date==2017-01-01 & id %like% 'fst*'")

I realize that this almost make fst a database, and I do not know if this is doable, but that's my 2 cents.
You might ask what is this bringing over loading the whole file and sub-selecting, I was thinking that for people like me remotely working and using networks that could make sense.

Regards and thanks

Will fst(s) be additive

Is it possible to append to an fst without having to load it (completely)?

Key index stored in binary file for sorted tables

To perform a binary search on the key columns of a fst file, a key index is required when using compression. If a key index is not present, we have to decompress many complete blocks (16 kB each) to use a single row from the block in our binary search. This has a high cost. Instead, if we write the first row of a compression block to a separate index in the fst file format, we can perform a binary search on the key index and only decompress a single block to get to the actual row that we search for. This will increase the fst file size with less than 0.05 percent for doubles (1/2000), so perfectly acceptable. A binary search on key values will probably be implemented in the version after the next.

Append column to fst file

Methods fst.rbind and fst.cbind will be added to the package in the next version.

Fast factorization of character columns

Processing character columns is by far the slowest of all data types. For character columns (that are not completely random) we can solve this problem by first converting the vector into a factor. Factors can be efficiently serialized, provided the number of levels is significantly smaller than the number of rows. Random access will suffer because we have to load all levels even for a small subset of data. This can be partly solved by reading with a streaming object that caches the levels after a first read. Subsequent reads will then be faster.

When column in "Date" format is written and later read, it comes back as an integer!

write.fst(fname,df) # imagine df$date is Date
df <- read.fst(fname) # df$date comes back as "int"

Write to fst binary file with an apply-like method

This would reduce the memory footprint of writing to a fst file. It is also possible to use a parLapply approach, where data is generated in a parallel, but serialized to a fst file sequentially. In the latter scenario, we have to test the speed of data transfer from the nodes to the master, because it might be too slow for practical purposes (for example on a MPI cluster on Windows).

SSE2 instructions used to enhance bit and byte filters for supporting architectures

Benchmark question

Hi Marcus,

I've just learned about your package, and its performance on the benchmarks looks absolutely impressive!
However could you please clarify some details about the test environment:

What was the hard drive on the machine where the benchmarks were ran?
Which versions of the software packages (fst, read_feather, readRDS, fread) were tested?
What dataset was used? Is it public? Or if it is randomly-generated, could you publish the script to create the same dataset?
In general, could you please publish the code you used to run the benchmark, so that we can evaluate it against newest versions of packages or against various datasets?

Thanks, and keep up the good work!

The option columns doesn't seem to work.

No matter what I choose the option column doesn't seem to work.
myDT <- read.fst("myfile.fst", columns = "firstcol",as.data.table = TRUE)
It says

Error in fstRead(fileName, columns, from, to) :
Selected column not found.

But I know it exist.
Using a vector of names doesn't work either.

But if I just use it without any column option it works well

myDT <- read.fst("myfile.fst" ,as.data.table = TRUE)

fst files can be streamed on a row by row basis

That would involve creating a fst file-connection object (similar to base-R file method). With that object data can be streamed row-by-row until the file is depleted (or the connection is closed). The binary file format allows for streaming from compressed fst files as well. In addition, a connection object could also be used to stream to a fst file. The fst format needs to accommodate multiple chunks for that options and the connection object needs a (custom sized) buffer for maximum performance.

Ability to choose columns by number or vector

Hello.

read.fst is able to read columns indicating its name.

It would be great (and I think easy) to also be able to select the columns by its number position and more important with a vector of TRUES and FALSES.

It's common to have that vectors when performing other operations, such as greps, or when reading from excel files with annotated information for every column.

Fix UBSAN and ASAN warnings

By following the steps taken in this dockerfile on a fresh Ubuntu install.

UTF-8 strings getting mangled after write-read cycle

I have a column with UTF-8 (european accented) strings in a data frame. After I do a write.fst() and retrieve the df back via read.fst() the strings become mangled. Are you supporting multiple string encodings here?

Can't write to home (~/) directory

I can't save fst to home directory when using ~/ instead of full path.

library(fst)
# Save file to home directory - works
write.fst(mtcars, "/home/USER/dummy.fst")
# Save file to home directory - produces error
write.fst(mtcars, "~/dummy.fst")

Error:

Error in fstStore(path, x, as.integer(compress)) : 
  There was an error creating the file. Please check for a correct filename.

On the other hand, both read.fst(mtcars, "~/dummy.fst") and read.fst(mtcars, "/home/USER/dummy.fst") work fine.

Session info:

R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fst_0.7.2      memuse_3.0-2

loaded via a namespace (and not attached):
[1] clisymbols_1.1.0  tools_3.3.2       Rcpp_0.12.10      data.table_1.10.4

Date (pseudo-) type is a supported

Convert a csv file directly to a fst file

The conversion needs very little memory, as we can use the rbind functionality of fst to append chunks from the csv file. The resulting fst file would have random row and column access and could be used to perform calculations of data sets that are too big to fit into memory.

Feature request: Support for bit64

For example: https://stackoverflow.com/questions/42379995/bit64-integers-with-fst

library(bit64)
library(data.table)
library(fst)
library(magrittr)

# Prepare example csvs
DT64_orig <- data.table(x = (c(2345612345679, 1234567890, 8714567890)))
fwrite(DT64_orig, "DT64_orig.csv")

# Read and move to fst
DT64 <- fread("DT64_orig.csv")
write.fst(DT64, "DT64_fst.fst")

DT_fst2 <- 
  read.fst("DT64_fst.fst") %>%
  setDT

# bit64 integers not preserved:
identical(DT_fst2, DT64)

Fill a data.table range with specific rows from read.fst

With this feature you can populate say row 1001:2000 in a 1e6 row data.table with a 1000 row read from fst.read. All this is done in memory. This feature is very useful for combining data from multiple (fst) sources into a single result table without having the overhead of copies. For example, when performing the merge sort algorithm on a set of data files, you need to

read first x rows from all files
sort the resulting table
write some rows to disk
read next x rows form file with smallest first chunk
sort resulting table
goto 3

This can be performed efficiently in R by using data.table's fast sorting and populating the result table in memory. With such an algorithm operating on a collection of fst files, we basically have a method of sorting arbitrary large fst files without running out of memory (and it can be done with multiple threads!).

[Feature request] Tilde-expansion in `write.fst`

Is it possible to support tilde expansion in read and write.fst? I find that write.fst doesn't support it at all, but read.fst does sporadically. ^{I haven't been able to reproduce an instance of it not working}

library(fst)
stopifnot(dir.exists("~/sandbox"))

z <- as.data.frame(x = 1)
write.fst(z, "~/sandbox/z.fst")

Error in fstStore(path, x, as.integer(compress)) :
There was an error creating the file. Please check for a correct filename.

write.fst(z, file.path(path.expand("~"), "sandbox", "z.fst")) # Works

zz <- read.fst("~/sandbox/z.fst") # Works

Saving empty data.table throw an error and created a file that is malformed

When trying to save a data.frame with 0 row. fst R package throw an error with error message:
Error in fstStore(normalizePath(path, mustWork = FALSE), x, as.integer(compress)) :
The dataset contains no data.
And the resulted .fst file is malformed. read.fst() report the file format is not recognized.

May be fst should allow saving an empty data.table or data.frame? R default save function can save an empty data.frame.
Even if you don't allow saving of an empty data frame, the error handling could be more careful. May be should not throw an error? Should clean up the failed save and either close the file with proper format or delete the file.

Error in fstRead when reading large data set

3.7M observations of 257 variables. Wrote to fst format without error but trying to when trying read the file back I got "Error in fstRead(fileName, columns, from, to) :
embedded nul in string:"

base load/save and feather both handle reads / writes of this data to file without issue.

Here is a link to the Stata version of the data set available via the web: https://pl-garlock.s3.amazonaws.com/05-Databases/GST-8002/Garlock%20Analytical%20Database/exposure%20data%20-%20all%20disease-REDACTED.dta

add python module to read/write from python panda

Hi, nice package!
It would be a great competitor to feather package if it was compatible with python pandas dataframes.
Any plan to make it available in python?

Cheers,
Benoit

PS: my own benchmarks

> r_bench <- microbenchmark(
+     read_f = {dt1 <- read_feather(path = filename)},
+     read_dt = {dt1 <- fread(file = gsub(".feather", ".csv", filename), showProgress = FALSE)},
+     read_fst = {dt2 <- read.fst(path = gsub(".feather", ".fst", filename))},
+     read_fstc = {dt2 <- read.fst(path = gsub(".feather", ".fstc", filename))},
+     read_rds = {dt2 <- readRDS(file = gsub(".feather", ".rds", filename))},
+     read_rdsc = {dt2 <- readRDS(file = gsub(".feather", ".rdsc", filename))},
+     times = 3)
> 
> print(r_bench)
Unit: milliseconds
      expr       min        lq      mean    median        uq       max neval
    read_f  73.49535  74.38310  74.80852  75.27085  75.46511  75.65938     3
   read_dt 409.07989 410.28315 411.33413 411.48641 412.46125 413.43609     3
  read_fst  67.21488  69.68649  74.13367  72.15810  77.59306  83.02803     3
 read_fstc 113.58359 113.87905 114.01423 114.17451 114.22955 114.28458     3
  read_rds 363.55270 366.95543 370.44090 370.35816 373.88500 377.41183     3
 read_rdsc 571.20738 571.27464 575.87312 571.34189 578.20598 585.07008     3

> w_bench <- microbenchmark(
+     write_f = {write_feather(x = dt, path = filename)},
+     write_dt = {fwrite(dt, file = gsub(".feather", ".csv", filename))},
+     write_fst = {write.fst(x = dt, path = gsub(".feather", ".fst", filename))},
+     write_fstc = {write.fst(x = dt, path = gsub(".feather", ".fstc", filename),compress = 100)},
+     write_rds = {saveRDS(object = dt, file = gsub(".feather", ".rds", filename),compress = FALSE)},
+     write_rdsc = {saveRDS(object = dt, file = gsub(".feather", ".rdsc", filename),compress = TRUE)},
+     times = 3)
> 
> print(w_bench)
Unit: milliseconds
       expr        min         lq       mean     median         uq        max neval
    write_f   77.57399   81.01968   84.72863   84.46536   88.30596   92.14655     3
   write_dt   65.89461   69.54576  538.90557   73.19692  775.41105 1477.62517     3
  write_fst   73.60318   75.90385  626.80981   78.20452  903.41312 1728.62172     3
 write_fstc  202.33712  211.38273  220.21007  220.42834  229.14654  237.86473     3
  write_rds  329.07046 3128.41469 4061.86755 5927.75891 5928.26610 5928.77328     3
 write_rdsc 2436.99475 2443.04194 2447.12685 2449.08913 2452.19291 2455.29668     3

write.fst can write multidimensional matrices to file

And provide fast compression with random access to the matix. Check if there is a use-case for such a feature.

Detect reading of files with wrong file type

From Jean-Luc by email: when trying to read a file that was not saved with fst I got a message that is not self explanatory. Would it be possible to have something more explicit (like 'Wrong file type'). I suppose that this suggests to put your signature in the header of the file. It is possibly to late now to change the structure.

Error in fstRead(fileName, columns, from, to) : std::bad_array_new_length

recipe for target 'FastStore.o' failed ubuntu 14.04

Hi! Thank you for great package!

I have some broblem: when I try to install it on UBUNTU 14.04 on Microsoft R Open 3.3.1
from github or cran it fails (recipe for target 'FastStore.o' failed).

data.table Object Class not Saved

It appears that if an object is a data.table, saving as and restoring from fst loses the object class, reverting to data.frame.

fst version 0.7.2

library("pryr")
library("fst")
library("data.table")

pryr::object_size(fbHistoricResultsSQL)
437 MB

str(fbHistoricResultsSQL)

Classes ‘data.table’ and 'data.frame':	962306 obs. of  64 variables:
 $ DATE       : Date, format: "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" ...
 $ TIME       : chr  "1605" "1605" "1605" "1605" ...
 $ FBRr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ FBR        : int  9 9 9 9 9 9 9 9 9 9 ...
 $ FBR Vodds  : num  15.2 15.2 15.2 15.2 15.2 ...
 $ FBR V%     : num  13.99 3.74 5.59 -0.43 -0.79 ...
 $ RSF        : num  0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
 $ POW        : num  1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
 $ POWr       : int  12 9 11 2 4 7 2 5 6 8 ...
 $ POW Vodds  : num  25.5 16.1 22.1 13.1 13.3 ...
 $ POW V%     : num  7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
 $ RSP        : num  0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
 $ PACEr      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ PACE       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACE%      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACELTO    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACELTO%   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMP    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMPr   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ DSLR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ COMr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COM        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SKFr       : int  7 7 7 1 1 7 1 7 1 7 ...
 $ TRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ JRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ SRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ bfwinr     : int  12 10 11 3 1 7 6 7 4 9 ...
 $ DRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ POS        : chr  "5" "7" "12" "4" ...
 $ SP         : num  50 33 50 6.5 2 20 12 20 10 25 ...
 $ BFOddsWin  : num  227.41 71.97 100 8.6 3.15 ...
 $ BFOddsPlace: num  32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
 $ IPMIN      : num  10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
 $ IPMAX      : num  570 120 1000 44 1000 1000 22 32 90 110 ...
 $ AMWAP      : num  1 21.26 1 9.64 3.93 ...
 $ RTA        : chr  "NHFT" "NHFT" "NHFT" "NHFT" ...
 $ rta2       : chr  "Maiden" "Maiden" "Maiden" "Maiden" ...
 $ TYPE       : chr  "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
 $ TRAINER    : chr  "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
 $ JOCKEY     : chr  "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
 $ COURSE     : chr  "Exeter" "Exeter" "Exeter" "Exeter" ...
 $ DISTANCE   : num  17 17 17 17 17 17 17 17 17 17 ...
 $ GOING      : chr  "Good" "Good" "Good" "Good" ...
 $ DRAW       : chr  "" "" "" "" ...
 $ HCAP       : chr  "No" "No" "No" "No" ...
 $ STAT       : chr  "" "D1" "" "" ...
 $ HEADGEAR   : chr  "" "" "" "" ...
 $ CLASS      : chr  "5" "5" "5" "5" ...
 $ RUNNERS    : num  13 13 13 13 13 13 13 13 13 13 ...
 $ SIRE       : chr  "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
 $ DAMSIRE    : chr  "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
 $ CLMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DIMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEIGHT     : chr  "140" "147" "140" "147" ...
 $ SEX        : chr  "m" "g" "m" "g" ...
 $ SKF        : int  1 1 1 3 3 1 3 1 3 1 ...
 $ prob       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ability    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ cons       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HORSE      : chr  "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
 $ FBRVPERC   : num  1399 374 559 -43 -79 ...
 $ PACEPERC   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Year       : chr  "2009" "2009" "2009" "2009" ...
 - attr(*, ".internal.selfref")=<externalptr> 

saveRDS(fbHistoricResultsSQL, "fbHistoricResultsSQL.Rds")

fst::write.fst(fbHistoricResultsSQL, "fbHistoricResultsSQL_compressed.fst", compress = 100)

Rds_in <- readRDS("fbHistoricResultsSQL.Rds")

fst_compressed_in <- fst::read.fst("fbHistoricResultsSQL_compressed.fst")

str(fst_compressed_in)
'data.frame':	962306 obs. of  64 variables:
 $ DATE       : num  14245 14245 14245 14245 14245 ...
 $ TIME       : chr  "1605" "1605" "1605" "1605" ...
 $ FBRr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ FBR        : int  9 9 9 9 9 9 9 9 9 9 ...
 $ FBR Vodds  : num  15.2 15.2 15.2 15.2 15.2 ...
 $ FBR V%     : num  13.99 3.74 5.59 -0.43 -0.79 ...
 $ RSF        : num  0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
 $ POW        : num  1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
 $ POWr       : int  12 9 11 2 4 7 2 5 6 8 ...
 $ POW Vodds  : num  25.5 16.1 22.1 13.1 13.3 ...
 $ POW V%     : num  7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
 $ RSP        : num  0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
 $ PACEr      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ PACE       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACE%      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACELTO    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACELTO%   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMP    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMPr   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ DSLR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ COMr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COM        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SKFr       : int  7 7 7 1 1 7 1 7 1 7 ...
 $ TRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ JRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ SRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ bfwinr     : int  12 10 11 3 1 7 6 7 4 9 ...
 $ DRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ POS        : chr  "5" "7" "12" "4" ...
 $ SP         : num  50 33 50 6.5 2 20 12 20 10 25 ...
 $ BFOddsWin  : num  227.41 71.97 100 8.6 3.15 ...
 $ BFOddsPlace: num  32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
 $ IPMIN      : num  10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
 $ IPMAX      : num  570 120 1000 44 1000 1000 22 32 90 110 ...
 $ AMWAP      : num  1 21.26 1 9.64 3.93 ...
 $ RTA        : chr  "NHFT" "NHFT" "NHFT" "NHFT" ...
 $ rta2       : chr  "Maiden" "Maiden" "Maiden" "Maiden" ...
 $ TYPE       : chr  "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
 $ TRAINER    : chr  "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
 $ JOCKEY     : chr  "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
 $ COURSE     : chr  "Exeter" "Exeter" "Exeter" "Exeter" ...
 $ DISTANCE   : num  17 17 17 17 17 17 17 17 17 17 ...
 $ GOING      : chr  "Good" "Good" "Good" "Good" ...
 $ DRAW       : chr  "" "" "" "" ...
 $ HCAP       : chr  "No" "No" "No" "No" ...
 $ STAT       : chr  "" "D1" "" "" ...
 $ HEADGEAR   : chr  "" "" "" "" ...
 $ CLASS      : chr  "5" "5" "5" "5" ...
 $ RUNNERS    : num  13 13 13 13 13 13 13 13 13 13 ...
 $ SIRE       : chr  "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
 $ DAMSIRE    : chr  "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
 $ CLMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DIMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEIGHT     : chr  "140" "147" "140" "147" ...
 $ SEX        : chr  "m" "g" "m" "g" ...
 $ SKF        : int  1 1 1 3 3 1 3 1 3 1 ...
 $ prob       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ability    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ cons       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HORSE      : chr  "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
 $ FBRVPERC   : num  1399 374 559 -43 -79 ...
 $ PACEPERC   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Year       : chr  "2009" "2009" "2009" "2009" ...

str(Rds_in)
Classes ‘data.table’ and 'data.frame':	962306 obs. of  64 variables:
 $ DATE       : Date, format: "2009-01-01" "2009-01-01" "2009-01-01" "2009-01-01" ...
 $ TIME       : chr  "1605" "1605" "1605" "1605" ...
 $ FBRr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ FBR        : int  9 9 9 9 9 9 9 9 9 9 ...
 $ FBR Vodds  : num  15.2 15.2 15.2 15.2 15.2 ...
 $ FBR V%     : num  13.99 3.74 5.59 -0.43 -0.79 ...
 $ RSF        : num  0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 ...
 $ POW        : num  1 7.4 3 10.2 10 8.6 10.2 9.8 9.4 8 ...
 $ POWr       : int  12 9 11 2 4 7 2 5 6 8 ...
 $ POW Vodds  : num  25.5 16.1 22.1 13.1 13.3 ...
 $ POW V%     : num  7.91 3.49 3.53 -0.34 -0.76 0.77 0.41 0.93 -0.09 1.86 ...
 $ RSP        : num  0.49 0.8 0.57 0.99 0.98 0.88 0.99 0.96 0.93 0.84 ...
 $ PACEr      : int  NA NA NA NA NA NA NA NA NA NA ...
 $ PACE       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACE%      : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACELTO    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ PACELTO%   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMP    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ PACEIMPr   : int  NA NA NA NA NA NA NA NA NA NA ...
 $ DSLR       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ COMr       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COM        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SKFr       : int  7 7 7 1 1 7 1 7 1 7 ...
 $ TRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ JRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ SRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ bfwinr     : int  12 10 11 3 1 7 6 7 4 9 ...
 $ DRr        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ POS        : chr  "5" "7" "12" "4" ...
 $ SP         : num  50 33 50 6.5 2 20 12 20 10 25 ...
 $ BFOddsWin  : num  227.41 71.97 100 8.6 3.15 ...
 $ BFOddsPlace: num  32 14 18.5 2.7 1.51 5.4 4.6 5.56 3.35 8.86 ...
 $ IPMIN      : num  10 16.5 50 5 3.05 13 6 1.01 7.6 25 ...
 $ IPMAX      : num  570 120 1000 44 1000 1000 22 32 90 110 ...
 $ AMWAP      : num  1 21.26 1 9.64 3.93 ...
 $ RTA        : chr  "NHFT" "NHFT" "NHFT" "NHFT" ...
 $ rta2       : chr  "Maiden" "Maiden" "Maiden" "Maiden" ...
 $ TYPE       : chr  "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" "4YO to 6YO" ...
 $ TRAINER    : chr  "Karen George" "George Baker" "S C Burrough" "N J Hawke" ...
 $ JOCKEY     : chr  "E Dehdashti" "A Tinkler" "R Greene" "Christian Williams" ...
 $ COURSE     : chr  "Exeter" "Exeter" "Exeter" "Exeter" ...
 $ DISTANCE   : num  17 17 17 17 17 17 17 17 17 17 ...
 $ GOING      : chr  "Good" "Good" "Good" "Good" ...
 $ DRAW       : chr  "" "" "" "" ...
 $ HCAP       : chr  "No" "No" "No" "No" ...
 $ STAT       : chr  "" "D1" "" "" ...
 $ HEADGEAR   : chr  "" "" "" "" ...
 $ CLASS      : chr  "5" "5" "5" "5" ...
 $ RUNNERS    : num  13 13 13 13 13 13 13 13 13 13 ...
 $ SIRE       : chr  "Karinga Bay" "Midnight Legend" "King O' The Mana" "Trempolino" ...
 $ DAMSIRE    : chr  "Infantry" "Timeless Times" "Shahrastani" "Cadoudal" ...
 $ CLMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ DIMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEMove     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ WEIGHT     : chr  "140" "147" "140" "147" ...
 $ SEX        : chr  "m" "g" "m" "g" ...
 $ SKF        : int  1 1 1 3 3 1 3 1 3 1 ...
 $ prob       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ ability    : int  NA NA NA NA NA NA NA NA NA NA ...
 $ cons       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ HORSE      : chr  "Prickles" "Shut The Bar" "Jackies Dream" "Maggio" ...
 $ FBRVPERC   : num  1399 374 559 -43 -79 ...
 $ PACEPERC   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Year       : chr  "2009" "2009" "2009" "2009" ...
 - attr(*, ".internal.selfref")=<externalptr>

Feature request: data-integrity check by adding hashvalue

Data integrity is pretty important in the organization I work for.
Accountants ask for 'hashvalues' of datasets, when they perform audits. It must be certain that the right dataset is used and that is has not changed.

Is it possible to calculate a hashvalue and save it as metadata (in the same file) when you write the data to disk? When you read the file, it would be nice that the hashvalue is calculated 'on the fly' and compared with the hashvalue recieved as metadata. In that way, you are always certain that you have the same data.

Benchmark on well-known data sets

And add a section on benchmarking on https://fstpackage.github.io

A range can be specified with read.fst on sorted data frames

When a sorted data set is stored as a fst binary file, sorting metadata is stored alongside the data. Using this metadata, a binary search can be performed on the key-columns before actually reading the data. For example, only 32 random seeks are needed in the binary file to search 4 billion rows for a begin- and end- value from a selected range. The performance penalty will be very small (seeking with modern SSD's is very fast).

Very slow writing to network drive when using compression (Windows 7)

When saving to a network drive without compression I get about 50Mbps and with compression at 100 I get 8Mbps. Saving to a local drive however is much much faster so the CPU doesn't seem to be the bottleneck.

Feature request: Routines to get meta data of the data frame without opening the whole data

I don't know if this is requested. It would be helpful to provide functions that gives back some basic info about the data frame in the save fst file without opening the whole file. For example:

Number of columns and rows. So that the users can determine whether to open the whole file or read part of the file.
May be give the name of the columns and the data type of each columns.

Thanks.

Problem with dates

fst is not respecting dates.. I hope that is an easy fix

> u1=data.frame(DT=Sys.Date()-1:10,variable=rep(c("A","B"),5), value=1:10)
> write.fst(u1,"c:\\t\\u1.fst")
> u1
           DT variable value
1  2017-02-05        A     1
2  2017-02-04        B     2
3  2017-02-03        A     3
4  2017-02-02        B     4
5  2017-02-01        A     5
6  2017-01-31        B     6
7  2017-01-30        A     7
8  2017-01-29        B     8
9  2017-01-28        A     9
10 2017-01-27        B    10
> u2=read.fst("c:\\t\\u1.fst")
> u2
      DT variable value
1  17202        A     1
2  17201        B     2
3  17200        A     3
4  17199        B     4
5  17198        A     5
6  17197        B     6
7  17196        A     7
8  17195        B     8
9  17194        A     9
10 17193        B    10
>

Bug Report: In R: fst crash in both saving and reading very large files. (500M+ rows and 50+ columns, 100+GB)

I am having problem saving and reading very large files in fst R package.
I used both CRAN version and development version.
The problem occurred intermittently. Every few tries on saving will get me a success save. Most of the reading so far have failed.
However, if I read a subset of the file, the reads are mostly successful.
The data is a big data.table frame.
I don't know how to provide more info.
The following is the error from read.fst():

system.time({a=read.fst('/dev/shm/AllHorizonDT_00.fst',as.data.table=T)})
*** caught segfault ***
address 0x7f531143e038, cause 'memory not mapped'

*** caught segfault ***
address 0x7f71c25db020, cause 'memory not mapped'

Traceback:
1: .Call("fst_fstRead", PACKAGE = "fst", fileName, columnSelection, startRow, endRow)
2: fstRead(fileName, columns, from, to)
3: read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T)
4: system.time({ a = read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T)})

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: *** caught segfault ***
Selection: address 0x7f531143e038, cause 'memory not mapped'

This is the error from write.fst():
Saving All data for all horizons ...
*** caught segfault ***
address 0x7f201085103a, cause 'memory not mapped'

Traceback:
1: .Call("fst_fstStore", PACKAGE = "fst", fileName, table, compression)
2: fstStore(normalizePath(path, mustWork = FALSE), x, as.integer(compress))
3: write.fst(AllHorizonDT, path = filename, compress = fst_compress_level)
4: save_horizon_data(AllHorizonDT, formatsub = "Formatted/", maturedsub = "matured/", agesub = "Ages/", horizsub = "AllHorizon", savefst = T)
5: eval(expr, envir, enclos)
6: eval(ei, envir)
7: withVisible(eval(ei, envir))
8: source("Product_DataPrep.R")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

Neither are very informative. Basically looks like some kind of member problem.

Let me know what I can do to help debugging this.
Thanks.

fstpackage / fst Goto Github PK

fst's Issues

Recommend Projects

Recommend Topics

Recommend Org