Comments (24)
I did some further benchmarking on packages data.table
, fst
, feather
and the base RDS methods. As expected, there are large differences in performance for serialization of the various column-types as can be seen in the figure below. In that figure, you can see the tested column types (horizontal) versus the mode (read/write) for all 4 serializers. The performance of the multi-threaded fwrite
really stands out for character columns, very impressive! The second figure shows the same data with the individual measurements and different axis (to get some idea of the performance stability). By the way, these benchmarks where not computed on the Xeon / RevoDrive (benchmarks on the fstpackage.org site), but on my very modest laptop (4/8 cores, i7-4710HQ @ 2.5GHz) with a Samsung EVO 850 SSD.
Additionally:
- The read and write performance of logical columns is very high for the
fst
package because of an effective bit packing algorithm. The actual file size for logical columns is a factor 16 smaller than that of the rds file (1 logical is packed into 2 bits instead of the 32 bits that R uses). This explains the 'larger than drive speed' performance for logicals. - Serialization of character columns is a difficult task for all serializers except for
fwrite
. Getting strings in and out of R's global string pool takes a lot of CPU power and the multi-threadedfwrite
really has a large advantage here (the task is CPU bound). - The benchmark has 2470 observations (about 1 hour of computational time). To be sure all disk and RAM caching is excluded, it would be better to generate a unique data set for each observation. That will make the benchmark take more time however.
- In general, the total measured speed for serializing a data set to disk can be estimated by:
# estimating total speed from column speeds
speed_tot <- 1 / ( ( 1 / speed_col1) + (1 / speedcol2) + ...)
so basically taking the inverse of the sum of the inverted speeds for each column type used (the effect of average string length not included here). These results show that when comparing performances between various solutions, the chosen data set is critically important and it would be very nice to have a set of type-specific benchmark data sets which can be used as a baseline!
I've posted the script for this uncompressed benchmark here (it's a bit raw, apologies for that :-))
from fst.
@MarcusKlik and @mattdowle, I really like both of your packages. Hopefully you guys will work closely together going forward.
I am really looking forward to have very good serialization solutions in R for very big data set. And one of the feature I mentioned to @MarcusKlik is the ability to work on a very large data.table off and in memory on demand depending on the available resources. That would be really helpful.
Thanks.
from fst.
Hi @mattdowle , thanks for tip on measuring the real drive speed, that's a great tool by the way. But the RevoDrive is at a server in my company which is stuck with Windows, so it will take some effort getting that measurement. The first thing that comes to mind on the benchmark results is that I am using the in-memory size (from R) as the base size for speed calculations. That size is equal to the uncompressed rds
file. But the fst
files are usually smaller than that because there is some compression being done even with compress = 0
(for example bit-packing for logicals). So the reported speed is the in-memory size divided by the time for a read or write. That can be higher than the maximum drive speed if 'base compression' already brings the file size down. I see for example that the uncompressed fst
file is only 66 percent of the uncompressed rds
file (that would more or less explain the difference).
I will take a closer look at the exact performance measurements and will get back on that, thanks!
from fst.
Hi @mattdowle, @st-pasha and @arunsrinivasan,
the next release of fst
is due in a few days and I've prepared a blog to post at the same time. In the blog I explore the performance of write_fst
/read_fst
and compare it against fread
/fwrite
, saveRDS
/readRDS
and write_feather
/read_feather
. I'm also looking at the multi-threaded enhancements (for data.table
and fst
), compression (fst
) and the effects of file size (all). If you guys would like to glance over the results to see if you recognize the results for data.table
, please let me know so I can sent you the blog-preview link!
from fst.
Speaking of benchmarks.... It would be interesting to see MonetDBLite included:
https://www.monetdb.org/blog/monetdblite-r
https://github.com/hannesmuehleisen/MonetDBLite
from fst.
Hi @st-pasha , thanks for the praise, I posted the benchmark code here, your evaluation is much appreciated! The code includes the figures published on the fstpackage.org website. A few remarks on the benchmark:
- I use the
microbenchmark
package to measure performance but use only a single iteration per benchmark. I found that using more iterations resulted in unrealistically high performance measurements due to caching of the SSD disk used (which is very effective). - The published figures refer to the benchmark script run on a Xeon E5 CPU @2.5GHz. It has a lot of cores, but only a single core is used by the
fst
package currently (that will change however when multi-threading is implemented, see also #48 ) - I will update the results for
fread
andfwrite
in the near future, now that your colleague @mattdowle and @arunsrinivasan have implemented multi-threading in thedata.table
package :-) - New and more extensive benchmarks are planned in the near future. Those benchmarks will include separate performance measurements for each specific type of column. Performance is greatly dependent on the column type (up to an order of magnitude). For example, character vectors in R are implemented in a very computational expensive manner and show poor performance (although there are some ideas to circumvent that).
- The data-set is randomly generated and the code is included in the gist.
- The SSD drive used was a OCZ RevoDrive 350
- In general I found that Xeon processors show very good performance on the blocked format of
fst
(probably due to effective branch prediction)
If you have anything to share on your evaluations later on, please do!
from fst.
@phillc73 , great tip! I noticed MonetDB being mentioned in a data.table
issue on serialization as well. I will make sure it is included in the new benchmarks, thanks!
from fst.
Hi Mark, Thanks for all the info. I did a talk yesterday and while preparing for it I discovered the command hdparm -t /dev/sda
to measure the true sustained read speed of the device. Looking up your OCZ RevoDrive 350 the manufacturer stated max read appears to be 1800MBps (1.8GB/s) which fits with it being MLC (somewhere between SSD and NVMe), iiuc. Does hdparm -t /dev/sda
agree? However, the speed of read.fst
on the first row of the benchmark table is stated as 3271.9 MB/s. That's much higher than that device speed is capable of (1800 MB/s). Could it be that that timing was reading from RAM cache not the device? Or maybe we have our B's and b's mixed up! I can't see use of sudo sh -c 'echo 3 >/proc/sys/vm/drop_caches'
or similar in the benchmark code. I can see that it uses one iteration so the intent is there, but if caches haven't been dropped the file is likely in RAM cache from a previous run. Thanks, Matt
from fst.
Wow - this is awesome!
from fst.
Hello again. Do you use Windows on your laptop like the server? The reason I ask is that there has been a problem reported on Windows with fread
in dev where either the parallel threads aren't kicking in, or they are but badly and the performance is worse than single threaded. Did it look as though fread
was using all the laptop's cores efficiently when it ran?
from fst.
Hi, the laptop has Windows 10, and the threads are kicking in during the fwrite
, but I can only see a single core working with fread
, so indeed they are not kicking in for fread
it seems!
from fst.
Relief! Thanks for info. I've borrowed a Windows 8.1 machine and managed to reproduce similar. Investigating.
Aside: did you compile data.table dev 1.10.5 yourself with Rtools (mingw) or download the AppVeyor .zip ? There was another windows issue (unrelated I think) that needed latest rtools.exe 3.4 (which AppVeyor uses) just to double check.
from fst.
Found and fixed the problem. I confirmed it's fixed for me on Windows 8.1. Please try again.
data.table 1.10.5 IN DEVELOPMENT built 2017-04-15 11:11:18 UTC; appveyor
install.packages("https://ci.appveyor.com/api/buildjobs/1txx94ruv769jc42/artifacts/data.table_1.10.5.zip", repos=NULL)
from fst.
Hi @mattdowle , multithreaded fread
almost works fine now on Windows :-). I still only saw the single thread at work in the benchmark with fread
. I tried the appveyor precompiled build and I also build data.table
from your latest commit. Then I reinstalled RTools 3.4 and tried again. Also, manually setting nThread = 8
had no effect on CPU load.
Then I checked multiple-column csv files and suddenly the threads kicked in. Working on the 1e6 dataset from ?fread
:
fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv")
microbenchmark( fread("singlecol.csv"), times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
fread("singlecol.csv") 52.72078 52.72078 52.72078 52.72078 52.72078 52.72078 1
> microbenchmark( fread("dualcol.csv"), times = 1)
Unit: milliseconds
expr min lq mean median uq max neval
fread("dualcol.csv") 18.77348 18.77348 18.77348 18.77348 18.77348 18.77348 1
So, the single column file, while exactly half in length, took 3 times longer to load. Apparently, the threads only kick in with more than 1 column (and my benchmark was using single column files :-))
from fst.
I'm guessing the data.table
benchmark speeds will go up with a factor of 6 for all investigated columns types using a 4/8 core machine. When the single column boundary case is fixed, I'll make sure the benchmark is updated and repeated on a machine with more cores!
from fst.
Excellent - thanks for working this out!! Yes there's no reason it shouldn't go MT on a single column, so hopefully a simple logic bug to fix somewhere. It also doesn't do the progress meter when ST - same area to fix. Will do ...
from fst.
Thanks, impressive work on the multi-threading! And interesting to see that OpenMP works very well now within the R tool chain. I was thinking TinyThreads++ for a more fine-grained solution for 'fst', but I think OpenMP might do the job just fine for fst
as well.
from fst.
Hi @wei-wu-nyc , thanks. First, in the interface milestone (#48) I have planned a 'simple' data.table
interface. From milestone advanced operations on, the plan is to implement a data.table
interface to fst
files to be able to effectively group, sort, merge, append (columns and rows) and select (all operations on-disk requiring very little memory). These operations can also be performed on a group of fst
files without any difference to the interface (fst
has a blocked format anyway). For the planned parallel 'merge sort' algorithm, the idea is to sort the individual chunks using data.table
's very fast sorting algorithm. So you see, the fst
package will depend heavily on data.table
(not so much the other way around I'm sure :-)). Oh, and I would like to be able to convert a csv
file directly to a fst
file without memory overhead, so the code behind the new multi-threaded fread
will be a key component for that.
It would be interesting to add a few sample use-cases for working with large data sets down the road. Common tasks such as:
- I have 50
csv
files 10 GB each, how can I calculate some method for each group in the data? - How do I sort such a large collection of files?
- I have a 100 GB (
fst
) file, how can I calculate some statistics on that? - I only need to select a single year from my data, but I do not have enough memory to read the
csv
files, what to do? - I want to do calculations from R on data at the lowest level from my companies large database, but the whole data set doesn't fit into memory, how can I stream to a
fst
file and perform my calculations from there on?
Some of the Kaggle competitions have (open) data which would be very suitable for these use-cases, and they would represent real-life problems, so it would be interesting to explore the use of fst
in solving some of these 'large-data' problems. When fst
has a more mature interface, I could set up the wiki to collect some of these tasks (or let users add them).
from fst.
Ok ... single column input now goes multi-threaded and a few other problems fixed too.
Please try again with latest dev. Fingers crossed!
from fst.
(vertical axis is in MB/s taking the object.size() of the column as base size)
Hi @mattdowle , just a quick benchmark to confirm, all looks fine now! I will test with more observations and larger files soon. Thanks for the quick fix!
from fst.
Only using 4 cores here, but the performance is already very impressive. And I expect much more of a boost in performance for data.table
when we use an order of magnitude more cores!
from fst.
And with the single/dual column test:
nrOfRows = 2e8
DT <- data.table(a = 1:nrOfRows, b = 1:nrOfRows)
fwrite(DT[, .(a)], "singlecol.csv")
fwrite(DT[, .(a, b)], "dualcol.csv")
we get:
> microbenchmark( fread("singlecol.csv"), times = 1)
Read 200000000 rows x 1 columns from 1.945 GB file in 00:02.681 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
expr min lq mean median uq max neval
fread("singlecol.csv") 2.847271 2.847271 2.847271 2.847271 2.847271 2.847271 1
> microbenchmark( fread("dualcol.csv"), times = 1)
Read 200000000 rows x 2 columns from 3.705 GB file in 00:06.439 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Unit: seconds
expr min lq mean median uq max neval
fread("dualcol.csv") 7.027425 7.027425 7.027425 7.027425 7.027425 7.027425 1
for a 200 million row data set with integers
from fst.
Great, impressive results!
from fst.
@MarcusKlik Great -- yes please. Do you have a blog-preview link?
from fst.
Related Issues (20)
- wrong forum
- Problem with windows file names encoding
- Progress bar when read/write HOT 1
- fst 0.9.4 package load fails with Rcpp 1.0.6 in R 4.1.0 (but not in R 4.0.5 or with Rcpp 1.0.7) HOT 1
- OpenMP not detected Mac 12 (Monterey) M1 (ARM) Mac HOT 17
- How to extract contents from a fst file when R crashes reading it HOT 2
- mac os, apple M1 installation guide should be updated to include the paths of homebrew installed libomp when using xcode-select c++ compiler HOT 1
- Convert `sql` query from BigQuery to `fst` format HOT 1
- Integer64 still remains numeric upon opening with read_fst HOT 9
- Binaries through r-universe HOT 1
- Chunkwise support for `read.fst`? HOT 3
- R crashes while reading an fst file HOT 15
- attributes are not saved HOT 1
- Unable to save embedded lists
- Can `read_fst` use a filter condition beforehand? HOT 1
- Big-endian seems to work: maybe remove misleading requirement on CRAN? HOT 3
- Why is the first read slower? HOT 2
- Compression rate to minimize reading time? HOT 2
- relatively new install issue HOT 7
- write_fst Seems To Skip Small Tables When Writing In A for Loop HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fst.