Comments (3)
Hi @statquant , thanks for the feature request! Your request is related to issues #16 and #30. As you say, for sorted table's, we can implement a binary search to retrieve a range of rows depending of some specified key range. A binary search is very fast, for example with only 30 seek operations on the fst
file, you can scan a billion records. For selections which are not related to a stored key, we could use the selection mechanism from data.table
, but on chunks of data instead of the whole table. The problem however is that you can't use aggregate statements for selection in that case, for example:
dt <- data.table(X = 1:10, Y = 10:1)
dt[X < mean(Y)]
X Y
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
This works for a complete table, but it won't work when the data is chunked into multiple subsets (in that case the mean
is not calculated correctly). So that is a problem. Possible solutions might be:
- A selection requires the specification of a grouping variable. So the selection is done per group. If the groups are small enough, there will be no problems for large data sets.
- No aggregate selections are allowed, only simple operators. The advantage of this solution is that we can program these simple operators in C++, increasing performance.
- A more elaborate framework where we allow custom methods as operators on the data. These methods should have a map reduce-like character, for example for the above example:
# Two chunks
dt1 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
dt2 <- data.table(X = sample(1:20, 10), Y = sample(1:20, 10))
# Calculate sums and counts
r1 <- dt1[, .(Sum = sum(Y), Count = .N)]
r2 <- dt2[, .(Sum = sum(Y), Count = .N)]
# Combine results and calculate mean
rTot <- rbindlist(list(r1, r2))
rTot[, sum(Sum) / sum(Count)]
[1] 10.35
So we calculated a mean
by using sum
and counting per chunk. The fst
package could provide methods like fst.sum
, fst.mean
etc. to perform these operations.
For your use-case I think that option 2 is probably enough?
from fst.
@MarcusKlik thanks for the prompt reply, indeed 2) is enough for me.
Honestly I think it would be for most people, as when you want to aggregate in some sense I'd guess you would still want the whole data to check what you've done, to change what you've done etc...
from fst.
Nice, I will make sure that your feature is on the list for one of the next versions of fst
.
from fst.
Related Issues (20)
- wrong forum
- Problem with windows file names encoding
- Progress bar when read/write HOT 1
- fst 0.9.4 package load fails with Rcpp 1.0.6 in R 4.1.0 (but not in R 4.0.5 or with Rcpp 1.0.7) HOT 1
- OpenMP not detected Mac 12 (Monterey) M1 (ARM) Mac HOT 17
- How to extract contents from a fst file when R crashes reading it HOT 2
- mac os, apple M1 installation guide should be updated to include the paths of homebrew installed libomp when using xcode-select c++ compiler HOT 1
- Convert `sql` query from BigQuery to `fst` format HOT 1
- Integer64 still remains numeric upon opening with read_fst HOT 9
- Binaries through r-universe HOT 1
- Chunkwise support for `read.fst`? HOT 3
- R crashes while reading an fst file HOT 15
- attributes are not saved HOT 1
- Unable to save embedded lists
- Can `read_fst` use a filter condition beforehand? HOT 1
- Big-endian seems to work: maybe remove misleading requirement on CRAN? HOT 3
- Why is the first read slower? HOT 2
- Compression rate to minimize reading time? HOT 2
- relatively new install issue HOT 7
- write_fst Seems To Skip Small Tables When Writing In A for Loop HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fst.