Comments (6)
Hi @wesm, thanks for this. Yes we are excited about Arrow (even though we are only supporting a subset at the moment) because it provides interoperability with lots of other things and makes sense as a way to represent columnar data. I don't see any issues why it should not be performant on GPU, as the MapD native format is quite similar (except we store nulls in-line when possible to save space and bandwidth). Would it make sense to set up a call with the project members so we can discuss ways to collaborate?
from cudf.
That sounds good to me. Adding @julienledem @xhochy since they will be interested, and maybe other from the Apache Arrow team.
I am interested in
- Ingest data (zero-copy, preferably) from Arrow record batches
- Ingest data to MapD from Arrow
- Export data as Arrow record batches
- UDF protocol for batch-based UDFs
- Benchmarks and analysis of pros/cons of different columnar-type memory layouts on the GPU (you say you store nulls inline -- does that mean sentinel values? Otherwise I am not sure how you could be more efficient that 1 bit per value for data that has nulls).
As background, I did some GPU development for accelerating Bayesian inference problems years ago and did a fair amount of CUDA C and PyCUDA work, so I've had a long-standing interest in architecting data structures and memory access patterns for the GPU.
from cudf.
Bingo on all fronts, all things mentioned in the talk I gave last week at GTC. We have also some basic work to do for supporting the rest of the data types (prototype did only simple, uncompress numerics to keep it simple).
from cudf.
Does the GPU benefit from columnar compression techniques like CPU-based columnar databases do?
from cudf.
@wesm, we already have some in core engine like dictionary compression. And we are planning to tokenize any string column that only has digits to save memory, but they don't require to be columnar if you just mean sth. like RLE or HCC. All ways aim to keep GPU decoding fast.
from cudf.
from cudf.
Related Issues (20)
- [FEA] Reduce arrow library dependencies in cudf HOT 3
- [PERF] Performance impact of `mixed_type_as_string` JSON reader option in reading JSON lines HOT 2
- [FEA] Be consistent in handling of default parameter values in pylibcudf HOT 4
- [FEA] Add Parquet-to-Arrow dictionary transcoding to the parquet reader
- [BUG] memcheck errors in dask-cudf tests HOT 6
- [BUG] TableTest.testChunkedPackTwoPasses is breaking CI HOT 6
- [BUG] Add support for `force_ascii=False` when writing to JSON with cuDF engine HOT 1
- [BUG] memcheck and racecheck errors in avro reader with `codec="deflate"`
- [FEA] Options to validate JSON fields
- [FEA] Parquet reader should use LogicalType rather than ConvertedType
- [FEA] Parquet reader and writer should support BYTE_STREAM_SPLIT encoding
- [QST] Why do I get an ModuleNotFoundError? No module named 'rmm._cuda.stream' HOT 2
- Standardize docstring typing when pylibcudf is split out
- [BUG] Unable to update array column for subset of rows HOT 2
- [API]: "Private" (imported) Python APIs used in across other RAPIDS projects HOT 2
- [QST] Returning from multi-thread. TypeError: a bytes-like object is required, not 'dict' HOT 5
- [BUG] Prescribed Databricks init script is failing as of 20240306 HOT 6
- [FEA] Performance improvement for mixed semi joins
- [FEA] White space normalization in JSON arrays HOT 2
- [FEA] Support `cumprod` in DataFrameGroupBy
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.