Comments (2)
Is there a link to a patch (or even better, reference to a pull request) we can have a look at?
from petastorm.
Don't remember the details since it was a long time ago. Try seeing if any of these references help: apache/parquet-java#470
https://issues.apache.org/jira/browse/PARQUET-409
From my experience, it's typically not a good idea to have parquet stores with small row-groups. It does violate a bunch of assumptions on the parquet store structure and makes you "fight" parquet library implementation a lot. It manifests as poor performance and large memory footprints in some scenarios.
from petastorm.
Related Issues (20)
- Random seed doesn't seem to work well HOT 2
- Customized dataset HOT 1
- How to pass pin_memory argument when using make_torch_dataloader HOT 2
- when hdfs-site.xml file has xi:include tag, the function cann't get hadoop_configuration info
- Prediction issue using Keras and TransformSpec with PySpark
- Petastorm sharding and setting batch sizes
- make_batch_reader Documentation out of date? seed?
- How to transform the string data to numerical when using make_batch_reader?
- AttributeError: 'bool' object has no attribute 'map' βwhile using Predicate
- TypeError: __init__() missing 2 required positional arguments: 'instance' and 'token'
- Seeing worse model performance from using petastorm vs normal pytorch dataloader HOT 1
- String as input in petastorm dataloaders HOT 3
- Issue with loading nested array type from spark DF to torch
- Bug in ConcurrentVentilator._ventilate() when randomize_item_order=True and random seed is fixed
- make_torch_dataloader using TransformSpec applies transformation on entire dataframe (not lazy loading) HOT 2
- FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version. HOT 1
- make_reader fails for example HOT 1
- ParquetDataset has an invalid parameter validate_schema HOT 1
- Petastorm hangs forever in DataBricks HOT 1
- Petastorm break with pyarrow 13.0 or newer. Stable version of pyarrow is at 16.0 now. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from petastorm.