Comments (7)
Hello, can you please try the latest version, v0.10.0? The CSV reader has been almost completely rewritten.
from duckdb.
Hello, I just tried it with v0.10.0. it is much better. The loading time is down to ~5min for query 2/ (There is a breaking change in the argument name, delimiter vs delim).
However, in comparison with the MS Excel, duckdb loading is magnitude slower. Excel loads the same file in no time.
Update:
I tested the release version of the duckdb. As expected, the load time improved, but still it is too long: 2 min. Also, the memory consumption is up to 18GB (the size of the CSV is 30MB). It seems, that something is not right.
from duckdb.
Thanks for giving it a try with DuckDB v0.10 - it's great that it no longer times out.
I don't think we can do much about this issue at the moment:
-
Having this many columns is a rare cornercase and sounds somewhat of a data modelling issue.
If having 298 rows (like in your example) is normal for your use case, then I would consider transposing the file so it has 298 columns an ~16k rows. This is doable in Excel using Pivot or via some Bash/Python script. Sure enough, pre-processing data to be loadable to DuckDB is unusual but this is an unusual data set. -
To improve the speed of DuckDB's CSV reader on similar files, we need a reproducible example.
As an alternative, you may try directly reading the file using read_text
and splitting the rows using string split. This should be fast. Then how you model your data is up to the queries you'd like to evaluate on it.
from duckdb.
Thanks, I will give it a try.
I can't do the transposing as the data come from a data scientist and the format is essential.
from duckdb.
One thing you can try is to turn off the CSV sniffer by passing auto_detect = false
. However, this requires you to manually specify the schema with a (long) CREATE TABLE
statement.
from duckdb.
@eknowles thanks for reporting this regression. Is there a chance you can share you're data? (Sharing privately is also fine.)
from duckdb.
I wanted to avoid explicitly creating the table schema (due to 17 000 columns). Instead, I used ALL_VARCHAR=TRUE flag (even all values are numbers) - I assumed this would stop autodetection.
Sorry, I can't share the data as they are product. of a research. However, I believe any CSV would work fine for this purpose:
columns: ~17000
rows: ~300
values: all cells contain a decimal number
I hope this will help.
from duckdb.
Related Issues (20)
- Method not allowed on read_csv
- struct field access after CASE statement doesn't match other struct supporting sql systems HOT 1
- "Invalid keyword" error in SQLDriverConnect when connecting via Power Query SDK HOT 16
- CLI: segmentation fault while reading multiple CSV files when progress_bar is enabled HOT 4
- Mismatch between uuid bind to as variable vs value returned from db with jdbc
- duckdb do not support Oracle database HOT 1
- Cannot use correlated column name in FROM clause of LATERAL join HOT 2
- read_csv: Invalid unicode (byte sequence mismatch) detected in segment statistics update
- Dashes added to UUIDs automatically HOT 5
- INET doesn't work with both a < and > predicate
- INTERNAL Error: Attempted to access index N within vector of size N HOT 2
- .wal file stay open on exit after a parquet file import HOT 7
- Cannot install httpfs on Linux -- DuckDb 0.10.0 HOT 2
- Python: eager execution / results sets / relations HOT 3
- Python-unexpected binder error when appending a dataframe HOT 2
- Cannot use overriding column alias for aggregate expression in HAVING clause
- `sum( ... RESPECT NULLS)` should throw error (and the docs shouldn't say that RESPECT NULLS is the default)
- Can not attach to a duckdb database in S3
- Inconsistency in read_csv behavior when csv file has 1 vs more columns and quoted strings with quote and escape chars
- Could not find node in column segment tree! HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from duckdb.