Giter VIP home page Giter VIP logo

Comments (7)

szarnyasg avatar szarnyasg commented on June 25, 2024

Hello, can you please try the latest version, v0.10.0? The CSV reader has been almost completely rewritten.

from duckdb.

lkordos avatar lkordos commented on June 25, 2024

Hello, I just tried it with v0.10.0. it is much better. The loading time is down to ~5min for query 2/ (There is a breaking change in the argument name, delimiter vs delim).

However, in comparison with the MS Excel, duckdb loading is magnitude slower. Excel loads the same file in no time.

Update:
I tested the release version of the duckdb. As expected, the load time improved, but still it is too long: 2 min. Also, the memory consumption is up to 18GB (the size of the CSV is 30MB). It seems, that something is not right.

from duckdb.

szarnyasg avatar szarnyasg commented on June 25, 2024

Thanks for giving it a try with DuckDB v0.10 - it's great that it no longer times out.

I don't think we can do much about this issue at the moment:

  1. Having this many columns is a rare cornercase and sounds somewhat of a data modelling issue.
    If having 298 rows (like in your example) is normal for your use case, then I would consider transposing the file so it has 298 columns an ~16k rows. This is doable in Excel using Pivot or via some Bash/Python script. Sure enough, pre-processing data to be loadable to DuckDB is unusual but this is an unusual data set.

  2. To improve the speed of DuckDB's CSV reader on similar files, we need a reproducible example.

As an alternative, you may try directly reading the file using read_text and splitting the rows using string split. This should be fast. Then how you model your data is up to the queries you'd like to evaluate on it.

from duckdb.

lkordos avatar lkordos commented on June 25, 2024

Thanks, I will give it a try.
I can't do the transposing as the data come from a data scientist and the format is essential.

from duckdb.

szarnyasg avatar szarnyasg commented on June 25, 2024

One thing you can try is to turn off the CSV sniffer by passing auto_detect = false. However, this requires you to manually specify the schema with a (long) CREATE TABLE statement.

from duckdb.

szarnyasg avatar szarnyasg commented on June 25, 2024

@eknowles thanks for reporting this regression. Is there a chance you can share you're data? (Sharing privately is also fine.)

from duckdb.

lkordos avatar lkordos commented on June 25, 2024

I wanted to avoid explicitly creating the table schema (due to 17 000 columns). Instead, I used ALL_VARCHAR=TRUE flag (even all values are numbers) - I assumed this would stop autodetection.

Sorry, I can't share the data as they are product. of a research. However, I believe any CSV would work fine for this purpose:
columns: ~17000
rows: ~300
values: all cells contain a decimal number

I hope this will help.

from duckdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.