Giter VIP home page Giter VIP logo

Comments (2)

rudolfix avatar rudolfix commented on July 22, 2024

@hello-world-bfree thanks for the bug report. I think it is (more or less) clear what is going on. if your data source is adding columns on the fly and the in-memory buffer for extracted data does not hold many rows (5000 by default) then we'll indeed write several parquet files with different schemas (you could increase the buffer size to a 1000 000 rows and see if that still happens: https://dlthub.com/docs/reference/performance#controlling-in-memory-buffers)

@steinitzu let's try to fix it. since all parquet files are loaded from a local storage we are able to get the column names and generate a COPY command per file

with maybe_context(lock):
            with sql_client.begin_transaction():
                sql_client.execute_sql(
                    f"COPY {qualified_table_name} FROM '{file_path}' ( FORMAT"
                    f" {source_format} {options});"
                )

right now we assume that duckdb will handle this itself and it is apparently not the case.

the above will work because dlt makes sure that all changes are added and schema is already migrated so all possible columns are present

from dlt.

steinitzu avatar steinitzu commented on July 22, 2024

@hello-world-bfree I'm pretty sure this was fixed in dlt 0.4.8 can you try updating? Latest version is 0.4.12
I was only able to replicate the bug on 0.4.7
There is normalization done for parquet files now where missing columns are added and columns are re-ordered as needed.

from dlt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.