Giter VIP home page Giter VIP logo

Comments (12)

filabrazilska avatar filabrazilska commented on September 24, 2024 2

Hi,
the difference between read_csv and scan_csv is in the fact that the former uses the passed-in schema for the schema attribute in CsvReader whereas the latter uses is for the dtypes attribute.
When changed the two behave the same (see my commit above). That said I don't know if anyone depends on the original behaviour so not sure if the maintainers are willing the accept this change.

from polars.

cmdlineluser avatar cmdlineluser commented on September 24, 2024 1

If I understand correctly, this appears to be a minimal repro?

Data:

wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip

.read_csv works as expected.

import polars as pl

pl.read_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ...

The same arguments with .scan_csv raises an exception:

# ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'.

With truncate_ragged_lines=True the file is read but we no longer get 130 columns.

pl.scan_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False,
    truncate_ragged_lines=True
).collect()

# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ...

(With read_csv(..., truncate_ragged_lines=True) we still get 130 Columns.)

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Have you tried setting the full schema? Setting only columns will set the names of the schema, but Polars will still determine the schema itself, which in this case will likely be done based on the header.

We accept a schema argument that will help you completely overwrite the schema.

from polars.

djouallah avatar djouallah commented on September 24, 2024

does not seems to be working, it is still ignoring the schema, it still provide a header, column_1 , 2 etc based on the number of columns in row=1

image

from polars.

cmdlineluser avatar cmdlineluser commented on September 24, 2024

I suppose the data is not actually needed.

Simpler repro:

import polars as pl
import tempfile

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
    f.seek(0)
    
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    # shape: (3, 5)
    # ┌─────┬─────┬─────┬──────┬──────┐
    # │ A   ┆ B   ┆ C   ┆ D    ┆ E    │
    # │ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
    # │ str ┆ str ┆ str ┆ str  ┆ str  │
    # ╞═════╪═════╪═════╪══════╪══════╡
    # │ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
    # │ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
    # │ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
    # └─────┴─────┴─────┴──────┴──────┘
    
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    # shape: (3, 3)
    # ┌─────┬─────┬─────┐
    # │ A   ┆ B   ┆ C   │
    # │ --- ┆ --- ┆ --- │
    # │ str ┆ str ┆ str │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # │ 4   ┆ 5   ┆ 6   │
    # │ 9   ┆ 10  ┆ 11  │
    # └─────┴─────┴─────┘

from polars.

filabrazilska avatar filabrazilska commented on September 24, 2024

The documentation for scan_csv seems to suggest that the schema param should indeed be used for schema rather than dtypes: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html

from polars.

djouallah avatar djouallah commented on September 24, 2024

Any updates on this?

from polars.

cmdlineluser avatar cmdlineluser commented on September 24, 2024

It looks like @filabrazilska did file a PR to address this #15305

But it hasn't been reviewed yet.

(PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look.

from polars.

djouallah avatar djouallah commented on September 24, 2024

@ritchie46 there is a regression with the latest update

PanicException                            Traceback (most recent call last)
<timed exec> in <module>

[<ipython-input-11-3bc5859e5af2>](https://localhost:8080/#) in polars_clean_csv(x)
     33   z = transform.with_columns(pl.col("SETTLEMENTDATE").str.to_datetime())
     34   columns = list(set(transform.columns) - {'SETTLEMENTDATE','DUID','UNIT'})
---> 35   final=z.with_columns(pl.col(columns).cast(pl.Float64),YEAR=pl.col("SETTLEMENTDATE").dt.iso_year()).collect()
     36   return final.to_arrow()

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
   1815         callback = _kwargs.get("post_opt_callback")
   1816 

-> 1817 return wrap_df(ldf.collect(callback))
1818
1819 @overload

PanicException: called Option::unwrap() on a None value

from polars.

cmdlineluser avatar cmdlineluser commented on September 24, 2024

Thanks @djouallah - it seems that was a different issue.

If you can make minimal test cases, it makes it easier for the devs to fix.

As an example, I made a minimal repro for you in #16437

It has just been fixed and will be part of 0.20.29 which should be released soon.

from polars.

evbo avatar evbo commented on September 24, 2024

I am seeing similar behavior but for nd_json. Issue linked here: #18244

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.