Checks <input type="c

Hi, the difference between read_csv and <code cla

It looks like <a class="user-mention notranslate" data-hovercard-type="user" data-hove

Thanks for the report and thank you for the minimal repro <a class="user-mention notra

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Reading CSV, Polars seems to ignore the provided Schema about polars HOT 12 CLOSED

djouallah commented on September 24, 2024

Reading CSV, Polars seems to ignore the provided Schema

from polars.

Comments (12)

filabrazilska commented on September 24, 2024 2

Hi,
the difference between read_csv and scan_csv is in the fact that the former uses the passed-in schema for the schema attribute in CsvReader whereas the latter uses is for the dtypes attribute.
When changed the two behave the same (see my commit above). That said I don't know if anyone depends on the original behaviour so not sure if the maintainers are willing the accept this change.

from polars.

cmdlineluser commented on September 24, 2024 1

If I understand correctly, this appears to be a minimal repro?

Data:

wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip

.read_csv works as expected.

import polars as pl

pl.read_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ...

The same arguments with .scan_csv raises an exception:

# ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'.

With truncate_ragged_lines=True the file is read but we no longer get 130 columns.

pl.scan_csv(
    "PUBLIC_DAILY_202401270000_20240128040505.CSV",
    skip_rows=1,
    schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
    has_header=False,
    truncate_ragged_lines=True
).collect()

# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ...

(With read_csv(..., truncate_ragged_lines=True) we still get 130 Columns.)

from polars.

ritchie46 commented on September 24, 2024

Have you tried setting the full schema? Setting only columns will set the names of the schema, but Polars will still determine the schema itself, which in this case will likely be done based on the header.

We accept a schema argument that will help you completely overwrite the schema.

from polars.

djouallah commented on September 24, 2024

does not seems to be working, it is still ignoring the schema, it still provide a header, column_1 , 2 etc based on the number of columns in row=1

from polars.

cmdlineluser commented on September 24, 2024

I suppose the data is not actually needed.

Simpler repro:

import polars as pl
import tempfile

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
    f.seek(0)
    
    df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
    # shape: (3, 5)
    # ┌─────┬─────┬─────┬──────┬──────┐
    # │ A   ┆ B   ┆ C   ┆ D    ┆ E    │
    # │ --- ┆ --- ┆ --- ┆ ---  ┆ ---  │
    # │ str ┆ str ┆ str ┆ str  ┆ str  │
    # ╞═════╪═════╪═════╪══════╪══════╡
    # │ 1   ┆ 2   ┆ 3   ┆ null ┆ null │
    # │ 4   ┆ 5   ┆ 6   ┆ 7    ┆ 8    │
    # │ 9   ┆ 10  ┆ 11  ┆ null ┆ null │
    # └─────┴─────┴─────┴──────┴──────┘
    
    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
    # shape: (3, 3)
    # ┌─────┬─────┬─────┐
    # │ A   ┆ B   ┆ C   │
    # │ --- ┆ --- ┆ --- │
    # │ str ┆ str ┆ str │
    # ╞═════╪═════╪═════╡
    # │ 1   ┆ 2   ┆ 3   │
    # │ 4   ┆ 5   ┆ 6   │
    # │ 9   ┆ 10  ┆ 11  │
    # └─────┴─────┴─────┘

from polars.

filabrazilska commented on September 24, 2024

The documentation for scan_csv seems to suggest that the schema param should indeed be used for schema rather than dtypes: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html

from polars.

djouallah commented on September 24, 2024

Any updates on this?

from polars.

cmdlineluser commented on September 24, 2024

It looks like @filabrazilska did file a PR to address this #15305

But it hasn't been reviewed yet.

^{(PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)}

from polars.

ritchie46 commented on September 24, 2024

Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look.

from polars.

djouallah commented on September 24, 2024

@ritchie46 there is a regression with the latest update

PanicException                            Traceback (most recent call last)
<timed exec> in <module>

[<ipython-input-11-3bc5859e5af2>](https://localhost:8080/#) in polars_clean_csv(x)
     33   z = transform.with_columns(pl.col("SETTLEMENTDATE").str.to_datetime())
     34   columns = list(set(transform.columns) - {'SETTLEMENTDATE','DUID','UNIT'})
---> 35   final=z.with_columns(pl.col(columns).cast(pl.Float64),YEAR=pl.col("SETTLEMENTDATE").dt.iso_year()).collect()
     36   return final.to_arrow()

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
   1815         callback = _kwargs.get("post_opt_callback")
   1816

-> 1817 return wrap_df(ldf.collect(callback))
1818
1819 @overload

PanicException: called Option::unwrap() on a None value

from polars.

cmdlineluser commented on September 24, 2024

Thanks @djouallah - it seems that was a different issue.

If you can make minimal test cases, it makes it easier for the devs to fix.

As an example, I made a minimal repro for you in #16437

It has just been fixed and will be part of 0.20.29 which should be released soon.

from polars.

evbo commented on September 24, 2024

I am seeing similar behavior but for nd_json. Issue linked here: #18244

from polars.

Reading CSV, Polars seems to ignore the provided Schema about polars HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent