Comments (12)
Hi,
the difference between read_csv
and scan_csv
is in the fact that the former uses the passed-in schema for the schema
attribute in CsvReader
whereas the latter uses is for the dtypes
attribute.
When changed the two behave the same (see my commit above). That said I don't know if anyone depends on the original behaviour so not sure if the maintainers are willing the accept this change.
from polars.
If I understand correctly, this appears to be a minimal repro?
Data:
wget https://nemweb.com.au/Reports/Current/Daily_Reports/PUBLIC_DAILY_202401270000_20240128040505.zip
unzip PUBLIC_DAILY_202401270000_20240128040505.zip
.read_csv
works as expected.
import polars as pl
pl.read_csv(
"PUBLIC_DAILY_202401270000_20240128040505.CSV",
skip_rows=1,
schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
has_header=False
)
# shape: (424_519, 130) # <- OK: 130 Columns
# ...
The same arguments with .scan_csv
raises an exception:
# ComputeError: found more fields than defined in 'Schema'
# Consider setting 'truncate_ragged_lines=True'.
With truncate_ragged_lines=True
the file is read but we no longer get 130 columns.
pl.scan_csv(
"PUBLIC_DAILY_202401270000_20240128040505.CSV",
skip_rows=1,
schema=dict.fromkeys([f"column_{n}" for n in range(1, 131)], pl.String),
has_header=False,
truncate_ragged_lines=True
).collect()
# shape: (424_519, 25) # <- ERROR!!! 25 Columns
# ...
(With read_csv(..., truncate_ragged_lines=True)
we still get 130 Columns.)
from polars.
Have you tried setting the full schema? Setting only columns
will set the names of the schema, but Polars will still determine the schema itself, which in this case will likely be done based on the header.
We accept a schema
argument that will help you completely overwrite the schema.
from polars.
does not seems to be working, it is still ignoring the schema, it still provide a header, column_1 , 2 etc based on the number of columns in row=1
from polars.
I suppose the data is not actually needed.
Simpler repro:
import polars as pl
import tempfile
with tempfile.NamedTemporaryFile() as f:
f.write(b"""
A,B,C
1,2,3
4,5,6,7,8
9,10,11
""".strip())
f.seek(0)
df = pl.read_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True)
# shape: (3, 5)
# ┌─────┬─────┬─────┬──────┬──────┐
# │ A ┆ B ┆ C ┆ D ┆ E │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str ┆ str │
# ╞═════╪═════╪═════╪══════╪══════╡
# │ 1 ┆ 2 ┆ 3 ┆ null ┆ null │
# │ 4 ┆ 5 ┆ 6 ┆ 7 ┆ 8 │
# │ 9 ┆ 10 ┆ 11 ┆ null ┆ null │
# └─────┴─────┴─────┴──────┴──────┘
lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABCDE", pl.String), truncate_ragged_lines=True).collect()
# shape: (3, 3)
# ┌─────┬─────┬─────┐
# │ A ┆ B ┆ C │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞═════╪═════╪═════╡
# │ 1 ┆ 2 ┆ 3 │
# │ 4 ┆ 5 ┆ 6 │
# │ 9 ┆ 10 ┆ 11 │
# └─────┴─────┴─────┘
from polars.
The documentation for scan_csv
seems to suggest that the schema
param should indeed be used for schema
rather than dtypes
: https://docs.pola.rs/py-polars/html/reference/api/polars.scan_csv.html
from polars.
Any updates on this?
from polars.
It looks like @filabrazilska did file a PR to address this #15305
But it hasn't been reviewed yet.
(PRs can be linked to issues with keywords: https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword)
from polars.
Thanks for the report and thank you for the minimal repro @cmdlineluser. Taking a look.
from polars.
@ritchie46 there is a regression with the latest update
PanicException Traceback (most recent call last)
<timed exec> in <module>
[<ipython-input-11-3bc5859e5af2>](https://localhost:8080/#) in polars_clean_csv(x)
33 z = transform.with_columns(pl.col("SETTLEMENTDATE").str.to_datetime())
34 columns = list(set(transform.columns) - {'SETTLEMENTDATE','DUID','UNIT'})
---> 35 final=z.with_columns(pl.col(columns).cast(pl.Float64),YEAR=pl.col("SETTLEMENTDATE").dt.iso_year()).collect()
36 return final.to_arrow()
[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager, **_kwargs)
1815 callback = _kwargs.get("post_opt_callback")
1816
-> 1817 return wrap_df(ldf.collect(callback))
1818
1819 @overload
PanicException: called Option::unwrap()
on a None
value
from polars.
Thanks @djouallah - it seems that was a different issue.
If you can make minimal test cases, it makes it easier for the devs to fix.
As an example, I made a minimal repro for you in #16437
It has just been fixed and will be part of 0.20.29
which should be released soon.
from polars.
I am seeing similar behavior but for nd_json
. Issue linked here: #18244
from polars.
Related Issues (20)
- Request to return LazyFrame for pl.from_arrow and more HOT 2
- rolling_corr giving inconsistent results HOT 1
- `test_read_database_cx_credentials` expected exception does not bifurcate over correct Python version HOT 1
- In `scan_parquet()`, `include_file_paths` returns twice the same column HOT 1
- In `bin.size()`, the `unit` parameter is not described HOT 1
- `.implode().list` method is not valid before `.over`. HOT 4
- Upgrading to 0.42 breaks compilation HOT 4
- Expr `.implode().get().over()` has strange results. HOT 3
- upsample not working if arguments are only sorted within group HOT 2
- Behaviour change of Expr.list.drop_nulls for structs with all None fields HOT 1
- `.first` behind `.sort_by.slice` gets wrong result.
- `Expr.shuffle` uses different order per column HOT 2
- `.struct.field()` after `shuffle()` seems to produce incorrect results HOT 2
- Multi-output, multi-sink lazy polars HOT 5
- Inconsistent Results Between Pandas and Polars using cut (and qcut)? HOT 3
- ExprStringNameSpace replace / replace_all literal flag ignored for dataframes with multiple rows
- `dt.round()` slow/fast path use different rounding HOT 2
- write_parquet with partition_by silently overwrites existing files HOT 7
- rank() on a Series of just 1 null assigns rank=1 to the null value. HOT 2
- read_ndjson ignores provided schema list inner types if values are inferred null HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.