Comments (7)
@nameexhaustion Sorry I just read ritchie's comment and closed based on that. We do need to ensure we only generate valid UTF-8.
from polars.
@pj-ml Perhaps, but that is not a
P-high
from polars.
I believe that's not a proper encoded csv file. Quotes must be escaped with quotes. Polars follows https://www.ietf.org/rfc/rfc4180.txt. Anything that doesn't follow that spec, we don't read. Though the error must be better.
from polars.
f.write('col1\n""• ')
with a space appended to the end parses fine though
from polars.
I read https://www.ietf.org/rfc/rfc4180.txt and I see that if double quotes are used inside the entry, the entry must be encapsulated by double quotes too.
I think we can close this out then. I will just have to use use_pyarrow=True
since I receive files that do not follow the standard used by polars.
from polars.
@orlp There could actually be an issue here - using read_csv
polars will create a dataframe containing invalid Utf8 instead of immediately failing, see an example here:
import polars as pl
df = pl.read_csv('col1\n""•'.encode()) # this loads a dataframe without complaining
print(df.select(pl.all().cast(pl.Binary)))
# ┌─────────────┐
# │ col1 │
# │ --- │
# │ binary │
# ╞═════════════╡
# │ b"\xe2\x80" │
# └─────────────┘
print(df.select(pl.all().cast(pl.Binary).cast(pl.String))) # casting it to binary and then back to run utf-8 validation to show that validation fails
# polars.exceptions.ComputeError: invalid utf8
print(df) # printing will cause a panic due to invalid utf-8
# thread '<unnamed>' panicked at crates/polars-core/src/fmt.rs:393:21:
# byte index 34 is out of bounds of `�`
from polars.
And how about cases where it generates valid UTF-8, but the input file does not follow the standard? Shouldn't there at least be a warning in such a case?
See example (no errors and no warnings):
import polars as pl
df = pl.read_csv('col1\n""• '.encode())
print(df.select(pl.all().cast(pl.Binary).cast(pl.String)))
# shape: (1, 1)
# ┌──────┐
# │ col1 │
# │ --- │
# │ str │
# ╞══════╡
# │ • │
# └──────┘
from polars.
Related Issues (20)
- exception thrown if converting arrow Table with struct and dictionary columns to polar dataframe
- converting pandas to Polars drops column if its name, when converted to string, matches another column's name
- pl.format should be clear it will return null when one of the arguments is null
- Off-by-one error when casting to Decimal with set precision
- Importing pyarrow after polars causes `SIGSEGV` HOT 4
- Polars assumes microseconds instead of reading numpy timedelta units HOT 1
- Cannot create Array column containing large u64 value
- Multipling a Decimal by Int returns Int type HOT 2
- Split out `Expr.top_k` from `Expr.top_k_by`
- `pl.Datetime` `time_zone` parameter has no type or value check HOT 6
- Cast from `pl.Date` to `pl.Datetime` silently returns incorrect value when new dtype cannot hold value HOT 2
- exception thrown if converting chunked arrow Table with struct and dictionary columns to polar Dataframe
- Panic when constructing Series with dtype `Duration('ms')` with large `timedelta` objects
- Can the separator of the read csv function support regular splitting? HOT 5
- Casting float to Decimal fails silently HOT 2
- Use parquet statistics when collecting column statistics from scanned parquet HOT 2
- Excessive Memory Consumption During Rolling Operations on Large DataFrames
- write_database() - Insert many rows with sql server using fast_executemany HOT 3
- fill_null doesn't support expr HOT 6
- `dt.total_nanoseconds` and `dt.total_microseconds` may overflow silently
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.