Comments (6)
Ok, I've find the final reason. This file is not a UTF-8 file, it is a "MS932", "CP932" or "Windows-31J", often called "Shift-JIS". You can see that standard utf8 decoding in python fails:
>>> b = b'\xe3\x82\xa8\xe3\x83\xad\xe4\xba\x8b\xe5\xb8\xab\xe3\x81\x9f\xe3\x81\xa1\xe3\x82\x88\xe3\x82\x8a \xe4\xba\xba\xe9\xa1\x9e\xe5\xad\xa6\xe5\x85\xa5\xe9\x96'
>>> b.decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 37-38: unexpected end of data
But if we specify Shift-JIS
it works:
>>> b.decode("Shift-JIS")
'繧ィ繝ュ莠句クォ縺溘■繧医j 莠コ鬘槫ュヲ蜈・髢'
I'm not sure if it's intended to be unicode and simply has some bad encoding, or if it was mis-encoded, but the TL;DR is that polars doesn't support this encoding if it is indeed Shift-JIS
, sorry!
from polars.
The relevant data can be found here. I am using the title.akas.tsv.gz
data.
from polars.
As an FYI, this has nothing to do with unidecode
. Performing just a title_akas_df.get_column("title").to_list()
can reproduce the issue. Could you please create that as a minimal example instead?
from polars.
The issue occurs because of line 467,864. If we load into df
then we can see:
>>> df = pl.read_csv(
... "title.akas.tsv",
... separator="\t",
... dtypes={"title": str},
... ignore_errors=True,
... )
>>> df["title"][467862]
'人類學入門'
>>> df["title"][467863]
pyo3_runtime.PanicException: Python API call failed
The issue is that the string cannot be rendered, and has nothing to do with map_elements
or lists
.
If we cast to binary, we get:
>>> df["title"].cast(pl.Binary)[467863]
b'\xe3\x82\xa8\xe3\x83\xad\xe4\xba\x8b\xe5\xb8\xab\xe3\x81\x9f\xe3\x81\xa1\xe3\x82\x88\xe3\x82\x8a \xe4\xba\xba\xe9\xa1\x9e\xe5\xad\xa6\xe5\x85\xa5\xe9\x96'
Perhaps someone with unicode experience can decipher this. If we try to grab just that one row with a selection, we panic and python is killed:
>>> s = df.select(pl.col("title").filter(pl.int_range(pl.len()) == 467863))
thread '<unnamed>' panicked at library/core/src/panicking.rs:155:5:
unsafe precondition(s) violated: hint::unreachable_unchecked must never be reached
thread caused non-unwinding panic. aborting.
Aborted
from polars.
FYI the space in the byte array seen above is \x20
. There is some badly formed unicode in here.
from polars.
Also, it's not your fault, the IMDB Page specifically cites Utf-8 as the encoding, but they are just wrong. I'd maybe contact them if you really need this dataset, or you can create a workaround by re-exporting the file with that one record deleted.
from polars.
Related Issues (20)
- [DOCS]: Clarify what rows are included in rolling calculations when `window_size` is "2d" or `timedlta(days=2)`
- unexpected SQL result with numbered entities for GROUP BY
- BatchedParquetReader appears to read in the whole file still
- BatchedParquetReader erroneously returns true for is_finished after the first get_batches call
- Convert ordered pandas Categoricals and Arrow DictionaryArrays to and from Enums, not Categoricals HOT 1
- Divide each element in a list by an int HOT 2
- hive partitioning unable to skip files when casting column to date HOT 6
- Python Polars 0.20.11 release workflow was failed HOT 1
- `scan_parquet` segfault on main HOT 1
- "expected <datatype>, got <datatype>" - use datatype display repr consistently? HOT 1
- docs(python): Add database and excel pages in User Guide IO
- `index_column` and `period` are not required to be of the same "type" HOT 2
- Bug Report - SchemaError with apply Function in Polars 0.20.3 HOT 2
- Allow converting unknown values to `null` in `pl.Enum` HOT 4
- Weakref to LazyFrame/DataFrame/Series is broken HOT 2
- polars 0.20.10 not compatible with connectorx on python 3.12
- concat failure between Datetime[UTC] column and Datetime column with only nulls
- Can't read parquet from a path with square brackets: RecursionError HOT 1
- Non-Deterministic group_by Sum HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.