Comments (17)
Seems I'm onto something. In Windows there is a setting (msconfig > boot > advanced > #cores), which I changed to 6 (ie: 3 cores, 6 logical processors).
...And... It worked fine:
>>> import polars as pl
>>> pl.read_csv('dummy.csv').shape
(2016, 2)
So polars can't handle a high core count?
edit: After reversing the 6-core count to unlimited (ie 32), it started to fail again.
Seems bigger is not always better π€£
from polars.
I tried polars 0.20.6 on two ubuntu VMs:
- 12 cores => Works fine
- 22 cores => Does not work, I have a similar error
>>> pl.scan_csv('dummy.csv').collect()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/meta2002/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
return wrap_df(ldf.collect())
polars.exceptions.ComputeError: could not parse `Behold Exhibit A:` as dtype `i64` at column 'index' (column number 1)
The current offset in the file is 47165 bytes.
You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Behold Exhibit A:` to the `null_values` list.
Original error: ```remaining bytes non-empty```
from polars.
After some debugging:
The file is split into multiple chunks to process in parallel, the number of these chunks depends on the number of cores.
The code that determines the chunks tries to start the chunks at a valid start of a record, but it's flawed when you have a quotes spanned over multiple lines and gives incorrect chunks starting at some text, that's why when we start parsing the chunk we stumble upon non numerical text, and polars raises the could not parse ... as dtype i64
.
The more cores you have, the more probable it is to have wrong chunks, and you can easily create some examples that will crash even with few cores.
if you do
>>> pl.read_csv('dummy.csv', dtypes={'index': pl.String, 'string': pl.String}, ignore_errors=True)
You will have:
polars.exceptions.ComputeError: found more fields than defined in 'Schema'
which confirms chunks are incorrectly determined.
from polars.
hmm cannot reproduce on macos or windows using 0.20.5 π€
Works fine for me. Not sure what the problem is or if this is troll bc of the dataset π€£
Does it work if you delete the specific entry?
from polars.
That is very bizar! I can consistently reproduce it. (windows + versions above).
And no, no trolling ;-) (loved the joke).
I tried now:
pl.from_dataframe(pd.read_csv('dummy.csv')).write_csv('dummy2.csv')
pl.read_csv('dummy2.csv')
Same error... (but the file is smaller, so I assume a difference in \r\n
vs \n
)
I removed index 164 (which had the place where it started to error). And I could read in the file without an issue.
So next thing I did was remove everything but 164. read in fine too.
Then remove everything AFTER 164. read in fine.
restore the whole file. error.
That is the weird thing about isolating the particular error. It proves difficult. Plus, if you cannot replicate it, that's double weird.
I'm running polars from a notebook (and installed by pip), but the way I know the library, that shouldn't matter?
from polars.
Ok, so I tested now on (exact the same file):
Ubuntu (WSL) -- fails - different environment
CLI win -- fails - same environment
notebook win -- fails - same environment
Always on the same line.
On Ubuntu (native Linux) it works
mamba create -n polars python=3.10 polars
mamba activate polars
python
>>> import polars as pl
>>> pl.read_csv("dummy.csv")
shape: (2_016, 2)
But the same procedure (clean polars environment + execute the commands) on my windows system leads to exact the same failure as before. So I can consistently make it fail. Even on a clean environment.
And I now tried the same procedure on my windows-WSL system: also a failure...
I also tried copying the file over onto another drive, just in case that it might be an issue with faulty sectors on my hdd. Next thing: reboot. Nope, after a reboot same issue, so unlikely a memory corruption issue.
Any ideas what I can try next?
from polars.
For info, I cannot reproduce on Mac M1 (polars 0.20.5) too:
>>> pl.read_csv('dummy.csv').min()
shape: (1, 2)
βββββββββ¬ββββββββββββββββββββββββββββββββββββ
β index β string β
β --- β --- β
β i64 β str β
βββββββββͺββββββββββββββββββββββββββββββββββββ‘
β 0 β " Inability to cohabit with otheβ¦ β
βββββββββ΄ββββββββββββββββββββββββββββββββββββ
from polars.
That is so weird... What can I try more to investigate this issue? Because I can replicate it consistently... So likely it's something environment related?
from polars.
Does setting the number of threads influence this? POLARS_MAX_THREADS=2
to 16?
from polars.
Hi @ritchie46 , it still failed, but with the same error, but on a different line:
polars.exceptions.ComputeError: could not parse `I'm starting to think the real solution is a tax on idiots. Undoubtedly within the Liberal party and the voters that elected these clowns there would be enough collected to pay off the entire national debt."` as dtype `i64` at column 'index' (column number 1)
When unsetting this envvar, I get the original error message:
polars.exceptions.ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)
What could I try else? I'm starting to suspect something processor/thread related because it works for most people and I upgraded recently to a 16 core (32 logical processors) machine.
from polars.
Seems like the error is related to having a quoted cell with newlines in it:
from polars.
Hi @Wainberg , I would agree if not already the first line (record 0) has a newline in it, and record 2, and 6 and 7... Why would it only start to fail at record 164 then? And why not with 12 cores, and it fails with 22 cores (as @taki-mekhalfa determined)?
I'm more suspecting some kind of CPU-optimization gone awry?
from polars.
Consider for example this CSV that only has 4 records:
"col1","col2"
1,a
2,b
3,2
4,"
5,1
6,1
7,1
8,1
"
>>> import pandas as pd
>>> pd.read_csv("test.csv")
col1 col2
0 1 a
1 2 b
2 3 2
3 4 \n5,1\n6,1\n7,1\n8,1\n
A record can contain an arbitrarily long string that spans an arbitrary number of lines and looks like valid CSV records. Which means any heuristic that relies on local inspection can be wrong in some edge cases.
Thus if records can span multiple lines, I am fairly certain this is the only correct general approach:
- Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
- The chunking code relies on that.
from polars.
And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.
from polars.
And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.
It seems like chunking with quotations might be tricky but scanning the entire file should be a worst case
from polars.
Yes, this is tricky. We search for the right number of fields and try to ensure we are not in an embedded file. We currently find 3 corresponding lines before we accept. I will see if tuning this helps.
from polars.
Ok, looking at this file. This is impossible to find a good position in the middle. This should be read single threaded.
Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
This would be very expensive as you will need to read the whole file disk. I think we can do some inspection on schema inference and for those edge cases fall back to single core scanning.
I made a PR that sniffs during schema inference and if we find to many new-lines in escaped fields we fallback to single threaded reading.
from polars.
Related Issues (20)
- Need fill_infinity feature HOT 2
- Inconsistent API of join_asof
- unreachable code panic on invalid argument for unique HOT 1
- Extract Underlying Coding from a Categorical or Enum Datatype HOT 2
- Cast categorical and enum data types directly to signed integers
- scan_csv should be able to read "0" and "1" into a boolean type
- serde deserialisation of AnyValue doesn't work HOT 2
- Problem with list eval on length 1 dataframes
- Add example showing how to unpivot multiple columns HOT 1
- Add pre-filtered decode to new-streaming Parquet source
- Dataframes that have both strings and categories cannot be serialized and deserialized from disk.
- LazyFrame.map_batches() ordering guarantees
- computation of list.len for null list seems incorrect HOT 1
- Schema for groupby-agg of literal raised to some power does not match `collect` result.
- Severe memory issues with `rolling` and `group_by`
- Schema inference fails when colums are produced without a name with pyodbc and sql server HOT 2
- StringCacheMismatchError when using joblib.Parallel and Categorical data HOT 1
- Expr.rank() function changed to unstable sort in polars >=1
- Significant performance difference depending on how I use the "filter" method HOT 5
- readJSON Fails on JSON with Newline Characters HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from polars.