Checks <input type="c

I tried polars 0.20.6 on two ubuntu VMs: <ul dir

For info, I cannot reproduce on Mac M1 (polars 0.20.5) too: <div class="highlight

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Reading in CSV file gives an error (but through pandas it doesn't) about polars HOT 17 CLOSED

svaningelgem commented on September 24, 2024

Reading in CSV file gives an error (but through pandas it doesn't)

from polars.

Comments (17)

svaningelgem commented on September 24, 2024 2

Seems I'm onto something. In Windows there is a setting (msconfig > boot > advanced > #cores), which I changed to 6 (ie: 3 cores, 6 logical processors).

...And... It worked fine:

>>> import polars as pl
>>> pl.read_csv('dummy.csv').shape
(2016, 2)

So polars can't handle a high core count?

edit: After reversing the 6-core count to unlimited (ie 32), it started to fail again.

Seems bigger is not always better 🤣

from polars.

taki-mekhalfa commented on September 24, 2024 1

I tried polars 0.20.6 on two ubuntu VMs:

12 cores => Works fine
22 cores => Does not work, I have a similar error

>>> pl.scan_csv('dummy.csv').collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/meta2002/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: could not parse `Behold Exhibit A:` as dtype `i64` at column 'index' (column number 1)

The current offset in the file is 47165 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Behold Exhibit A:` to the `null_values` list.

Original error: ```remaining bytes non-empty```

from polars.

taki-mekhalfa commented on September 24, 2024 1

After some debugging:

The file is split into multiple chunks to process in parallel, the number of these chunks depends on the number of cores.

The code that determines the chunks tries to start the chunks at a valid start of a record, but it's flawed when you have a quotes spanned over multiple lines and gives incorrect chunks starting at some text, that's why when we start parsing the chunk we stumble upon non numerical text, and polars raises the could not parse ... as dtype i64.

The more cores you have, the more probable it is to have wrong chunks, and you can easily create some examples that will crash even with few cores.

if you do

>>> pl.read_csv('dummy.csv', dtypes={'index': pl.String, 'string': pl.String}, ignore_errors=True)

You will have:

polars.exceptions.ComputeError: found more fields than defined in 'Schema'

which confirms chunks are incorrectly determined.

from polars.

Julian-J-S commented on September 24, 2024

hmm cannot reproduce on macos or windows using 0.20.5 🤔

Works fine for me. Not sure what the problem is or if this is troll bc of the dataset 🤣

Does it work if you delete the specific entry?

from polars.

svaningelgem commented on September 24, 2024

That is very bizar! I can consistently reproduce it. (windows + versions above).

And no, no trolling ;-) (loved the joke).

I tried now:

pl.from_dataframe(pd.read_csv('dummy.csv')).write_csv('dummy2.csv')
pl.read_csv('dummy2.csv')

Same error... (but the file is smaller, so I assume a difference in \r\n vs \n)

I removed index 164 (which had the place where it started to error). And I could read in the file without an issue.
So next thing I did was remove everything but 164. read in fine too.
Then remove everything AFTER 164. read in fine.
restore the whole file. error.

That is the weird thing about isolating the particular error. It proves difficult. Plus, if you cannot replicate it, that's double weird.

I'm running polars from a notebook (and installed by pip), but the way I know the library, that shouldn't matter?

from polars.

svaningelgem commented on September 24, 2024

Ok, so I tested now on (exact the same file):
Ubuntu (WSL) -- fails - different environment
CLI win -- fails - same environment
notebook win -- fails - same environment

Always on the same line.

On Ubuntu (native Linux) it works

mamba create -n polars python=3.10 polars
mamba activate polars
python
>>> import polars as pl
>>> pl.read_csv("dummy.csv")
shape: (2_016, 2)

But the same procedure (clean polars environment + execute the commands) on my windows system leads to exact the same failure as before. So I can consistently make it fail. Even on a clean environment.
And I now tried the same procedure on my windows-WSL system: also a failure...

I also tried copying the file over onto another drive, just in case that it might be an issue with faulty sectors on my hdd. Next thing: reboot. Nope, after a reboot same issue, so unlikely a memory corruption issue.

Any ideas what I can try next?

from polars.

taki-mekhalfa commented on September 24, 2024

For info, I cannot reproduce on Mac M1 (polars 0.20.5) too:

>>> pl.read_csv('dummy.csv').min()
shape: (1, 2)
┌───────┬───────────────────────────────────┐
│ index ┆ string                            │
│ ---   ┆ ---                               │
│ i64   ┆ str                               │
╞═══════╪═══════════════════════════════════╡
│ 0     ┆ " Inability to cohabit with othe… │
└───────┴───────────────────────────────────┘

from polars.

svaningelgem commented on September 24, 2024

That is so weird... What can I try more to investigate this issue? Because I can replicate it consistently... So likely it's something environment related?

from polars.

ritchie46 commented on September 24, 2024

Does setting the number of threads influence this? POLARS_MAX_THREADS=2 to 16?

from polars.

svaningelgem commented on September 24, 2024

Hi @ritchie46 , it still failed, but with the same error, but on a different line:

polars.exceptions.ComputeError: could not parse `I'm starting to think the real solution is a tax on idiots. Undoubtedly within the Liberal party and the voters that elected these clowns there would be enough collected to pay off the entire national debt."` as dtype `i64` at column 'index' (column number 1)

When unsetting this envvar, I get the original error message:

polars.exceptions.ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)

What could I try else? I'm starting to suspect something processor/thread related because it works for most people and I upgraded recently to a 16 core (32 logical processors) machine.

from polars.

Wainberg commented on September 24, 2024

Seems like the error is related to having a quoted cell with newlines in it:

from polars.

svaningelgem commented on September 24, 2024

Hi @Wainberg , I would agree if not already the first line (record 0) has a newline in it, and record 2, and 6 and 7... Why would it only start to fail at record 164 then? And why not with 12 cores, and it fails with 22 cores (as @taki-mekhalfa determined)?

I'm more suspecting some kind of CPU-optimization gone awry?

from polars.

itamarst commented on September 24, 2024

Consider for example this CSV that only has 4 records:

"col1","col2"
1,a
2,b
3,2
4,"
5,1
6,1
7,1
8,1
"

>>> import pandas as pd
>>> pd.read_csv("test.csv")
   col1                    col2
0     1                       a
1     2                       b
2     3                       2
3     4  \n5,1\n6,1\n7,1\n8,1\n

A record can contain an arbitrarily long string that spans an arbitrary number of lines and looks like valid CSV records. Which means any heuristic that relies on local inspection can be wrong in some edge cases.

Thus if records can span multiple lines, I am fairly certain this is the only correct general approach:

Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
The chunking code relies on that.

from polars.

itamarst commented on September 24, 2024

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

from polars.

taki-mekhalfa commented on September 24, 2024

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

It seems like chunking with quotations might be tricky but scanning the entire file should be a worst case

from polars.

ritchie46 commented on September 24, 2024

Yes, this is tricky. We search for the right number of fields and try to ensure we are not in an embedded file. We currently find 3 corresponding lines before we accept. I will see if tuning this helps.

from polars.

ritchie46 commented on September 24, 2024

Ok, looking at this file. This is impossible to find a good position in the middle. This should be read single threaded.

Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".

This would be very expensive as you will need to read the whole file disk. I think we can do some inspection on schema inference and for those edge cases fall back to single core scanning.

I made a PR that sniffs during schema inference and if we find to many new-lines in escaped fields we fallback to single threaded reading.

from polars.

Reading in CSV file gives an error (but through pandas it doesn't) about polars HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent