Giter VIP home page Giter VIP logo

Comments (17)

svaningelgem avatar svaningelgem commented on September 24, 2024 2

Seems I'm onto something. In Windows there is a setting (msconfig > boot > advanced > #cores), which I changed to 6 (ie: 3 cores, 6 logical processors).

...And... It worked fine:

>>> import polars as pl
>>> pl.read_csv('dummy.csv').shape
(2016, 2)

So polars can't handle a high core count?

edit: After reversing the 6-core count to unlimited (ie 32), it started to fail again.
image

Seems bigger is not always better 🀣

from polars.

taki-mekhalfa avatar taki-mekhalfa commented on September 24, 2024 1

I tried polars 0.20.6 on two ubuntu VMs:

  • 12 cores => Works fine
  • 22 cores => Does not work, I have a similar error
>>> pl.scan_csv('dummy.csv').collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/meta2002/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1940, in collect
    return wrap_df(ldf.collect())
polars.exceptions.ComputeError: could not parse `Behold Exhibit A:` as dtype `i64` at column 'index' (column number 1)

The current offset in the file is 47165 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `dtypes` argument
- setting `ignore_errors` to `True`,
- adding `Behold Exhibit A:` to the `null_values` list.

Original error: ```remaining bytes non-empty```

from polars.

taki-mekhalfa avatar taki-mekhalfa commented on September 24, 2024 1

After some debugging:

The file is split into multiple chunks to process in parallel, the number of these chunks depends on the number of cores.

The code that determines the chunks tries to start the chunks at a valid start of a record, but it's flawed when you have a quotes spanned over multiple lines and gives incorrect chunks starting at some text, that's why when we start parsing the chunk we stumble upon non numerical text, and polars raises the could not parse ... as dtype i64.

The more cores you have, the more probable it is to have wrong chunks, and you can easily create some examples that will crash even with few cores.

if you do

>>> pl.read_csv('dummy.csv', dtypes={'index': pl.String, 'string': pl.String}, ignore_errors=True)

You will have:

polars.exceptions.ComputeError: found more fields than defined in 'Schema'

which confirms chunks are incorrectly determined.

from polars.

Julian-J-S avatar Julian-J-S commented on September 24, 2024

hmm cannot reproduce on macos or windows using 0.20.5 πŸ€”

Works fine for me. Not sure what the problem is or if this is troll bc of the dataset 🀣

Does it work if you delete the specific entry?

from polars.

svaningelgem avatar svaningelgem commented on September 24, 2024

That is very bizar! I can consistently reproduce it. (windows + versions above).

And no, no trolling ;-) (loved the joke).

I tried now:

pl.from_dataframe(pd.read_csv('dummy.csv')).write_csv('dummy2.csv')
pl.read_csv('dummy2.csv')

Same error... (but the file is smaller, so I assume a difference in \r\n vs \n)

I removed index 164 (which had the place where it started to error). And I could read in the file without an issue.
So next thing I did was remove everything but 164. read in fine too.
Then remove everything AFTER 164. read in fine.
restore the whole file. error.

That is the weird thing about isolating the particular error. It proves difficult. Plus, if you cannot replicate it, that's double weird.

I'm running polars from a notebook (and installed by pip), but the way I know the library, that shouldn't matter?

from polars.

svaningelgem avatar svaningelgem commented on September 24, 2024

Ok, so I tested now on (exact the same file):
Ubuntu (WSL) -- fails - different environment
CLI win -- fails - same environment
notebook win -- fails - same environment

Always on the same line.

On Ubuntu (native Linux) it works

mamba create -n polars python=3.10 polars
mamba activate polars
python
>>> import polars as pl
>>> pl.read_csv("dummy.csv")
shape: (2_016, 2)

But the same procedure (clean polars environment + execute the commands) on my windows system leads to exact the same failure as before. So I can consistently make it fail. Even on a clean environment.
And I now tried the same procedure on my windows-WSL system: also a failure...

I also tried copying the file over onto another drive, just in case that it might be an issue with faulty sectors on my hdd. Next thing: reboot. Nope, after a reboot same issue, so unlikely a memory corruption issue.

Any ideas what I can try next?

from polars.

taki-mekhalfa avatar taki-mekhalfa commented on September 24, 2024

For info, I cannot reproduce on Mac M1 (polars 0.20.5) too:

>>> pl.read_csv('dummy.csv').min()
shape: (1, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ index ┆ string                            β”‚
β”‚ ---   ┆ ---                               β”‚
β”‚ i64   ┆ str                               β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═══════════════════════════════════║
β”‚ 0     ┆ " Inability to cohabit with othe… β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

from polars.

svaningelgem avatar svaningelgem commented on September 24, 2024

That is so weird... What can I try more to investigate this issue? Because I can replicate it consistently... So likely it's something environment related?

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Does setting the number of threads influence this? POLARS_MAX_THREADS=2 to 16?

from polars.

svaningelgem avatar svaningelgem commented on September 24, 2024

Hi @ritchie46 , it still failed, but with the same error, but on a different line:

polars.exceptions.ComputeError: could not parse `I'm starting to think the real solution is a tax on idiots. Undoubtedly within the Liberal party and the voters that elected these clowns there would be enough collected to pay off the entire national debt."` as dtype `i64` at column 'index' (column number 1)

When unsetting this envvar, I get the original error message:

polars.exceptions.ComputeError: could not parse `Fools!"` as dtype `i64` at column 'index' (column number 1)

What could I try else? I'm starting to suspect something processor/thread related because it works for most people and I upgraded recently to a 16 core (32 logical processors) machine.

from polars.

Wainberg avatar Wainberg commented on September 24, 2024

Seems like the error is related to having a quoted cell with newlines in it:

image

from polars.

svaningelgem avatar svaningelgem commented on September 24, 2024

Hi @Wainberg , I would agree if not already the first line (record 0) has a newline in it, and record 2, and 6 and 7... Why would it only start to fail at record 164 then? And why not with 12 cores, and it fails with 22 cores (as @taki-mekhalfa determined)?

I'm more suspecting some kind of CPU-optimization gone awry?

from polars.

itamarst avatar itamarst commented on September 24, 2024

Consider for example this CSV that only has 4 records:

"col1","col2"
1,a
2,b
3,2
4,"
5,1
6,1
7,1
8,1
"
>>> import pandas as pd
>>> pd.read_csv("test.csv")
   col1                    col2
0     1                       a
1     2                       b
2     3                       2
3     4  \n5,1\n6,1\n7,1\n8,1\n

A record can contain an arbitrarily long string that spans an arbitrary number of lines and looks like valid CSV records. Which means any heuristic that relies on local inspection can be wrong in some edge cases.

Thus if records can span multiple lines, I am fairly certain this is the only correct general approach:

  1. Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".
  2. The chunking code relies on that.

from polars.

itamarst avatar itamarst commented on September 24, 2024

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

from polars.

taki-mekhalfa avatar taki-mekhalfa commented on September 24, 2024

And I guess a key point here is that the first pass scan can likely be done much faster than full CSV parsing. Mostly you just need to figure out if the number of unescaped quotation marks on each line is even or odd. If it's odd you assume a line continuation.

It seems like chunking with quotations might be tricky but scanning the entire file should be a worst case

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Yes, this is tricky. We search for the right number of fields and try to ensure we are not in an embedded file. We currently find 3 corresponding lines before we accept. I will see if tuning this helps.

from polars.

ritchie46 avatar ritchie46 commented on September 24, 2024

Ok, looking at this file. This is impossible to find a good position in the middle. This should be read single threaded.

Do a first pass, single core scan of the CSV. The output is an ordered mapping (could be an array) of record number to starting line and length. E.g. assuming "record 0 starts at line 1 and is only 1 line, record 1 is line 2 and is only 1 line, record 2 is line 3 and is 2 lines, record 4 is line 5 and is 1 line".

This would be very expensive as you will need to read the whole file disk. I think we can do some inspection on schema inference and for those edge cases fall back to single core scanning.

I made a PR that sniffs during schema inference and if we find to many new-lines in escaped fields we fallback to single threaded reading.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.