Comments (4)
Thanks for the issue.
I've seen some mention
Could you link where this was mentioned? Generally read_csv
will infer the data types of the columns of a CSV: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.read_csv/#cudf-read-csv
from cudf.
Thanks for the issue.
I've seen some mention
Could you link where this was mentioned? Generally
read_csv
will infer the data types of the columns of a CSV: https://docs.rapids.ai/api/cudf/stable/user_guide/api_docs/api/cudf.read_csv/#cudf-read-csv
Seems like it was a misunderstanding/miscommunication from the person who mentioned this to me, I did not actually see this in the docs. It was cleared up a bit after.
With that being said though, I was mostly interested what libcudf would do if it receives a file like this
ID, Mixed_Data, Description
1, "John Doe", "Name"
2, 42, "Age"
3, "2024-01-11", "Date"
4, true, "Boolean Value"
5, 55000.75, "Salary"
Mixed_data
column has different data types. Obviously i would hope users aren't using these kinds of CSV files. I'm curious to know what libcudf's priorities would be here, because there may exists potential optimizations in the CUDA kernels for the csv reader by transposing the rows to a columnar format, but this would only be doable if you don't delegate the burden on libcudf to deal with poorly formatted csv files like this.
Hopefully that makes sense, lmk if it doesn't.
from cudf.
A column holding a mixed typed values isn't supported in cuDF like in pandas, but for an IO situation like CSV reading, the data is coerced to a common type which would be string (object
in cudf)
In [5]: data = """ID, Mixed_Data, Description
...: 1, "John Doe", "Name"
...: 2, 42, "Age"
...: 3, "2024-01-11", "Date"
...: 4, true, "Boolean Value"
...: 5, 55000.75, "Salary"
...: """
In [6]: import io
In [7]: cudf.read_csv(io.StringIO(data))
Out[7]:
ID Mixed_Data Description
0 1 "John Doe" "Name"
1 2 42 "Age"
2 3 "2024-01-11" "Date"
3 4 true "Boolean Value"
4 5 55000.75 "Salary"
In [8]: cudf.read_csv(io.StringIO(data)).dtypes
Out[8]:
ID int64
Mixed_Data object
Description object
dtype: object
In [13]: cudf.read_csv(io.StringIO(data)).loc[1, " Mixed_Data"]
Out[13]: ' 42'
from cudf.
Thank you, that makes sense. I suspected as much but wanted to confirm just incase. Will close this question
from cudf.
Related Issues (20)
- [FEA] Migrate left join and conditional join benchmarks to use nvbench
- [BUG] [JNI] `CudaTest.testCudaException` will not throw `cudaErrorInvalidValue` expectedly under certain environment
- [BUG] cudf.Series should accept None values when nan_as_null=False
- I am not able to install cudf with Cuda12.4 python 3.11.7 driver = 551 [BUG] HOT 1
- [BUG] `Index.get_loc` is returning incorrect results on index objects that are in decreasing order
- [BUG] cmake fails to configure static libcudf due to arrow issues
- [BUG] `df.loc` needs to return index of same types
- [BUG] `df.loc` drops index labels during assignment
- [BUG] `Index.repeat` is failing for `DatetimeIndex` with a frequency
- [FEA] Implement new test organization in cuDF
- [FEA] Disable fallback in cudf.pandas on request HOT 1
- [BUG] CMake Error "The required target arrow_compute is not in any export set when calling with target arrow_static" HOT 3
- [QST] aggregate function that operates on vector(array of numeric) data
- [BUG] `loc` returning incorrect results for `DatetimeIndex` that is in monotonically decreasing
- [BUG] chunked parquet reader is not factoring empty dataframes with `>0` columns present HOT 2
- [FEA] Make line terminator sequence handling in regular expression engine a configurable option HOT 1
- [BUG] cudf.pandas dataframe.__repr__ slow in jupyterlab for large datasets HOT 1
- [BUG] iloc/loc keeps circular reference to original DataFrame/Series
- [BUG] double free or memory corruption when parsing some JSON HOT 6
- [BUG] Stop allowing floating arrow along minor versions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.