Comments (2)
The jaccard API uses hash_character_ngrams
internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type
and so the function is unable to build the output list column.
So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32
individual strings.
Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.
from cudf.
Even with large-strings support the amount of memory needed to process this example will be significant.
The original df
size is 11 rows of 214,748,364 bytes each = ~2.4GB for the total input strings size.
Using a width=5
means each row generates 214,748,368 individual substrings at 5 bytes each = ~1.1GB per row. (11 rows ~ 12GB). The internal code uses hashing which reduces the 5 bytes to 4 bytes = ~859MB per row. (11 rows ~ 9.5GB).
Since the jaccard call here in this example is comparing the df
with itself the temporary memory doubles to ~19GB.
Internally the intermediate substrings/hashes are sorted to help with counting the unique values. The sorted output requires a 2nd temporary copy (of the 9.5GB) which gets us to (19+9.5) = 28.5GB peak memory.
So overall jaccard_index
would need about 6x the input memory available for processing.
from cudf.
Related Issues (20)
- [FEA] Reduce time required to import cudf_polars HOT 2
- [BUG] libcudf JSON reader crash with compressed data HOT 2
- [FEA] Enable using `num_rows` and `skip_rows` with `ParquetReader`
- [Story] Enabling prefetching of unified memory HOT 5
- [FEA] Optionally inform users when a Polars query falls back to the CPU HOT 1
- [FEA] Support parquet row group skipping in Polars physical engine
- [DOC] cudf/source/user_guide/10min.ipynb gives warning on docs build as dask_cudf is missing
- [FEA] Support parquet read from multiple source with mismatched schema if set of projected columns have matching schemas HOT 9
- [FEA] Refactor Column/NamedColumn split in cudf-polars
- [BUG] skip_rows doesn't work properly in ChunkedParquetReader HOT 1
- [BUG] error: subprocess-exited-with-error,error: metadata-generation-failed HOT 3
- [BUG] Integer promotion fixes needed for NumPy 2 for comparison operators
- [BUG] `strings::split_record` throws exception on input having one empty row HOT 1
- [BUG] `cudf::strings_column_view::chars_size` returns incorrect value for sliced input HOT 1
- Prefer using `assert_<object>_equal` over `assert_eq` in Python tests
- [FEA] Support `UINT32` columns as an input type for `integers_to_ipv4`
- [BUG] Compiler warnings with `jni_utils.hpp`
- [FEA] Support for `cudf.PeriodDtype`
- [FEA] Support for `cudf.SparseDtype` and sparse array
- [FEA] Support for `Series.iat`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.