Comments (3)
Hi @dangraysf - Thanks for filing an issue!
The edit_distance_matrix
method in cudf currently operates on the entire string column at once. When you convert to a dask.dataframe.DataFrame
object (using dask_cudf
), you are partitioning your data into a collection of distinct cudf.DataFrame
objects. Using ddf.map_partitions(func, ...)
tells dask/dask_cudf that func
can be independently mapped across the partitions in an "embarrassingly parallel" fashion. Unfortunately, the logic needed to perform an edit_distance_matrix
calculation across the global collection is not embarrassingly parallel at all (In fact, it requires that each partition be compared to every other partition).
I don't personally have much expertise in string processing, but my sense is that the all-to-all nature of edit_distance_matrix
will make it tricky to scale out. If you only want to compare every row to a smaller number of target
strings (using edit_distance(target=...)
, then the problem becomes much easier.
from cudf.
Great insight! My workaround is the following:
AMINO_ACIDS = 'ACDEFGHIKLMNPQRSTVWY'
LENGTH = 12
MATRIX_SIZE = 250_000
# Initialize Dask client
cluster = LocalCUDACluster()
client = Client(cluster)
print(client.dashboard_link)
# Generate sequences
sequences = [generate_sequence(LENGTH) for _ in range(MATRIX_SIZE)]
df = cudf.DataFrame({'sequence': sequences})
ddf = dask_cudf.from_cudf(df, npartitions=4)
outputs = []
for idx, s in tqdm(enumerate(sequences)):
output = ddf['sequence'].map_partitions(lambda x: x.str.edit_distance(s), meta=(idx, 'int')).compute()
outputs.append(pd.DataFrame({idx:output.to_dict()}))
# Flush CUDA memory
mempool = cp.get_default_memory_pool()
mempool.free_all_blocks()
Exactly as you suggest -- edit_distance down the row is all goodand across partitions.
This is still much more performant than a vanilla edit_distance matrix using the the standard C implementations (rapidfuzz) and so RAPIDs cuDF + nvedit remains very good for this simple biologist without resorting to CPU-based HPC -- thank you for this package!!
from cudf.
Sounds good! Thanks again for the discussion @dangraysf
I'll close this issue for now, but feel free to follow-up if you have other questions/concerns.
from cudf.
Related Issues (20)
- [BUG] cudf.Series.duplicated returns error 'Series' object has no attribute 'duplicated' HOT 5
- [BUG] Calling `cat.as_ordered` does not work on a sliced column
- [BUG] `uses_custom_row_groups` should not be hardcoded to true in `chunked_parquet_reader`
- [BUG] Issues with `codecov` on `cudf` CI HOT 1
- [BUG] orc reader returning an incorrect timestamp for `rockylinux8`
- [BUG] OOM in `has_next` and `read_chunk` of chunked parquet reader HOT 9
- [BUG] stop throwing when configuring default host mr
- [FEA] Add an option to enable pandas debugging mode in cudf.pandas fast path HOT 1
- [BUG] `cudf.read_json` does not raise an exception with invalid data when `lines=True` and `engine='cudf'`
- Share struct member definition for parse_options and parse_options_view HOT 1
- [ENH] Use `strict=True` argument to `zip` once py39 support is dropped HOT 1
- [FEA] cudf_kafka: add unit tests HOT 3
- [FEA] Support `arrow:Schema` in Parquet writer for faithful roundtrip with Arrow via Parquet
- [FEA] Handle size overflow in nested columns by ORC chunked reader
- [FEA] Better control over the output dtype in aggregations
- For the overload of replace in libcudf where input/target/repl are columns, there isn't a maxrepl arg. HOT 5
- [MNT] add tests for ListMethods? HOT 2
- [BUG] `cudf::round` with `HALF_UP` mode produces non-deterministic output
- [FEA] Add developer/private cudf.pandas API to check for proxy objects
- [QST] Unable to install the cuDF in python 3.10 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudf.