I can generate sequences of str: <div class="snippet-clipboard-content notranslate

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Great insight! My workaround is the following: <div class="snippet-clipboard-conte

Sounds good! Thanks again for the discussion <a class="user-mention notranslate" data-

[QST] Any magic fixes for str.edit_distance_matrix with dask_cudf across partitions? about cudf HOT 3 CLOSED

dangraysf commented on May 30, 2024

[QST] Any magic fixes for str.edit_distance_matrix with dask_cudf across partitions?

from cudf.

Comments (3)

rjzamora commented on May 30, 2024

Hi @dangraysf - Thanks for filing an issue!

The edit_distance_matrix method in cudf currently operates on the entire string column at once. When you convert to a dask.dataframe.DataFrame object (using dask_cudf), you are partitioning your data into a collection of distinct cudf.DataFrame objects. Using ddf.map_partitions(func, ...) tells dask/dask_cudf that func can be independently mapped across the partitions in an "embarrassingly parallel" fashion. Unfortunately, the logic needed to perform an edit_distance_matrix calculation across the global collection is not embarrassingly parallel at all (In fact, it requires that each partition be compared to every other partition).

I don't personally have much expertise in string processing, but my sense is that the all-to-all nature of edit_distance_matrix will make it tricky to scale out. If you only want to compare every row to a smaller number of target strings (using edit_distance(target=...), then the problem becomes much easier.

from cudf.

dangraysf commented on May 30, 2024

Great insight! My workaround is the following:

AMINO_ACIDS = 'ACDEFGHIKLMNPQRSTVWY'
LENGTH = 12
MATRIX_SIZE = 250_000

# Initialize Dask client
cluster = LocalCUDACluster()
client = Client(cluster)
print(client.dashboard_link)

# Generate sequences
sequences = [generate_sequence(LENGTH) for _ in range(MATRIX_SIZE)]

df = cudf.DataFrame({'sequence': sequences})
ddf = dask_cudf.from_cudf(df, npartitions=4) 

outputs = []

for idx, s in tqdm(enumerate(sequences)):
    output = ddf['sequence'].map_partitions(lambda x: x.str.edit_distance(s), meta=(idx, 'int')).compute()
    outputs.append(pd.DataFrame({idx:output.to_dict()}))
    
    # Flush CUDA memory
    mempool = cp.get_default_memory_pool()
    mempool.free_all_blocks()

Exactly as you suggest -- edit_distance down the row is all goodand across partitions.

This is still much more performant than a vanilla edit_distance matrix using the the standard C implementations (rapidfuzz) and so RAPIDs cuDF + nvedit remains very good for this simple biologist without resorting to CPU-based HPC -- thank you for this package!!

from cudf.

rjzamora commented on May 30, 2024

Sounds good! Thanks again for the discussion @dangraysf

I'll close this issue for now, but feel free to follow-up if you have other questions/concerns.

from cudf.

Recommend Projects

[QST] Any magic fixes for str.edit_distance_matrix with dask_cudf across partitions? about cudf HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent