Giter VIP home page Giter VIP logo

Comments (3)

rjzamora avatar rjzamora commented on May 30, 2024

Hi @dangraysf - Thanks for filing an issue!

The edit_distance_matrix method in cudf currently operates on the entire string column at once. When you convert to a dask.dataframe.DataFrame object (using dask_cudf), you are partitioning your data into a collection of distinct cudf.DataFrame objects. Using ddf.map_partitions(func, ...) tells dask/dask_cudf that func can be independently mapped across the partitions in an "embarrassingly parallel" fashion. Unfortunately, the logic needed to perform an edit_distance_matrix calculation across the global collection is not embarrassingly parallel at all (In fact, it requires that each partition be compared to every other partition).

I don't personally have much expertise in string processing, but my sense is that the all-to-all nature of edit_distance_matrix will make it tricky to scale out. If you only want to compare every row to a smaller number of target strings (using edit_distance(target=...), then the problem becomes much easier.

from cudf.

dangraysf avatar dangraysf commented on May 30, 2024

Great insight! My workaround is the following:

AMINO_ACIDS = 'ACDEFGHIKLMNPQRSTVWY'
LENGTH = 12
MATRIX_SIZE = 250_000

# Initialize Dask client
cluster = LocalCUDACluster()
client = Client(cluster)
print(client.dashboard_link)

# Generate sequences
sequences = [generate_sequence(LENGTH) for _ in range(MATRIX_SIZE)]

df = cudf.DataFrame({'sequence': sequences})
ddf = dask_cudf.from_cudf(df, npartitions=4) 

outputs = []

for idx, s in tqdm(enumerate(sequences)):
    output = ddf['sequence'].map_partitions(lambda x: x.str.edit_distance(s), meta=(idx, 'int')).compute()
    outputs.append(pd.DataFrame({idx:output.to_dict()}))
    
    # Flush CUDA memory
    mempool = cp.get_default_memory_pool()
    mempool.free_all_blocks()

Exactly as you suggest -- edit_distance down the row is all goodand across partitions.

This is still much more performant than a vanilla edit_distance matrix using the the standard C implementations (rapidfuzz) and so RAPIDs cuDF + nvedit remains very good for this simple biologist without resorting to CPU-based HPC -- thank you for this package!!

from cudf.

rjzamora avatar rjzamora commented on May 30, 2024

Sounds good! Thanks again for the discussion @dangraysf

I'll close this issue for now, but feel free to follow-up if you have other questions/concerns.

from cudf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.