Giter VIP home page Giter VIP logo

Comments (1)

arunjose696 avatar arunjose696 commented on July 26, 2024 1

Hi @wanghaisheng ,

In the issue could not find how much time modin takes, and to me it was not quite clear how many rows were there in your original data. Could you clarify what 400w means and time taken by modin?

I tried creating a synthetic dataset with the below script and ran your benchmark.

import pandas as pd
import random
import string
import numpy as np

# Function to generate a random URL
def generate_random_url():
    letters = string.ascii_lowercase
    domain = ''.join(random.choice(letters) for i in range(random.randint(5, 10)))
    extension = random.choice(['com', 'net', 'org', 'biz', 'info', 'co'])
    return f"http://www.{domain}.{extension}"

# Function to generate random data for additional columns
def generate_random_data(size):
    return np.random.rand(size)

# Number of URLs to generate
num_urls = 4000000

# Generate random URLs
urls = [generate_random_url() for _ in range(num_urls)]

# Create a DataFrame with 'URL' column
df = pd.DataFrame(urls, columns=['URL'])

# Adding 10 more random columns
for i in range(10):
    col_name = f'Random_{i+1}'
    df[col_name] = generate_random_data(num_urls)

# Adding some duplicates
num_duplicates = 3000
duplicate_indices = random.sample(range(num_urls), num_duplicates)

for index in duplicate_indices:
    df.at[index, 'URL'] = df.at[index // 2, 'URL']

# Shuffle the DataFrame to randomize the order
df = df.sample(frac=1).reset_index(drop=True)

# Print the DataFrame info
print(df.info())

# Print the first few rows of the DataFrame
print(df.head())

df.to_csv('waybackmachines-www.amazon.com.csv')

I could observe at my end that modin on ray is faster when the number of rows (defined by num_urls in my script) is 4000000 and when number of rows are lesser(say 400000) dask performs better,

Perf comparison on Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz(112 cpus)

Number of rows Modin on ray Dask
4000000 18.183s 29.693s
400000 7.898s 5.461s

As modin is intended to work on large dataframes I would say it could occur than modin has bad performance when data size is too small.

from modin.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.