Comments (1)
Hi @wanghaisheng ,
In the issue could not find how much time modin takes, and to me it was not quite clear how many rows were there in your original data. Could you clarify what 400w
means and time taken by modin?
I tried creating a synthetic dataset with the below script and ran your benchmark.
import pandas as pd
import random
import string
import numpy as np
# Function to generate a random URL
def generate_random_url():
letters = string.ascii_lowercase
domain = ''.join(random.choice(letters) for i in range(random.randint(5, 10)))
extension = random.choice(['com', 'net', 'org', 'biz', 'info', 'co'])
return f"http://www.{domain}.{extension}"
# Function to generate random data for additional columns
def generate_random_data(size):
return np.random.rand(size)
# Number of URLs to generate
num_urls = 4000000
# Generate random URLs
urls = [generate_random_url() for _ in range(num_urls)]
# Create a DataFrame with 'URL' column
df = pd.DataFrame(urls, columns=['URL'])
# Adding 10 more random columns
for i in range(10):
col_name = f'Random_{i+1}'
df[col_name] = generate_random_data(num_urls)
# Adding some duplicates
num_duplicates = 3000
duplicate_indices = random.sample(range(num_urls), num_duplicates)
for index in duplicate_indices:
df.at[index, 'URL'] = df.at[index // 2, 'URL']
# Shuffle the DataFrame to randomize the order
df = df.sample(frac=1).reset_index(drop=True)
# Print the DataFrame info
print(df.info())
# Print the first few rows of the DataFrame
print(df.head())
df.to_csv('waybackmachines-www.amazon.com.csv')
I could observe at my end that modin on ray is faster when the number of rows (defined by num_urls
in my script) is 4000000 and when number of rows are lesser(say 400000) dask performs better,
Perf comparison on Intel(R) Xeon(R) Platinum 8276L CPU @ 2.20GHz(112 cpus)
Number of rows | Modin on ray | Dask |
---|---|---|
4000000 | 18.183s | 29.693s |
400000 | 7.898s | 5.461s |
As modin is intended to work on large dataframes I would say it could occur than modin has bad performance when data size is too small.
from modin.
Related Issues (20)
- Avoid unnecessary length checks in `df.squeeze`
- Run a subset of CI tests for all Python versions that Modin has declared supported on a scheduled basis.
- Upgrade github actions dependency versions HOT 1
- ValueError: The 'nrows' option is not supported with the 'pyarrow' engine HOT 4
- BUG: HOT 1
- Poor performance of df.insert and df.to_parquet HOT 23
- Pass sort parameter in stack to `query_compiler` from modin/pandas/dataframe.py
- BUG: columns mismatch after df.update
- Polars API
- BUG: Series.compare with differently named series raises ValueError, but should not HOT 1
- BUG: Broken links in Modin Usage Examples page HOT 3
- Using dynamic partitioning for broadcast_apply
- merge not supported HOT 1
- Add more granular lazy execution flags in query compiler
- Modin read_csv not loading the complete file (memory leak in file reading) HOT 1
- BUG: groupby().apply() raise numpy ValueError when Series has multi index HOT 1
- BUG: Apply on axis=1 causes "daemonic processes are not allowed to have children" on some operations on Dask engine, or launches Ray instance HOT 1
- modin with ray engine hang HOT 2
- Possible issue with `dropna(how="all")` not deleting data from partition on ray. HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from modin.