Giter VIP home page Giter VIP logo

benchmarks's Introduction

PyCaret Benchmarks

The purpose of this repository is to assist with the benchmarking of the pycaret module. Currently, the benchmarks are only for time series forecasting and feature extraction. The benchmarking is done using the M3 and M4 competition dataset, but can be extended to other datasets as well.

Result Summary

M3 Benchmark Summary

M4 Benchmark Summary

Time Series Benchmarking (Windows)

The benchmarking is done in 4 steps as described below.

Step 1: Create forecasts

You can edit the batch file include all the models and categories you want to benchmark. Then run using the following command. This will execute experiment.py in a loop for all combinations.

scripts\mseries\run_experiments.bat

This will create 2 files per model-category combination - one with the predictions and the other with the run statistics.

Steps 2, 3, & 4

Steps 2, 3, 4 have also been complied into a single script which can be run as follows (you can edit this to select only a subset of datasets or plots).

scripts\mseries\run_postprocessing.bat

Or if you want, you can run them individually as follows.

Step 2: Evaluate results (metrics, time, etc.)

Once you have run Step 1 for all the combinations of interest, you can run the evaluation script to compile the benchmark results (metrics, time, etc.)

python scripts/mseries/evaluation.py --dataset=M3

This will produce a file called data\m3\current_evaluation_full.csv and data\m3\current_evaluation_full.csv with the summary of the benchmark.

Step 3: Update running metrics (i.e. combine previous results with current results)

Next, you can combine these results with the already run benchmarks in the past.

python scripts/mseries/update.py --dataset=M3

Step 4: Plot results

Finally, you can plot the results using plotly.

python scripts/mseries/plot.py --dataset=M3 --group=Monthly

Time Series Feature Extraction (Windows)

This library can also be used to extract features from time series data. The following command will extract features from the M3 dataset and save them to a csv file. This can be useful to evaluate the characteristics of the data before modeling and deciding on the appropriate settings to use for modeling (especially at scale).

scripts\mseries\run_extract_properties.bat
  • This will save the results into a folder as such data\m4\results\properties
  • The captured properties include
    • Total length of series (train + test)
    • Total length of training data
    • Total length of test data
    • Whether the data is strictly positive or not
    • Whether the data is white noise or not
    • The recommended value of 'd' and 'D' to use for models like ARIMA
    • Whether the data has seasonality or not
    • The candidate seasonal periods to be tested
    • The significant seasonal periods present
    • All the seasonal periods to use for modeling (when models accept multiple seasonal periods)
    • The primary seasonal period to use for modeling (when models only accept a single seasonality)
    • The significant seasonal periods present when harmonics are removed.
    • All the seasonal periods to use for modeling with harmonics removed (when models accept multiple seasonal periods)
    • The primary seasonal period to use for modeling when harmonics are removed (when models only accept a single seasonality)

benchmarks's People

Contributors

ngupta23 avatar

Watchers

 avatar

benchmarks's Issues

Limit max SP for benchmarking to increase speed of training

Most groups will have a SP <= 52 (constrained by weekly). e.g. majority of

  • hourly datasets will have SP = 24 or 48
  • weekly datasets will have SP = 7 or 52 (see here)
  • monthly datasets will have SP = 12
  • quarterly datasets will have SP = 4
  • annual datasets will have SP = 1
  • other datasets will have SP = 1

Limiting the SP to 52 will be appropriate to speed up training in certain cases.

Allow passing a list of argument to the plotting function

Currently, we can only pass a single argument to the plotting function as follows

python scripts/mseries/plot.py --dataset=M3 --group=Monthly --model=ets

If I want to pass multiple models, it is currently not possible. Need to provide a way to do this.

# TODO: See how to accept list of values for each parameter from command line

e.g.

python scripts/mseries/plot.py --dataset=M3 --group=Monthly --model=["ets","naive"] --os=["win32"]

Add progress bar to native ray runs

Per @Yard1

Change

elif execution_engine == "ray":
all_results = []
function_remote = ray.remote(function_single_group)
for single_group in grouped_data.groups.keys():
result_single_group = function_remote.remote(
data=grouped_data.get_group(single_group), **function_kwargs
)
all_results.append(result_single_group)
all_results = ray.get(all_results)
# Combine all results into 1 dataframe
all_results = pd.concat(all_results)

to

elif execution_engine == "ray":
    results_in_progress = []
    all_results = []
    function_remote = ray.remote(function_single_group)
    for single_group in grouped_data.groups.keys():
        result_single_group = function_remote.remote(
            data=grouped_data.get_group(single_group), **function_kwargs
        )
        results_in_progress.append(result_single_group)
    progress_bar = tqdm.tqdm(total=len(tasks_in_progress))
    while results_in_progress:
        results_finished, results_in_progress = ray.wait(results_in_progress, timeout=1)
        progress_bar.update(len(results_finished))
        all_results.extend(results_finished)
    all_results = ray.get(all_results)
    # Combine all results into 1 dataframe
    all_results = pd.concat(all_results)

Add plot for % of models covered by primary model

Add plot for % of models covered by primary model (bar chart type of plot). This will be useful in cases when the primary model is not able to build model for some reason and we need to filter out or evaluate these cases in more detail.

Build Streamlit app for looking and interacting with results

Add a Streamlit app that can be used to compare results as well as filter to a subset of the benchmarks for viewing (filters)

Use Cases

  • Comparison between Keys (for filtering)
    • os
    • python_version
    • library
    • library_version
    • dataset
    • group
    • model
    • model_engine
    • model_engine_version
    • execution_engine
    • execution_engine_version
    • execution_mode
    • execution_mode_version
    • num_cpus
    • backup_model
  • Comparison over time
  • If no env changes, but pycaret settings change (in setup or create_model), do results become better or worse?
  • Comparison of Ray and Spark
  • Comparison of Native vs. Fugue
  • Compare the performance of Pycaret across different python versions
  • How does pycaret/ray/spark scale with the number of CPUs
  • Running performance tests (nightly/weekly) - if nothing changes, performance should not change
  • Metrics when a new version of pycaret is released (compared to the previous version)

Related to #10

Add benchmarking of feature extraction

Do a prerun with feature extraction (such as seasonal period, seasonality type, and other parameters). This will help with RCA and also help to eventually make pycaret better.

e.g. if extracted SP is large, we need to be able to fix this as this will slow down model training (e.g. in Auto-ARIMA).

Writing and reading of both forecasts and times should be moved

Writing and reading of both forecasts and times should be moved to their own folder to match with the excluded folder in git.

Code to update

test_results.to_csv(f"data/forecasts-{prefix}.csv", index=False)
time_df = pd.DataFrame(
{
"time": [time_taken],
"dataset": [dataset],
"group": [ts_category],
"model": [model],
"engine": [engine],
"execution_mode": [execution_mode],
"run_date": [RUN_DATE],
"pycaret_version": [PYCARET_VERSION],
}
)
time_df.to_csv(f"data/time-{prefix}.csv", index=False)

forecast = pd.read_csv(f"data/forecasts-{suffix}.csv")

times = pd.read_csv(f"data/time-{suffix}.csv")

Move to location

data/m3/results/*

[BUG] Discrepancy in "Naive" results between 3.0.0rc8 and the next local version

The local version after 3.0.0rc8 that is benchmarked has a hash 38f*a71. In this change, ETS fails to run due to sktime/sktime#4173. Hence the model falls back to naive in all cases. I manually verified that the predictions for ETS with the 38f*a71 version are identical to the forecasts from 3.0.0rc8 naive. Yet the results show a slight difference in the metric (see image below - pink 38f*a71 vs. blue 3.0.0rc8).

Need to investigate the root cause of this issue. Maybe the merging of indices in the evaluation part are not correct.

image

Split evaluation CSV

Split the evaluation CSV into

(1) one containing the keys + metrics
(2) one containing the keys + run times

Questions to answer:

  • Should keys include run date (makes the comparison to last run difficult)?
  • Should keys contain commit hash (makes the comparison to last run difficult)?

Add OS version to results

Which OS was used to run (Windows/Linux/Mac). This will help benchmark which OS is better in terms of accuracy (if there are any differences).

Append new results to previous results

There has to be code in the evaluation.py to read the previous results and compare them to the current run to flag which become better and which became worse.

Also, there has to be a running history of the results - for example, today if we run for pycaret 3.0.0 and tomorrow we run for 3.1.0, we should have both results in the file. This will help plot the progress of models over time and see if things are becoming better or worse.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.