Giter VIP home page Giter VIP logo

lightautoml_spark's Introduction

SLAMA: LightAutoML on Spark

SLAMA is a version of LightAutoML library modified to run in distributed mode with Apache Spark framework.

It requires:

  1. Python 3.9
  2. PySpark 3.2+ (installed as a dependency)
  3. Synapse ML library (It will be downloaded by Spark automatically)

Currently, only tabular Preset is supported. See demo with spark-based tabular automl preset in examples/spark/tabular-preset-automl.py. For further information check docs in the root of the project containing dedicated SLAMA section.

License

This project is licensed under the Apache License, Version 2.0. See LICENSE file for more details.

Installation

First of all you need to install git and poetry.

# Load LAMA source code
git clone https://github.com/fonhorst/LightAutoML_Spark.git

cd LightAutoML/

# !!!Choose only one item!!!

# 1. Global installation: Don't create virtual environment
poetry config virtualenvs.create false --local

# 2. Recommended: Create virtual environment inside your project directory
poetry config virtualenvs.in-project true

# For more information read poetry docs

# Install LAMA
poetry lock
poetry install

lightautoml_spark's People

Contributors

alexmryzhkov avatar bizzyvinci avatar btbpanda avatar crustaceanj avatar cybsloth avatar dalone avatar desimakov avatar dev-rinchin avatar fonhorst avatar greenplace avatar m4kedon5k1i avatar netang avatar resivalex avatar se-teryoshkin avatar vabun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

qwinpin

lightautoml_spark's Issues

**OpenML Benchmark for AutoML**

Find Big Datasets suitable to be used with Tabular Preset for correctness and workload testing. For Tabular Preset see also example.

Desired data volumes:
"100 млн на 1000 фичей, 16 ядер и 100 гигов на ноду, 200 ядер и 1 Тб на кластер.
Инференс не более 1 часа
У бизнеса там 4 часа на все, и надо иметь шанс рестарта."

Training time: 4 hourse

Spark-based distributed transformers implementation.

The following transformers should be implemented using Spark SQL functions:

  • #25
  • #23
  • #24
  • #26
  • #27
  • #28
  • TunableTransformer (No need to rewrite but may be some refactoring is neccessary)
  • #29
  • #30
  • #31
  • #32
  • #33
  • SequentialTransformer (only applies other transformers, no need to rewrite)
  • UnionTransformer (multiproc impl ?)
  • ColumnwiseUnion (only applies other transformers, no need to rewrite)
  • ColumnsSelector (works only with metadata, no need to rewrite, but requires support from SparkDataset - columns subselecting with the slice syntax)
  • BestOfTransformers (applies other transformers and a criterion function to the resulting dataset, appropriate modifications should be done in pipelines)
  • ConvertDataset (all modifications is required only for SparkDataset)
  • ChangeRoles (all modifications is required only for SparkDataset)
  • #34
  • #35
  • #36
  • #37
  • #38
  • #39
  • #40
  • #41
  • #42
  • #43
  • #44
  • #45
  • #47
  • #46

Implement base entities for LAMA

  • add SparkDataset, SparkTransformer, SparkBasedMLAlgo
  • Example of Spark Transformer with fit/predict phases
  • Example of test for the implemented transformer
  • Create feature branches

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.