Giter VIP home page Giter VIP logo

symkit's Introduction

Symbolic Regression 101

About simple ML models

Let's consider Linear Regression to illustrate a few shortcomings of traditional (in this case linear) ML methods. While the number of terms in the equetion is derived from the dimension of our input data, the structure of this equation is fixed and known in advance

In Python we would end up with a decision

    X = # get data to predict on
    W = [1.0000000000000002, 1.9999999999999991]
    b = 3.0000000000000018

    predictions =  b + sum(X[i] * wi for i, wi in enumerate(W))

In this scenario we learned the coefficients, but the overall structure will never change

  1. we multiply every input feature by its corresponding coefficient
  2. we aggregate resulting values using a sum function
  3. we add the bias

Of course we have access to a variety of more complex, non-linear models out there, but they are usually harder to explain ...

Where does Symbolic Regression stand then ?

Symbolic Regression aims at learning not only the (potential) coefficients but first and foremost the structure of the model. The resulting relaxed model consists in polynomial, built by composition from

  • our input variables, eg. Age and Height
  • a set of operator at hand, eg. addition, substraction, multiplication and division

Symbolic Regression runs Genetic Programming in order to produce the N bests polynomials for your data.

What pros can you expect compared to other ML methods ?

  • built-in explainability along with non-linearity : if you understand all the operators you provide the algorithm with, you understand all potential polynomials derived from these operators
  • built-in feature selection & model-size reduction : when passing existing traits to a new offspring, we take the polynomial complexity in consideration (eq. to the model size in a Minimum Description Length setting). We then build a skyline based on both performance and complexity, and only consider models on this skyline. This process efficiently avoids "bloating" (complex individuals will not mate on the next round), controls overfitting (the simplest the polynomial, the more likely it does generalize to new data) and embeds explainability in the learning process
  • easier transfer learning : it's fairly easy to extract the list of learned polynomials from a Symbolic Regression model and instantiates a new one with this list, to solve a new but similar problem

How is Symkit different from other Symbolic Regression packages like gplearn ?

At each round in the genetic programming process, we need to evaluate the entire population of polynomials on the training data. If you consider a usual population size of 1000 and 500 rounds until what we assume to be convergence, this can quickly become very expensive to run on an off-the-shelf laptop, even for a medium-sized dataset.
Taking a step back, we can do better. At each round, indivuals are likely to share sub-expressions (they may share a common parent, and potentially grand-parents ...). In sympy this is known as a Common Subexpression Elimination, but it's essentialy a graph optimization technique. Extracting the common subparts to all our expressions, we can precompute results and inline them back into our original expressions ! This saves an enormous amount a CPU and RAM !!

Why using polars in the context of Symkit ?

The polars engine not only submit computation to multiple threads, but applies graph optimization before running your expressions. The only extra step we need is to convert a sympy expr to a polars expr (see core.py )

Symkit

TODO

  • add unary operators like sin and cos [X]
  • try an online version with RiverML [ ]
  • pareto front optimisation [X]
    • multi-objective optimisation (eg. return vs risk ) ? [ ]
  • SymbolicRegression for symbolic regression on timeseries [ ]
  • scikit-learn tree structure into a sympy expression (that can be translated into a polars expression later on)

symkit's People

Contributors

remiadon avatar

Watchers

 avatar  avatar

symkit's Issues

exploit duality in Symbolic Regression

One idea stolen from Linear Optimisation is duality
Applying this to Symbolic Regression, one can image a setup where at generation G:

  • best individuals in are picked, for further offsping (traditional definition of genetic algos)
  • worst indiduals are picked, and most common subepxressions are extracted. While those badly performing expressions are used to form a next generation (maximizing the loss), the most common subepxressions are blacklisted from the mating process of the best performing ones.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.