scikit-optimize / scikit-optimize Goto Github PK

Sequential model-based optimization with a `scipy.optimize` interface

Home Page: https://scikit-optimize.github.io

License: BSD 3-Clause "New" or "Revised" License

Python 94.83% Shell 4.86% Makefile 0.31%

bayesian-optimization bayesopt binder hacktoberfest hyperparameter hyperparameter-optimization hyperparameter-search hyperparameter-tuning machine-learning optimization scientific-computing scientific-visualization scikit-learn sequential-recommendation visualization

scikit-optimize's Introduction

Scikit-Optimize

Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements several methods for sequential model-based optimization. skopt aims to be accessible and easy to use in many contexts.

The library is built on top of NumPy, SciPy and Scikit-Learn.

We do not perform gradient-based optimization. For gradient-based optimization algorithms look at scipy.optimize here.

Approximated objective function after 50 iterations of gp_minimize. Plot made using skopt.plots.plot_objective.

Important links

Static documentation - Static documentation
Example notebooks - can be found in examples.
Issue tracker -https://github.com/scikit-optimize/scikit-optimize/issues
Releases - https://pypi.python.org/pypi/scikit-optimize

Install

scikit-optimize requires

Python >= 3.6
NumPy (>= 1.13.3)
SciPy (>= 0.19.1)
joblib (>= 0.11)
scikit-learn >= 0.20
matplotlib >= 2.0.0

You can install the latest release with: :

pip install scikit-optimize

This installs an essential version of scikit-optimize. To install scikit-optimize with plotting functionality, you can instead do: :

pip install 'scikit-optimize[plots]'

This will install matplotlib along with scikit-optimize.

In addition there is a conda-forge package of scikit-optimize: :

conda install -c conda-forge scikit-optimize

Using conda-forge is probably the easiest way to install scikit-optimize on Windows.

Getting started

Find the minimum of the noisy function f(x) over the range -2 < x < 2 with skopt:

import numpy as np
from skopt import gp_minimize

def f(x):
    return (np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) +
            np.random.randn() * 0.1)

res = gp_minimize(f, [(-2.0, 2.0)])

For more control over the optimization loop you can use the skopt.Optimizer class:

from skopt import Optimizer

opt = Optimizer([(-2.0, 2.0)])

for i in range(20):
    suggested = opt.ask()
    y = f(suggested)
    opt.tell(suggested, y)
    print('iteration:', i, suggested, y)

Read our introduction to bayesian optimization and the other examples.

Development

The library is still experimental and under heavy development. Checkout the next milestone for the plans for the next release or look at some easy issues to get started contributing.

The development version can be installed through:

git clone https://github.com/scikit-optimize/scikit-optimize.git
cd scikit-optimize
pip install -e.

Run all tests by executing pytest in the top level directory.

To only run the subset of tests with short run time, you can use pytest -m 'fast_test' (pytest -m 'slow_test' is also possible). To exclude all slow running tests try pytest -m 'not slow_test'.

This is implemented using pytest attributes. If a tests runs longer than 1 second, it is marked as slow, else as fast.

All contributors are welcome!

Making a Release

The release procedure is almost completely automated. By tagging a new release travis will build all required packages and push them to PyPI. To make a release create a new issue and work through the following checklist:

update the version tag in __init__.py
update the version tag mentioned in the README
check if the dependencies in setup.py are valid or need unpinning
check that the doc/whats_new/v0.X.rst is up to date
did the last build of master succeed?
create a new release
ping conda-forge

Before making a release we usually create a release candidate. If the next release is v0.X then the release candidate should be tagged v0.Xrc1 in __init__.py. Mark a release candidate as a "pre-release" on GitHub when you tag it.

Commercial support

Feel free to get in touch if you need commercial support or would like to sponsor development. Resources go towards paying for additional work by seasoned engineers and researchers.

Made possible by

The scikit-optimize project was made possible with the support of

If your employer allows you to work on scikit-optimize during the day and would like recognition, feel free to add them to the "Made possible by" list.

scikit-optimize's People

Contributors

Stargazers

Watchers

Forkers

betatim glouppe mechcoder alexanderfabisch mailshanx mkolod mehdidc cmmalone aung2phyowai nfcampos kegl nel215 yunjie-yang directorscut82 libardo1 schevalier ibab 42machinelearning xiangze jayinai andreh7 hedgefair jtsmith2 amit-dingare mp4096 thomasjpfan davestroud bjodah wenwangting xypan1232 hbredin diana-hep orazaro elkbrsathuji emilyfay lefant gittu4 yangkky nazben wangchimit rsantana-isg hariom-yadaw willemolding ahcheriet aiadventures kmustyxl iaroslav-ai juwlee muharremokutan falconzyx mirca amaotone thirumalahcu shitao-zz pearlphilip megamanics naereen yemarnevets vlimant fredcallaway rthmak ajoeajoe armgong silky kiudee kejiashi raamana jimgoo pmlandwehr amueller sam2015 vutsuak16 zhouyonglong yngtodd jnothman lesteve bluesceada hdashk markusprim tony32769 huangshizhi blackteaandcoffee anaaa mikesmyk drvinceknight caoyi0905 tgaudin yogeshluthra lesaffrea hulalazz smutch stefan-endres strategist922 jkleint eddie-santos tagomatech robinhsu121 fcharras vishalbelsare oxhead

scikit-optimize's Issues

examples on website?

Are the examples somewhere on the website? I can't seem to find them.

Unify GBRT and GP interfaces

Based on: #34 (comment)

We should aim to converge on a unified interface where possible. It is too much work to duplicate all the acquisition functions etc.

Implement Tree-structured Parzen Estimator (TPE)

Algorithm 4 from http://papers.nips.cc/paper/4443-algorithms-for-hyper-parameter-optimization.pdf

ExtraTrees returns NaN for std

yield (check_minimize, minimizer, bench1, 0., [(-2.0, 2.0)], 0.05, 75) with et_minimize produces

scikit-optimize/skopt/acquisition.py:165: RuntimeWarning: invalid value encountered in greater
  mask = std > 0

and std is:

(Pdb) print(std)
[  0.00000000e+00   3.44874701e-01   4.35236492e-01              nan
   5.35666028e-01   3.76289149e-01   0.00000000e+00   3.44874701e-01
   3.03596891e-01   2.84929167e-01              nan              nan
   1.11601649e-01   3.44874701e-01              nan              nan
   2.98023224e-08   6.69582536e-01   1.68631973e-01              nan]

Doc: generate examples gallery

Would be nice to generate examples upon deployment to build a nice gallery. This would require some changes to ci_scripts/deploy.sh and to the templates, but nothing impossible.

How does the parameters being of different scales affect the Gaussian Process?

The reason I scaled all the parameters to between 0 and 1 to make the GP fitting invariant of the scale of parameters.

I am sure that using an anisotropic kernel as being done now makes it invariant of the different scales of the paramters but it might be worth investigating.

relate ro robo

relevant: https://github.com/automl/RoBO

API for non continuous inputs

At the moment, input values are assumed to live within a bounded continuous range. We should think about an API on how to specify integer and symbolic values as well, and what would be the consequences for the algorithms we implemented so far.

Tune the hyperparameters of the Gaussian kernel

According to the talk with @glouppe on chat, in https://github.com/MechCoder/scikit-optimize/pull/38 the length scale of the kernel has been fixed to be 1.0 . We should either

Tune the hyperparameter by maximizing the log likelihood function and get a point estimate of this value as done before. (which does not seem to work) (or)
Maximize the value of the integrated acquisition function across all hyperparameters.

Build static documentation

To consolidate the package, we should generate a static version of the documentation.

I recently found https://github.com/BurntSushi/pdoc which seems to be quite nice and easy for that purpose.

Support for categorical parameters seems to be broken

def bench1(x):
    return np.asscalar(np.asarray(x))

def bench2(x):
    return np.asscalar(np.asarray(x, dtype=np.int))

bench1([1])
1
bench2(["1"])
1

from skopt import forest_minimize
# Works
forest_minimize(bench1, ((1.0, 4.0),))
# Fails
forest_minimize(bench2, (("1", "2", "3", "4"),))

# Works
gp_minimize(bench1, ((1.0, 4.0),), maxiter=5)
# Fails
gp_minimize(bench2, (("1", "2", "3", "4"),))

Move GBRT in a `learning` submodule

I propose creating a learning submodule, for basically everything which is a modification of a ML algorithm. The wrapper around Gradient Boosting should be moved there.

Add skopt.benchmarks functions

For now, we could be move in the two benchmarks functions defined in the tests of gp_opt.py.

Expose `base_estimator` in `GBTQuantiles`

Comment by @glouppe in #9:

Would be good to expose the base_estimator in this newly introduced class. Default params of GradientBoostingRegressor are very likely to give poor results.

Methods to optimize the acquisition function

Right now we support lbfgs and random sampling. What are some other methods to optimize the acquistion function?

Expected input/output shape

When exploring #37 and #38, I noticed that we are not very consistent with respect to the input/output shape. We should enforce one and only way to do things.

I would suggest the following conventions:

func: 1d array-like as input, scalar as output (as in scipy.minimize)
acquisition functions: 2d array-like as input, 1d array as output.

Everything else raises an error.

Bug for EI?

As observed in https://github.com/MechCoder/scikit-optimize/pull/14, the approximated objective when using EI is really weird. What is the issue?

Definition of terms related to uncertainty

Some thoughts on "uncertainty". This issue was inspired by @MechCoder's comment in #9. The first part of this issue tries to correctly define various terms that often get used interchangeably and are easy to confuse (I confidently predict that I will make at least one error in this post). Once we have defined the terms, we can decide which of them we need in order to evaluate various acquisition functions.

Standard deviation (\sigma): this is the square root of the variance. Can be calculated for any sample no matter what distribution the samples come from.

Standard error (of the mean): \sigma / \sqrt(N) a measure of the uncertainty associated with the estimated value of the mean.

Confidence interval (CI): The N% confidence interval will contain the measured value N% of the time. Alice wants to estimate the value of a parameter t, so she constructs an estimator that as well as a CI. The 68% CI (around that) will contain the true value t in 68% of experiments (that is we clone Alice and repeat what she did many times).

N% quantile: The N% quantile starts at negative infinity and goes until a point x, think of it as the integral of the p.d.f. between -inf and x which equals N%.

If that is distributed according to a normal distribution then the 68% CI is [that - sigma, that + sigma].

For a normal distribution mu-sigma = the 16% quantile.

For our purposes we have a surrogate model (a GP or what have you) for the true, expensive function f. At a given point x our best estimate of the true value of f is the mean mu(x) of our surrogate model.

Now my understanding runs out -> need help.

What is the band we get from a GP and then feed into EI and friends? Is it the "standard error on the mean" or "68% confidence interval" or "68% credible interval" or something else?

LICENSE

This repo needs a license. BSD tres clauses?

The other kind of three comma club

Tests are slow! (episode II)

The current build takes more than 15mn, this is very long, given that we dont have so much code yet... We should really try to trim some of the tests.

Add scikit-learn compatible BayesSearchCV

Hey.
Do you intent to provide a GridSearchCV plug-in replacement or only the optimizer?
The thing is that it might take a while to get that into scikit-learn, and it would be nice if people had access to it.

Cheers,
Andy

Implement RF based model selection

The computed variance for each RandomForest is given in http://arxiv.org/pdf/1211.0906v2.pdf in section 4.3.2 (This will involve wrapping sklearn's DecisionTrees to return the standard deviation of each leaf)

The ExpectedImprovement makes the same assumption about the predictions being gaussian except there is a minor modification given in Section 5.2 of https://www.cs.ubc.ca/~murphyk/Papers/gecco09.pdf

There is a change from sklearn's RF implementation in computing the split point described in 4.3.2 in http://arxiv.org/pdf/1211.0906v2.pdf but we can try without that modification.

Extend test suite

Before implementing any more things, we should really extend the test suite with more thorough tests. At the moment, I cant even minimize a 1D parabola with the default parameters of gp_minimize...

(and I dont even understand why it fails... so many things to adjust :/)

We might want to look at other packages for good defaults.

Slow tests

The current three tests take 17 mins to run on Travis, while the entire sklearn test suite runs in 10 mins

Add an example on how to cross-validate a sklearn model using gp_optimize

Add random search

For API checks and baseline purposes, I think it would be nice to have dummy random search method.

Incompatibility with Python 2.7.x?

I noticed that when run with 2.7.11, there is a syntax error:

in space.py
def __init__(self, *categories, prior=None):
SyntaxError: invalid syntax

The regular argument cannot come after the *argument. Simply reversing these parameters causes other issues in space.py
This seems to be in accordance with this accepted Python enhancement proposal.

Do the devs plan on making skopt compatible with 2.7.x?

Parallelize n_restarts_optimizer

Should be trivial

GradientBoostingQuantileRegressor.predict API

Currently, GradientBoostingQuantileRegressor.predict concatenates predictions vertically. I think this is a bug, isnt it?

Matern kernel returning a zeroed out covariance matrix

I have been playing around the code for sometime and it doesn't seem to work at least for the test example (or seems to at least by chance)

a = 1
b = 5.1 / (4 * pi**2)
c = 5.0 / pi
r = 6
s = 10
t = 1 / (8*pi)

def branin(x):
    x1 = x[0]
    x2 = x[1]
    return a * (x2 - b * x1**2 + c * x1 - r)**2 + s * (1 - t) * cos(x1) + s

bounds = [[-5, 10], [0, 15]]
res = gp_minimize(
    branin, bounds, search='sampling', maxiter=2, random_state=0,
    acq='UCB')

More specifically this line https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/gaussian_process/gpr.py#L282 returns a matrix of zeros.

This is because the optimized scale parameter of the Matern kernel is 1e-5, which sets the covariance between all the samples to be zero.

Should we try a different approach other than scaling the parameters down to 0 and 1.

@glouppe @betatim What are your thoughts on this?

Add Gitter rooms and badge.

Add a gitter rooms -

https://gitter.im/scikit-optimize/user (public room)
https://gitter.im/scikit-optimize/dev (invite only, private room)
A gitter badge to the user channel on README.md ;)

Add probability of improvement acquisition function

Stopping criteria

What is a good stopping criteria for blackbox optimization?

Ask-and-tell interface?

Hi, I just discovered this project. I wonder whether it is really the goal to provide only a scipy-like interface or whether you think it would be possible to provide an ask-and-tell interface, too. That would be much more convenient for use cases in which the optimization process is controlled actually by the objective function.

duecredit?

Might consider this:

https://github.com/duecredit/duecredit

Return type of `Space.rvs(x)` and `Space.inverse_transform(X)`

Both Space.rvs(X) and Space.inverse_transform(X) return arrays of object dtypes. Are we okay with that?

Collecting benchmark problems

Collecting benchmarks:

Local-search based technique to optimize the acquisition function of trees and friends

We cannot optimize the acquisition function of using conventional gradient / 2nd order information based methods. SMAC does it in the following way described in page 13 of http://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf

Some terminology.

If we have p parameters and a parameter configuration, a one-exchange neighbourhood is defined as a parameter configuration that is different in exactly one parameter.
For a parameter (say X) that is continuous, this neighbor is sampled from a Gaussian centered at X with std 0.2 keeping all other parameters constant.
For a parameter (say Y) that is categorical, this neighbour is any other categorical parameter keeping all other parameters constant.

Seems like they do a multi-start local search with 10 points. For each local search:

Initialize a random point p
Check the acquisition values at "4X + Y" neighbours.
If none of the neighbours have a lesser acquisition than p, then terminate
Else reassign p to the neighbour with minimum acquisition value.

Then return the minimum of all the 10 local searches.

Explicit is better than implicit

This sounds like a incredibly formal, bureaucratic and heavy, try and read to the end before panicking.

I think one of the first things we should do is make sure we are all on the same page on how the project will work. I suggest the following:

all changes by PR
a PR solves one problem (don't mix problems together in one PR) with the minimal set of changes
describe why you are proposing the changes you are proposing
try to not rush changes (definition of rush depends on how big your changes are)
someone else has to merge your PR
new code needs to come with a test
no merging if travis is red

I don't see this as rules to be enforced by 🚓 but as guidelines.

I think it is important to write down briefly these kind of "obvious" things if you want to start a project that is long term (not just a hackathon hack) with people who you haven't worked with so much. Basically: explicit is better than implicit 😀

Canonical examples from Brochu 2010

Would love to have everything to reproduce fig 7 from http://arxiv.org/pdf/1012.2599v1.pdf (and some of the other figures?)

This would also serve as a way to check the correctness of our implementation (for which I currently have doubts regarding EI, as reported in #17)

Refactor minimize functions to make use of sampling API

Now that #75 has been merged, we should refactor all *_minimize functions in order to make use of the new API.

We may need to make a few internal changes since sample_points return values in the original space, while we will need to feed the transformed values instead to the optimizer.

I would expect something along the following lines:

Make _check_grid a public util returning the corresponding list of Distribution objects.
sample_grid(grid, n_samples)
warp(grid, samples): from original to warped space
unwarp(grid, samples): from warped to original sapce

CC: @MechCoder @betatim

Implement weighted hamming distance kernel for categorical inputs

Section 4.1.2 in http://arxiv.org/pdf/1211.0906v2.pdf