dwave-examples / qboost Goto Github PK

Solve a binary classification problem with Qboost

License: Apache License 2.0

Python 100.00%

qboost's Introduction

QBoost

In machine learning, boosting methods are used to combine a set of simple, "weak" predictors in such a way as to produce a more powerful, "strong" predictor. The QBoost method selects weak classifiers by minimizing the squared difference between observed and predicted labels. The resulting optimization problem can be formulated as a binary quadratic model and solved on the D-Wave system. This approach has been explored previously in the literature, for example Refs. [1-4].

This example illustrates the application of a QBoost formulation on two illustrative data sets.

Usage

The demonstration includes two sample data sets, which can be run with the following commands:

python demo.py blobs

Or:

python demo.py digits

Blobs data set

This is a simple synthetically generated data set for illustrative purposes, based on the scikit-learn function sklearn.datasets.make_blobs. Data for classification are randomly generated with two classes and an arbitrary number of samples, total features, and informative features. The features are randomly generated from Gaussian distributions. For each informative feature, the mean value of the feature differs between the two target classes, while the non-informative features are sampled from the same probability distribution in each class.

Each run of the demo will randomly generate the data set and split it into training and test sets. The results of the demo will print the indices of the informative features, the features that are selected by QBoost, and the accuracy of the resulting ensemble classifier on the test set.

The values of the number of samples, number of features, and number of informative features can be controlled using command line arguments, as described in the help:

python demo.py blobs -h

Digits data set

The digits data is based on the well-known handwritten digits data set. Each instance is a digitized image of a handwritten digit. The images consist of 8x8 grayscale pixel values, which are represented as 64 features. The goal is to construct a classifier that can predict the digit that is represented by an image.

This demonstration constructs a binary classification problem using images for any two of the available ten digits, as specified by the --digit1 and --digit2 command line options (the defaults are 0 and 1). As with the blobs data set, the available data are split into training and test sets. The demonstration will print the number of features selected by QBoost, along with the prediction accuracy on the test set.

The --plot-digits option is also provided to display one instance of the images for each of the two selected digits. The displayed images are chosen randomly from the available data.

The following command displays the available options for use with the digits data set:

python demo.py digits -h

Code Overview

Boosting methods construct a strong classifier by intelligently combining a set of weak classifiers. Given a set of N weak classifiers, QBoost solves an optimization problem to select weak classifiers. The objective of this optimization problem is to minimize the strong classifier's squared loss between actual and predicted targets. A regularization term is also included to penalize complex models that include more weak classifiers. The resulting optimization problem can be written as [1]:

where s is an index over training instances, i is an index over features, h_i(x_s) denote the weak classifiers, y_s denote the observed targets, w_i are the weights to be determined, and lambda is the regularization parameter, which multiplies the L0 norm of the weight vector. In this demonstration, the weights are treated as binary variables that take a value of either 0 or 1. The weak classifiers are constructed from a series of single-feature decision tree rules, also known as decision stumps. Following Refs. [1,4], the output of each weak classifier is scaled to -1/N and +1/N.

By default, the demonstration is run with a fixed value of the regularization parameter, lambda. If the --cross-validation option is used, then a simple parameter sweep is performed, and the value of lambda is selected on the basis of providing the highest accuracy on a validation set. This requires repeatedly solving the optimization problem, and the demonstration may take several minutes to run when this option is used.

Disclaimer

This demo and its code are intended for demonstration purposes only and are not designed for performance. There are a variety of additional considerations, not addressed in the example code, which should be considered as part of a more detailed implementation. These include:

This example uses simple decision stumps as weak classifiers, but in principle the method can be applied to any collection of weak classifiers.
Iteration to select weak classifiers in batches [2].
More rigorous determination of the regularization parameter.
Further evaluation of weak classifier output scaling. Most prior work in the literature scales the weak classifier outputs by the number of weak classifiers. Ref. [4] suggests that scaling by the square root instead is more natural.
This example employs a bit depth of 1, so that the weight associated with each classifier is limited to 0 or 1. It is possible to increase the bit depth, for example as discussed Ref. [1]. Nevertheless, it has been reported that low bit depths can improve performance, particularly on smaller training sets.

References

[1] Neven, H., Denchev, V. S., Rose, G., and Macready, W. G. Training a Binary Classifier with the Quantum Adiabatic Algorithm, 2008, arXiv:0811.0416v1

[2] Neven, H., Denchev, V. S., Rose, G., and Macready, W. G. QBoost: Large Scale Classifier Training with Adiabatic Quantum Optimization, Journal of Machine Learning Research: Workshop and Conference Proceedings, 2012. URL: http://proceedings.mlr.press/v25/neven12/neven12.pdf.

[3] Mott, A., Job, J., Vlimant, J.-R., Lidar, D., and Spiropulu, M. Solving a Higgs optimization problem with quantum annealing for machine learning. Nature, Vol. 550, 2017, doi:10.1038/nature24047.

[4] Boyda, E., Basu, S., Ganguly, S., Michaelis, A., Mukhopadhyay, S., and Nemani, R. R. Deploying a quantum annealing processor to detect tree cover in aerial imagery of California. PLoS One, 2017, doi:10.1371/journal.pone.0172505.

License

Released under the Apache License 2.0. See LICENSE file.

qboost's People

Contributors

Stargazers

Watchers

qboost's Issues

prediction_proba method for classifiers

Application

It is not obvious how to calculate the ROC_AUC and PR_AUC curves from classifiers WeakClassifiers, QBoostClassifier, and QboostPlus.

Proposed Solution

We could implement a prediction_proba method for each classifier analogous to the prediction_proba method from scikit learn.

D-Wave results are being used as weight coefficients and are all are 1's

What is happening?

TL;DR - This codebase demonstrates no value in representing D-Wave Quantum for AI application advancements.

Running the examples and exploring the output weights. What is really happening?

It appears that nothing logically substantial is happening from D-Wave.

After running the examples, the D-Wave output weights are 1's [1 1 1...]. That means nothing is really happening here. There's no value in this. Repeated executions result in varied outcomes ( due to RNG initializations ). However the D-wave final weights remain the same. This provides no additive value to the classic model for training AI.

The readme.md file states that:

This code demonstrates the use of the D-Wave system to solve a binary classification problem using the Qboost algorithm.

However there's no clear alignment with the results of the program output.

python demo.py --wisc
python demo.py --mnist

Possible Considerations

There does not appear to be a mistake here. However the value of D-wave is unclear. We can consider the following:

This demo.py example isn't worthwhile and can't demonstrate D-wave/Quantum's value in AI/ML advancements. Would another example work?
The code isn't demonstrating anything useful. The training models are 100% trained classically and the data exchanged with D-Wave returns 1's [1 1 1...]. This represents no value in the model.
Is it possible that the intent here is to demonstrate that the D-Wave QPU is able to verify the classically trained model. If that is the case, then what value does this example bring to ML/AL advancements?

Output Results Screenshot

CI fails for py38

Looks like scipy is not available.

Preprocessing should be removed

There are several issues with the current preprocessing that is done in demo.py. Currently there are calls to both the sklearn StandardScaler and the Normalizer, although as currently written only Normalizer is used (it overwrites the preceeding calls to StandardScaler). However, Normalizer is not appropriate in this context as it operates by row, re-scaling each individual sample separately, independently of all of the others. Generally speaking, the StandardScaler would be more appropriate in this context (it scales by column (feature)), but in demo.py all of the code is ultimately using decision tree classifiers, so re-scaling the features would have no effect on the results. (Incidentally, the current usage of StandardScaler in the code, even though it is overwritten by the normalizer, is not correct: the test data should be transformed via scaler.transform, not scaler.fit_transform, which is re-computing the transformation based on the test data).

The main takeaway is that both the normalizer and standard scaler preprocessing should be removed from demo.py, leaving a comment that preprocessing is not necessary because all of the weak classifiers are based on decision trees.

Update to work with latest scikit-learn (0.22.1)

When run with the latest version (scikit-learn==0.22.1), the demo fails with:

$ python demo.py --mnist

======================================
Train#: 3333, Test: 1667
Num weak classifiers: 35
Tree depth: 3
Traceback (most recent call last):
  File "demo.py", line 196, in <module>
    clfs = train_model(X_train, y_train, X_test, y_test, 1.0)
  File "demo.py", line 80, in train_model
    X_train = centerer.fit_transform(X_train)
  File "/usr/local/lib/python3.7/site-packages/sklearn/base.py", line 571, in fit_transform
    return self.fit(X, **fit_params).transform(X)
  File "/usr/local/lib/python3.7/site-packages/sklearn/preprocessing/_data.py", line 2033, in fit
    .format(K.shape[0], K.shape[1]))
ValueError: Kernel matrix must be a square matrix. Input is a 3333x784 matrix.

Current QBoost implementation is just a diminished AdaBoost

As applied to the given demonstration problems, the QBoost algorithm does essentially nothing. AdaBoost is actually run first to pre-select a set of weak classifiers, which are then provided to the QBoost algorithm. With the current settings and demonstration problems, the results are simply that QBoost includes all of the classifiers that AdaBoost provides to it (confirm by seeing the list of "1" weights in the output for QBoost). In other words, for these problems, QBoost could be replaced by the following anti-algorithm:

Run AdaBoost
Change all of the weights that AdaBoost came up with to 1

Note that the screen output is misleading in terms of the method comparisons. For example, runs with the "mnist" data set often show that "QBoost" is performing better than Adaboost. This is misleading because what is labeled Adaboost is actually sklearn Adaboost running with weak decision tree classifiers at depth 1, whereas QBoost is pulling from a custom Adaboost implementation that uses weak decision tree classifiers with depth 3 (confusingly, this custom Adaboost implementation is labeled "DecisionTree" in the output; see Issue #13).

The key questions are what weak classifiers should be considered in the QBoost algorithm, and what actually is the definition of the QBoost algorithm? The README refers to the earlier 2008 paper (https://arxiv.org/pdf/0811.0416.pdf). Aside from not actually using the QBoost terminology, this paper presents the algorithm as drawing from all possible "degree 1 and 2 decision stumps" (basically decision rules using either a single variable or a product of two variables). As described, this produces a large number of variables if doing a one-shot global optimization: 930 variables for the 30-feature case and many more for the 784-feature case. Compare these numbers to the 35 variables being currently used because QBoost is being fed weak classifiers pre-selected by AdaBoost.

A different version of the method is described in the more recent 2012 paper (http://proceedings.mlr.press/v25/neven12/neven12.pdf), which introduces the name QBoost. This paper reduces the problem size by using "inner" and "outer" loops that pre-select the weak classifiers as detailed by Algorithms 1 and 2. Note that neither paper discusses using AdaBoost to pre-select the weak classifiers, and if one is going to do that, it is unclear what the motivation for QBoost is (the 2008 paper contrasts QBoost as a "global optimization" vs the "greedy" AdaBoost method, but this is nullified if we simply use AdaBoost first).

In conclusion, my suggestions are the following:

The code should be reworked to use an implementation of QBoost that does not simply re-use the classifiers selected by AdaBoost. Probably the implementation should draw from the 2012 paper, otherwise it is unclear how to select a set of weak classifiers and achieve a reasonable problem size.
Any comparisons against other boosting methods such as AdaBoost should use the same pool of weak classifiers (as done in the papers mentioned above). Otherwise, it is not an apples-to-apples comparison.

Both the 2008 and 2012 papers show improved performance relative to AdaBoost when using the same pool of weak classifiers. It would be nice to illustrate that through this demo as well.

Error from pip's dependency resolver

When running this example with Ocean 4 in Leap IDE, see the following message:

Installing collected packages: tabulate, scikit-learn, matplotlib
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.1 which is incompatible.
Successfully installed matplotlib-3.3.4 scikit-learn-0.23.1 tabulate-0.8.9

Inaccurate labeling of WeakClassifiers/DecisionTree methods

In qboost.py, the WeakClassifiers class is misleading because its fit and predict methods actually implement the AdaBoost algorithm. In demo.py, this method is then compared with AdaBoost from sklearn. In the demo.py output, "Adaboost" refers to the sklearn implementation and "Decision tree" refers to the Adaboost implementation from the WeakClassifiers class. As far as I can tell, the only real difference between the two is that the demo runs the WeakClassifiers AdaBoost method with a tree depth of 3, whereas the call to sklearn's AdaBoost model uses the default, which is a tree depth of 1.

Need to review the best way to address this. Some points to consider:

The screen output from demo.py should be updated to use a more accurate description than "Decision tree"
Perhaps remove one of the two AdaBoost implementations from demo.py
Or, if both are kept, consider using the same tree depth for consistency
Use a better class name than WeakClassifiers
Update comments/docstrings to describe what is actually being done by the different methods