qiime2 / q2-sample-classifier Goto Github PK

View Code? Open in Web Editor NEW

7.0 13.0 36.0 20.75 MB

QIIME 2 plugin for machine learning prediction of sample data.

License: BSD 3-Clause "New" or "Revised" License

Python 95.14% HTML 1.12% Makefile 0.09% TeX 3.65%

hacktoberfest

q2-sample-classifier's Introduction

q2-sample-classifier

This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.

q2-sample-classifier's People

Contributors

Stargazers

Watchers

q2-sample-classifier's Issues

increase test coverage

This plugin is currently around 90% covered with unit tests - it looks like there is some important functionality that should still be tested based on the coverage report.

`maturity-index` — make pipeline, output MAZ scores

all functions that don't need to be public should be private

Some of these, I think, are: select_estimator (which might be better named _construct_estimator), svm_set, extract_important_features (more descriptive name for this?), and warn_feature_selection. You should prefix these with an _ to indicate that they should be considered private.

Drop README tutorial(s)

As with the other plugins in QIIME 2, we provide official tutorials as part of https://github.com/qiime2/docs, and unofficial tutorials on the forum. We should clear out this README of the existing tutorial content and ensure that things get moved to the appropriate location (docs or forum).

implement neural network models?

References
http://scikit-learn.org/stable/modules/neural_networks_supervised.html#neural-networks-supervised

visualizers should take `qiime2.MetadataCategory` as input...

rather than qiime2.Metadata and a string for the category. See an example here.

This is only relevant in visualizers where a single category is used, so for example classify-samples, but not predict-coordinates.

support custom color schemes

and explicitly set a default color scheme

otherwise, some users wind up with really bad default color schemes:

Replace category with regression formula in `regress_samples`

Improvement Description
It is very useful to simultaneously analyze multiple metadata categories and their interactions.

Proposed Behavior
Fortunately, this is just a two line change but would give the user much more flexibility when building comprehensive models.

References
(see here)

add overall accuracy: fold-improvement over random error

`detect_outliers` suggestions

Inputs support two different modes

Could detect_outliers be split into a couple of different methods, tailored for different the different modes? All of the work could happen in a single function under the hood, but if mode-specific options need to be added in the future this could get confusing for users. Even if you don't want to make this change, you should catch the case where the user provided only one of subset_category or subset_value and raise an error (this will catch users otherwise).

possibly filtered to remove samples

It might save users some trouble if this filtering could be part of the method. Or, could the filtering be integrated into this method so users don't have to do this themselves. You could import filter_samples from q2-feature-classifier to support that.

This series may be added to a metadata map

It could actually just be used as metadata directly now. See the (really) new metadata tutorial. Also, we're trying to drop the "mapping/map" nomenclature, since it's so generic. Could you refer to it as "sample metadata" or a "sample metadata file"?

Your big doc string should ultimately be made a tutorial, as users won't find it as a doc string. I also think you should shorten the description for this in plugin_setup.py, in favor of a tutorial. This probably won't render nicely in different interfaces, and it's really intended to be more a description rather than usage documentation. You could move this info exclusively to the README for the moment.

inliers have value 1, outliers have value -1

Looks like this is outdated?

add options to pickle/output estimators

will require QIIME2 pipelines to allow output of visualization and artifact (pickled estimator) in a single action.

However, in mean time could prototype with a fit_model method.

`_metadata_to_df` should be replaced with call to get numeric columns from `Metadata`

This method will be added for the 2017.7 release - method name is still TBD.

improve spacing of columns in visualization tables

e.g., feature importance tables are pretty ugly:

Need to condense tables and align headings/text

add r-squared to regression outputs

These currently report r. It would be good to also report r**2, to avoid users confusing the reported r value for an r-squared value (and because r-squared is important output, of course).

add support for class weights / prior probabilities

add mislabeling command

One feature that I just realized is missing is the option to calculate mislabeling at different thresholds, like we did in Qiime. Original tutorial.

Feature Selection

Would be super cool if this plugin could provide some basic support for feature selection

http://scikit-learn.org/stable/modules/feature_selection.html

See this example for some examples of how to run these models. Maybe for a first pass, RandomForests could be used.

display important features lists as collapsible blocks of content

Is this even possible if the lists are provided as tables (from pandas DataFrames)?

add download as tsv link in visualizers for confusion matrix data table

check other visualizers for other data that should be downloadable.

add versioneer

This will be necessary for this to transition to the core distribution. See setup.py, __init__.py, and versioneer.py in one of the other plugins for how this works.

remove `print` statement from utilities.py

You should search the whole code base for any leftover print statements.

fix axis labels on confusion matrices

otherwise the labels occasionally overlap the plot, e.g.:

add data pre-processing/scaling/normalization options

References
http://scikit-learn.org/stable/modules/preprocessing.html

expand unit tests

The code is a bit under-tested at the moment. Before release, the tests should be expanded to include some toy data sets with obvious answers, and which cover boundary cases and invalid inputs. You could set a seed (as you note in your comments) to test for specific results. Let me know if you need input on testing, in which case we could pick a time to pair-program the tests.

BUG: confusion matrix reverse y-axis

this has only happened once and usually works, but it has happened:

modifications to `description` and other help text

Predict {0} sample metadata classes using -> Predict {0} sample metadata using - "classes" doesn't make sense for continuous data

Outputs ... and optionally a trained estimator to use on additional unknown samples - is that true? I don't see an option for that, and I don't think it's currently possible to do with QIIME 2.

Replace ? with . for consistency:

Automatically optimize input feature selection using recursive feature elimination?
Automatically tune hyperparameters using random grid search?

Ouput consists of predicted coordinates -> Output consists of predicted coordinates

'Plugin for machine learning prediction of sample data.' -> 'Plugin for machine learning prediction of sample metadata.'

unregister `predict-coordinates` and `detect-outliers` for initial release

these methods are still experimental

configure travis and coveralls, and flake8 the code base

`predict_coordinates` suggestions

would it be better to do this as a multilabel regression?
currently each dimension is predicted separately

I think it would make sense to predict both together, rather than independently, since some variable could theoretically change independently across the two axes. Have you been able to find any literature on this problem? It's not something I've ever tried to do before, but it seems like there are probably some known best practices.

or precise location within 2-D physical space, such as the built environment

Not something you need to deal with now, but would 3-D physical space ultimately make more sense to support, with the second and third dimensions being optional?

The latitude and longitude parameter names might make more sense as something like axis1_category and axis2_category (still with the same default values) if you're thinking that you'd like to use this for built environments, etc, where lat/long probably aren't the right axes. Or, just focus the docs for the moment on predicting lat/long. In either case, I think _category should be part of the parameter names for consistency with your other methods, and so it's more clear to users what these values are supposed to be.

visualization comments

I think these category labels would be a little more clear if they were centered in these plots:

It looks like there are some \ns in the "parameters" section that are not being interpreted correctly. I think it might be better to not include this parameters section, since all of that information is already in the provenance.

What do you think of putting a tsv link after the Feature Importance header, rather than after the table? It's easy to not notice if it's way down at the bottom of the page.
The leading zero in these regress-* visualizations is a little confusing. Could you drop that? (I think it's the index of the dataframe that those results are in.)

Also, would it be possible to make those column headers more clear (e.g., expand MSE, and use P-value instead of P-val)?

Could you change Accuracy score to Overall accuracy, and put that with the table of per-category accuracies? Especially if you end up removing the parameters section (see my comment 2) this info will be hanging out on it's own and would be a little awkward. In the regress visualizers, I think you could also then drop the MSE (is it the same as the accuracy score?).

drop `transpose` parameter from `_load_data`

It looks like only the default is ever used (False), and the DataFrame that you have should always be in the correct orientation since it's coming from Q2.

Feature Importance tables should not include the pandas index

The first column that is printed is not of use to the user, and it's just going to confuse them. You should be able to call df.to_[csv|html](..., index=False) to exclude the index. Here's an example of what I'm looking at:

drop `biom_to_pandas` function

You should be able to just set the view type for your table input as pd.DataFrame. See here for an example.

Reduce code replication

A bunch of functions are almost identical, differing only by the scikit-learn pipeline that they provide access to. We could perhaps replace all of them with a factory function, similar to what was done in q2-feature-classifier.

delete method doc strings in `classify.py`

Since you already set these as descriptions when you register the methods, I don't think you need to include these as doc strings as well and they'll probably get out of date. See here, here, and here.

remove `setup_requires` from `setup.py`, or add missing dependencies

install_requires seems to be missing some dependencies (e.g., qiime2, but I'm not what others, if any). You can just remove install_requires all together though I think, as this is intended to be a QIIME 2 plugin only, not a stand-alone package, so will have its dependencies installed through conda. It might make sense to wait until you're ready to release this though to remove install_requires.

could the amount of test data be reduced?

There are a lot of qza and qzv files, which is going to make the repository big. Could you reduce the number of those? If they're being used in tutorial, but not in unit tests, we could ultimately host them elsewhere (e.g., on S3).

visualizer function annotations should include return type (i.e., None)

See here.

can `split_optimize_classify` be split into several smaller functions?

That will make it clearer and easier to test. It's pretty big right now.

could `predict-coordinates` return a single pd.DataFrame?

i.e., could predictions and prediction_regression be in a single pd.DataFrame, rather than in two?

ENH: added collapse widget to feature importance viz results

#62

partial dependence plots on main features?

partial dependence plots would be an interesting way to incorporate visualization of main features, but would be limited to fairly low dimensionality.

Unpin scipy in conda recipe

README should be ported to a new qiime2 tutorial with release of this plugin

It would be cool if you could use the moving pictures tutorial data to illustrate classify-samples, as that would make it really easy to teach this in workshops. The Atacama tutorial data could be useful for regress-samples (the AverageSoilRelativeHumidity and pH metadata columns are the relevant ones), but you might have a better data set for that in which case we could just prep some new tutorial data.

`n_jobs` should always default to 1

I noticed here and here that the default is 4, but this would be bad (for example) on a dual core system. To be safe, make sure that the default value for n_jobs is 1 throughout the code base.

decision tree classification/regression? and exporting trees

We use decision trees as a base estimator for adaboost meta-estimators.

Their limitations may make them inappropriate for high-dimensional data, though pre-processing (e.g., stricter feature extraction) could improve this.

However, the ability to view decision trees make them an enticing option...

could export_graphviz be used for methods that use decision trees as a base estimator?

`python q2_sample_classifier/tests/test_classifier.py` command in `travis.yml` is redundant

This should be run by py.test (though it's worth confirming that by breaking one of the tests in test_classifier.py and confirming that py.test fails as a result).

Possible stochastic failure related to bad linkage matrix

Bug Description
It appears that it is possible for the visualizer to fail when it finds a "perfectly bad" split of the test and training data.

Expected Behavior
@nbokulich says that this is very unlikely to happen, but we should look into reproducing and raising a different error. If a user runs into this, they should be able to just re-run their command.

Screenshots

Test failure/traceback

_____________________ EstimatorsTests.test_maturity_index ______________________

self = <q2_sample_classifier.tests.test_classifier.EstimatorsTests testMethod=test_maturity_index>

    def test_maturity_index(self):
        maturity_index(self.temp_dir.name, self.table_ecam_fp, self.md_ecam_fp,
                       category='month', group_by='delivery', n_jobs=1,
>                      control='Vaginal', test_size=0.4)

test-env/lib/python3.5/site-packages/q2_sample_classifier/tests/test_classifier.py:250: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test-env/lib/python3.5/site-packages/q2_sample_classifier/classify.py:145: in maturity_index
    accuracy, output_dir, maz_stats=maz_stats)
test-env/lib/python3.5/site-packages/q2_sample_classifier/utilities.py:412: in _visualize_maturity_index
    g = _clustermap_from_dataframe(top, metadata, group_by, category)
test-env/lib/python3.5/site-packages/q2_sample_classifier/visuals.py:69: in _clustermap_from_dataframe
    row_cluster=False)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1301: in clustermap
    **kwargs)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1131: in plot
    row_linkage=row_linkage, col_linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1032: in plot_dendrograms
    axis=1, ax=self.ax_col_dendrogram, linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:746: in dendrogram
    label=label, rotate=rotate)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:567: in __init__
    self.dendrogram = self.calculate_dendrogram()
test-env/lib/python3.5/site-packages/seaborn/matrix.py:644: in calculate_dendrogram
    color_threshold=-np.inf)
test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:2296: in dendrogram
    is_valid_linkage(Z, throw=True, name='Z')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Z = array([], shape=(0, 4), dtype=float64), warning = False, throw = True
name = 'Z'

    def is_valid_linkage(Z, warning=False, throw=False, name=None):
        """
        Checks the validity of a linkage matrix.
    
        A linkage matrix is valid if it is a two dimensional array (type double)
        with :math:`n` rows and 4 columns.  The first two columns must contain
        indices between 0 and :math:`2n-1`. For a given row ``i``, the following
        two expressions have to hold:
    
        .. math::
    
            0 \\leq \\mathtt{Z[i,0]} \\leq i+n-1
            0 \\leq Z[i,1] \\leq i+n-1
    
        I.e. a cluster cannot join another cluster unless the cluster being joined
        has been generated.
    
        Parameters
        ----------
        Z : array_like
            Linkage matrix.
        warning : bool, optional
            When True, issues a Python warning if the linkage
            matrix passed is invalid.
        throw : bool, optional
            When True, throws a Python exception if the linkage
            matrix passed is invalid.
        name : str, optional
            This string refers to the variable name of the invalid
            linkage matrix.
    
        Returns
        -------
        b : bool
            True if the inconsistency matrix is valid.
    
        """
        Z = np.asarray(Z, order='c')
        valid = True
        name_str = "%r " % name if name else ''
        try:
            if type(Z) != np.ndarray:
                raise TypeError('Passed linkage argument %sis not a valid array.' %
                                name_str)
            if Z.dtype != np.double:
                raise TypeError('Linkage matrix %smust contain doubles.' % name_str)
            if len(Z.shape) != 2:
                raise ValueError('Linkage matrix %smust have shape=2 (i.e. be '
                                 'two-dimensional).' % name_str)
            if Z.shape[1] != 4:
                raise ValueError('Linkage matrix %smust have 4 columns.' % name_str)
            if Z.shape[0] == 0:
>               raise ValueError('Linkage must be computed on at least two '
                                 'observations.')
E                                ValueError: Linkage must be computed on at least two observations.

test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:1459: ValueError
==================== 1 failed, 12 passed in 150.50 seconds =====================

Comments
This is probably going to be difficult to fix.

References
It appears

support additional base estimators in adaboost estimators

Improvement Description
Currently based on DecisionTreeClassifier and DecisionTreeRegressor.

Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes:

References
see adaboost docs and user guide.

qiime2 / q2-sample-classifier Goto Github PK

q2-sample-classifier's Introduction

q2-sample-classifier

q2-sample-classifier's People

Contributors

Stargazers

Watchers

Forkers

q2-sample-classifier's Issues

Recommend Projects

Recommend Topics

Recommend Org