Giter VIP home page Giter VIP logo

q2-sample-classifier's Introduction

q2-sample-classifier

This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org.

q2-sample-classifier's People

Contributors

adamovanja avatar cduvallet avatar chriskeefe avatar colinvwood avatar david-rod avatar ebolyen avatar gregcaporaso avatar hagenjp avatar jairideout avatar jiung-wen avatar justin212k avatar leo-simpson avatar lizgehret avatar misialq avatar nbokulich avatar oddant1 avatar q2d2 avatar thermokarst avatar trellixvulnteam avatar turanoo avatar valentynbez avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q2-sample-classifier's Issues

increase test coverage

This plugin is currently around 90% covered with unit tests - it looks like there is some important functionality that should still be tested based on the coverage report.

all functions that don't need to be public should be private

Some of these, I think, are: select_estimator (which might be better named _construct_estimator), svm_set, extract_important_features (more descriptive name for this?), and warn_feature_selection. You should prefix these with an _ to indicate that they should be considered private.

Drop README tutorial(s)

As with the other plugins in QIIME 2, we provide official tutorials as part of https://github.com/qiime2/docs, and unofficial tutorials on the forum. We should clear out this README of the existing tutorial content and ensure that things get moved to the appropriate location (docs or forum).

support custom color schemes

and explicitly set a default color scheme

otherwise, some users wind up with really bad default color schemes:

image

Replace category with regression formula in `regress_samples`

Improvement Description
It is very useful to simultaneously analyze multiple metadata categories and their interactions.

Proposed Behavior
Fortunately, this is just a two line change but would give the user much more flexibility when building comprehensive models.

References
(see here)

`detect_outliers` suggestions

Inputs support two different modes

Could detect_outliers be split into a couple of different methods, tailored for different the different modes? All of the work could happen in a single function under the hood, but if mode-specific options need to be added in the future this could get confusing for users. Even if you don't want to make this change, you should catch the case where the user provided only one of subset_category or subset_value and raise an error (this will catch users otherwise).

possibly filtered to remove samples

It might save users some trouble if this filtering could be part of the method. Or, could the filtering be integrated into this method so users don't have to do this themselves. You could import filter_samples from q2-feature-classifier to support that.

This series may be added to a metadata map

It could actually just be used as metadata directly now. See the (really) new metadata tutorial. Also, we're trying to drop the "mapping/map" nomenclature, since it's so generic. Could you refer to it as "sample metadata" or a "sample metadata file"?

Your big doc string should ultimately be made a tutorial, as users won't find it as a doc string. I also think you should shorten the description for this in plugin_setup.py, in favor of a tutorial. This probably won't render nicely in different interfaces, and it's really intended to be more a description rather than usage documentation. You could move this info exclusively to the README for the moment.

inliers have value 1, outliers have value -1

Looks like this is outdated?

add options to pickle/output estimators

will require QIIME2 pipelines to allow output of visualization and artifact (pickled estimator) in a single action.

However, in mean time could prototype with a fit_model method.

add r-squared to regression outputs

These currently report r. It would be good to also report r**2, to avoid users confusing the reported r value for an r-squared value (and because r-squared is important output, of course).

add versioneer

This will be necessary for this to transition to the core distribution. See setup.py, __init__.py, and versioneer.py in one of the other plugins for how this works.

expand unit tests

The code is a bit under-tested at the moment. Before release, the tests should be expanded to include some toy data sets with obvious answers, and which cover boundary cases and invalid inputs. You could set a seed (as you note in your comments) to test for specific results. Let me know if you need input on testing, in which case we could pick a time to pair-program the tests.

modifications to `description` and other help text

Predict {0} sample metadata classes using -> Predict {0} sample metadata using - "classes" doesn't make sense for continuous data

Outputs ... and optionally a trained estimator to use on additional unknown samples - is that true? I don't see an option for that, and I don't think it's currently possible to do with QIIME 2.

Replace ? with . for consistency:

  • Automatically optimize input feature selection using recursive feature elimination?
  • Automatically tune hyperparameters using random grid search?

Ouput consists of predicted coordinates -> Output consists of predicted coordinates

'Plugin for machine learning prediction of sample data.' -> 'Plugin for machine learning prediction of sample metadata.'

`predict_coordinates` suggestions

would it be better to do this as a multilabel regression?
currently each dimension is predicted separately

I think it would make sense to predict both together, rather than independently, since some variable could theoretically change independently across the two axes. Have you been able to find any literature on this problem? It's not something I've ever tried to do before, but it seems like there are probably some known best practices.

or precise location within 2-D physical space, such as the built environment

Not something you need to deal with now, but would 3-D physical space ultimately make more sense to support, with the second and third dimensions being optional?

The latitude and longitude parameter names might make more sense as something like axis1_category and axis2_category (still with the same default values) if you're thinking that you'd like to use this for built environments, etc, where lat/long probably aren't the right axes. Or, just focus the docs for the moment on predicting lat/long. In either case, I think _category should be part of the parameter names for consistency with your other methods, and so it's more clear to users what these values are supposed to be.

visualization comments

  1. I think these category labels would be a little more clear if they were centered in these plots:

screenshot 2017-06-09 16 00 15

  1. It looks like there are some \ns in the "parameters" section that are not being interpreted correctly. I think it might be better to not include this parameters section, since all of that information is already in the provenance.

screenshot 2017-06-09 16 02 17

  1. What do you think of putting a tsv link after the Feature Importance header, rather than after the table? It's easy to not notice if it's way down at the bottom of the page.

  2. The leading zero in these regress-* visualizations is a little confusing. Could you drop that? (I think it's the index of the dataframe that those results are in.)

screenshot 2017-06-09 16 25 34

Also, would it be possible to make those column headers more clear (e.g., expand MSE, and use P-value instead of P-val)?

  1. Could you change Accuracy score to Overall accuracy, and put that with the table of per-category accuracies? Especially if you end up removing the parameters section (see my comment 2) this info will be hanging out on it's own and would be a little awkward. In the regress visualizers, I think you could also then drop the MSE (is it the same as the accuracy score?).

Reduce code replication

A bunch of functions are almost identical, differing only by the scikit-learn pipeline that they provide access to. We could perhaps replace all of them with a factory function, similar to what was done in q2-feature-classifier.

remove `setup_requires` from `setup.py`, or add missing dependencies

install_requires seems to be missing some dependencies (e.g., qiime2, but I'm not what others, if any). You can just remove install_requires all together though I think, as this is intended to be a QIIME 2 plugin only, not a stand-alone package, so will have its dependencies installed through conda. It might make sense to wait until you're ready to release this though to remove install_requires.

could the amount of test data be reduced?

There are a lot of qza and qzv files, which is going to make the repository big. Could you reduce the number of those? If they're being used in tutorial, but not in unit tests, we could ultimately host them elsewhere (e.g., on S3).

README should be ported to a new qiime2 tutorial with release of this plugin

It would be cool if you could use the moving pictures tutorial data to illustrate classify-samples, as that would make it really easy to teach this in workshops. The Atacama tutorial data could be useful for regress-samples (the AverageSoilRelativeHumidity and pH metadata columns are the relevant ones), but you might have a better data set for that in which case we could just prep some new tutorial data.

`n_jobs` should always default to 1

I noticed here and here that the default is 4, but this would be bad (for example) on a dual core system. To be safe, make sure that the default value for n_jobs is 1 throughout the code base.

Possible stochastic failure related to bad linkage matrix

Bug Description
It appears that it is possible for the visualizer to fail when it finds a "perfectly bad" split of the test and training data.

Expected Behavior
@nbokulich says that this is very unlikely to happen, but we should look into reproducing and raising a different error. If a user runs into this, they should be able to just re-run their command.

Screenshots

Test failure/traceback
_____________________ EstimatorsTests.test_maturity_index ______________________

self = <q2_sample_classifier.tests.test_classifier.EstimatorsTests testMethod=test_maturity_index>

    def test_maturity_index(self):
        maturity_index(self.temp_dir.name, self.table_ecam_fp, self.md_ecam_fp,
                       category='month', group_by='delivery', n_jobs=1,
>                      control='Vaginal', test_size=0.4)

test-env/lib/python3.5/site-packages/q2_sample_classifier/tests/test_classifier.py:250: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test-env/lib/python3.5/site-packages/q2_sample_classifier/classify.py:145: in maturity_index
    accuracy, output_dir, maz_stats=maz_stats)
test-env/lib/python3.5/site-packages/q2_sample_classifier/utilities.py:412: in _visualize_maturity_index
    g = _clustermap_from_dataframe(top, metadata, group_by, category)
test-env/lib/python3.5/site-packages/q2_sample_classifier/visuals.py:69: in _clustermap_from_dataframe
    row_cluster=False)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1301: in clustermap
    **kwargs)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1131: in plot
    row_linkage=row_linkage, col_linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:1032: in plot_dendrograms
    axis=1, ax=self.ax_col_dendrogram, linkage=col_linkage)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:746: in dendrogram
    label=label, rotate=rotate)
test-env/lib/python3.5/site-packages/seaborn/matrix.py:567: in __init__
    self.dendrogram = self.calculate_dendrogram()
test-env/lib/python3.5/site-packages/seaborn/matrix.py:644: in calculate_dendrogram
    color_threshold=-np.inf)
test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:2296: in dendrogram
    is_valid_linkage(Z, throw=True, name='Z')
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

Z = array([], shape=(0, 4), dtype=float64), warning = False, throw = True
name = 'Z'

    def is_valid_linkage(Z, warning=False, throw=False, name=None):
        """
        Checks the validity of a linkage matrix.
    
        A linkage matrix is valid if it is a two dimensional array (type double)
        with :math:`n` rows and 4 columns.  The first two columns must contain
        indices between 0 and :math:`2n-1`. For a given row ``i``, the following
        two expressions have to hold:
    
        .. math::
    
            0 \\leq \\mathtt{Z[i,0]} \\leq i+n-1
            0 \\leq Z[i,1] \\leq i+n-1
    
        I.e. a cluster cannot join another cluster unless the cluster being joined
        has been generated.
    
        Parameters
        ----------
        Z : array_like
            Linkage matrix.
        warning : bool, optional
            When True, issues a Python warning if the linkage
            matrix passed is invalid.
        throw : bool, optional
            When True, throws a Python exception if the linkage
            matrix passed is invalid.
        name : str, optional
            This string refers to the variable name of the invalid
            linkage matrix.
    
        Returns
        -------
        b : bool
            True if the inconsistency matrix is valid.
    
        """
        Z = np.asarray(Z, order='c')
        valid = True
        name_str = "%r " % name if name else ''
        try:
            if type(Z) != np.ndarray:
                raise TypeError('Passed linkage argument %sis not a valid array.' %
                                name_str)
            if Z.dtype != np.double:
                raise TypeError('Linkage matrix %smust contain doubles.' % name_str)
            if len(Z.shape) != 2:
                raise ValueError('Linkage matrix %smust have shape=2 (i.e. be '
                                 'two-dimensional).' % name_str)
            if Z.shape[1] != 4:
                raise ValueError('Linkage matrix %smust have 4 columns.' % name_str)
            if Z.shape[0] == 0:
>               raise ValueError('Linkage must be computed on at least two '
                                 'observations.')
E                                ValueError: Linkage must be computed on at least two observations.

test-env/lib/python3.5/site-packages/scipy/cluster/hierarchy.py:1459: ValueError
==================== 1 failed, 12 passed in 150.50 seconds =====================

Comments
This is probably going to be difficult to fix.

References
It appears

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.