yeatmanlab / afq-insight Goto Github PK

View Code? Open in Web Editor NEW

10.0 6.0 10.0 42.3 MB

Statistical learning for AFQ results

Home Page: http://yeatmanlab.github.io/AFQ-Insight/

License: BSD 3-Clause "New" or "Revised" License

Python 98.68% Makefile 0.43% Shell 0.89%

afq statistical-learning

afq-insight's Introduction

AFQ-Insight Python based statistical learning for tractometry

See https://yeatmanlab.github.io/AFQ-Browser/ for documentation of the AFQ data format

This is a work in progress.

Contributing

We love contributions! AFQ-Insight is open source, built on open source, and we'd love to have you hang out in our community.

We have developed some guidelines for contributing to AFQ-Insight.

Citing AFQ-Insight

If you use AFQ-Insight in a scientific publication, please see cite us:

Richie-Halford, A., Yeatman, J., Simon, N., and Rokem, A. Multidimensional analysis and detection of informative features in diffusion MRI measurements of human white matter. DOI:10.1101/2019.12.19.882928

@article {Richie-Halford2019.12.19.882928,
	author = {{R}ichie-{H}alford, {A}dam and {Y}eatman, {J}ason and {S}imon, {N}oah and {R}okem, {A}riel},
	title = {{M}ultidimensional analysis and detection of informative features in diffusion {MRI} measurements of human white matter},
	elocation-id = {2019.12.19.882928},
	year = {2019},
	doi = {10.1101/2019.12.19.882928},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2019/12/20/2019.12.19.882928},
	eprint = {https://www.biorxiv.org/content/early/2019/12/20/2019.12.19.882928.full.pdf},
	journal = {bioRxiv}
}

Acknowledgements

AFQ-Insight development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).

afq-insight's People

Contributors

Stargazers

Watchers

Forkers

arokem fagan2888 mnarayan cbarnes7 36000 sambjohnson mfromano pierre-nedelec hfxcarl earoy

afq-insight's Issues

Visualize coefficient regularization paths

Convert functions in insight.py to sklearn estimators

The sgl based functions in insight.py should be sklearn compatible estimators. This will allow them to be used in pipelines and allow them to use standardized sklearn testing utilities.

Return subjectID when transforming AFQ data

The AFQTransformer function does not return subjectID after creating the feature matrix.

When subjects from the input data are dropped we don't what IDs each row correspond to.

Add a real research example to the sphinx gallery

Add ALS and WH examples from the PLOS Comp Bio paper to the sphinx gallery.

Add ability to transform datasets

From @arokem and @RLQiao's work, it seems that it would be useful to pass a transformer model to an AFQDataset and transform the dataset in place. I'm thinking of something like

def fit(self, model, **fit_kwargs):
    return model.fit(X=self.X, y=self.y, **fit_kwargs)

So that one could for example do

dataset = AFQDataset.from_files(...)
imputer = SimpleImputer()
imputer = dataset.fit(imputer)

to get back a fitted imputer.

dataset = AFQDatset.from_files(...)
imputer = SimpleImputer()
dataset_imputed = dataset.fit_transform(imputer)

to get back a transformed dataset.

What do you think @arokem and @RLQiao?

Update get_lipschitz function and import copt from pypi instead of github

Add a `GroupMeanTransformer`

We often want mean value in each group. Would be great to have a GroupMeanTransformer that gives us that easily.

Implement stratified cross-validation for multiple covariates

Add capabilities to split and impute AFQDataset

Dataset splitting and imputation is common and has the potential to introduce train/test leakage if done incorrectly. It's also easier to do when all of your dataset is in memory (i.e. before being batched into tensorflow or pytorch datasets.

Proposed solution

Make AFQDataset initialization more general so that user can provide X, y, groups, feature_names, group_names, subjects, sessions, classes explicitly on init.
create a new static method from_files() that takes all of the filenames that are in the current AFQDataset init method and returns an AFQDataset object with all of the data read in.
create a new split() method that will apply an sklearn splitter class and wrap its split method (which usually returns tuples of indices) by returning a tuple of AFQDataset objects representing the train and test splits. The user is free to recursively split the test set if they want both test and validate splits.
create a train_test_split() that wraps next(split(ShuffleSplit())) in much the same way that train_test_split works in sklearn.

Topics for discussion

We often need to impute missing bundles and have to take care to fit imputation on only the training set to avoid data leakage. It would be nice to provide the user with an option to do this during splitting. One option is to add imputer and imputer_kwargs parameters to the aforesaid split() method and impute within the wrapper for the splitters split method. I like this option but am open to other implementations.

Tagging @arokem to continue a discussion that we had about this in person. Note my change of heart here in making AFQDataset more general and reading from input files using a static method.

`load_afq_data` should accept multiple metrics in file

Allow user to specify which metric to use (e.g. dki_fa)

Adding support for weighted or adaptive penalties for SGL

Label encoding is broken in the presence of NaNs

Currently, load_afq_data with label encoding breaks in the presence of NaN labels. Bug fix PR with tests to follow.

Combine the select and remove group functions

Combine the functions select_group with select_groups. Likewise for remove_group and remove_groups.

Implement Combat transformer

From @mnarayan:

I did a tutorial on dealing with site effects a few years ago at OHBM, I can send my slides around. A few papers have shown that effectively standardizing DTI features to make sure they look similar across all sites (using a combat) was quite useful. https://github.com/neuroquant/neurocombat. It might be useful to use this as part of the model training (i.e. learning the combat standardization model on the training set and applying it on the test set before calling predict).

We should implement combat as an sklearn feature transformer and allow users the option of inserting it into the pipeline.

Rename power_transformer to transformer in the AFQ pipelines

There's no reason that power_transformer has to be a power transformer. We should update the variable names and doc string accordingly.

shuffle bug in datasets

When shuffle=False group outputs are not generated.
https://github.com/richford/AFQ-Insight/blob/d6814354542a59f8d9885388e44bf98b93f2b6ed/afqinsight/datasets.py#L546

Add attention mechanism to the tf nn models

Add a RegressOutTransformer

It is common to regress out certain phenotypic covariates from both the target phenotype and the tract profile features. So common that we should probably have a RegressOutCovariate or some such transformer to do this.

Add utilities for matching between participants

To create quasi-experiments based on certain match characteristics in large datasets.

@36000 has written some code to get this started.

Maybe this should go into a new module called match? It looks like the convention so far is for modules to be named as imperative verbs ("cross_validate", "transform", "plot").

It would be nice to have an `AFQDataset.bundles` attribute

I currently end up doing this in many places: np.unique(np.array(myafq.feature_names)[:, 1])

Should bundle-wise averaging use nanmean?

As it is now:

https://github.com/richford/AFQ-Insight/blob/17db4799758115e0906d20dbfb4b3d748c2935bc/afqinsight/datasets.py#L240

it would drop any bundle that has any nans. Is there a good reason to do so?

Modernize random number generator

Add some more sklearn API to the CNN model?

We can't currently use cross_val_score because:

TypeError: Cannot clone object '<afqinsight.cnn.CNN object at 0x7fca351d1be0>' (type <class 'afqinsight.cnn.CNN'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.

I wonder how many more methods we'd need to implement for this (and other things? For example sklearn pipelines?) to work.

Make the augmentation functions work on GPU

Either using numba, tensorflow, or jax.

Add sparse group regression problem generator

datasets.py has the make_sparse_group_classification function which is a sparse group extension of sklearn.datasets.make_classification. We should also make a sparse group regression problem generator

Migrate to groupyr

I decided (unilaterally...sorry...but I really think it's the right choice) that we should pull all of the SGL estimators out into their own project, import that library into AFQ-Insight and leave only AFQ related transformers and pipelines here.

Remaining steps

Migrate all estimators to groupyr
Add groupyr as a dependency
Remove dependency cruft leftover from previous implementation
Delete all of the old estimator functions and use groupyr estimators instead

Reading y with missing values

I might be doing something wrong:

output = load_afq_data(unsupervised=False, 
                       dwi_metrics=["dki_fa", "dki_md", "dki_mk", "dki_awf"], 
                       fn_nodes=DATA + 'hbn_tract_profiles.csv', 
                       fn_subjects=DATA + 'hbn_participants.csv',
                       index_col="subject_id",
                       target_cols=["age"])

Where the "age" column in "hbn_participants.csv" contains some missing values. But, I wasn't expecting:

y = output.y 
np.all(np.isnan(y))

True

Match subject identifiers with and without the "sub-" prefix

Problem

When using load_afq_data or AFQDataset, if the one of subjects.csv or nodes.csv has a "sub-" prefix for the subject identifiers and the other does not, then it won't find the matching phenotypic info for each subject.

Proposed solution

Add a drop_sub_prefix=False parameter to load_afq_data and AFQDataset. When False it should reproduce the current behavior. When True it should drop the "sub-" prefix from both the nodes and subject dataframes before merging.

Fix sklearn import bug

Problems in running in google colab due to sklearn incompatability. Same error was reproduced when installing and running afqinsight on the cluster.

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-e34982e51a91> in <module>()
----> 1 import afqinsight as afqi
      2 import itertools
      3 import matplotlib.pyplot as plt
      4 import numpy as np
      5 import os.path as op

1 frames
/content/afq-insight/afqinsight/datasets.py in <module>()
      9 from collections import namedtuple
     10 from shutil import copyfile
---> 11 from sklearn.datasets._samples_generator import _generate_hypercube
     12 from sklearn.preprocessing import StandardScaler
     13 from sklearn.utils import check_random_state

ModuleNotFoundError: No module named 'sklearn.datasets._samples_generator'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

So the version installed on when using colaboratory is version 0.21.1 which is consistent with what is specified in requirements. @richford, I think you have a different scikit-learn version installed in your dev environment.

The problematic line is https://github.com/richford/AFQ-Insight/blob/master/afqinsight/datasets.py#L11
which needs to read from sklearn.datasets.samples_generator import _generate_hypercube to work without error.

AttributeError: module 'copt.utils' has no attribute 'SquareLoss'

Running this gives the error, "AttributeError: module 'copt.utils' has no attribute 'SquareLoss'"
https://github.com/richford/AFQ-Insight/blob/a6838a1c969720872260b2e4917128dabae96965/afqinsight/insight.py#L330
I believe it is now in copt.loss.SquareLoss

TypeError: minimize_proximal_gradient() got an unexpected keyword argument 'step_size'

Running this line I get the error "TypeError: minimize_proximal_gradient() got an unexpected keyword argument 'step_size'"
https://github.com/richford/AFQ-Insight/blob/a6838a1c969720872260b2e4917128dabae96965/afqinsight/insight.py#L355
I believe step_size has been replaced by step: https://github.com/openopt/copt/blob/0c1d1c4ea6beb22a3e47f69dc65f0d49bc1babf7/copt/proximal_gradient.py#L80

Use s3fs to store/retrieve hyperopt trials

Add fetcher for the HBN data

Inconsistencies in use of pandas dataframes and numpy arrays

This function returns x as an array and y as a dataframe:
https://github.com/richford/AFQ-Insight/blob/9ea44e144827dbd83e9b2c3ac3b00ca54457e4d2/afqinsight/datasets.py#L27
However the classification functions, such as this one, expect x and y to be numpy arrays:
https://github.com/richford/AFQ-Insight/blob/9ea44e144827dbd83e9b2c3ac3b00ca54457e4d2/afqinsight/insight.py#L94

My suggestion is to return and expect dataframes, so that when you print x or y you have column labels, but use the numpy arrays internally.

User should be able to supply tract names

Now, the tract names are supplied in utils.py. User should be able to input their own tract names.

Add a drop_na parameter to AFQDataset and load_afq_data

load_afq_data and AFQDataset should have a drop_na parameter to drop any rows that have NaN phenotypical values.

Add AFQDataset.from_dataframe() static method

Loading data without `subjects` table for unsupervised learning

In some cases (e.g., unsupervised learning), we might only want to read the nodes table.

Maybe we can make the subjects table an optional input to load_data?

The output from load_afq_data is getting pretty busy

load_afq_data already returns a tuple with up to 8 elements. Depending on how we resolve #78, this might climb up to 9 elements. This is too much and I often forget the order of the results. How should we simplify the returned result:

Return a dict with keys X, subjects, groups, etc.
Return a namedtuple. See os.stat for an example implementation from the standard library.
Leave load_afq_data as is but also provide an AFQDataLoader class with attributes for each return value.

I'm partial to the namedtuple since it can be unpacked like a regular tuple so it would be backwards compatible. But we could also access the fields by name, much like a class instance's attributes. Kind of the best of both.

@arokem WDYT?

Fetcher for HCP phenotypic data

Need to look whether it's possible to automatically download this data from S3 and plug it into afqi analysis.

Handle more cases with subject csv

index_col could be a parameter that the user passes in, and you should only drop the unnamed column if it exists.
https://github.com/richford/AFQ-Insight/blob/9ea44e144827dbd83e9b2c3ac3b00ca54457e4d2/afqinsight/datasets.py#L86

Variable group sizes in make_sparse_group_(regression|classification)

Currently, make_sparse_group_classification and make_sparse_group_regression limit the problems generated to uniform group sizes. They should allow parameters like n_informative_features_per_group to be array-like where the length of the array is equal to the number of groups.

Tensorflow datasets => CNN

I've been working with the tf_models module and trying to feed in a tensorflow dataset into these models. I am running into some issues that are replicated in this self-contained example:

https://gist.github.com/arokem/db5d1b212268ab65effaa173f603cdfe

I must be doing something wrong, but I've tried a lot of different things already:

changing the input_shape to all kinds of other things.
Initializing the dataset using from_tensors and from_generator instead of from_tensor_slices.
Turning batching off.

And probably some other things that I don't remember right now. What am I missing?

Add documentation on M1 installation issues

tables has trouble installing on M1 macs. The workaround it to do the following before installing AFQ-Insight

pip install cython
brew install hdf5
brew install c-blosc
export HDF5_DIR=/opt/homebrew/opt/hdf5 
export BLOSC_DIR=/opt/homebrew/opt/c-blosc
pip install tables

We should add this to the documentation.

yeatmanlab / afq-insight Goto Github PK

afq-insight's Introduction

AFQ-Insight Python based statistical learning for tractometry

Contributing

Citing AFQ-Insight

Acknowledgements

afq-insight's People

Contributors

Stargazers

Watchers

Forkers

afq-insight's Issues

Proposed solution

Topics for discussion

Problem

Proposed solution

Recommend Projects

Recommend Topics

Recommend Org