Giter VIP home page Giter VIP logo

afq-insight's Introduction

AFQ-Insight Python based statistical learning for tractometry

Build Status Coverage Status Code style: black License DOI

See https://yeatmanlab.github.io/AFQ-Browser/ for documentation of the AFQ data format

This is a work in progress.

Contributing

We love contributions! AFQ-Insight is open source, built on open source, and we'd love to have you hang out in our community.

We have developed some guidelines for contributing to AFQ-Insight.

Citing AFQ-Insight

If you use AFQ-Insight in a scientific publication, please see cite us:

Richie-Halford, A., Yeatman, J., Simon, N., and Rokem, A. Multidimensional analysis and detection of informative features in diffusion MRI measurements of human white matter. DOI:10.1101/2019.12.19.882928

@article {Richie-Halford2019.12.19.882928,
	author = {{R}ichie-{H}alford, {A}dam and {Y}eatman, {J}ason and {S}imon, {N}oah and {R}okem, {A}riel},
	title = {{M}ultidimensional analysis and detection of informative features in diffusion {MRI} measurements of human white matter},
	elocation-id = {2019.12.19.882928},
	year = {2019},
	doi = {10.1101/2019.12.19.882928},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2019/12/20/2019.12.19.882928},
	eprint = {https://www.biorxiv.org/content/early/2019/12/20/2019.12.19.882928.full.pdf},
	journal = {bioRxiv}
}

Acknowledgements

AFQ-Insight development is supported through a grant from the Gordon and Betty Moore Foundation and from the Alfred P. Sloan Foundation to the University of Washington eScience Institute, as well as NIMH BRAIN Initiative grant 1RF1MH121868-01 to Ariel Rokem (University of Washington).

afq-insight's People

Contributors

36000 avatar arokem avatar mnarayan avatar pyup-bot avatar richford avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

afq-insight's Issues

Add ability to transform datasets

From @arokem and @RLQiao's work, it seems that it would be useful to pass a transformer model to an AFQDataset and transform the dataset in place. I'm thinking of something like

def fit(self, model, **fit_kwargs):
    return model.fit(X=self.X, y=self.y, **fit_kwargs)

So that one could for example do

dataset = AFQDataset.from_files(...)
imputer = SimpleImputer()
imputer = dataset.fit(imputer)

to get back a fitted imputer.

Or

dataset = AFQDatset.from_files(...)
imputer = SimpleImputer()
dataset_imputed = dataset.fit_transform(imputer)

to get back a transformed dataset.

What do you think @arokem and @RLQiao?

Add a `GroupMeanTransformer`

We often want mean value in each group. Would be great to have a GroupMeanTransformer that gives us that easily.

Add capabilities to split and impute AFQDataset

Dataset splitting and imputation is common and has the potential to introduce train/test leakage if done incorrectly. It's also easier to do when all of your dataset is in memory (i.e. before being batched into tensorflow or pytorch datasets.

Proposed solution

  • Make AFQDataset initialization more general so that user can provide X, y, groups, feature_names, group_names, subjects, sessions, classes explicitly on init.
  • create a new static method from_files() that takes all of the filenames that are in the current AFQDataset init method and returns an AFQDataset object with all of the data read in.
  • create a new split() method that will apply an sklearn splitter class and wrap its split method (which usually returns tuples of indices) by returning a tuple of AFQDataset objects representing the train and test splits. The user is free to recursively split the test set if they want both test and validate splits.
  • create a train_test_split() that wraps next(split(ShuffleSplit())) in much the same way that train_test_split works in sklearn.

Topics for discussion

We often need to impute missing bundles and have to take care to fit imputation on only the training set to avoid data leakage. It would be nice to provide the user with an option to do this during splitting. One option is to add imputer and imputer_kwargs parameters to the aforesaid split() method and impute within the wrapper for the splitters split method. I like this option but am open to other implementations.

Tagging @arokem to continue a discussion that we had about this in person. Note my change of heart here in making AFQDataset more general and reading from input files using a static method.

Implement Combat transformer

From @mnarayan:

I did a tutorial on dealing with site effects a few years ago at OHBM, I can send my slides around. A few papers have shown that effectively standardizing DTI features to make sure they look similar across all sites (using a combat) was quite useful. https://github.com/neuroquant/neurocombat. It might be useful to use this as part of the model training (i.e. learning the combat standardization model on the training set and applying it on the test set before calling predict).

We should implement combat as an sklearn feature transformer and allow users the option of inserting it into the pipeline.

Add a RegressOutTransformer

It is common to regress out certain phenotypic covariates from both the target phenotype and the tract profile features. So common that we should probably have a RegressOutCovariate or some such transformer to do this.

Add utilities for matching between participants

To create quasi-experiments based on certain match characteristics in large datasets.

@36000 has written some code to get this started.

Maybe this should go into a new module called match? It looks like the convention so far is for modules to be named as imperative verbs ("cross_validate", "transform", "plot").

Add some more sklearn API to the CNN model?

We can't currently use cross_val_score because:

TypeError: Cannot clone object '<afqinsight.cnn.CNN object at 0x7fca351d1be0>' (type <class 'afqinsight.cnn.CNN'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.

I wonder how many more methods we'd need to implement for this (and other things? For example sklearn pipelines?) to work.

Add sparse group regression problem generator

datasets.py has the make_sparse_group_classification function which is a sparse group extension of sklearn.datasets.make_classification. We should also make a sparse group regression problem generator

Migrate to groupyr

I decided (unilaterally...sorry...but I really think it's the right choice) that we should pull all of the SGL estimators out into their own project, import that library into AFQ-Insight and leave only AFQ related transformers and pipelines here.

Remaining steps

  • Migrate all estimators to groupyr
  • Add groupyr as a dependency
  • Remove dependency cruft leftover from previous implementation
  • Delete all of the old estimator functions and use groupyr estimators instead

Reading y with missing values

I might be doing something wrong:

output = load_afq_data(unsupervised=False, 
                       dwi_metrics=["dki_fa", "dki_md", "dki_mk", "dki_awf"], 
                       fn_nodes=DATA + 'hbn_tract_profiles.csv', 
                       fn_subjects=DATA + 'hbn_participants.csv',
                       index_col="subject_id",
                       target_cols=["age"])

Where the "age" column in "hbn_participants.csv" contains some missing values. But, I wasn't expecting:

y = output.y 
np.all(np.isnan(y))
True

Match subject identifiers with and without the "sub-" prefix

Problem

When using load_afq_data or AFQDataset, if the one of subjects.csv or nodes.csv has a "sub-" prefix for the subject identifiers and the other does not, then it won't find the matching phenotypic info for each subject.

Proposed solution

Add a drop_sub_prefix=False parameter to load_afq_data and AFQDataset. When False it should reproduce the current behavior. When True it should drop the "sub-" prefix from both the nodes and subject dataframes before merging.

Fix sklearn import bug

Problems in running in google colab due to sklearn incompatability. Same error was reproduced when installing and running afqinsight on the cluster.

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-5-e34982e51a91> in <module>()
----> 1 import afqinsight as afqi
      2 import itertools
      3 import matplotlib.pyplot as plt
      4 import numpy as np
      5 import os.path as op

1 frames
/content/afq-insight/afqinsight/datasets.py in <module>()
      9 from collections import namedtuple
     10 from shutil import copyfile
---> 11 from sklearn.datasets._samples_generator import _generate_hypercube
     12 from sklearn.preprocessing import StandardScaler
     13 from sklearn.utils import check_random_state

ModuleNotFoundError: No module named 'sklearn.datasets._samples_generator'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

So the version installed on when using colaboratory is version 0.21.1 which is consistent with what is specified in requirements. @richford, I think you have a different scikit-learn version installed in your dev environment.

The problematic line is https://github.com/richford/AFQ-Insight/blob/master/afqinsight/datasets.py#L11
which needs to read from sklearn.datasets.samples_generator import _generate_hypercube to work without error.

Inconsistencies in use of pandas dataframes and numpy arrays

This function returns x as an array and y as a dataframe:
https://github.com/richford/AFQ-Insight/blob/9ea44e144827dbd83e9b2c3ac3b00ca54457e4d2/afqinsight/datasets.py#L27
However the classification functions, such as this one, expect x and y to be numpy arrays:
https://github.com/richford/AFQ-Insight/blob/9ea44e144827dbd83e9b2c3ac3b00ca54457e4d2/afqinsight/insight.py#L94

My suggestion is to return and expect dataframes, so that when you print x or y you have column labels, but use the numpy arrays internally.

The output from load_afq_data is getting pretty busy

load_afq_data already returns a tuple with up to 8 elements. Depending on how we resolve #78, this might climb up to 9 elements. This is too much and I often forget the order of the results. How should we simplify the returned result:

  • Return a dict with keys X, subjects, groups, etc.
  • Return a namedtuple. See os.stat for an example implementation from the standard library.
  • Leave load_afq_data as is but also provide an AFQDataLoader class with attributes for each return value.

I'm partial to the namedtuple since it can be unpacked like a regular tuple so it would be backwards compatible. But we could also access the fields by name, much like a class instance's attributes. Kind of the best of both.

@arokem WDYT?

Variable group sizes in make_sparse_group_(regression|classification)

Currently, make_sparse_group_classification and make_sparse_group_regression limit the problems generated to uniform group sizes. They should allow parameters like n_informative_features_per_group to be array-like where the length of the array is equal to the number of groups.

Tensorflow datasets => CNN

I've been working with the tf_models module and trying to feed in a tensorflow dataset into these models. I am running into some issues that are replicated in this self-contained example:

https://gist.github.com/arokem/db5d1b212268ab65effaa173f603cdfe

I must be doing something wrong, but I've tried a lot of different things already:

  1. changing the input_shape to all kinds of other things.
  2. Initializing the dataset using from_tensors and from_generator instead of from_tensor_slices.
  3. Turning batching off.

And probably some other things that I don't remember right now. What am I missing?

Add documentation on M1 installation issues

tables has trouble installing on M1 macs. The workaround it to do the following before installing AFQ-Insight

pip install cython
brew install hdf5
brew install c-blosc
export HDF5_DIR=/opt/homebrew/opt/hdf5 
export BLOSC_DIR=/opt/homebrew/opt/c-blosc
pip install tables

We should add this to the documentation.

MOAR TESTS!!!

Write more unit tests. Check codacy badge to see which files need more coverage.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.