Giter VIP home page Giter VIP logo

pypyls's Introduction

pyls

This package provides a Python interface for partial least squares (PLS) analysis, a multivariate statistical technique used to relate two sets of variables.

Build Status CircleCI Codecov Documentation Status License

Table of Contents

If you know where you're going, feel free to jump ahead:

Installation and setup

This package requires Python >= 3.5. Assuming you have the correct version of Python installed, you can install this package by opening a terminal and running the following:

git clone https://github.com/rmarkello/pyls.git
cd pyls
python setup.py install

There are plans (hopes?) to get this set up on PyPI for an easier installation process, but that is a long-term goal!

Purpose

Overview

Partial least squares (PLS) is a statistical technique that aims to find shared information between two sets of variables. If you're unfamiliar with PLS and are interested in a thorough (albeit quite technical) treatment of it Abdi et al., 2013 is a good resource. There are multiple "flavors" of PLS that are tailored to different use cases; this package implements two functions that fall within the category typically referred to as PLS-C (PLS correlation) or PLS-SVD (PLS singular value decomposition) and one function that falls within the category typically referred to as PLS-R (PLS regression).

Background

The functionality of the current package largely mirrors that originally introduced by McIntosh et al., (1996) in their Matlab toolbox. However, while the Matlab toolbox has a significant number of tools dedicated to integrating neuroimaging-specific paradigms (i.e., loading M/EEG and fMRI data), the current Python package aims to implement and expand on only the core statistical functions of that toolbox.

While the core algorithms of PLS implemented in this package are present (to a degree) in scikit-learn, this package provides a different API and includes some additional functionality. Namely, pyls:

  1. Has integrated significance and reliability testing via built-in permutation testing and bootstrap resampling,
  2. Implements mean-centered PLS for multivariate group/condition comparisons,
  3. Uses the SIMPLS instead of the NIPALS algorithm for PLS regression

Usage

pyls implement two subtypes of PLS-C: a more traditional form that we call "behavioral PLS" (pyls.behavioral_pls) and a somewhat newer form that we call "mean-centered PLS" (pyls.meancentered_pls). It also implements one type of PLS-R, which uses the SIMPLS algorithm (pyls.pls_regression); this is, in principle, very similar to "behavioral PLS."

PLS correlation methods

Behavioral PLS

As the more "traditional" form of PLS-C, pyls.behavioral_pls looks to find relationships between two sets of variables. To run a behavioral PLS we would do the following:

>>> import numpy as np

# let's create two data arrays with 80 observations
>>> X = np.random.rand(80, 10000)  # a 10000-feature (e.g., neural) data array
>>> Y = np.random.rand(80, 10)     # a 10-feature (e.g., behavioral) data array

# we're going to pretend that this data is from 2 groups of 20 subjects each,
# and that each subject participated in 2 task conditions
>>> groups = [20, 20]  # a list with the number of subjects in each group
>>> n_cond = 2         # the number of tasks or conditions

# run the analysis and look at the results structure
>>> from pyls import behavioral_pls
>>> bpls = behavioral_pls(X, Y, groups=groups, n_cond=n_cond)
>>> bpls
PLSResults(x_weights, y_weights, x_scores, y_scores, y_loadings, singvals, varexp, permres, 
bootres, splitres, cvres, inputs)

Mean-centered PLS

In contrast to behavioral PLS, pyls.meancentered_pls doesn't look to find relationships between two sets of variables, but rather tries to find relationships between groupings in a single set of variables. As such, we will only provide it with one of our created data arrays (X) and it will attempt to examine how the features of that array differ between groups and/or conditions. To run a mean-centered PLS we would do the following:

>>> from pyls import meancentered_pls
>>> mpls = meancentered_pls(X, groups=groups, n_cond=n_cond)
>>> mpls
PLSResults(x_weights, y_weights, x_scores, y_scores, singvals, varexp, permres, bootres, splitres,
inputs)

PLS regression methods

Regression with SIMPLS

Whereas pyls.behavioral_pls aims to maximize the symmetric relationship between X and Y, pyls.pls_regression performs a directed decomposition. That is, it aims to find components in X that explain the most variance in Y (but not necessarily vice versa). To run a PLS regression analysis we would do the following:

>>> from pyls import pls_regression
>>> plsr = pls_regression(X, Y, n_components=5)
>>> plsr
PLSResults(x_weights, x_scores, y_scores, y_loadings, varexp, permres, bootres, inputs)

Currently pyls.pls_regression() does not support groups or conditions.

PLS Results

The docstrings of the results objects (bpls, plsr, and mpls in the above example) have some information describing what each output represents, so while we work on improving our documentation you can rely on those for some insight! Try typing help(bpls), help(plsr), or help(mpls) to get more information on what the different values represent.

If you are at all familiar with the Matlab PLS toolbox you might notice that the results structures have a dramatically different naming convention; despite this all the same information should be present!

pypyls's People

Contributors

emdupre avatar eric2302 avatar justinehansen avatar rmarkello avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pypyls's Issues

Add cross-validation

Work by Rahim et al., 2017 demonstrated that, in many cases, BehavioralPLS may overfit to the data on which it's trained. While the authors recommend using reduced rank regression in place of PLS, for the sake of the current code base another option is to integrate their use of cross-validation.

sklearn already provides many if not all of the features needed to perform cross-validation; however, those features are geared towards use on sklearn-style estimator classes. While there are potential plans to move the current package towards a more sklearn-style approach, I think it would be reasonable and encouraged to build CV into the functionality of this repository at the current juncture.

There are a number of possible ways to do this, but I'm imagining something where there's a parameter (e.g., cv_split) that can be provided when initializing BehavioralPLS or MeanCenteredPLS that determines the train / test validation split. For example, cv_split = 0.75 would partition 75% of the data for training and 25% for testing. The requested decomposition would be performed n_split times (using the parameter currently reserved for split-half reliability testing) as described by Rahim et al., and the metrics laid out in that paper (ΔCV) could be returned with the default results.

Error when n_perm or n_boot is zero

Trying to run behavioral_pls() or meancentered_pls() with n_perm or n_boot set to 0 currently yields a completely unhelpful error. These parameters technically should be optional (if you don't want to run the permutation or bootstrap tests), so it would be great to ensure that the functions can run correctly without them being set.

I envision that the appropriate fix to this is to wrap all calls to bootstrapping / permutation tests in if clauses, e.g.:

if n_boot > 0:
    # run bootstraps here
if n_perm > 0:
    # run permutations here

These changes will primarily need to take place in the run_pls() method of pyls.base.BasePLS, pyls.types.MeanCenteredPLS and pyls.types.BehavioralPLS where the relevant functions are being invoked.

Refactor BasePLS to be `scikit-learn`-style

Scikit-learn is perhaps one of the mostly widely known and used machine-learning toolboxes in the world. It is incredibly robust and, critically, very easy to use. In most cases, you generate an instance of the algorithm you want to use—say, clf = sklearn.svm.SVC()—and then fit your data to it: clf.fit(X, Y).

The core decompositions contained in the current toolbox are, for the most part, implemented to some degree in scikit-learn; however, the permutation testing / bootstrap resampling / split-half reliability assessments are unique to this repository and are, in my estimation, functionality that should be retained.

If we could somehow integrate the current framework into scikit-learn-style classes, that would give us the best of both worlds. The scikit-learn developer guidelines have some pretty extensive information on what this might look like, and they even provide a project template that might be useful for implementation.

Riffing on the examples in the README I'm imaging something like:

bpls = BehavioralPLS(groups=[5, 5], n_cond=2, n_perm=100, n_boot=50, n_split=50, seed=1234)
bpls.fit(X, Y)
bpls.results_
PLSResults(u, s, v, usc, vsc, boot_result, perm_result, perm_splithalf, inputs, s_varexp)

This is just a possibility! It's not completely necessary, but it might be nice and would potentially make integrating cross-validation as in #21 a bit easier.

Order of arrays when using conditions/groups

Hi all,

I had a question about the order of data in arrays. In the example given for behavioral_pls, the arrays are 80xN, with the 80 being made up of 2 groups with 20 subjects each, each of these having 2 conditions. My question is what order should the observations be in this array.

Best,
Mike

Implement x_weights_ci and y_weights_ci

This issue is related to #58. It would be nice to also obtain the confidence intervals for the weights instead of the loadings. For my project I am more interested in analyzing the weights instead of the loadings because they take multicolinearity into account.

PLS results class with properties (and doc-strings)?

Should invoking MeanCenteredPLS() or BehavioralPLS() returns a PLSResults() class (or similar), with resulting arrays as properties (to include doc-strings)? Should there be one general PLSResults() class for all analysis types, and the variables/attributes that aren't generated by the class will just return None?

Bootstrap ratios for behavioral saliences

Hi,
Thanks for creating this package, it has really come to use. But shouldn't the bootstrap ratios also be calculated for the right singular vectors (latent behavioral components), to see which behaviors are consistently contributing (=stable)?

Add duecredit

A lot of the code in this repository builds on a significant amount of previous work and publications. Currently, references to those works are located in the doc-strings of relevant functions/classes (under the References heading). However, we could instead integrate duecredit, a tool that helps users know to whom and how to assign credit when using code.

Thankfully, the duecredit page has a brief walk-through for integration. It would be great to integrate it into pyls and convert all the references currently located in doc-strings into duecredit-style citations.

As an example, pyls.types.BehavioralPLS currently has the following reference in its doc-string:

class BehavioralPLS(BasePLS):
    """
    Runs "behavioral" PLS

    Parameters
    ----------
    ...
    
    References
    ----------
    .. [1] McIntosh, A. R., Bookstein, F. L., Haxby, J. V., & Grady, C. L.
       (1996). Spatial pattern analysis of functional brain images using
       partial least squares. Neuroimage, 3(3), 143-157.
    """

This could be converted to a duecredit-style citation:

@due.dcite(BibTeX("""
    @article{mcintosh1996spatial,
    author={McIntosh, AR and Bookstein, FL and Haxby, James V and Grady, CL},
    title={Spatial pattern analysis of functional brain images using partial least squares},
    year={1996},
    journal={NeuroImage},
    volume={3},
    number={3},
    pages={143--157}}
    """),
    description='First application of PLS to functional neuroimaging data.')
class BehavioralPLS(BasePLS):
    """
    Runs "behavioral" PLS

    Parameters
    ----------
    ...
    """

Then, users would be able to follow the instructions for running their code with duecredit enabled to get a functioning BibTeX file!

It would be best to follow the steps in the duecredit guide for copying over the stub.py to the primary directory of pyls and then importing (from .due import due, Doi, BibTeX) as required. The rest would be filling out the BibTeX-style decorators, like in the example above.

What exactly is the cv argument doing?

Hi,

I am a little bit confused about the cv argument in pyls.behavioral_pls. For what exactly is the cross validation used? For computing the p-values? And if so, how does the cv argument interplays with n_split ? Finally, is my question somehow related to #24 ?

Greetings,

Johannes

Fix tests

I broke them all with recent restructuring and never bothered fixing them.

Carry through `grouping` for BehavioralPLS

Make sure that a grouping variable can be effectively carried through to all functions for BehavioralPLS. Currently only implemented (half-heartedly) in pyls.compute.svd() -- the other functions accept but don't generally do anything with grouping.

Rename repository

Unfortunately, the current name of the repository, pyls, was reserved on PyPi as of nearly a decade ago. If this package is to ever be uploaded to PyPi for easy distribution (i.e., pip install packagename), it will need to be renamed to something that hasn't been used before.

It would be great if the name could somehow play on multivariate covariance models, decompositions, or dimensionality reduction—though writing those out sound a bit unwelcoming... All suggestions are welcome!

tl;dr: Help! We need a new name!

Split-half resampling

Fully implement split-half resampling and ensure that all outputs are generated correctly. Once done, make default n_split=500.

Add user documentation

We need documentation! Our README is a bit spartan, so some more in-depth user documentation would be very helpful for orienting people to the repository and, hopefully, preempting any questions they might have about installation, use, etc. The documentation should ideally describe:

  1. The purpose of pyls,
  2. How to download and install pyls,
  3. Some basic usage information, and
  4. A reference API

For (1), something in line with the project roadmap would be sufficient; for (2), basic instructions on downloading and installation (e.g., python setup.py install) would be perfect, and; for (3), a few in-depth examples would be great, demonstrating the various potential use cases of the code.

I think the best choice for setting all of this up would be Sphinx! Sphinx has a quickstart guide that, while a bit obtuse at times, is sufficient to at least get some bones. Once the bones are there, my tactic has generally been to find documentation that I like and borrow as appropriate (licensing permitting!). One of my other repositories that could be copied, to some degree, is snfpy.

It's worth noting that Sphinx uses reStructuredText for formatting. This is quite a bit different than the Markdown that GitHub normally relies on, so it will be good to keep a reference handy.

Unable to access pyls functions - module have no attribute

I have installed the pyls package according to the website, however python spyder is unable to access anything to run the behavioral pls.
I have included a snippet of my code (removed the portion where I set my variables), and the error I get.
Current the pyls package is nested into the site-packages folder as I am running it through anaconda.
Please help... I'm new to python.

I'm running it on an M3 MBP with sonoma 14.4.1

My code: (I import the packages I need - no problem - run unto issues doing the actual test)

import pyls
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

#perform pls analysis
results = pyls.behavioral_pls(X, Y, groups=groups, n_cond=n_cond, n_boot=n_boot, n_perm=n_perm)`

The error message I get: (I blocked out user details with ...)

runfile('/Users/ ... /.spyder-py3/temp.py', wdir='/Users/ ... /.spyder-py3')
Traceback (most recent call last):

File ~/opt/anaconda3/envs/science/lib/python3.12/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)

File ~/.spyder-py3/temp.py:36
results = pyls.behavioral_pls(X, Y, groups=groups, n_cond=n_cond, n_boot=n_boot, n_perm=n_perm)

AttributeError: module 'pyls' has no attribute 'behavioral_pls'

Implement x_loadings_ci in pyls.structures.PLSBootResults

I would like to use pyls.behavioral_pls for my current project. When inspecting the attributes from pyls.structures.PLSBootResults I noticed that it only offers y_loadings_ci but not a corresponding x_loadings_ci. In fact I discovered that Justine Hansen already created a workaround for this issue by running PLS one more time with flipped X and y to get the 'right' confidence intervals. But maybe it would make sense to implement x_loadings_ci in the first place?

Add parallel computing for permutations/bootstraps

One of the key features of BehavioralPLS and MeanCenteredPLS is the permutation testing and bootstrap resampling code currently built into base.BasePLS. Both permutation testing and bootstrap resampling require, traditionally, somewhere on the order of 1000-5000+ computations, each, in order to be considered reliable. Currently these computations are done serially, one after the other; while this is relatively quick for small data, it can rapidly become very time-consuming when the data is "big" (i.e., when there are on the order of >250,000 features for a dataset).

It should be possible to perform these computations in parallel, rather than serially. There exist a few Python libraries that might make this possible, but the built-in multiprocessing library would be a fantastic place to start. base.BasePLS already accepts an n_proc argument that allows users to specify the number of parallel processes they wish to run, and a multiprocessing.Pool could be used to handle running the computations in parallel.

Once implemented, it would be great to perform a variety of tests on this to determine whether the speed boosts from parallel computation are worthwhile for low-dimensional data. That is, it is possible that the time required to generate the parallel processes may be more than the time it takes to simply run the computations serially. I believe the speed up of parallel processing will only be beneficial for large data, so it may be good to build in some checks, based on the results of testing, to override user specifications for parallelism if it won't be beneficial.

Change split-half resampling mechanism

The current split-half resampling implementation, which is identical to the implementation in the Matlab PLS toolbox, could potentially use some tweaking. For a primer/background/deep dive into the math and rationale behind the current code, check the original paper by Kovacevic et al. (2013).

The intended goal of split-half resampling is to provide a metric of reliability. That is, it aims to offer an assessment of how much the observed effects (i.e., latent variables) are supported by the data, regardless of the samples (i.e., subjects) that are driving those decompositions. In a way, this aim could perhaps be better achieved by cross-validation, as described in #21.

If cross-validation is implemented, then we could eliminate split-half resampling altogether. However, another option would be to only eliminate performing split-half resampling during the permutation testing, and instead assess reliability of split-half resampling for the original (non-permuted) data.

Doing split-half resampling on the original data would result in a distribution of correlations for each left/right singular vector (U and V). Rather than returning a non-parametric p-value, as is done with the current split-half resampling + permutation paradigm, we could generate some basic metrics for interpreting the distribution (e.g., confidence intervals, central tendency, skewness). These metrics could be reported with the standard PLSResults. Notably, the proposed regime would be significantly less computationally expensive (see below for step-by-step)

The proposal (with math!)

Where n_split = 100 and n_perm=1000, and X and Y are input data matrices of shape (N x M1) and (N x M2).

Current split-half resampling paradigm

  1. Generate the cross-covariance matrix, D = Y.T @ X, and perform SVD on it D = U @ S @ V.T
  2. Randomly split D into two halves (row-wise), D1 and D2, and project them onto the original left/right singular vectors: U1 = D1.T @ V, U2 = D2.T @ V, V1 = D1 @ U, V2 = D2 @ U;
  3. Compute the Pearson correlation of the projected singular vectors: U_corr = corr(U1, U2) and V_corr = corr(V1, V2), where U_corr and V_corr are vectors of correlations for each singular vector separately;
  4. Repeat steps 2-3 n_split times and take the average of U_corr and V_corr across all splits: U_corr_mean = mean(U_corr) and V_corr_mean = mean(V_corr);
  5. Permute Y randomly and repeat steps 1-4 n_perm times;
  6. Assess how many times U_corr_mean and V_corr_mean from the permuted decompositions (steps 5) are higher than the original values, and divide by the number of permutations (1000) to generate a p-value to report

Proposed split-half resampling paradigm

  1. Generate the cross-covariance matrix, D = Y.T @ X, and perform SVD on it D = U @ S @ V.T
  2. Randomly split D into two halves (row-wise), D1 and D2, and project them onto the original left/right singular vectors: U1 = D1.T @ V, U2 = D2.T @ V, V1 = D1 @ U, V2 = D2 @ U;
  3. Compute the Pearson correlation of the projected singular vectors: U_corr = corr(U1, U2) and V_corr = corr(V1, V2), where U_corr and V_corr are vectors of correlations for each singular vector separately;
  4. Repeat steps 2-3 n_split times to generate a distribution of correlations for each singular vector
  5. Compute various metrics on the distributions (i.e., 95%ile values, central tendency, skewness) and report

Integrate `sklearn.utils.Bunch`

Currently, the results structures generated by BehaviorPLS and MeanCenteredPLS are built on top of the pyls.utils.DefDict class. DefDict functions as a dict-like object, which allows you to access keys of the dictionary as attributes (i.e., you can use dict.attr instead of dict['attr']), which seems minor but makes for a nice user experience.

DefDict also permits sub-classes to define a "default" set of keys that should be stored in the dictionary when it is created. This "default" key set limits the keyword arguments that can be provided at creation. As an example, if we pretend the default key set of DefDict is ['key1', 'key2', 'key3'] and we try to create an object with pyls.utils.DefDict(key1=10, key5=20), key5 would not be stored because it is not in the default list.

However, the class also has some unwanted functionality. Keeping with the above "default" key list:

>>> test = pyls.utils.DefDict(key1=10, key2=20)
>>> test.keys()
dict_keys(['key1', 'key2'])
>>> test.keys = 100  # assign 100 to the `keys` variable
>>> test.keys()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' object is not callable

That is, the class permits overwriting built-in Python dict functionality, which is not good. Thankfully, the following functionality (of limiting the providing keyword arguments) works as expected:

>>> test = pyls.utils.DefDict(key1=20, key2=20, keys=100)
>>> test.keys()
dict_keys(['key1', 'key2'])

Since keys is not in the default key list, it is ignored. Still, the former issue is not ideal, and I've grown increasingly unconvinced that the default key list is necessary for our intended purposes of using it as a results structure.

We could potentially use sklearn.utils.Bunch to replace DefDict. This sklearn utility also functions as a dict-like object, allowing access to the keys of the dictionary as attributes. And indeed, Bunch doesn't experience the primary failing of DefDict:

>>> test = sklearn.utils.Bunch(key1=10, key2=20)
>>> test.keys()
dict_keys(['key1', 'key2'])
>>> test.keys = 100
>>> test.keys()  # this works!!
dict_keys(['key1', 'key2', 'keys'])

It would be nice to either (1) entirely replace DefDict with Bunch, (2) take the functionality from Bunch that prevents such overrides and integrate it into DefDict, or (3) explicitly disallow setting new items in DefDict by superseding the __setitem__ method (somehow). I'm flexible on this front, and would be fine with help accomplishing whatever is easiest!

Add ability to accept conditions in MeanCenteredPLS

MeanCenteredPLS() currently requires a grouping variable, but should optionally accept a within-group grouping variable (i.e., group conditions). Would require refactoring all the _gen_X() methods for the class.

Get warning that random seed is not set

I am calling pyls.behavioral_pls with this command:

pls_result = pyls.behavioral_pls(X=brain_df_discovery,
                                 Y=behavior_df_discovery,
                                 n_perm=1000,
                                 n_boot=1000,
                                 seed=123,
                                 random_state=123,
                                 test_split=10,
                                 test_size=0.1)

Although I set bot random_state and seed to a fixed value, I get the following error:

C:\Users\Johannes.Wiesner\Miniconda3\envs\csp_wiesner_johannes\lib\site-packages\sklearn\utils\extmath.py:368: FutureWarning:

If 'random_state' is not supplied, the current default is to use 0 as a fixed seed. This will change to  None in version 1.2 leading to non-deterministic results that better reflect nature of the randomized_svd solver. If you want to silence this warning, set 'random_state' to an integer seed or to None explicitly depending if you want your code to be deterministic or not.

Mean centering causing problems.

Hi @rmarkello

I think there's a bug in the mean centering.
If data is inputted as pandas dataframe:
--> 381 X -= X.mean(axis=0, keepdims=True)
382 Y_agg -= Y_agg.mean(axis=0, keepdims=True)
ValueError: the 'keepdims' parameter is not supported in the pandas implementation of mean()

If data is a numpy array:
ValueError: array must not contain infs or NaNs
and the input data['X'] is filled with NaNs.

Dot product of scores is not equal to singular values

I am running the following:

from pyls.examples import load_dataset
from pyls.compute import xcorr
from pyls.compute import svd
from scipy.stats import zscore
import numpy as np

# Way 1
data = load_dataset('linnerud')
R = xcorr(data.X, data.Y, norm=False, covariance=False) # internal z-scoring and division by N-1 is done
U, S, V = svd(R) # SVD using sklearn internally 

# Way 2
myX = zscore(data.X,ddof=1)
myY = zscore(data.Y,ddof=1)
myR=np.dot(myY.T,myX) / (len(myX)-1)
myU, myS, myVh = np.linalg.svd(myR) # SVD from numpy
myS = np.diag(myS)

#Point 1
np.allclose(myR,R) # passes, same Cross-correlation matrices
np.allclose(U,myU) # does not pass, different orthogonal bases

# Point 2
# singular value is equal to covariance of the correponding vectors of scores
myXV = np.dot(myX, myVh.T)
myYU = np.dot(myY, myU)
np.allclose(np.dot(myXV[:,0], myYU[:,0]) / (len(myXV)-1) , myS[0][0]) # True, correct

data.X = zscore(data.X, 0, ddof=1)
data.Y = zscore(data.Y, 0, ddof=1)
XV = np.dot(data.X, V)
YU = np.dot(data.Y, U)
np.allclose(np.dot(XV[:,0], YU[:,0]) / (len(XV)-1) , S[0][0]) # False !

Reference: doi:10.1016/j.neuroimage.2010.07.034, equation (9)

Point 1: Why np.allclose(U,myU) does not pass?
Point 2: Why np.allclose(np.dot(XV[:,0], YU[:,0]) / (len(XV)-1) , S[0][0]) return False?

EDIT

Additonally, in this tutorial (https://github.com/rmarkello/pyls/blob/master/docs/user_guide/behavioral.rst)

XU = np.dot(data.X, U) should be XV = np.dot(data.X, V) and YV = np.dot(data.Y, V) -> YU = np.dot(data.Y, U)

Restructure PLS classes inheritance

All PLS classes in types.py should be restructured to inherit from a BasePLS class that contains standard functions common to all PLS analyses (i.e., permutation, bootstrapping).

Add comparisons to Matlab results

The current repository can import Matlab PLS result files and use the data from those files as inputs in a new analysis. I believe this will be helpful for people who are transitioning from Matlab to Python and don't want to lose any of their results from the Matlab PLS toolbox. However, it would also be great to ensure that the results from this repository and the results from PLS in Matlab were comparable, to a degree.

To aid in this, I ran a number of different PLS analyses in Matlab on random data and bundled all the results files. It would be great if a set of tests could be developed that:

  1. Download the results files,
  2. Load every .mat file and use the data as an input to an analysis with BehavioralPLS or MeanCenteredPLS, and
  3. Compare the results from the Matlab and python implementation to ensure they are more-or-less equatable.

This may involve adding CircleCI testing, since re-running all the analyses and comparing the results might take a long time (too long for TravisCI), but I'd be keen to see how far TravisCI can go before adding that in.

Given that there is some randomness in permutation testing / bootstrap resampling, equating Matlab and python results may prove difficult. As I see it, options include:

  1. Rather than re-generating permutation and bootstrap samples, re-use the perm_result.permsamp and boot_result.bootsamp arrays stored in the Matlab results files during the python analysis. Unfortunately, this would require some re-coding of the BasePLS class...
  2. Don't directly compare the permutation/bootstrap results between Matlab and python and simply ensure the results of the original SVD are equivalent.
  3. Retain a very liberal threshold for comparing permutation/bootstrap results between Matlab and python.

I'm certainly open to discussion on how best to ensure that users who might be migrating from the Matlab toolbox can expect similar results when running PLS analyses in Python!

PLSRegression alters input matrices

First, thank you so much for this repo! Trying to compare results between python and matlab has been an insane headache until I found this. :)

I noticed that every time I run pls_regression(), my X and Y input matrices both get altered. To prevent this, I am currently creating a sacrificial copy of each original matrix for use with pls_regression, and all's well. But I wanted to report that this surprised me. I would expect that my source data not be altered.

I assume this could be fixed by creating local copies of X and Y_agg prior to

X -= np.nanmean(X, axis=0, keepdims=True)
, and using the local copies rather than references to the originals throughout that function, but I'm not familiar enough with the rest of the codebase to say that with confidence.

Best wishes!

Roadmap (start here!)

Project mission + summary

Welcome to our project roadmap!

The overarching goal of this project is to provide an easy-to-use interface for conducting multivariate cross-covariance analyses. Right now, that amounts to Partial Least Squares-SVD; indeed, some of the core functionality of the current repository can be considered a translation of the Matlab PLS toolbox by the same authors. However, the current package only aims to implement the core statistical functionality of these analyses in easy-to-use interfaces. In doing so, we hope to leverage all the benefits of the significance and reliability testing made popular in the Matlab toolbox and combine it with the accessibility and efficacy of tools like scikit-learn.

These goals will likely evolve over time, though, so check back here for updates!

How to get involved

If you're itching to get started, you should start by reading our contributing guidelines and code of conduct!

Once you're done with that, head on over to the roadmap project. That project lays out all the 'development' plans into shorter- and longer-term goals. Most of them will (hopefully) have helpful labels identifying whether they're good first issues, general enhancements, or new features. If you're interested in tackling something, head on over to the relevant issue and let us know.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.