abidlabs / contrastive Goto Github PK

Contrastive PCA

License: MIT License

Python 1.06% Jupyter Notebook 98.94%

contrastive's Introduction

This is the GitHub repository for Abubakar Abid's blog and research journal, which you can find at https://abidlabs.github.io/

contrastive's People

Contributors

Stargazers

Watchers

contrastive's Issues

TypeError: 'numpy.float64' object cannot be interpreted as an integer

Hi guys,

I encountered an issue when running:

contrastive/experiments/Single-Cell RNA-seq (Figure 3).ipynb

The error started from the code :

dataset = SingleCell(itemgetter(*active_file_idx)(fnames), [fnames[6]])

My best guess this is an issue with Scipy new release?

================ERROR message================

TypeError                                 Traceback (most recent call last)
<ipython-input-34-9d882ec2e7eb> in <module>()
      2 
      3 active_file_idx = [1,2]
----> 4 dataset = SingleCell(itemgetter(*active_file_idx)(fnames), [fnames[6]])

<ipython-input-22-c169daba19a1> in __init__(self, active_files, background_file, N_GENES, to_standardize, verbose)
      5 
      6     def __init__(self, active_files, background_file, N_GENES = 500, to_standardize=True, verbose=True):
----> 7         self.active = vstack([self.file_to_features(fname) for fname in active_files])
      8         self.bg = vstack([self.file_to_features(fname) for fname in background_file])
      9         self.reduce_features(N_GENES)

<ipython-input-22-c169daba19a1> in <listcomp>(.0)
      5 
      6     def __init__(self, active_files, background_file, N_GENES = 500, to_standardize=True, verbose=True):
----> 7         self.active = vstack([self.file_to_features(fname) for fname in active_files])
      8         self.bg = vstack([self.file_to_features(fname) for fname in background_file])
      9         self.reduce_features(N_GENES)

<ipython-input-22-c169daba19a1> in file_to_features(self, fname)
     28         col = data[:,0]-1 #1-indexed
     29         values = data[:,2]
---> 30         c = csc_matrix((values, (row, col)), shape=(row.max()+1, col.max()+1))
     31         return c
     32 

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
     49                     # (data, ij) format
     50                     from .coo import coo_matrix
---> 51                     other = self.__class__(coo_matrix(arg1, shape=shape))
     52                     self._set_self(other)
     53                 elif len(arg1) == 3:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    152                     # Use 2 steps to ensure shape has length 2.
    153                     M, N = shape
--> 154                     self._shape = check_shape((M, N))
    155 
    156                 idx_dtype = get_index_dtype(maxval=max(self.shape))

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/sputils.py in check_shape(args, current_shape)
    279             new_shape = tuple(operator.index(arg) for arg in shape_iter)
    280     else:
--> 281         new_shape = tuple(operator.index(arg) for arg in args)
    282 
    283     if current_shape is None:

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/sputils.py in <genexpr>(.0)
    279             new_shape = tuple(operator.index(arg) for arg in shape_iter)
    280     else:
--> 281         new_shape = tuple(operator.index(arg) for arg in args)
    282 
    283     if current_shape is None:

TypeError: 'numpy.float64' object cannot be interpreted as an integer`

Reverse cPCA

First of all, this is not an issue.
Thanks for developing this algorithm.

The question is, is it possible to reverse cPCA in order to obtain the "corrected" matrix with the "subtracted" background?

In R I would do something like (even if this is not completely correct)
newmat <- t(t(cpca$x %*% t(cpca$rotation))) # if center and scale are set to FALSE

More in general, what to do after cPCA? I would like to be able to explore the corrected matrix.
Many thanks

How to get the top cPCA components?

Hi, I am trying to get the top cPCA components, is there a way to get that directly without tinkering with the code?

mdl = CPCA(n_components=2)
projected_data = mdl.fit_transform(a, b, plot=True)

i.e. What are the top cPCA components that explain a, b's differences?

Thanks!

trying to understand how this is unsupervised learning/theory behind

Dear all
thank you for developing this method! its very useful indeed.
I would like to understand more about this method/ i.e. how does this consider as unsupervised learning when the user defines the background and target datasets e.g. healthy vs disease? and how does it differ from the purpose of e.g. PLSDA? where class labels/categorial covariates are also specified?

Mice protein experiment does not work

When I try to run the experiment, I get the following error message:
classes = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',', skip_header=1,usecols=range(78,81),dtype=None, encoding = 'bytes') __main__:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.

Also, the resulting cPCAs don't show any values at all, so I guess the import does not really work?

Here's the code I used:

import numpy as np

data = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',',skip_header=1,usecols=range(1,78),filling_values=0)

classes = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',', skip_header=1,usecols=range(78,81),dtype=None) __main__:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.

target_idx_A = np.where((classes[:,-1]==b'S/C') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Control'))[0]

target_idx_B = np.where((classes[:,-1]==b'S/C') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Ts65Dn'))[0]

labels = len(target_idx_A)*[0] + len(target_idx_B)*[1]

labels = len(target_idx_A) + len(target_idx_B)

target_idx = np.concatenate((target_idx_A,target_idx_B))

target = data[target_idx]

background_idx = np.where((classes[:,-1]==b'C/S') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Control'))

background = data[background_idx]

from contrastive import CPCA

mdl = CPCA()

projected_data = mdl.fit_transform(target, background, plot=True, active_labels=labels)

What is the legend argument used for?

It seems that the legend argument is unused in transform() or fit_transform()

ValueError: n_components=1000 must be between 0 and min(n_samples, n_features)=100 with svd_solver='full'

For a dataset with n_features=1001 and n_samples=999 we will get:

ValueError: n_components=1000 must be between 0 and min(n_samples, n_features)=999 with svd_solver='full'

The default value of preprocess_with_pca_dim causes problems when trying to transform_fit() as:

it cannot be changed (unless manually in fit)
it uses hardcoded 1000 (but when n < p, then SVD will not work), so maybe it could use min(1000, n_samples)

Unclear how to format input matrices

Not sure what shape the input matrices take, however I'm getting the following issue:

My background matrix has 6000 genes, and 39 conditions (6000X39)
My foreground matrix has 6000 genes, and 261 conditions (6000X261)

Running mdl.fit_transform gives me this error: ValueError: operands could not be broadcast together with shapes (261,261) (39,39)

Also, should I normalize my matrices first?

Return alphas as option to CPCA() class

I'm running cPCA on my dataset in a lot of different configurations and plotting the cPCs in my own script. It would be great if I could get the alphas returned so that I can add them to my plot titles. It looks like you do have a return_alphas option in your code, but it's not a top-level exposed parameter, and it would be great to have this as an option.

Thanks for putting together this great package!

Projected data is a matrix of complex numbers

Hi,

Thanks for sharing your code. First of all, this is not an issue, but a question.
I applied CPCA to my data and I get an array of complex numbers as projected data. Do you have a guess why that's the case?
Thanks,
Tahereh

Slides from talk?

Great talk at ICML comp bio! Wondering if you can post the slides? Wanted to see the statements regarding what you prove for any alpha > 0

No plotting of background data

Using the default implementation with something as simple as the iris dataset, I don't seem to be getting the background data being plotted, but just the foreground? Is this intentional?

Problem installing

HI,
I run in this bug when I execute pip3 install contrastive

python setup.py egg_info did not run successfully.
  ?? exit code: 1
  ????> [18 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      
      If the previous advice does not cover your use case, feel free to report it at
      https://github.com/scikit-learn/sklearn-pypi-package/issues/new
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

×? Encountered error while generating package metadata.
????> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Piping with .fit().transform() not possible

It is great to have sklearn-like interface! However, when trying to workaround #13 I noted that .fit() does not return self. This is a minor thingy, but would make it so much nicer.

Why do you adjust the sign here?

In the code below, you appear to be adjusting the sign of the first two dimensions of the data projected into cPCA space. Why only the first two? Should this depend on the number of components specified by the user?

contrastive/contrastive/__init__.py

Line 298 in 03d1384

reduced_dataset[:,0] = reduced_dataset[:,0]*np.sign(reduced_dataset[0,0])

Ability to specify the number of PCs to return

I'd like to be able to access more than just the first 2 PCs from the projected_data object. Is there a way to specify this?

Example from quick test does not work

When running the quick test:

import numpy as np
from contrastive import CPCA

N = 400; D = 30; gap=3
# In B, all the data pts are from the same distribution, which has different variances in three subspaces.
B = np.zeros((N, D))
B[:,0:10] = np.random.normal(0,10,(N,10))
B[:,10:20] = np.random.normal(0,3,(N,10))
B[:,20:30] = np.random.normal(0,1,(N,10))


# In A there are four clusters.
A = np.zeros((N, D))
A[:,0:10] = np.random.normal(0,10,(N,10))
# group 1
A[0:100, 10:20] = np.random.normal(0,1,(100,10))
A[0:100, 20:30] = np.random.normal(0,1,(100,10))
# group 2
A[100:200, 10:20] = np.random.normal(0,1,(100,10))
A[100:200, 20:30] = np.random.normal(gap,1,(100,10))
# group 3
A[200:300, 10:20] = np.random.normal(2*gap,1,(100,10))
A[200:300, 20:30] = np.random.normal(0,1,(100,10))
# group 4
A[300:400, 10:20] = np.random.normal(2*gap,1,(100,10))
A[300:400, 20:30] = np.random.normal(gap,1,(100,10))
A_labels = [0]*100+[1]*100+[2]*100+[3]*100

cpca = CPCA(standardize=False)
cpca.fit_transform(A, B, plot=True, active_labels=A_labels)

the following error is thrown by the last command:


To use the plotting feature, you must download the 'matplotlib' package
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
<ipython-input-26-14aff36cdcc3> in <module>()
----> 1 cpca.fit_transform(A, B, plot=True, active_labels=A_labels)

/usr/local/lib/python3.5/dist-packages/contrastive/__init__.py in fit_transform(self, foreground, background, plot, gui, alpha_selection, n_alphas, max_log_alpha, n_alphas_to_return, active_labels, colors, legend, alpha_value, return_alphas)
     56     def fit_transform(self, foreground, background, plot=False, gui=False, alpha_selection='auto', n_alphas=40,  max_log_alpha=3, n_alphas_to_return=4, active_labels = None, colors=None, legend=None, alpha_value=None, return_alphas=False):
     57         self.fit(foreground, background)
---> 58         return self.transform(dataset=foreground, alpha_selection=alpha_selection,  n_alphas=n_alphas, max_log_alpha=max_log_alpha, n_alphas_to_return=n_alphas_to_return, plot=plot, gui=gui, active_labels=active_labels, colors=colors, legend=legend, alpha_value=alpha_value, return_alphas=return_alphas)
     59 
     60         """

/usr/local/lib/python3.5/dist-packages/contrastive/__init__.py in transform(self, dataset, alpha_selection, n_alphas, max_log_alpha, n_alphas_to_return, plot, gui, active_labels, colors, legend, alpha_value, return_alphas)
    214             if (alpha_selection=='auto'):
    215                 transformed_data, best_alphas = self.automated_cpca(dataset, n_alphas_to_return, n_alphas, max_log_alpha)
--> 216                 plt.figure(figsize=[14,3])
    217                 for j, fg in enumerate(transformed_data):
    218                     plt.subplot(1,4,j+1)

UnboundLocalError: local variable 'plt' referenced before assignment

It looks that error is about matplotlib. MatPlotlib is installed and it works just fine except in this quick test.

'NameError: name 'cpca_alpha' is not defined' while using Kernel cPCA

When I am trying to use Kernel cPCA, it is throwing the following error “NameError: name 'cpca_alpha' is not defined”.

The code snippet is similar to that is used for cPCA and looks like this:

import numpy as np
from contrastive import Kernel_CPCA

N = 400; D = 30; gap=3

In B, all the data pts are from the same distribution, which has different variances in three subspaces.

B = np.zeros((N, D))
B[:,0:10] = np.random.normal(0,10,(N,10))
B[:,10:20] = np.random.normal(0,3,(N,10))
B[:,20:30] = np.random.normal(0,1,(N,10))

In A there are four clusters.

A = np.zeros((N, D))
A[:,0:10] = np.random.normal(0,10,(N,10))

group 1

A[0:100, 10:20] = np.random.normal(0,1,(100,10))
A[0:100, 20:30] = np.random.normal(0,1,(100,10))

group 2

A[100:200, 10:20] = np.random.normal(0,1,(100,10))
A[100:200, 20:30] = np.random.normal(gap,1,(100,10))

group 3

A[200:300, 10:20] = np.random.normal(2*gap,1,(100,10))
A[200:300, 20:30] = np.random.normal(0,1,(100,10))

group 4

A[300:400, 10:20] = np.random.normal(2*gap,1,(100,10))
A[300:400, 20:30] = np.random.normal(gap,1,(100,10))
A_labels = [0]*100+[1]*100+[2]*100+[3]*100

cpca = Kernel_CPCA(standardize=False)
cpca.fit_transform(A, B, plot=False, active_labels=A_labels)

LA.eig(sigma) gives complex eigenvalues for symmetrical sigma

In line 293 of contrastive/init.py:
w, v = LA.eig(sigma)

It seems that np.linalg.eig sometimes gives complex eigenvalues due to truncation error even though my sigmas are symmetrical matrices. Would it be better to replace the line with the following?

w, v = np.linalg.eigh(sigma)