This is the GitHub repository for Abubakar Abid's blog and research journal, which you can find at https://abidlabs.github.io/
abidlabs / contrastive Goto Github PK
View Code? Open in Web Editor NEWContrastive PCA
License: MIT License
Contrastive PCA
License: MIT License
This is the GitHub repository for Abubakar Abid's blog and research journal, which you can find at https://abidlabs.github.io/
Hi guys,
I encountered an issue when running:
contrastive/experiments/Single-Cell RNA-seq (Figure 3).ipynb
The error started from the code :
dataset = SingleCell(itemgetter(*active_file_idx)(fnames), [fnames[6]])
My best guess this is an issue with Scipy new release?
================ERROR message================
TypeError Traceback (most recent call last)
<ipython-input-34-9d882ec2e7eb> in <module>()
2
3 active_file_idx = [1,2]
----> 4 dataset = SingleCell(itemgetter(*active_file_idx)(fnames), [fnames[6]])
<ipython-input-22-c169daba19a1> in __init__(self, active_files, background_file, N_GENES, to_standardize, verbose)
5
6 def __init__(self, active_files, background_file, N_GENES = 500, to_standardize=True, verbose=True):
----> 7 self.active = vstack([self.file_to_features(fname) for fname in active_files])
8 self.bg = vstack([self.file_to_features(fname) for fname in background_file])
9 self.reduce_features(N_GENES)
<ipython-input-22-c169daba19a1> in <listcomp>(.0)
5
6 def __init__(self, active_files, background_file, N_GENES = 500, to_standardize=True, verbose=True):
----> 7 self.active = vstack([self.file_to_features(fname) for fname in active_files])
8 self.bg = vstack([self.file_to_features(fname) for fname in background_file])
9 self.reduce_features(N_GENES)
<ipython-input-22-c169daba19a1> in file_to_features(self, fname)
28 col = data[:,0]-1 #1-indexed
29 values = data[:,2]
---> 30 c = csc_matrix((values, (row, col)), shape=(row.max()+1, col.max()+1))
31 return c
32
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/compressed.py in __init__(self, arg1, shape, dtype, copy)
49 # (data, ij) format
50 from .coo import coo_matrix
---> 51 other = self.__class__(coo_matrix(arg1, shape=shape))
52 self._set_self(other)
53 elif len(arg1) == 3:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
152 # Use 2 steps to ensure shape has length 2.
153 M, N = shape
--> 154 self._shape = check_shape((M, N))
155
156 idx_dtype = get_index_dtype(maxval=max(self.shape))
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/sputils.py in check_shape(args, current_shape)
279 new_shape = tuple(operator.index(arg) for arg in shape_iter)
280 else:
--> 281 new_shape = tuple(operator.index(arg) for arg in args)
282
283 if current_shape is None:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scipy/sparse/sputils.py in <genexpr>(.0)
279 new_shape = tuple(operator.index(arg) for arg in shape_iter)
280 else:
--> 281 new_shape = tuple(operator.index(arg) for arg in args)
282
283 if current_shape is None:
TypeError: 'numpy.float64' object cannot be interpreted as an integer`
First of all, this is not an issue.
Thanks for developing this algorithm.
The question is, is it possible to reverse cPCA in order to obtain the "corrected" matrix with the "subtracted" background?
In R I would do something like (even if this is not completely correct)
newmat <- t(t(cpca$x %*% t(cpca$rotation)))
# if center and scale are set to FALSE
More in general, what to do after cPCA? I would like to be able to explore the corrected matrix.
Many thanks
Hi, I am trying to get the top cPCA components, is there a way to get that directly without tinkering with the code?
mdl = CPCA(n_components=2)
projected_data = mdl.fit_transform(a, b, plot=True)
i.e. What are the top cPCA components that explain a, b's differences?
Thanks!
Dear all
thank you for developing this method! its very useful indeed.
I would like to understand more about this method/ i.e. how does this consider as unsupervised learning when the user defines the background and target datasets e.g. healthy vs disease? and how does it differ from the purpose of e.g. PLSDA? where class labels/categorial covariates are also specified?
When I try to run the experiment, I get the following error message:
classes = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',', skip_header=1,usecols=range(78,81),dtype=None, encoding = 'bytes') __main__:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
Also, the resulting cPCAs don't show any values at all, so I guess the import does not really work?
Here's the code I used:
import numpy as np
data = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',',skip_header=1,usecols=range(1,78),filling_values=0)
classes = np.genfromtxt('/Users/Dina/Desktop/Python/Data_Cortex_Nuclear.csv',delimiter=',', skip_header=1,usecols=range(78,81),dtype=None) __main__:1: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
target_idx_A = np.where((classes[:,-1]==b'S/C') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Control'))[0]
target_idx_B = np.where((classes[:,-1]==b'S/C') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Ts65Dn'))[0]
labels = len(target_idx_A)*[0] + len(target_idx_B)*[1]
labels = len(target_idx_A) + len(target_idx_B)
target_idx = np.concatenate((target_idx_A,target_idx_B))
target = data[target_idx]
background_idx = np.where((classes[:,-1]==b'C/S') & (classes[:,-2]==b'Saline') & (classes[:,-3]==b'Control'))
background = data[background_idx]
from contrastive import CPCA
mdl = CPCA()
projected_data = mdl.fit_transform(target, background, plot=True, active_labels=labels)
It seems that the legend argument is unused in transform()
or fit_transform()
For a dataset with n_features=1001 and n_samples=999 we will get:
ValueError: n_components=1000 must be between 0 and min(n_samples, n_features)=999 with svd_solver='full'
The default value of preprocess_with_pca_dim
causes problems when trying to transform_fit()
as:
fit
)min(1000, n_samples)
Not sure what shape the input matrices take, however I'm getting the following issue:
My background matrix has 6000 genes, and 39 conditions (6000X39)
My foreground matrix has 6000 genes, and 261 conditions (6000X261)
Running mdl.fit_transform gives me this error: ValueError: operands could not be broadcast together with shapes (261,261) (39,39)
Also, should I normalize my matrices first?
I'm running cPCA on my dataset in a lot of different configurations and plotting the cPCs in my own script. It would be great if I could get the alphas returned so that I can add them to my plot titles. It looks like you do have a return_alphas
option in your code, but it's not a top-level exposed parameter, and it would be great to have this as an option.
Thanks for putting together this great package!
Hi,
Thanks for sharing your code. First of all, this is not an issue, but a question.
I applied CPCA to my data and I get an array of complex numbers as projected data. Do you have a guess why that's the case?
Thanks,
Tahereh
Great talk at ICML comp bio! Wondering if you can post the slides? Wanted to see the statements regarding what you prove for any alpha > 0
Using the default implementation with something as simple as the iris dataset, I don't seem to be getting the background data being plotted, but just the foreground? Is this intentional?
HI,
I run in this bug when I execute pip3 install contrastive
python setup.py egg_info did not run successfully.
?? exit code: 1
????> [18 lines of output]
The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
rather than 'sklearn' for pip commands.
Here is how to fix this error in the main use cases:
- use 'pip install scikit-learn' rather than 'pip install sklearn'
- replace 'sklearn' by 'scikit-learn' in your pip requirements files
(requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
- if the 'sklearn' package is used by one of your dependencies,
it would be great if you take some time to track which package uses
'sklearn' instead of 'scikit-learn' and report it to their issue tracker
- as a last resort, set the environment variable
SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
More information is available at
https://github.com/scikit-learn/sklearn-pypi-package
If the previous advice does not cover your use case, feel free to report it at
https://github.com/scikit-learn/sklearn-pypi-package/issues/new
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
×? Encountered error while generating package metadata.
????> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
It is great to have sklearn-like interface! However, when trying to workaround #13 I noted that .fit()
does not return self. This is a minor thingy, but would make it so much nicer.
In the code below, you appear to be adjusting the sign of the first two dimensions of the data projected into cPCA space. Why only the first two? Should this depend on the number of components specified by the user?
contrastive/contrastive/__init__.py
Line 298 in 03d1384
I'd like to be able to access more than just the first 2 PCs from the projected_data object. Is there a way to specify this?
When running the quick test:
import numpy as np
from contrastive import CPCA
N = 400; D = 30; gap=3
# In B, all the data pts are from the same distribution, which has different variances in three subspaces.
B = np.zeros((N, D))
B[:,0:10] = np.random.normal(0,10,(N,10))
B[:,10:20] = np.random.normal(0,3,(N,10))
B[:,20:30] = np.random.normal(0,1,(N,10))
# In A there are four clusters.
A = np.zeros((N, D))
A[:,0:10] = np.random.normal(0,10,(N,10))
# group 1
A[0:100, 10:20] = np.random.normal(0,1,(100,10))
A[0:100, 20:30] = np.random.normal(0,1,(100,10))
# group 2
A[100:200, 10:20] = np.random.normal(0,1,(100,10))
A[100:200, 20:30] = np.random.normal(gap,1,(100,10))
# group 3
A[200:300, 10:20] = np.random.normal(2*gap,1,(100,10))
A[200:300, 20:30] = np.random.normal(0,1,(100,10))
# group 4
A[300:400, 10:20] = np.random.normal(2*gap,1,(100,10))
A[300:400, 20:30] = np.random.normal(gap,1,(100,10))
A_labels = [0]*100+[1]*100+[2]*100+[3]*100
cpca = CPCA(standardize=False)
cpca.fit_transform(A, B, plot=True, active_labels=A_labels)
the following error is thrown by the last command:
To use the plotting feature, you must download the 'matplotlib' package
---------------------------------------------------------------------------
UnboundLocalError Traceback (most recent call last)
<ipython-input-26-14aff36cdcc3> in <module>()
----> 1 cpca.fit_transform(A, B, plot=True, active_labels=A_labels)
/usr/local/lib/python3.5/dist-packages/contrastive/__init__.py in fit_transform(self, foreground, background, plot, gui, alpha_selection, n_alphas, max_log_alpha, n_alphas_to_return, active_labels, colors, legend, alpha_value, return_alphas)
56 def fit_transform(self, foreground, background, plot=False, gui=False, alpha_selection='auto', n_alphas=40, max_log_alpha=3, n_alphas_to_return=4, active_labels = None, colors=None, legend=None, alpha_value=None, return_alphas=False):
57 self.fit(foreground, background)
---> 58 return self.transform(dataset=foreground, alpha_selection=alpha_selection, n_alphas=n_alphas, max_log_alpha=max_log_alpha, n_alphas_to_return=n_alphas_to_return, plot=plot, gui=gui, active_labels=active_labels, colors=colors, legend=legend, alpha_value=alpha_value, return_alphas=return_alphas)
59
60 """
/usr/local/lib/python3.5/dist-packages/contrastive/__init__.py in transform(self, dataset, alpha_selection, n_alphas, max_log_alpha, n_alphas_to_return, plot, gui, active_labels, colors, legend, alpha_value, return_alphas)
214 if (alpha_selection=='auto'):
215 transformed_data, best_alphas = self.automated_cpca(dataset, n_alphas_to_return, n_alphas, max_log_alpha)
--> 216 plt.figure(figsize=[14,3])
217 for j, fg in enumerate(transformed_data):
218 plt.subplot(1,4,j+1)
UnboundLocalError: local variable 'plt' referenced before assignment
It looks that error is about matplotlib
. MatPlotlib is installed and it works just fine except in this quick test.
When I am trying to use Kernel cPCA, it is throwing the following error “NameError: name 'cpca_alpha' is not defined”.
The code snippet is similar to that is used for cPCA and looks like this:
import numpy as np
from contrastive import Kernel_CPCA
N = 400; D = 30; gap=3
B = np.zeros((N, D))
B[:,0:10] = np.random.normal(0,10,(N,10))
B[:,10:20] = np.random.normal(0,3,(N,10))
B[:,20:30] = np.random.normal(0,1,(N,10))
A = np.zeros((N, D))
A[:,0:10] = np.random.normal(0,10,(N,10))
A[0:100, 10:20] = np.random.normal(0,1,(100,10))
A[0:100, 20:30] = np.random.normal(0,1,(100,10))
A[100:200, 10:20] = np.random.normal(0,1,(100,10))
A[100:200, 20:30] = np.random.normal(gap,1,(100,10))
A[200:300, 10:20] = np.random.normal(2*gap,1,(100,10))
A[200:300, 20:30] = np.random.normal(0,1,(100,10))
A[300:400, 10:20] = np.random.normal(2*gap,1,(100,10))
A[300:400, 20:30] = np.random.normal(gap,1,(100,10))
A_labels = [0]*100+[1]*100+[2]*100+[3]*100
cpca = Kernel_CPCA(standardize=False)
cpca.fit_transform(A, B, plot=False, active_labels=A_labels)
In line 293 of contrastive/init.py:
w, v = LA.eig(sigma)
It seems that np.linalg.eig sometimes gives complex eigenvalues due to truncation error even though my sigmas are symmetrical matrices. Would it be better to replace the line with the following?
w, v = np.linalg.eigh(sigma)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.