baselabujamous / clust Goto Github PK

View Code? Open in Web Editor NEW

160.0 160.0 35.0 6.61 MB

Automatic and optimised consensus clustering of one or more heterogeneous datasets

License: Other

Python 100.00%

clust's People

Contributors

Stargazers

Watchers

clust's Issues

Unexpected clustering behavior

Hello! I'm working with time series gene expression data from a metatranscriptome (~10 species). I have biological duplicate gene expression data from 5 days from each of 15 sites. In this case, I'm trying to use clust to see what similarities there are between all samples, so I treated all sites as biological replicates (30 samples from day 1, 30 samples from day 2, ... , 30 samples from day 5).

If i cluster ~38k genes from about 10 species, I get this first set of clustering.
Clusters_profiles.pdf

I was suspicious of cluster 0, so I subsetted by gene expression data to only genes in the first cluster and re-ran clust. This produced this result.
Clusters_profiles.pdf

Why are these profiles grouped together in the first place? Is it possible to make clust more stringent so that these profiles are not grouped together?

Normalization: I chose to normalize by data using edgeR. I ran the following:

library(edgeR)

counts <- read.csv("outputs/counts/all_counts.csv", row.names = 1)
y <- DGEList(counts = counts)
keep <- rowSums(cpm(y)>1) >= 2
y <- y[keep, , keep.lib.sizes=FALSE]
head(y$counts)
y$samples
dim(y$counts)
y <- calcNormFactors(y)

norm_counts <- cpm(counts, normalized.lib.sizes = FALSE)
head(norm_counts)
write.csv(norm_counts, "sandbox/clust/edgeR_cpm.csv", quote = F)

I then ran clust like this:

clust -o all_out_edger -n 101 3 4 -r edger-reps.txt edgeR_cpm.csv

I get similar results when I do not use cpm data, i.e.:

clust -o all_out_ -r reps.txt all_counts.csv

I am using clust Version v1.8.12

gene expression values in vertical axis

Would be good if the expression values can be made visible in the vertical axis (at least 0 on the axis and the minimum and maximum values)?

Thanks

Clust 1.10.7 incompatibility issue with Pandas 0.25.0

Hi,
I have just installed clust using anaconda and I am trying to run it but I keep getting the same error, even when I try to run your examples. I also tried to run with your example data on the clust's Beta website, which also gives an unespecified error. Below is the error message on my terminal:

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.10.7 (2019) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Thursday 08 August 2019 (23:44:55)                   |
| 1. Reading dataset(s)                                                     |
Traceback (most recent call last):
  File "/home/adriayumi/miniconda3/bin/clust", line 12, in <module>
    sys.exit(main())
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/clust/__main__.py", line 103, in main
    args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
    returnSkipped=True)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
    datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 205, in readDataFromFiles
    usecols=range(skipcolumns, ncols), na_filter=data_na_filter, comments=comm)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 240, in pdreadcsv_regexdelim
    delimiter='\t', dtype=dtype, header=-1, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 457, in _read
    parser = TextFileReader(fp_or_buf, **kwds)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1135, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/adriayumi/miniconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1906, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 534, in pandas._libs.parsers.TextReader.__cinit__
OverflowError: can't convert negative value to npy_uint64`

I noticed that it could be a problem with pandas version 0.25.0. So I run clust under an environment with an older pandas version (0.24.0), and it ran normally. I suggest to add this to the requirements.

Help issue

Hi Basel

I keep getting this error any ideas how to fix it.

chris@chris-ubuntu:~$ clust /media/chris/E27C21847C215497/Clust/clust-1.8.10/X1.txt

Hope that makes sense
Cheers

Warnings during data pre-processing

Hi Basel,

I'm trying to use Clust in a count matrix with 28582 rows and 84 columns (excluding row names and column names), and I'm getting some warnings during the pre-processing step. The results seem normal.

This is the first time I'm getting these warnings. They didn't show up in any of my previous analysis.

I'm using NumPy 1.15.4.

count_matrix.tsv.zip

(clust) apcamargo@elementaryos:~/Documents/Clust$ python clust.py ../Data/count_matrix.tsv 

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.9 (2018) Basel Abu-Jamous            |
+---------------------------------------------------------------------------+
| Analysis started at: Sunday 25 November 2018 (16:33:08)                   |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
/home/apcamargo/Documents/Clust/clust/scripts/preprocess_data.py:458: RuntimeWarning: overflow encountered in power
  Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0))
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
/home/apcamargo/anaconda3/envs/clust/lib/python2.7/site-packages/numpy/core/_methods.py:117: RuntimeWarning: invalid value encountered in subtract
  x = asanyarray(arr - arrmean)
/home/apcamargo/anaconda3/envs/clust/lib/python2.7/site-packages/numpy/core/function_base.py:133: RuntimeWarning: invalid value encountered in multiply
  y *= step
/home/apcamargo/Documents/Clust/clust/scripts/preprocess_data.py:85: RuntimeWarning: invalid value encountered in less
  return np.sum(X < v) * 1.0 / ds.numel(X)
/home/apcamargo/Documents/Clust/clust/scripts/numeric.py:102: RuntimeWarning: invalid value encountered in subtract
  return np.subtract(Xloc.transpose(), V).transpose()
|  - Flat expression profiles filtered out (default in v1.7.0+).            |
|    To switch it off, use the --no-fil-flat option (not recommended).      |
|    Check https://github.com/BaselAbujamous/clust for details.             |
| 3. Seed clusters production (the Bi-CoPaM method)                         |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 4. Cluster evaluation and selection (the M-N scatter plots technique)     |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 5. Cluster optimisation and completion                                    |
| 6. Saving results in                                                      |
| /home/apcamargo/Documents/Clust/Results_25_Nov_18        |
+---------------------------------------------------------------------------+
| Analysis finished at: Sunday 25 November 2018 (16:57:40)                  |
| Total time consumed: 0 hours, 24 minutes, and 32 seconds                  |
|                                                                           |
\===========================================================================/

/===========================================================================\
|                              RESULTS SUMMARY                              |
+---------------------------------------------------------------------------+
| Clust received 1 dataset with 28582 unique genes. After filtering, 28127  |
| genes made it to the clustering step. Clust generated 1 clusters of       |
| genes, which in total include 44 genes. The smallest cluster includes 44  |
| genes, the largest cluster includes 44 genes, and the average cluster     |
| size is 44.0 genes.                                                       |
+---------------------------------------------------------------------------+
|                                 Citation                                  |
|                                 ~~~~~~~~                                  |
| When publishing work that uses Clust, please include this citation:       |
| Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of   |
| optimal co-expressed gene clusters from gene expression data. Genome      |
| Biology 19:172; doi: https://doi.org/10.1186/s13059-018-1536-8.           |
+---------------------------------------------------------------------------+
| For enquiries contact:                                                    |
| Basel Abu-Jamous                                                          |
| Department of Plant Sciences, University of Oxford                        |
| [email protected]                                           |
| [email protected]                                                  |
\===========================================================================/

ImportError: No module named numpy

Hi,

I cloned your git repository and cd to "clust/ExampleData/1_RawData", then run "clust Data/ -r Replicates.txt -n Normalisation.txt -m MapIDs.txt". It gives the following errors:

Traceback (most recent call last):
File "/rhome/jzhan067/.local/bin/clust", line 7, in
from clust.main import main
File "/rhome/jzhan067/.local/lib/python2.7/site-packages/clust/init.py", line 1, in
from .clustpipeline import runclust
File "/rhome/jzhan067/.local/lib/python2.7/site-packages/clust/clustpipeline.py", line 1, in
import matplotlib
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda2/4.4.10/lib/python2.7/site-packages/matplotlib/init.py", line 122, in
from matplotlib.cbook import is_string_like, mplDeprecation, dedent, get_label
File "/opt/linux/centos/7.x/x86_64/pkgs/miniconda2/4.4.10/lib/python2.7/site-packages/matplotlib/cbook.py", line 33, in
import numpy as np
ImportError: No module named numpy

How to resolve it?

Regards.

ImportError: No module named sklearn.metrics.pairwise

Dear author，
I had download the tool but when I run "python clust.py",and it tells me "ImportError: No module named sklearn.metrics.pairwise".Actually,I have intstall sklearn,metrics,pairwise by "pip innstall".And I didn't find a pythonpackage named sklearn.metrics.pairwise.
So,how can I to install clust?
Thanks!

plot

multiple datasets run into issues

Hi Basel,
I try to run clusters for the same time points but with either females dataset or male dataset. For each dataset, it ran successfully alone, but it kept throwing me errors when I use one replicate file including both files.

My code: clust ~/input/TPM/ -n 101 3 4 -r ~/input/TPM/Ag_all_X0.txt -o ~/output/tpm/Ag_all/

The error messages:

/===========================================================================
| Clust |
| (Optimised consensus clustering of multiple heterogenous datasets) |
| Python package version 1.8.10 (2018) Basel Abu-Jamous |
+---------------------------------------------------------------------------+
| Analysis started at: Friday 04 January 2019 (21:11:02) |
| 1. Reading dataset(s) |
Traceback (most recent call last):
File "/anaconda/anaconda/envs/python2/bin/clust", line 11, in
sys.exit(main())
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/main.py", line 98, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/clustpipeline.py", line 84, in clustpipeline
returnSkipped=True)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 193, in readDataFromFiles
usecols=range(skipcolumns, ncols), na_filter=True, comments=comm)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/clust/scripts/io.py", line 228, in pdreadcsv_regexdelim
delimiter='\t', dtype=dtype, header=-1, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 787, in init
self._make_engine(self.engine)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda/anaconda/envs/python2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1708, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.cinit
pandas.errors.EmptyDataError: No columns to parse from file

Thanks for pointing out solutions.

Best,
Wen-Juan

ValueError: invalid literal for float()

Hello,

I am trying to run the program using
$ clust path_to_directory

its giving these errors:

Traceback (most recent call last):
File "/usr/local/bin/clust", line 11, in
sys.exit(main())
File "/usr/local/lib/python2.7/dist-packages/clust/main.py", line 88, in main
args.np, args.optimisation, args.q3s, args.deterministic)
File "/usr/local/lib/python2.7/dist-packages/clust/clustpipeline.py", line 73, in clustpipeline
returnSkipped=True)
File "/usr/local/lib/python2.7/dist-packages/clust/scripts/io.py", line 28, in readDatasetsFromDirectory
datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
File "/usr/local/lib/python2.7/dist-packages/clust/scripts/io.py", line 154, in readDataFromFiles
usecols=range(skipcolumns, ncols), ndmin=2, comments=comm)
File "/usr/lib/python2.7/dist-packages/numpy/lib/npyio.py", line 848, in loadtxt
items = [conv(val) for (conv, val) in zip(converters, vals)]
ValueError: invalid literal for float(): 1,7.591893546,7.255723111,6.52040516,6.614578794,7.818629699,7.503905326,6.817377628,6.302704563,6.106799149,7.484905097,7.426335129,6.274657542,6.087165531,6.139690166,7.630860516,7.692840101,7.22647

Kindly help!

Order of genes in the output vs the input

The output pre-processed data and the output partition matrix (B) from the runclust method are aligned with each other and ordered alphabetically but are NOT aligned with the input dataset.

Run command

In the example, there is a code of "clust Data/", shall it be executed in Python 2.7?

ImportError: No module named 'clustpipeline'

Dear Basel,

Thanks a lot for providing Clust, I am really excited to use your tool, which looks amazing. I am having some problems to run it. I have all the dependencies satisfied and the installation was smooth. However, when try to run clust, I get the following error:

raceback (most recent call last):
File "/home/gu/miniconda3/envs/work/bin/clust", line 7, in
from clust.main import main
File "/home/gu/miniconda3/envs/work/lib/python3.5/site-packages/clust/main.py", line 4, in
import clustpipeline
ImportError: No module named 'clustpipeline'

Do you have any idea on how to fix this?
I appreciate any help!
Thanks
Gustavo

Subclustering: inconsistent co-abundance pattern

Hello,

I am testing a sub clustering approach as described below:

Execute a Clust on data set X(1 tabular file).
Get genes from an interesting cluster Y.
Create a new dataset Z from dataset X that contains only genes from cluster Y.
Execute Clust on dataset Z with increased tightness (-t ~50) to try to get genes with a more refined behavior from cluster Y.

The idea is trying to get genes that, for example, are increasing at every step of a time course.

Curiously, I just got into the situation in which the newly generated sub-cluster contains a series of genes which behavior is basically contrary to the one reflected by the original cluster. I am attaching images of the 'original' cluster and the 'sub-cluster'.

Original cluster (163 genes). The colored lines represent the genes that appear in the new sub-cluster below.

Sub-cluster (18 genes).

The behavior is kind of antagonistic depending on which cluster you look, but the genes and the original data set are the same.

Am I missing something here in terms of interpretation? Do you have any idea what could be happening here?

Maybe this is a more theoretical question regarding statistics and the way the data is being normalized and how the clusters are being created. I would be grateful for any input regarding these doubts.

This is an exploratory study for selecting candidates for a targeted approach later on, so I wouldn't rely on candidates that show a different behavior according to the data analysis approach.

Many thanks in advance for any support!

Best,
Miguel

Error during data pre-processing: bool' object is not iterable

Hi,

I'm using clust for the first time. Using tpm values to build my clusters. I'm using Python 2.7.14 and I'm running it in HPC. I get this error:

| Analysis started at: Tuesday 15 January 2019 (09:52:58) |
| 1. Reading dataset(s) |
| 2. Data pre-processing |
Traceback (most recent call last):
File "/nbi/Research-Groups/JIC/Diane-Saunders/Anaconda/Installation/bin/clust", line 11, in
sys.exit(main())
File "/Anaconda/Installation/lib/python2.7/site-packages/clust/main.py", line 98, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/Anaconda/Installation/lib/python2.7/site-packages/clust/clustpipeline.py", line 102, in clustpipeline
filteringtype=filteringtype, filterflat=filflat, params=None, datafiles=datafiles)
File "/Anaconda/Installation/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 630, in preprocess
Xproc[l] = fixnans(Xproc[l])
File "/Anaconda/Installation/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 70, in fixnans
sumnans = sum(isnan(Xinloc[i]))
TypeError: 'bool' object is not iterable

Any ideas why this might be?

Thanks

pre-filtering for clust

Hi Basel

Sorry another really quick question for clust.

So for pre-processing before inputting to clust. For WGCNA (what I was using previously) I was using a quite harsh filtering of cpm > 1 in 90% of my samples. I did this due to queries online, in papers etc.

I was wondering, for clust, does a pre filtering so harsh need to be done? What I am currently leaning towards is a cpm > 1 in each condition due to clust providing incredibly tighter clusters with fewer number of genes (honestly the difference with the cpm >1 in 90% samples between WGCNA and clust is quite staggering).

Again, quite a simple question which I understand is more on myself to work out but I was wondering what your thoughts would be due to you obviously having a more in depth knowledge of the filtering clust undertakes, as well as clust having a more advanced algorithm which may be able to tolerate problems with low variance.

Also, I apologise if this is the wrong site for this rather simple questions. I am more than happy to ask them at another source if that is what you prefer. I know github is more for the programming and bug aspect.

I look forward to your response.

Ben

plot issue

Hi
My clusters_profiles.pdf , all clusters are black.
I need some help.
thanks a lot

TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

First of all thanks for this great and useful tool!
Unfortunately, I encountered the following error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/maxnestlcl/anaconda3/bin/clust", line 12, in
sys.exit(main())
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/clust/main.py", line 103, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
returnSkipped=True)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 205, in readDataFromFiles
usecols=range(skipcolumns, ncols), na_filter=data_na_filter, comments=comm)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/clust/scripts/io.py", line 240, in pdreadcsv_regexdelim
delimiter='\t', dtype=dtype, header=None, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/home/maxnestlcl/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'TPA_4h'

Thanks!

Memory Error in Bi-CoPaM clustering method

Hej Basel,

thanks for publishing clust! I gave it a try and for the examples and for a gene set of size 30000 it worked just fine ot of the box. Great!

However, when I tried with a gene set with approx. 70000 genes (with 11 replicates in total and for 5 time points), clust threw an error in step 3, the Bi-CoPaM method. I attached the error log below. Running the command with more than 1 cpu creates a longer, but similar error.

Do you have an idea how to get rid of it?

Thanks and best regards!
Philipp

`clust Data/ -n Normalisation.txt -r Replicates.txt -o results/

/===========================================================================
| Clust |
| (Optimised consensus clustering of multiple heterogenous datasets) |
| Python package version 1.1.4 (2017) Basel Abu-Jamous |
+---------------------------------------------------------------------------+
| Analysis started at: Thursday 11 May 2017 (14:40:58) |
| 1. Reading datasets |
| 2. Data pre-processing |
| 3. Seed clusters production (the Bi-CoPaM method) |
Traceback (most recent call last):
File "/software/Clust/clust.py", line 6, in
main(args)
File "/software/Clust/clust/main.py", line 100, in main
args.q3s)
File "/software/Clust/clust/clustpipeline.py", line 97, in clustpipeline
ncores=ncores)
File "/software/Clust/clust/scripts/uncles.py", line 380, in uncles
(Xloc[l], Ks[ki], Ds[ki], methodsDetailedloc[l], GDMloc[:, l], Ng) for ki in range(NKs))
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call
while self.dispatch_one_batch(iterator):
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
self._dispatch(tasks)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
result = ImmediateResult(func)
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init
self.results = batch()
File "/software/python/Python2.7/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File "/software/Clust/clust/scripts/uncles.py", line 313, in clustDataset
tmpU = cl.clusterdataset(X, K, D, methods) # Obtain U's
File "/software/Clust/clust/scripts/clustering.py", line 24, in clusterdataset
U[ms] = chc(X, K, methodsloc[ms][1:])
File "/software/Clust/clust/scripts/clustering.py", line 73, in chc
Z = sphc.linkage(X, method=linkage_method, metric=distance)
File "/software/python/Python2.7/lib/python2.7/site-packages/scipy/cluster/hierarchy.py", line 669, in linkage
int(_cpy_euclid_methods[method]))
File "scipy/cluster/_hierarchy.pyx", line 740, in scipy.cluster._hierarchy.linkage (scipy/cluster/_hierarchy.c:9172)
File "scipy/cluster/stringsource", line 1281, in View.MemoryView.memoryview_copy_contents (scipy/cluster/_hierarchy.c:23661)
File "scipy/cluster/stringsource", line 1237, in View.MemoryView._err_extents (scipy/cluster/_hierarchy.c:23211)
ValueError: got differing extents in dimension 0 (got 336196312 and 2483679960)
`

Unrecognised replicate name (sMEPI_rep1) in line 1 in replicates

It exits after showing an error with "Unrecognised replicate name (sMEPI_rep1) in line 1 in replicates" when given the replicate file. Here are my commands and the replicate file.

clust sub_data.txt -r replicates -o Clust_out -np 18
###########################################################
sub_data.txt sMEPI sMEPI_rep1,sMEPI_rep2,sMEPI_rep3
sub_data.txt sMCOR sMCOR_rep1,sMCOR_rep2,sMCOR_rep3
sub_data.txt sMEND sMEND_rep1,sMEND_rep2,sMEND_rep3
sub_data.txt sMSTE sMSTE_rep1,sMSTE_rep2,sMSTE_rep3
sub_data.txt s0.5EPI s0.5EPI_rep1,s0.5EPI_rep2,s0.5EPI_rep3
sub_data.txt s0.5COR s0.5COR_rep1,s0.5COR_rep2,s0.5COR_rep3
sub_data.txt s0.5END s0.5END_rep1,s0.5END_rep2,s0.5END_rep3
sub_data.txt s0.5STE s0.5STE_rep1,s0.5STE_rep2,s0.5STE_rep3
sub_data.txt s6EPI s6EPI_rep1,s6EPI_rep2,s6EPI_rep3
sub_data.txt s6COR s6COR_rep1,s6COR_rep2,s6COR_rep3
sub_data.txt s6END s6END_rep1,s6END_rep2,s6END_rep3
sub_data.txt s6STE s6STE_rep1,s6STE_rep2,s6STE_rep3
sub_data.txt s24EPI s24EPI_rep1,s24EPI_rep2,s24EPI_rep3
sub_data.txt s24COR s24COR_rep1,s24COR_rep2,s24COR_rep3
sub_data.txt s24END s24END_rep1,s24END_rep2,s24END_rep3
sub_data.txt s24STE s24STE_rep1,s24STE_rep2,s24STE_rep3

How to deal with missing data values

First of all thanks for this great tool!
My data files have missing values for quite a number of genes (set to N/A). When running these data files I get an error message but clust is still finishing the run. Removing all rows containing N/A values works but I don't want to loose all the data.

Error message:
c:[ ].py:19: RuntimeWarning: invalid value encountered in greater I = np.bitwise_and(~isnan(X), X>0)
c:[ ].py:465: RuntimeWarning: invalid value encountered in power Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0))

Data:
GeneID | Treatment 1 | Treatment 2 | Treatment 3 | Treatment 4 | Treatment 5 | Treatment 6
1 | 4.273093893 | 0 | 1.946402008 | 1.374515554 | 2.655817399 | 5.267132206
2 | 5.956198005 | N/A | N/A | N/A | N/A | 5.266617765
3 | N/A | 0 | N/A | 0 | N/A | 5.264203631
4 | 0 | 0 | N/A | 0 | N/A | 5.261192058
6 | 3.96170082 | 1.7741793 | 0 | 1.612520247 | 3.915867084 | 5.259103225
7 | 5.118588008 | 0 | 3.888582101 | 0 | 0 | 5.257160244
8 | 4.393112039 | 0 | N/A | N/A | N/A | 5.252373101
…

How to deal with this issue (N/A and leaving them blank gives the same error message)?
How is clust dealing with this data?
Does clust automatically remove these rows?

Thanks for your help!

"Error: could not save clusters plots in a PDF file." with more than 11 clusters

Hi Basel,

I just installed your fantastic package (ver 1.10.8) and it worked pretty great with a setting in which I receive 10 or 11 clusters. However when I increase the tightness on the same datasets I receive 19 clusters and I get the "Error: could not save clusters plots in a PDF file." message. The environment, computing cluster and even the command is the same (apart from -t 2). The matplotlib and all the other modules are freshly installed yesterday so they are in the most upgraded versions.
I saw some other people used to have the same issue but it is already under the "closed Issues".

Best,
Michal

more_itertools versioning???

Has anyone seen this before? Looks like versioning issues any thoughts or quick fixes anyone has found?

Thanks

# clust
Traceback (most recent call last):
  File "/usr/local/falcon/LOCAL/bin/clust", line 7, in <module>
    from clust.__main__ import main
  File "/usr/local/falcon/LOCAL/lib/python2.7/site-packages/clust/__init__.py", line 1, in <module>
    from .clustpipeline import runclust
  File "/usr/local/falcon/LOCAL/lib/python2.7/site-packages/clust/clustpipeline.py", line 3, in <module>
    import clust.scripts.io as io
  File "/usr/local/falcon/LOCAL/lib/python2.7/site-packages/clust/scripts/io.py", line 5, in <module>
    import clust.scripts.numeric as nu
  File "/usr/local/falcon/LOCAL/lib/python2.7/site-packages/clust/scripts/numeric.py", line 4, in <module>
    import sklearn.metrics.pairwise as skdists
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/__init__.py", line 7, in <module>
    from .ranking import auc
  File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/ranking.py", line 36, in <module>
    from ..preprocessing import label_binarize
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/__init__.py", line 6, in <module>
    from ._function_transformer import FunctionTransformer
  File "/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/_function_transformer.py", line 5, in <module>
    from ..utils.testing import assert_allclose_dense_sparse
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/testing.py", line 751, in <module>
    import pytest
  File "/usr/local/lib/python2.7/dist-packages/pytest.py", line 34, in <module>
    from _pytest.python_api import approx
  File "/usr/local/lib/python2.7/dist-packages/_pytest/python_api.py", line 11, in <module>
    from more_itertools.more import always_iterable
  File "/usr/local/lib/python2.7/dist-packages/more_itertools/__init__.py", line 1, in <module>
    from more_itertools.more import *  # noqa
  File "/usr/local/lib/python2.7/dist-packages/more_itertools/more.py", line 329
    def _collate(*iterables, key=lambda a: a, reverse=False):
                               ^
SyntaxError: invalid syntax

Smaller and more clusters

Hi Basel,

I have the following questions:

I want to get smaller and more clusters on my single dataset. My command is "clust cor.os.txt -t 100 -o clust_result/ -n 0 --no-fil-flat -cs 10
-K 40 42 44 46 50 52 54 56 58 60". I have pre-processed my data, so no pre-processing at all is needed when run clust. Is there an up limitation on "-t"? Can I set it as large as I want?
By using the above command, only half of the total genes were included in the final clusters. How to increase the number of genes in the final clusters?

Regards,
Jianhai

cluster plots in PDF issue

Hi Basel,

For recent Clust plots I have done, the process was running successfully and all related results files are produced but the clusters plots in PDF is missing. I am wondering which particular required program/software for Clust I am missing? Thanks.

Here is a screenshot for the error message:

/===========================================================================
| RESULTS SUMMARY |
+---------------------------------------------------------------------------+
| Clust received 2 datasets with 44635 unique genes. After filtering, |
| 43671 genes made it to the clustering step. Clust generated 12 clusters |
| of genes, which in total include 9213 genes. The smallest cluster |
| includes 42 genes, the largest cluster includes 2163 genes, and the |
| average cluster size is 768 genes. |
+---------------------------------------------------------------------------+

Cheers,
Wen-Juan

Datasets with a single sample raise an exception

If a dataset has a single sample only, it would not make sense to apply clustering to it. However, if clust is anyway run over such a dataset, this would be the resulting error:

Multithread Usage Bug

Hello!
Thanks for publishing this tool, I look forward to comparing it with the other clustering methods I have tried. I ran into two difficulties while using it.

First of all, when using more than 1 processor (ex. -np 8), I run into a piping error.

The pipeline hangs up at 80% completion of Seed clusters production for 5+ minutes (I killed it afterward).

However, running clust on the exact same data with -np 1 fixes this issue. I am testing it on an Ubuntu 18.04 server, let me know if you would like me to do any further diagnostics.

Additionally, I don't know if it is worth reporting as an actual bug, but when all of the data is contained within one file...

clust_input1	HH6 HH6_pGFP_2, HH6_pGFP_3
clust_input1	HH8	HH8_pGFP_3,HH8_pGFP_1
clust_input1	HH10	HH10_pGFP_A1, HH10_pGFP_3, HH10_pGFP_2
clust_input1	HH12	HH12_pGFP_A1, HH12_pGFP_2, HH12_pGFP_1
clust_input1	HH14	HH14_pGFP_A1, HH14_pGFP_3, HH14_pGFP_2
clust_input1	HH16	HH16_pGFP_1, HH16_pGFP_A12, HH16_pGFP_A1

specifying -d as anything but 1 results in an error. Would it be possible to make this detect at the replicate level instead of dataset? or is this parameter more meant for multi-dataset analysis.

Thanks,
Austin

Log2 fold change nromalization -- help

Hi Basel,

Congratulations for developing this great tool, very helpful. I just wanted to confirm if we need to give any normalization codes while using log2 fold change expression values as input. I believe no normalization is required. In the manual, normalization is recommended for log2 RNA-seq TPM and FPKM but not for log2 fold change expression values. The log2fold change values were calculated using T0 as controls.

Specifically, the data is in the format:
gene_id T0 T1 T2 T3 T4
MSTRG.21649.1 0 -17.99461767 -17.99461767 -17.99461767 -17.99461767
MSTRG.18239.1 0 -20.38068299 -20.38068299 -20.38068299 20.38068299
MSTRG.6149.1 0 -18.56707533 -18.56707533 19.56707533 16.56707533
MSTRG.6144.1 0 -17.17941598 -17.17941598 -17.17941598 -17.17941598
MSTRG.21338.1 0 -16.79450764 -12.51669354 -16.79450764 -12.45173893
MSTRG.19827.1 0 -16.30894521 -16.30894521 -16.30894521 -16.30894521
MSTRG.13234.1 0 -16.3043002 -8.283472721 -16.3043002 -7.745002345

Please advise. Many thanks.

Problems when clustering log-transformed two-colour micro-array data

Hi Basel,

First of all, thanks for building this tool. I have already used this to process several RNASeq, which went effortless.

However, right now I would like to re-process public micro-array data available on GEO (such as https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17237). I used the SOFT files formatted files to map the probes to the current gene annotation and reformat it in an expression matrix. ( available here:
GSE17237.txt)

For this example, the expression values represent accordign to the SOFT file:
Data were analyzed using the limma package and the R statistical data analysis program (R 2.7.1). Due to some spread in M-values data was scale normalized between arrays at each timepoint. Values in matrix table are given as log2 ratios (test/reference)

When I run clust using the normalisation option -n 6, or -n 0, I get the following error:
/==================================================================
| Clust |
| (Optimised consensus clustering of multiple heterogenous datasets) |
| Python package version 1.8.12 (2018) Basel Abu-Jamous |
+---------------------------------------------------------------------------+
| Analysis started at: Thursday 07 February 2019 (17:38:45) |
| 1. Reading dataset(s) |
| 2. Data pre-processing |
Traceback (most recent call last):
File "/software/shared/apps/x86_64/clust/1.8.12/bin/clust", line 10, in
sys.exit(main())
File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/main.py", line 98, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/clustpipeline.py", line 97, in clustpipeline
OGsIncludedIfAtLeastInDatasets=OGsIncludedIfAtLeastInDatasets)
File "/shared/clssoft/apps/x86_64/clust/1.8.12/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 465, in calculateGDMandUpdateDatasets
Xnew[l][ogi] = np.log2(np.sum(np.power(2.0, Xloc[l][np.in1d(OGsDatasets[l], og)]), axis=0))
AttributeError: 'float' object has no attribute 'log2'

Could it be, there is an problem when the input data is already log-transformed?

Thanks in advance,
Emmelien

Error using replicates files

I am trying to run clust with the replicate file replicates_clust.txt, and am receiving the error below. As far as I can see the format agrees with the example file and running without the replicate file is progressing to the next steps (still running). Is there something I missed?
/===========================================================================
| Clust |
| (Optimised consensus clustering of multiple heterogenous datasets) |
| Python package version 1.12.0 (2019) Basel Abu-Jamous |
+---------------------------------------------------------------------------+
| Analysis started at: Sunday 08 November 2020 (18:10:59) |
| 1. Reading dataset(s) |
Traceback (most recent call last):
File "/scratch/al862/envs/myroot/bin/clust", line 10, in
sys.exit(main())
File "/home/al862/.local/lib/python3.6/site-packages/clust/main.py", line 103, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/home/al862/.local/lib/python3.6/site-packages/clust/clustpipeline.py", line 92, in clustpipeline
(replicatesIDs, conditions) = io.readReplicates(replicatesfile, datapath, datafiles, replicates)
File "/home/al862/.local/lib/python3.6/site-packages/clust/scripts/io.py", line 125, in readReplicates
conditions[c] = line[1:]
TypeError: 'filter' object is not subscriptable

does not work with missing genes

So first this is really cool software. However I have noticed a problem when running with multiple species.
Basically if in the mapping folder there is any gene which does not have a corresponding gene in the other species clust just stops. I even tried just randomly deleting genes in the provided data and it still does not work. Is there a bug somewhere making this happen. When I change my mapping file and delete any genes which are not found in all species everything runs fine.

Thanks
Chris

MemoryError

Hi Basel,

Thank you for making Clust. I am trying to use it for analyzing five species and a memory error came up. There seems to be a problem running it with the default number of datasets from which a gene needs to be included. The error log is attached. I would appreciate your help in solving this. Please note that everything went fine when I ran clust by specifying the number of datasets (clust Data/ -n 0 -m MapIDs.tsv -d 5).
Error.txt

Best regards,
Ray

ValueError: Usecols do not match columns

Hi Basel,

I keep getting this error when running the script using python clust.py (I was not able to execute the program using the first two methods described). Any idea if this is because of a file formatting issue?

Thanks,

Jon

Warning during cluster optimization and completion when using multiple datasets

Hi Basel,

When I'm using Clust with multiple datasets I get a warning message during the "Cluster optimization and completion" step. The output seems to be alright, except that I get no PDF with the cluster plots.

(clust) apcamargo@elementaryos:~/clust$ python clust.py ExampleData/1_RawData/Data/

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.10 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Monday 26 November 2018 (17:27:17)                   |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
|  - Flat expression profiles filtered out (default in v1.7.0+).            |
|    To switch it off, use the --no-fil-flat option (not recommended).      |
|    Check https://github.com/BaselAbujamous/clust for details.             |
| 3. Seed clusters production (the Bi-CoPaM method)                         |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 4. Cluster evaluation and selection (the M-N scatter plots technique)     |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 5. Cluster optimisation and completion                                    |
/home/apcamargo/anaconda3/envs/clust/lib/python2.7/site-packages/numpy/core/fromnumeric.py:2920: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
| 6. Saving results in                                                      |
| /home/apcamargo/clust/Results_26_Nov_18                                   |
| Error: could not save clusters plots in a PDF file.                       |
| Resuming producing the other results files ...                            |
+---------------------------------------------------------------------------+
| Analysis finished at: Monday 26 November 2018 (17:28:09)                  |
| Total time consumed: 0 hours, 0 minutes, and 52 seconds                   |
|                                                                           |
\===========================================================================/

/===========================================================================\
|                              RESULTS SUMMARY                              |
+---------------------------------------------------------------------------+
| Clust received 3 datasets with 9332 unique genes. After filtering, 9330   |
| genes made it to the clustering step. Clust generated 14 clusters of      |
| genes, which in total include 4686 genes. The smallest cluster includes   |
| 27 genes, the largest cluster includes 1231 genes, and the average        |
| cluster size is 334.714285714 genes.                                      |
+---------------------------------------------------------------------------+
|                                 Citation                                  |
|                                 ~~~~~~~~                                  |
| When publishing work that uses Clust, please include this citation:       |
| Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of   |
| optimal co-expressed gene clusters from gene expression data. Genome      |
| Biology 19:172; doi: https://doi.org/10.1186/s13059-018-1536-8.           |
+---------------------------------------------------------------------------+
| For enquiries contact:                                                    |
| Basel Abu-Jamous                                                          |
| Department of Plant Sciences, University of Oxford                        |
| [email protected]                                           |
| [email protected]                                                  |
\===========================================================================/

Too many missing genes

Hi Basel,

I ran with 10686 genes and all my interesting genes are missing.
Is there anyway to run it without filtering? Also do you have any idea to add optimal K number ?
For example from mclust.
Thanks.
Won

clust Data/ -r Replicates.txt -n Normalisation.txt -cs 5

| Clust received 2 datasets with 10686 unique genes. After filtering, |
| 10686 genes made it to the clustering step. Clust generated 9 clusters |
| of genes, which in total include 829 genes. The smallest cluster |
| includes 16 genes, the largest cluster includes 270 genes, and the |
| average cluster size is 92 genes.

Details about how expression values in replicates are collapsed to one (per condition) value

Hi Basel,
Thanks for creating "clust" which I am currently using for one of my projects.
I am looking for the information about how expression values in replicates are collapsed to one (per condition) value in clust. Could you please point me to the details regarding this?
Thanking you,
Sam

NameError: name 'Ds' is not defined

Hi,

Im getting this error while running Clust through miniconda on my Macbook pro. See code attached. Extremely new to Python, so fully expect I did something wrong getting this job submitted and/or Clust installed. Thanks for any help!

Daniel

Clust_error.txt

PDF file is missing

Hi,

I used the last updated unix version of your program, it doesn't create a PDF file (while creating 10 clusters). I checked also the online version and got the same problem if I try to get tightness of more than 1. Did I miss something while running? The PDF file is important to me, so if there is another quick way that I can create it I will appreciate it,
Thanks a lot
Hiba

Proteomics

Do you have an idea if this method would work with proteomics data?

TypeError: can only concatenate list (not "NoneType") to list

Hello, I want to try using clust to cluster my data but I came across an error. I cannot find the reason why it is not working. Maybe you could help me.

I have four files (A, B, C and D) and each has 3 time points. The file B has two replicates.
The data are log2 FC transformed.
I followed the example and created tsv files and put them into ./Data dir.
You can find the input files and code below.

replicate.txt
A.txt
B.txt
C.txt
D.txt

clust Data/ -r replicate.txt -n 0 --no-fil-flat
 /===========================================================================\
> |                                   Clust                                   |
> |    (Optimised consensus clustering of multiple heterogenous datasets)     |
> |           Python package version 1.8.9 (2018) Basel Abu-Jamous            |
> +---------------------------------------------------------------------------+
> | Analysis started at: Tuesday 27 November 2018 (20:33:51)                  |
> | 1. Reading dataset(s)                                                     |
> | 2. Data pre-processing                                                    |
> Traceback (most recent call last):
>   File "/home/s1469622/miniconda3/bin/clust", line 11, in <module>
>     load_entry_point('clust==1.8.9', 'console_scripts', 'clust')()
>   File "/home/s1469622/miniconda3/lib/python2.7/site-packages/clust/__main__.py", line 98, in main
>     args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
>   File "/home/s1469622/miniconda3/lib/python2.7/site-packages/clust/clustpipeline.py", line 109, in clustpipeline
>     Xprocessed = op.processed_X(X_summarised_normalised, conditions, GDM, OGs, MapNew, MapSpecies)  # pandas DataFrames
>   File "/home/s1469622/miniconda3/lib/python2.7/site-packages/clust/scripts/output.py", line 380, in processed_X
>     resHeader[l] = np.array([['Genes'] + resHeader[l]])
> TypeError: can only concatenate list (not "NoneType") to list

Data Output

Hi there

First of all thank you for this program, it is providing me with really biological useful results to use in GO and pathway analysis.

One query I have is that clust provides a visualtion (in the form of a pdf) of the coexpressed genes. I want to tidy this up by downloading the data used to plot these graphs in the pdf. Is there a way to do this currently or is there a way to get this data out?

I apologise if this is a basic request or I have missed something already shared, but I appreciate the time foe any response.

Ben Young

TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1191, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/newlustre/home/longyong/anaconda3/bin/clust", line 12, in
sys.exit(main())
File "/newlustre/home/longyong/.local/lib/python3.6/site-packages/clust/main.py", line 103, in main
args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
File "/newlustre/home/longyong/.local/lib/python3.6/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
returnSkipped=True)
File "/newlustre/home/longyong/.local/lib/python3.6/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
File "/newlustre/home/longyong/.local/lib/python3.6/site-packages/clust/scripts/io.py", line 205, in readDataFromFiles
usecols=range(skipcolumns, ncols), na_filter=data_na_filter, comments=comm)
File "/newlustre/home/longyong/.local/lib/python3.6/site-packages/clust/scripts/io.py", line 240, in pdreadcsv_regexdelim
delimiter='\t', dtype=dtype, header=-1, skiprows=skiprows, usecols=usecols, na_filter=na_filter, comment=comments).values
File "/newlustre/home/longyong/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/newlustre/home/longyong/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 435, in _read
data = parser.read(nrows)
File "/newlustre/home/longyong/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1139, in read
ret = self._engine.read(nrows)
File "/newlustre/home/longyong/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1995, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 899, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 914, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 991, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1123, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1197, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: '/newlustre/home/longyong/longyong/zebrafish/lethal/clust'

Can't find file /newlustre/home/longyong/.vnc/mu01:19.pid
You'll have to kill the Xvnc process manually

Default Y-axis label in plot

I noticed that the default y-axis on the produced plot in Cluster_profiles is the same as the filename of the input data. If this is intended, I don't think it is a good default, and should be probably changed to 'z-score' or a similar metric. Would be glad to fix this, or to allow for more flexibility.

Example:

error Processed_Data file (normalized data)

As I was interested in obtaining the normalized data (of my input RNA-Seq count data), I used the .txt file that is present in the Processed_Data folder that is generated after running CLUST. However, (at least for some - I have not checked many) genes, the normalized values do not make sense (the 101 31 4 normalization was used). I give an example here with the input and processed values for gene AT1G52000 in my 18 samples.

My input data
AT1G52000 | 2894.15 | 2083.09 | 2252.74 | 2264.05 | 2326.51 | 2177.02 | 3417.66 | 3123.52 | 3042.18 | 6034.85 | 6116.73 | 6675.45 | 4809.43 | 3728.87 | 3930.98 | 8258.78 | 7762.17 | 7665.90

Processed.txt (normalized data from clust - Processed data file)
AT1G52000 | 0.68 | 0.54 | 0.39 | 0.21 | 0.39 | 0.21 | 1.34 | 1.51 | 0.96 | -1.93 | -0.27 | -1.15 | -1.15 | -1.00 | -1.86 | 0.08 | 0.96 | 0.08

Custom normalisation methods

Hi, I didn't find the normalisation method I want to use. Is there a way I can apply my own normalisation code?

ValueError: could not convert string to float

I'm getting the following error when trying to run clust:

clust /home/nsa/HS_intra/clust_data -d 6 -m Orthogroups.tsv -r Replicates.txt -n Normalisation.txt

/===========================================================================
| Clust |
| (Optimised consensus clustering of multiple heterogenous datasets) |
| Python package version 1.12.0 (2019) Basel Abu-Jamous |
+---------------------------------------------------------------------------+
| Analysis started at: Tuesday 27 October 2020 (14:02:55) |
| 1. Reading dataset(s) |
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/nsa/miniconda3/envs/clust/bin/clust", line 10, in
sys.exit(main())
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/clust/main.py", line 101, in main
clustpipeline.clustpipeline(args.datapath, args.m, args.r, args.n, args.o, args.K, args.t,
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/clust/clustpipeline.py", line 86, in clustpipeline
(X, replicates, Genes, datafiles) = io.readDatasetsFromDirectory(datapath, delimiter='\t| |, |; |,|;', skiprows=1, skipcolumns=1,
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/clust/scripts/io.py", line 46, in readDatasetsFromDirectory
datafilesread = readDataFromFiles(datafileswithpath, delimiter, float, skiprows, skipcolumns, returnSkipped)
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/clust/scripts/io.py", line 204, in readDataFromFiles
X[l] = pdreadcsv_regexdelim(datafiles[l], delimiter=delimiter, dtype=dtype, skiprows=skiprows,
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/clust/scripts/io.py", line 239, in pdreadcsv_regexdelim
result = pd.read_csv(StringIO('\n'.join(re.sub(delimiter, '\t', str(x)) for x in f)),
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/pandas/io/parsers.py", line 686, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/pandas/io/parsers.py", line 458, in _read
data = parser.read(nrows)
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/pandas/io/parsers.py", line 1196, in read
ret = self._engine.read(nrows)
File "/home/nsa/miniconda3/envs/clust/lib/python3.8/site-packages/pandas/io/parsers.py", line 2155, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 941, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1073, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1147, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: could not convert string to float: 'TRINITY_DN401_c0_g1_i10_685_0.701307.p4'

Does that mean that I can only use number as gene IDs? Because in Figure 8 of program documentation it shows a file using other characters. Are there a specific standard for gene name besides that it "should not include spaces, commas, or semicolons" ?

Thank you!

Explain how automatic normalization works

Hi Basel,

I've been using Clust a lot lately and found that sometimes the automatic normalization applies different normalization methods for somewhat similar datasets.
eg.: Dataset A has more zeros than dataset B. Dataset A is normalized by 101 31 4 and dataset B is normalized by 101 3 4.

I understand the reasoning for these choices, but I think that it would be better if the automatic normalization was described in detail in the documentation.

Too many genes not included in any cluster

Hi,
I am using clust to analyze my two datasets from same species, but I found that only 6,000 genes were clustered and the rest of the genes were not included (40,000+ genes in total), I tried adjusting the -t option. When t=0.1, only 10,000+ genes were clustered. The following is part of my Summary.tsv file contents when t=1.

Starting data and time Saturday 20 April 2019 (13:29:50)
Ending date and time Saturday 20 April 2019 (13:42:59)
Time consumed 0 hours, 13 minutes, and 8 seconds
Number of datasets 2
Total number of input genes 46458
Genes included in the analysis 36447
Genes filtered out from the analysis 10011
Number of clusters 20
Total number of genes in clusters 6049
Genes not included in any cluster 30398

Do you have any suggestion to increase gene number in clusters, many thanks.

Output representative expression profiles of the clusters

Hi Basel,

In many cases, it's very useful to use a prototypical expression profile of the clusters in downstream analysis (by measuring it's correlation to an external variable, for instance). In WGCNA, the eigengene of the modules are usually used for this purpose.

It would be useful if Clust could output some kind of representation of the expression profile of each cluster. It could be the eigengene, median expression for each sample, trimmed mean etc.

What do you think?

Data pre-processing, AttributeError: 'float' object has no attribute 'replace'

Hello @BaselAbujamous, thank you for providing this capacity for looking at gene expression data from multiple species! This is perfect for my project, and I'm very excited about clust.

I ran into a problem with the following command using Orthofinder output:

clust species-expression -d 17 -r species_replicates -m Orthogroups.csv

The error is below. Can you please tell me if there is a problem with my formatting?

Here are the files used for this run:

curl -L https://osf.io/6f4yn/download -o Orthogroups.csv
curl -L https://osf.io/cbfst/download -o species_replicates
curl -L https://osf.io/sx546/download -o species-expression.tar.gz
tar -xvzf species-expression.tar.gz

Output error:

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.10 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Friday 25 January 2019 (00:16:46)                    |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
Traceback (most recent call last):
  File "/opt/miniconda3/envs/run_clust/bin/clust", line 11, in <module>
    sys.exit(main())
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/__main__.py", line 98, in main
    args.cs, args.np, args.optimisation, args.q3s, args.basemethods, args.deterministic)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/clustpipeline.py", line 97, in clustpipeline
    OGsIncludedIfAtLeastInDatasets=OGsIncludedIfAtLeastInDatasets)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 436, in calculateGDMandUpdateDatasets
    OGsFirstColMap, delimGenesInMap)
  File "/opt/miniconda3/envs/run_clust/lib/python2.7/site-packages/clust/scripts/preprocess_data.py", line 397, in mapGenesToCommonIDs
    Maploc[i, j] = re.split(delimGenesInMap, Maploc[i, j].replace('.', 'thisisadot').replace('-', 'thisisadash').replace('/', 'thisisaslash'))
AttributeError: 'float' object has no attribute 'replace'

On an Ubuntu 18.04 instance, Conda py2.7 environment

# packages in environment at /opt/miniconda3/envs/run_clust:
#
# Name                    Version                   Build  Channel
_r-mutex                  1.0.0               anacondar_1    r
backports.functools-lru-cache 1.5                       <pip>
binutils_impl_linux-64    2.31.1               h6176602_1    conda-forge
binutils_linux-64         2.31.1               h6176602_3    conda-forge
blas                      1.0                         mkl  
blast                     2.5.0                hc0b0e79_3    bioconda
boost                     1.69.0          py27h8619c78_1000    conda-forge
boost-cpp                 1.69.0            h11c811c_1000    conda-forge
bwidget                   1.9.11                        1  
bzip2                     1.0.6             h14c3975_1002    conda-forge
ca-certificates           2018.11.29           ha4d7672_0    conda-forge
cairo                     1.14.12              h8948797_3  
certifi                   2018.11.29            py27_1000    conda-forge
clust                     1.8.10                    <pip>
curl                      7.63.0            h646f8bb_1000    conda-forge
cycler                    0.10.0                    <pip>
diamond                   0.9.21                        1    bioconda
dlcpar                    1.0              py27h24bf2e0_1    bioconda
fastme                    2.1.5                         0    bioconda
fasttree                  2.1.10               h470a237_2    bioconda
fontconfig                2.13.0               h9420a91_0  
freetype                  2.9.1             h94bbf69_1005    conda-forge
fribidi                   1.0.5             h14c3975_1000    conda-forge
gawk                      4.2.1             h14c3975_1000    conda-forge
gcc_impl_linux-64         7.3.0                habb00fd_1    conda-forge
gcc_linux-64              7.3.0                h553295d_3    conda-forge
gettext                   0.19.8.1          h9745a5d_1001    conda-forge
gfortran_impl_linux-64    7.3.0                hdf63c60_1  
gfortran_linux-64         7.3.0                h553295d_3  
glib                      2.56.2            had28632_1001    conda-forge
graphite2                 1.3.13            hf484d3e_1000    conda-forge
gsl                       2.4                  h14c3975_4  
gxx_impl_linux-64         7.3.0                hdf63c60_1    conda-forge
gxx_linux-64              7.3.0                h553295d_3    conda-forge
harfbuzz                  1.9.0             he243708_1001    conda-forge
icu                       58.2              hf484d3e_1000    conda-forge
intel-openmp              2019.1                      144  
iqtree                    1.6.9                he941832_0    bioconda
joblib                    0.13.1                    <pip>
jpeg                      9c                h14c3975_1001    conda-forge
kiwisolver                1.0.1                     <pip>
krb5                      1.16.3            hc83ff2d_1000    conda-forge
libcurl                   7.63.0            h01ee5af_1000    conda-forge
libedit                   3.1.20170329      hf8c457e_1001    conda-forge
libffi                    3.2.1             hf484d3e_1005    conda-forge
libgcc                    7.2.0                h69d50b8_2    conda-forge
libgcc-ng                 7.3.0                hdf63c60_0    conda-forge
libgfortran-ng            7.3.0                hdf63c60_0  
libiconv                  1.15              h14c3975_1004    conda-forge
libpng                    1.6.36            h84994c4_1000    conda-forge
libssh2                   1.8.0             h1ad7b7a_1003    conda-forge
libstdcxx-ng              7.3.0                hdf63c60_0    conda-forge
libtiff                   4.0.10            h648cc4a_1001    conda-forge
libuuid                   1.0.3                         1    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxml2                   2.9.8             h143f9aa_1005    conda-forge
llvm-meta                 7.0.0                         0    conda-forge
mafft                     7.407                         0    bioconda
make                      4.2.1             h14c3975_2004    conda-forge
matplotlib                2.2.3                     <pip>
mcl                       14.137          pl526h470a237_4    bioconda
mkl                       2019.1                      144  
mkl_fft                   1.0.10           py27h470a237_1    conda-forge
mkl_random                1.0.2                    py27_0    conda-forge
mmseqs2                   7.4e23d              h21aa3a5_1    bioconda
muscle                    3.8.1551             h2d50403_3    bioconda
ncurses                   6.1               hf484d3e_1002    conda-forge
numpy                     1.15.4           py27h7e9f1db_0  
numpy                     1.16.0                    <pip>
numpy-base                1.15.4           py27hde5b4d6_0  
openmp                    7.0.0                h2d50403_0    conda-forge
openssl                   1.0.2p            h14c3975_1002    conda-forge
orthofinder               2.2.7                         0    bioconda
pandas                    0.23.4                    <pip>
pango                     1.42.4               h049681c_0  
pcre                      8.42                 h439df22_0  
perl                      5.26.2            h14c3975_1000    conda-forge
pip                       18.1                  py27_1000    conda-forge
pixman                    0.34.0            h14c3975_1003    conda-forge
portalocker               1.3.0                     <pip>
pthread-stubs             0.4               h14c3975_1001    conda-forge
pyparsing                 2.3.1                     <pip>
python                    2.7.15            h938d71a_1006    conda-forge
python-dateutil           2.7.5                     <pip>
pytz                      2018.9                    <pip>
r-base                    3.5.1                h1e0a451_2    r
raxml                     8.2.12               h470a237_0    bioconda
readline                  7.0               hf8c457e_1001    conda-forge
scikit-learn              0.20.2                    <pip>
scipy                     1.2.0                     <pip>
scipy                     1.1.0            py27h7c811a0_2  
setuptools                40.6.3                   py27_0    conda-forge
six                       1.12.0                    <pip>
sklearn                   0.0                       <pip>
sompy                     0.1.1                     <pip>
sqlite                    3.26.0            h67949de_1000    conda-forge
subprocess32              3.5.3                     <pip>
tk                        8.6.9             h84994c4_1000    conda-forge
tktable                   2.10                 h14c3975_0  
wheel                     0.32.3                   py27_0    conda-forge
xorg-libxau               1.0.8             h14c3975_1006    conda-forge
xorg-libxdmcp             1.1.2             h14c3975_1007    conda-forge
xz                        5.2.4             h14c3975_1001    conda-forge
zlib                      1.2.11            h14c3975_1004    conda-forge

baselabujamous / clust Goto Github PK

clust's People

Contributors

Stargazers

Watchers

Forkers

clust's Issues

clust Data/ -r Replicates.txt -n Normalisation.txt -cs 5

Recommend Projects

Recommend Topics

Recommend Org