ksahlin / isonclust Goto Github PK

View Code? Open in Web Editor NEW

46.0 5.0 8.0 489.51 MB

De novo clustering of long transcript reads into genes

License: GNU General Public License v3.0

Python 100.00%

clustering transcriptome oxford-nanopore pacbio-iso-seq bioinformatics

isonclust's People

Contributors

Stargazers

Watchers

Forkers

pythseq ashwini-girish jkomyno bhavnah wenmm healthvivo nextgenusfs d-jch

isonclust's Issues

get_sorted_fastq_for_cluster.py, line 124 // p_no_error_in_kmers = 1.0 - exp_errors_in_kmers/ float((len(seq) - k +1))

Hi,
I've run isONclust with the test: " isONclust --fastq test/sample_alz_2k.fastq --outfolder output ", it was ok.
After that
I've tried to run isONclust on a sugarcane fastq file, read of insert of 301 020 reads.
isONclust --fastq reads_of_insert.fastq --outfolder output1

and I had the error: " p_no_error_in_kmers = 1.0 - exp_errors_in_kmers/ float((len(seq) - k +1))
ZeroDivisionError: float division by zero".

Do you know what could be the problem ?
Thank you,
Virgg

------------------------ Full LOG ----------------------------
isONclust --fastq /30days/uqvperlo/smrtanalysis/output/merge_data/reads_of_insert.fastq --outfolder output1
started sorting seqs
0 reads processed.
10000 reads processed.
20000 reads processed.
30000 reads processed.
40000 reads processed.
50000 reads processed.
60000 reads processed.
70000 reads processed.
80000 reads processed.
90000 reads processed.
Traceback (most recent call last):
File "/home/uqvperlo/.conda/envs/isonclust/bin/isONclust", line 178, in
main(args)
File "/home/uqvperlo/.conda/envs/isonclust/bin/isONclust", line 67, in main
sorted_reads_fastq_file = get_sorted_fastq_for_cluster.main(args)
File "/home/uqvperlo/.conda/envs/isonclust/lib/python3.7/site-packages/modules/get_sorted_fastq_for_cluster.py", line 124, in main
p_no_error_in_kmers = 1.0 - exp_errors_in_kmers/ float((len(seq) - k +1))
ZeroDivisionError: float division by zero

Dependencies for running --medaka argument

Running the --medaka argument in the isONclust environment will result in error.

Do we need to install medaka in the same environment with isONclust? Will it break any of the other dependecies?

Thank you

UPDATE: I went ahead and installed medaka in the same environment with isONclust.
Running the --medaka argument returns the following error (which is the same error when medaka was not manually installed )
Traceback (most recent call last): File "/home/axd/miniconda3/envs/isonclust/bin/isONclust", line 263, in <module> main(args) File "/home/axd/miniconda3/envs/isonclust/bin/isONclust", line 139, in main print("Saving references in:", f.name) UnboundLocalError: local variable 'f' referenced before assignment

TypeError with simulated fastq data

Hi, I have a TypeError when I execute the following command:

isONclust --t 1 --consensus --k 13 --w 20 --batch_type total_nt \
  --aligned_threshold 0.4 --min_prob_no_hits 0.1 \
  --fastq /data/simulated_aligned_reads.fastq \
  --outfolder /data/isONclust/git-issue

The error log is:

Traceback (most recent call last):
  File "isONclust", line 263, in <module>
started sorting seqs
0 reads processed.
10000 reads processed.
    main(args)
  File "/isONclust/isONclust", line 47, in main
    sorted_reads_fastq_file = get_sorted_fastq_for_cluster.main(args)
  File "/isONclust/modules/get_sorted_fastq_for_cluster.py", line 233, in main
    read_array, error_rates = fastq_single_core(args)
  File "/isONclust/modules/get_sorted_fastq_for_cluster.py", line 146 in fastq_single_core
    exp_errors_in_kmers = expected_number_of_erroneous_kmers(qual, k)
  File "/isONclust/modules/get_sorted_fastq_for_cluster.py", line 25, in expected_number_of_erroneous_kmers
    prob_error = [D[char_] for char_ in quality_string]
TypeError: 'NoneType' object is not iterable

I'm running Python 3.9 on Ubuntu 20.04, but I don't believe this has anything to do with the issue.
The data is simulated with NanoSim.

I've added the simulated fastq file here, so you can test it.

Incorporation of isONclust to a currently used pipeline (Iso-Seq, PacBio)

Hello everyone,
I just finished processing my Iso-Seq data and the pipeline I used, in brief, consisted of the isoseq3 steps, CD-HIT-EST followed by Cogent (using the CD-HIT outcome for this step) and Cupcake to collapse isoforms (as described in the tutorial). Now, my goal is to include isONclust to the pipeline and I have few questions regarding the steps I should take:

Most importantly, my data originate from four flow cells. Should I merge the four FLNC sets and perform isONclust on this data or do the clustering independently for each flow cell? I did not prepare the libraries myself but from my understanding, some size selection was employed in the process, meaning I would expect at least a portion of transcripts being uique to a certain flow cell.
In my first try, afterisoseq3 clusterstep, I performed polishing with all the initial subreads (isoseq3 polish). Is it still OK to keep this step once isONclust was run?
By using isONclust, can I omit the CD-HIT step now? I am not a bioinformatician by training and still new in the field, so I may be wrong. My understanding, however, was that CD-HIT takes the longest representative sequence in each cluster, without building any kind of consensus, considering quality scores etc, and that is why I want to implement isONclust now, hoping I can then proceed with the Cogent/Cupcake steps directly. Does it make sense to proceed like that?

I am not sure whether it is relevant, but I will be doing transcript collapsing without a reference genome.

Thank you very much for taking the time to address my concerns and questions.

Best,
Eva

Is it possible to use isOnclust on multiple samples?

Hi!
I am working with samples from nanopore. I successfully used isOnclust on a single sample, but I would need to compare various samples. Is it possible to use isOnclust on multiple samples?

isONclust: error: too few arguments

Hello Kristoffer,

I have been having (probably a trivial issue) with running isonclust and after two days of trying, I eventually would like to call for help.

I tried running isonclust on an Iso-Seq data using either flnc.fastq as an input or a combination of CCS and FLNC .bam files. In either case, the error was the same: "too few arguments". I thus downloaded the sample data set and ran isONclust --isoseq --fastq sample_alz_2k.fastq --outfolder test but the error was the same.

Subsequently, I tried specifying --k and --w instead of using --isoseq but the error was the same. Eventually, I tried with "my" data again and I specified every single parameter (using the default values). The command was thus: isONclust --fastq Merged.FLNC.fastq --t 12 --isoseq --min_shared 5 --mapped_threshold 0.7 --aligned_threshold 0.4 --min_fraction 0.8 --min_prob_no_hits 0.1 --outfolder isONclust-191011 Nonetheless, there are still "too few arguments".

I assume it is some really small thing missing somewhere and if this is the case, I am sorry about it but I am stuck and out of ideas, what might be going on. Could you please help me? Many thanks.

Best,
Eva

Working with huge datasets

Hello!
Thank you for the tool. I run it with some samples, and the results are promising.
Currently, I am testing this tool for one pipeline that should handle many ONT barcoded samples.
I have already found that one needs to concatenate samples in one file to cluster reads from different samples. The issue is that it may easily be a couple of hundred samples, so I am looking for ways to reduce memory requirements.

So, I have the following questions:

Is it possible to add support for gzipped files?
What if instead of clustering all pooled reads from many samples, I cluster sample-by-sample, using --consensus to get representative sequence (or use my own scripts to do that), pool representative sequences along samples, and cluster them again to create clusters that are shared among all samples?
I got following header for consensus sequences: ">consensus_cl_id:448_total_supporting_reads:20057". Hovewer, for this test I got only cluster IDs up to 104, and 448 definitely not a cluster id. Am I misreading it?

I am especially interested in question 2; please share your opinion on that approach.

Best,

pipeline not an option

In the manual you write:

isONclust pipeline --ont --fastq [reads.fastq] --outfolder [/path/to/output]

but pipeline is not an option or command, is it?
isONclust: error: invalid choice: 'pipeline' (choose from 'write_fastq')

Version mismatch with pypi repos

When I use the preferred way to install isONclust via pip install, I am getting version 0.0.6, however the latest release on GitHub is 0.0.4 (as of 2020-03-04). Why is that, what are the differences? Are there releases not available on this GitHub?

Usage of --consensus to generate a non redundant fasta file with unique transcript sequences

Hello,
I would like to generate a single fasta file with all unique transcripts using the --consensus parameter from reads generated by ONT cDNA sequencing. I have tried to do so on 2 different fastq files with varying number of reads. On the smaller file it exits out generating a fasta file without any sequences. In another instance it gave the following error;
Total number of reads iterated through:255873
Passed mapping criteria:12291
Passed alignment criteria in this process:5110
Total calls to alignment mudule in this process:5435
Time elapesd clustering last iteration single core: 267.3920338153839
Time elapsed clustering: 969.0639133453369
Nr clusters larger than 1: 22938
Nr clusters (all): 238472

STARTING TO CREATE CLUSTER CONSENSUS

Temporary workdirektory for polishing: /tmp/tmptcwada2z
creating center of 62799 sequences.
Traceback (most recent call last):
File "/home/superuser/ENTER/bin/isONclust", line 263, in
main(args)
File "/home/superuser/ENTER/bin/isONclust", line 127, in main
center = consensus.run_spoa(reads_path.name, os.path.join(work_dir,"spoa_tmp.fa"), "spoa")
File "/home/superuser/ENTER/lib/python3.8/site-packages/modules/consensus.py", line 89, in run_spoa
subprocess.check_call([ spoa_path, reads, "-l", "0", "-r", "0", "-g", "-2"], stdout=output_file, stderr=null)
File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 359, in check_call
retcode = call(*popenargs, **kwargs)
File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 340, in call
with Popen(*popenargs, **kwargs) as p:
File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 854, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/home/superuser/ENTER/lib/python3.8/subprocess.py", line 1702, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'spoa'

Could you please let me know, if spoa is installed as part of the isONclust package? I ask because of the lack of any error message associated with the smaller fastq file, where it generated a fasta file without inputting any sequences.
Thanks!

Stranded vs unstranded sequencing

Dear Kristoffer,

I'm very impressed with isONclust's performance - thank you for designing this much-needed tool!

Out of curiosity, I'm evaluating its utility for clustering reads from ONT genomic amplicon data instead of transcriptome data. We've sequenced a 9kb amplicon from a viral genome and suspect that we might have mixed infection with potentially large structural variations among strains. I realise isONclust is not designed for this type of data, but I was curious to see how it performs anyway.

My amplicon dataset is clustered into two clusters with c. 50% abundance each. After some initial excitement I noticed that the clusters correspond to sequencing direction. So, it looks like isONclust expects stranded data and won't cluster reverse reads with forward reads. I guess this behaviour makes sense to identify antisense transcripts, but it's not appropriate for unstranded data.

I realise that PacBio IsoSeq and ONT direct RNA data are stranded, but ONT cDNA libraries are typically unstranded (?). Could isONclust be modified to accept unstranded data? Or can unstranded data be reoriented before running isONclust?

Thanks and best wishes,
Marius

Is it possible to use isONclust to help determine whether two reads came from the same isoform?

Hi Kristoffer,

I understand that each cluster generated by isONclust represents all reads that came from the same gene.

However, I was looking for a de novo tool/method that could determine whether two overlapped reads (that have a set of matching minimizers) came from the same isoform for ONT reads. So I was wondering if there is any way in which I may use isONclust to help determine this? Or is it possible to use isONclust in an alternative or modified way to cluster reads that came from the same isoform?

Since ONT reads have high error rates, those unmatched gaps/regions (with no matching minimizers) between two matched/overlapped reads may be caused by either sequencing errors or alternative splicing, so it seems challenging to determine whether two reads came from the same isoform for ONT reads. I would appreciate your advice.

Thank you very much!

error: too few arguments

I am getting this error when I try the simple test file

$ ./isONclust --fastq sample_alz_2k.fastq --outfolder ./out/
usage: isONclust [-h] [--version] [--fastq FASTQ] [--flnc FLNC] [--ccs CCS]
                 [--t NR_CORES] [--d PRINT_OUTPUT] [--q QUALITY_THRESHOLD]
                 [--ont] [--isoseq] [--use_old_sorted_file] [--k K] [--w W]
                 [--min_shared MIN_SHARED]
                 [--mapped_threshold MAPPED_THRESHOLD]
                 [--aligned_threshold ALIGNED_THRESHOLD]
                 [--batch_type BATCH_TYPE] [--min_fraction MIN_FRACTION]
                 [--min_prob_no_hits MIN_PROB_NO_HITS] [--outfolder OUTFOLDER]
                 {write_fastq} ...
isONclust: error: too few arguments

Threads error

When I run isONclust on ONT data with the default --t parameter, it finishes without any issues, but when I increase the number of threads from 8 to any other number I get the following error :

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/vineeth/anaconda3/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/vineeth/anaconda3/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/vineeth/anaconda3/lib/python3.6/site-packages/modules/cluster.py", line 732, in reads_to_clusters_helper
    return reads_to_clusters(*args, **kwargs)
  File "/home/vineeth/anaconda3/lib/python3.6/site-packages/modules/cluster.py", line 678, in reads_to_clusters
    print("Percent passed alignment criteria out of number of calls to this module:{0}".format( round(100*aln_passed_criteria/float(aln_called), 2) ))
ZeroDivisionError: float division by zero
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vineeth/anaconda3/bin/isONclust", line 178, in <module>
    main(args)
  File "/home/vineeth/anaconda3/bin/isONclust", line 86, in main
    clusters, cluster_seq_origin = cluster.cluster_seqs(read_array, p_emp_probs,  args)
  File "/home/vineeth/anaconda3/lib/python3.6/site-packages/modules/cluster.py", line 784, in cluster_seqs
    cluster_results =res.get(999999999) # Without the timeout this blocking call ignores all signals.
  File "/home/vineeth/anaconda3/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
ZeroDivisionError: float division by zero

Explicitly setting --t to 8 works as well

Add script to create separate fasta for each cluster

The script should take the tsv file generated by isONclust and output reads into fasta or fastq files with reads belonging to each cluster. The script should take a parameter -n [int] that tells the script the smallest clusters that should be output, and a --fastq or --fasta flag for the output format.

isonclust command for ONT data

Hi,

I'm testing out isONclust with my ONT data and want to verify I'm using the correct command. In the readme, it says to use IsoCon pipeline, but the biorxiv paper says you used isONclust.py.

I got an error when I entered IsoCon pipeline (missing --fl and -outfolder vs. --outfolder), though this command seems to work:

isONclust --ont --fastq <my fastq file> --outfolder <output folder>

In addition, how can I increase the stringency for matching reads? I tried changing the --mapped_threshold parameter to 0.9, but the output was the same, so I wasn't sure whether changing from default (0.7) to 0.9 did anything. Basically, I'd like to see how read clustering performs with a range of percent similarities, say 80-90%.

Thanks!

struct.error

There is something wrong about the "struct.error: 'i' format requires -2147483648 <= number <= 2147483647" and the recent error please see error.txt
error.txt

this is my code:
isONclust --ont --fastq ../../00.all_length/reads_fitered.fq --outfolder out --t 16 1>log.txt 2>error
log.txt

in filedir "out" is sorted.fastq and logfile.txt
Could you please help me to solve the error?

RE: IsONclust on ONY; spot died with <Signals.SIGILL: 4>.

Hi
I am running isONclust in a Conda environment and the spoa and medaka modules are loaded before the run.

Here is my commands:

ml Miniconda3/4.10.3
source activate isonclust
ml spoa/4.0.0-GCC-8.3.0
ml minimap2/2.22-GCC-8.3.0
ml medaka/1.5.0

isONclust --ont --use_old_sorted_file --fastq ./isONclust-out2/sorted.fastq  --outfolder isONclust-out2 --consensus  --t 24

the run failed in the consesus step when running spoa with the following kill signal.

Traceback (most recent call last):
  File "/home/malabady/.conda/envs/isonclust/bin/isONclust", line 263, in <module>
    main(args)
  File "/home/malabady/.conda/envs/isonclust/bin/isONclust", line 127, in main
    center = consensus.run_spoa(reads_path.name, os.path.join(work_dir,"spoa_tmp.fa"), "spoa")
  File "/home/malabady/.conda/envs/isonclust/lib/python3.10/site-packages/modules/consensus.py", line 89, in run_spoa
    subprocess.check_call([ spoa_path, reads, "-l", "0", "-r", "0", "-g", "-2"], stdout=output_file, stderr=null)
  File "/home/malabady/.conda/envs/isonclust/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['spoa', '/tmp/tmp_12sh3ko/reads_tmp.fq', '-l', '0', '-r', '0', '-g', '-2']' died with <Signals.SIGILL: 4>.

I can see /tmp/tmp_12sh3ko/reads_tmp.fq and can run spoa on the reads_tmp.fq directly.

Any suggestions what's wrong?

Thanks
Magdy

ValueError: min() arg is an empty sequence

I am getting this error within the multiprocessing.pool.RemoteTraceback library. I am running the following command

isONclust \
--ont \
--fastq results/.temp/merged.barcode.clusters.fastq \
--q 7 \
--aligned_threshold 0.85 \
--min_fraction 0.95 \
--mapped_threshold 0.7 \
--min_shared 55 \
--outfolder results/isONclust/merged_barcodes/origins

And I am receiving this error

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/site-packages/modules/parallelize.py", line 17, in reads_to_clusters_helper
    return cluster.reads_to_clusters(*args, **kwargs)
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/site-packages/modules/cluster.py", line 224, in reads_to_clusters
    lowest_batch_index = max(1, min(prev_b_indices))
ValueError: min() arg is an empty sequence
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/bin/isONclust", line 263, in <module>
    main(args)
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/bin/isONclust", line 72, in main
    clusters, representatives = parallelize.parallel_clustering(read_array, p_emp_probs, args)
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/site-packages/modules/parallelize.py", line 156, in parallel_clustering
    cluster_results =res.get(999999999) # Without the timeout this blocking call ignores all signals.
  File "/Users/joshl/miniconda3/envs/mapt_pipeline/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
ValueError: min() arg is an empty sequence

Is it possible to add a “--shortname" option for isONclust to output the read name with “the read id only”?

Hi Kristoffer,

I was wondering if it would be possible to add a --shortname option for isONclust to output the read name with “the read id only” instead of the entire @ line (currently the entire @ line is concatenated together with “_” in the output file final_clusters.tsv)? This would save some runtime in post-processing the output file. Since the @ line also contains “_” in some part of the original text, the added “_” cannot be used as a delimiter to convert the long string of the read name back to its original form. So if it is possible to add this --shortname option, it would be really helpful.

Thank you very much!

homopolymer compression?

Thanks @ksahlin for the great tool. I've been trying to use for a variety of clustering tasks using some amplicon data. I still have some uncertainty on how it is performing as I don't necessarily know exactly I should be expecting in my amplicon pool, but I do know that I see some "lumping" of closely related sequences that I would like to see come out in separate clusters.

So we are running all of our amplicons with the newer R10 flow cells and use high accuracy basecalling model -- the goal is to reduce errors in homopolymers. I notice this is different than at least your test case (using fast basecalling model). So as I looked at the code I see that the first step in clustering is actually to compress homopolymers: https://github.com/ksahlin/isONclust/blob/master/modules/cluster.py#L209. So I'm wondering what would happen if homopolymers were not compressed? Is compressing homopolymers required here just for kmer selection or is it being used because of higher error rates in these sequences?

Related point is that the error profiles are quite different between the ONT guppy basecall models, so I'm also then wondering about the role of the empirical distances that seem to be hard coded into isONclust -- would the basecall model (ie single read accuracy) alter these distances? I guess in other words should I expect differing performance using the FAST vs HAC basecall models? And if I try to remove the homopolymer compression step as a test, would that cause a problem?

Memory Problems with ONT Data

I was using isONclust in parallel as a previous step to define a transcriptome using ONT data with isONform. I looked at the memory profiling of isONclust and after a few minutes, when almost reaching the memory limit (125 Gb), the memory consumption of isONclust dropped at 40-50 Gb.
IsONclust "seems to be working" as the command line did not appear and no error was thrown but only 1 thread out of the total that were running was actually alive.
I realized that there were 2 reads that were very long (<100 Kb), while the other reads were 10 Kb long at most. I removed those outlayers and now it seems to work.

I was thinking that maybe some error might be thrown in order not to lead to confusion.

Thanks for your time!

About the read order in each cluster in the final_clusters.tsv file

Hello!

I have a question regarding the read order in each cluster in the final_clusters.tsv file and would appreciate your advice.

My understanding is that isONclust first sorts the reads such that longer reads with higher quality scores appear earlier, and then processes the reads one-by-one in the sorted order. So in the final_clusters.tsv file, are the reads in each cluster also reported in the sorted order (i.e. longer reads with higher quality scores appear earlier)?

Thank you very much!

ksahlin / isonclust Goto Github PK

isonclust's People

Contributors

Stargazers

Watchers

Forkers

isonclust's Issues

Recommend Projects

Recommend Topics

Recommend Org