genomerik / tama Goto Github PK

View Code? Open in Web Editor NEW

125.0 125.0 24.0 800 KB

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)

License: GNU General Public License v3.0

Python 100.00%

tama's People

Contributors

Stargazers

Watchers

tama's Issues

TAMA without a reference genome

Hello everyone,

I am new in the field of bioinformatics and currently working with my fist Iso-Seq data. I have obtained what might be an OK reference transcriptome but I am searching for ways to improve it. The TAMA tool got my attention and I would like to give it a try, but the problem is we do not have a reference transcriptome for our species of interest. When using Cogent in such a situation as described here, a fake genome is created first prior to isoform collapsing. My question is, is there is a similar "workaround" with TAMA as well. Or could I use the Cogent-based fake genome and then proceed with TAMA collapse?

Many thanks for sharing your opinions/ideas with me.

Best,
Eva

Tama merge.

Hi, I have a question: I am merging transcriptomes previously checked in TAMA-Collapse and Sqanti3, from different tissues, treatments and plant varieties to obtain a bid bed file with unique IDs. However, after merging the transcriptomes, the prefix_merge.txt does not contain all my initial transcripts. Is there a way or parameter to obtain the same Isoform number? Thanks in advance.

Best,

Cesar

Over-clustering of transcripts in gene-dense genomes

Hi Richard,

Re our discussion at PAG about what should constitute a "gene" for the purpose of transcript clustering, I wanted to note some issues that might arise from TAMAs default rule of "cluster everything with any overlap".

Fungal genomes frequently contain genes with overlapping UTRs. This is a known issue for transcript assembly (cufflinks) and RNA-seq guided gene annotation. See CodingQuarry paper for details.

One potential solution might be to first predict the CDS region of a transcript then cluster transcripts with shared exons within the CDS, ignoring UTRs.

Alternatively, for well-annotated genomes where the user is just wanting to incorporate novel isoforms of existing genes, you might just assign transcripts to the reference gene that they have the greatest overlap with.

Transcripts fully contained within an intron of another gene (no CDS overlap) are assigned to the same cluster as the gene they are nested in. This would call things like homing endonucleases from mobile-intron elements as isoforms of whatever gene they are inserted in. The same issue would also come up where other TEs insert into an existing intron.

I think using a CDS overlap rule for clustering would also work here.

ImportError: No module named pysam

Dear TAMA developer:

Thanks for developing TAMA.

I got an error when I run tama_collapse.py, saying ImportError: No module named pysam. I fixed it by install pysam package using conda. Could you please take a look to see this a general issue?

Thanks
Changyu

MemoryError problem

Hi Rik,
I have my first Isoseq dataset from a Sequel2 run, this is a set of 2 samples prepped with the TeloPrime protocol (5'-cap). I have 2 main questions:
1)
I went through your instructions at https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Sequence-Cleanup up to the collapsing stage. I mapped the data using minimap2. Aligned reads are 3470490 for sample1 and 4068057 for sample2.
For 1 sample at the Tama_collapse stage I've got an error saying:
"var_coverage_dict[scaffold][this_coord][trans_id] = 1
MemoryError"

This job was run on a cluster, allocating 240gb of memory.
Tama_collapse on sample1 failed with the above mentioned message and on sample2 is running since a week (I allocated 480gb for the second job).
Attached you will find the output log of the sample that failed.
Do you have any suggestion on how to solve/improve this?

2)Is there a way with your set of tools to annotate the trascripts to a reference gff/gtf file?
Thanks in advance for your time.
I'm looking forward to your reply.
Cheers,
Roberta
Tama_collapse.o5180769.zip

Can I use FLAIR corrected alignment with tama collapse?

Hi Rik,
I am using the tama for a while and I can say it is a great software. This message is just a question and not an issue report.
For curiosity, I tried to collapse alignment corrected by flair with tama. I wanted to try the flair out, because it is able to correct misaligned splice sites. The flair produce an alignment bed12 file what I converted into bam file. Than I used this bam file to collapse it with tama.
The first few lines from the bam file looks like this:

67:553|1f202b78-ee34-44f4-ba8c-73d7edabbef5;16 16 scaffold_1 1424 255 388M55N68M55N34M * 0 0 * *
66:1618|eab3ca4a-40c9-45d4-bda4-019f0b265ff4;16 16 scaffold_1 1427 255 385M55N68M55N232M50N152M56N311M50N186M57N112M52N30M62N75M * 0 0 * *
62:1705|d776b5ad-11a1-4143-ba20-e4d0f549b7ca;16 16 scaffold_1 1441 255 371M55N68M55N232M50N152M56N311M50N186M57N112M52N30M62N224M * 0 0 * *
79:305|974bfdb6-1267-43d7-9ad2-06f156c16a10;16 16 scaffold_1 9731 255 195M * 0 0 * *
62:2408|75a2bdf7-ad07-4d10-8481-78df0ba3bca8;0 0 scaffold_1 11127 255 793M100N89M56N43M76N12M89N523M48N56M52N139M57N308M82N195M61N161M * 0 0 **

It is similar to the deSALT aligner output except it do not contain the read sequence.
Do you thing can I use the tama for this type of data? When I tried I got this error message: Genome seq is not the same length as query seq.
Thanks for your suggestions,
Cheers,
Botond

SAM parse off by one?

Hello,

I am wondering if your sam parse section has an off-by one error present. Hopefully I am missing something or some downstream correction. Here's your code:

def trans_coordinates(start_pos,cigar):
    [cig_dig_list,cig_char_list] = cigar_list(cigar)
    end_pos = int(start_pos)
    exon_start_list = []
    exon_end_list = []
    
    exon_start_list.append(int(start_pos))
    
    for i in xrange(len(cig_dig_list)):
        cig_flag = cig_char_list[i]
        
        if cig_flag == "H":
            continue
        elif cig_flag == "S":
            continue
        elif cig_flag == "M":
            end_pos = end_pos + int(cig_dig_list[i])
            continue
        elif cig_flag == "I":
            continue
        elif cig_flag == "D":
            end_pos = end_pos + int(cig_dig_list[i])
            continue
        elif cig_flag == "N":
            #add exon end position to list (must be before updating end pos info)
            exon_end_list.append(end_pos)
            end_pos = end_pos + int(cig_dig_list[i])
            #add exon start
            exon_start_list.append(end_pos)
            continue
    #add last exon end position
    exon_end_list.append(end_pos)
    
    return end_pos,exon_start_list,exon_end_list

It looks like you are treating the sam file as being zero-based instead of one-based as in the specification.

Imagine that your transcript alignment begins with an exon of length 1...

end_pos = int(start_pos) # you initially define end_pos to be the same as start
# drop a bit into it...
elif cig_flag == "M":
  end_pos = end_pos + int(cig_dig_list[i])
  continue

a cigar match of length 1 should not change the end position from the start in 1-based coordinates but this code does.

I don't think this would harm most boundary comparisons that are simply numeric because it would be systematic but it should affect reported features unless you are subtracting one from your end coordinates for say bed files to position it properly. If you did any coding checks by grabbing sequences by these coords it would likely severely affect them unless there is a correction applied there too.

I noticed this due to a recent switch to deSALT since minimap2 was dropping small exons from its alignments. I was investigating this message from tama_collapse:

Error with exon end earlier than exon start
226925864	226925864

# observed when tama was processing two transcripts with these cigar strings that I've tokenized:
1S 25M 15754N 113N 93M 1864N 140M 425N 126M 85N 383M 1S
1S 25M 15754N 113N 93M 1864N 140M 425N 126M 85N 383M 1S

I get this message frequently and thus lose data. There don't seem to be any single base exons here so my first guess which led to the coordinate system issue was wrong. My actual issue looks to be that there are two introns with no intervening exons being reported in the cigar, the 1st intron is immediately followed by another which seems like an aligner bug.

Nonetheless, if someone is permitting one-based exons, the following shouldn't be an error when end equals start unless tama internally operates on python/bed style coords which it doesn't seem to.
Tama exits at this point.

if check_end <= check_start: # exon end must always be greater than exon start
            print("Error with exon end earlier than exon start")

I do see that there is a danger of division by zero if one assumes that exon length = check_end - check_start in a one-based coordinate system.

Am I missing something here?

Regards

TAMA merge - Minimum overlap ratio

Is there any option with tama merge to prevent the merging of transcripts into a single locus (similar to bedtools intersect -r option). In the example below, I have a single incorrect transcript derived from an incorrect sequence on the wrong strand that is merging with a close gene model by just a few bases in a single exon. Is there any setting/option so that these are not merged into the same gene model based on an overlap threshold?

one isoform divided into different transcript groups

When I used TAMA collapse to collapse my isoforms, I found that one isoform was divided into different transcript groups. Example as follow, CAT-6_84937/f1p4/390 was divided into 4 transcript groups. Is this a bug or an explicable events ?

3 33509968 33510356 G8998.1;CAT-6_84937/f1p4/390 40 + 33509968 33510356 255,0,0 1 388 0
3 33507166 33510366 G8998.1;CAT-6_406/f25p4/1106 40 + 33507166 33510366 255,0,0 5 352,142,121,63,429 0,1390,1612,1837,2771
3 33509968 33510356 G8998.2;CAT-6_84937/f1p4/390 40 + 33509968 33510356 255,0,0 1 388 0
3 33507168 33510364 G8998.2;CAT-6_43299/f1p4/1973 40 + 33507168 33510364 255,0,0 4 350,142,121,1361 0,1388,1610,1835
3 33509968 33510356 G8998.3;CAT-6_84937/f1p4/390 40 + 33509968 33510356 255,0,0 1 388 0
3 33507174 33510357 G8998.3;CAT-6_255748/f1p4/1040 40 + 33507174 33510357 255,0,0 6 235,55,142,121,63,420 0,289,1382,1604,1829,2763
3 33507178 33508883 G8998.4;CAT-6_36466/f1p4/587 40 + 33507178 33508883 255,0,0 3 340,142,105 0,1378,1600
3 33507183 33510315 G8998.5;CAT-6_72713/f1p4/1055 40 + 33507183 33510315 255,0,0 5 335,142,121,63,378 0,1373,1595,1820,2754
3 33509968 33510356 G8998.6;CAT-6_84937/f1p4/390 40 + 33509968 33510356 255,0,0 1 388 0
3 33507386 33510353 G8998.6;CAT-6_4437/f3p4/873 40 + 33507386 33510353 255,0,0 5 132,142,121,63,416 0,1170,1392,1617,2551

how to calculate coverage & identity value

Hi,

I am new to Iso-seq and RNA-seq, now I have done PacBio Iso-Seq and alignment, how can I get -c coverage and -i identity?

Thanks a lot.

TAMA GO - contradicting results based on Aligners

Hello Richard,

I was trying out the TAMA Go degradation tool and found quite drastic differences depending on the aligner used:

Aligned with Minimap
minimap2 -t 30 -ax splice -uf --secondary=no -C5 -O6,24 -B4 $REFERENCE/mm10.fa $1.flnc.fasta > $1.sam 2> $1.minimap_map.log
samtools sort -O SAM $1.sam > $1.sorted.sam

B21_degradation.txt, aligned with Minimap2
Degradation Signature = 0.947700974421
K17_degradation.txt, aligned with Minimap2
Degradation Signature = 0.942204249062
K23_degradation.txt, aligned with Minimap2
Degradation Signature = 0.938241261273

Aligned with GMAP
gmap -D $REFERENCE -d mm10 -f samse -n 0 -t 16 --cross-species --max-intronlength-ends 200000 -z sense_force
$1.flnc.fasta > $1.sam
2> $1.gmap_map.log
samtools sort -O SAM $1.sam > $1.sorted.sam

B21_degradation.txt, aligned with GMAP
Degradation Signature = 0.45011328759
K17_degradation.txt, aligned with GMAP
Degradation Signature = 0.38691335822
K23_degradation.txt, aligned with GMAP
Degradation Signature = 0.364736473647

Of course I understand that this is to do with aligner issue but was wondering whether you see this with your samples, and given the conflicting results, which aligner do you trust more?

Thank you,

Best wishes,
Szi Kay

key error 'UnnamedSample_HQ_transcript' read support

Good afternoon!
I have some 5’-cap selected Iso-Seq data. I aligned with gmap (and still gonna alignment with desalt) and after used tama collapse. However, occurs a key error 'UnnamedSample_HQ_transcript'. You can see bellow the command, the beginning of my collapsed bed file, my hq_transcript file I used for aligning and my cluster file. Just as an extra information, my file contains 3 samples that after this I will demultiplex. Do you know why this error is occurring?

-bash-4.1$ python tama_read_support_collapse_cluster.py highquality_gmapn1_collapse_capped_trans_read.bed unpolished.cluster_report.csv output_file

USAGE: python tama_collapse_trans_support_cluster.py prefix_trans_read.bed cluster_file output_file
opening collapse
opening cluster
Interpretting as Iso-Seq1 cluster
Going through cluster
Traceback (most recent call last):
File "tama_read_support_collapse_cluster.py", line 148, in
cluster_read_list = list(cluster_read_dict[cluster_id].keys())
KeyError: 'UnnamedSample_HQ_transcript'

-bash-4.1$ head highquality_gmapn1_collapse_capped_trans_read.bed
1 3136605 3136668 G1.1;UnnamedSample_HQ_transcript/213864 40 - 3136605 3136668 255,0,0 1 63 0
1 3199735 3202443 G2.1;UnnamedSample_HQ_transcript/132038 40 - 3199735 3202443 255,0,0 1 2708 0
1 3201364 3205200 G2.2;UnnamedSample_HQ_transcript/72631 40 - 3201364 3205200 255,0,0 1 3836 0
1 3417436 3418825 G3.1;UnnamedSample_HQ_transcript/204620 40 - 3417436 3418825 255,0,0 1 1389 0
1 4744318 4748209 G4.1;UnnamedSample_HQ_transcript/70487 40 + 4744318 4748209 255,0,0 1 3891 0
1 4758097 4758256 G5.1;UnnamedSample_HQ_transcript/213693 40 + 4758097 4758256 255,0,0 1 159 0
1 4761262 4785718 G6.1;UnnamedSample_HQ_transcript/54434 40 - 4761262 4785718 255,0,0 6 3480,194,124,166,155,146 0,13060,16262,21305,22688,24310
1 4769792 4785731 G6.2;UnnamedSample_HQ_transcript/23568 40 - 4769792 4785731 255,0,0 5 4724,124,166,155,159 0,7732,12775,14158,15780
1 4773210 4785709 G6.3;UnnamedSample_HQ_transcript/184765 40 - 4773210 4785709 255,0,0 5 1306,124,166,155,137 0,4314,9357,10740,12362
1 4773210 4785709 G6.4;UnnamedSample_HQ_transcript/55166 40 - 4773210 4785709 255,0,0 5 3591,124,166,155,137 0,4314,9357,10740,12362

-bash-4.1$ head hq_transcripts.fasta

UnnamedSample_HQ_transcript/0
GCGCCCGAGCTTAGTTCTGGTGGGCGGACGCCGCGGCTTTAAGCCGGGGCGGGACTGAGC
GAGGAGCAGCAGCGGCGTCGGCGGCGAGAGTTCTTGGGGACTCGGGCCCGGATGCTGGGG
GCTAGTCGGCAGGTCCCGGCTGGATAAACTCATGGACTTGGCAGCTCAGGCTTGAGAAAA
ACACAAACTCAGAAGAAAGATGGCGTTCGTTGCAACCCAGGGGGCCACGGTGGTCGACCA
AACCACTCTGATGAAGAAATACCTTCAGTTTGTGGCAGCTCTCACGGATGTGAATACACC
TGATGAAACGAAGTTGAAAATGATGCAGGAAGTCAGCGAGAACTTTGAGAATGTCACGTC
ATCTCCTCAGTATTCTACATTCCTGGAACACATCATCCCTCGATTTCTTACCTTTCTCCA
AGATGGAGAAGTCCAGTTTCTTCAGGAGAAGCCAGCCCAGCAACTACGGAAGCTGGTACT
TGAGATACTCCACAGGATTCCAACCAATGAGCATCTGCGTCCTCATACGAAAAACGTTCT

-bash-4.1$ head unpolished.cluster_report.csv
cluster_id,read_id,read_type
transcript/0,m54345U_200310_133837/123864620/ccs,FL
transcript/0,m54345U_200310_133837/13501881/ccs,FL
transcript/0,m54345U_200310_133837/4654108/ccs,FL
transcript/0,m54345U_200310_133837/1706014/ccs,FL
transcript/0,m54345U_200310_133837/159711271/ccs,FL
transcript/1,m54345U_200310_133837/18219976/ccs,FL
transcript/1,m54345U_200310_133837/24512758/ccs,FL
transcript/1,m54345U_200310_133837/134808826/ccs,FL
transcript/2,m54345U_200310_133837/20250751/ccs,FL

how to cite this work?

@GenomeRIK you do a hard work for this , how to cite this work?

how does TAMA do variant calling?

Hi again - I'm afraid I have another query.

I was wondering if you can provide more details of how TAMA collapse does its variant calling? In particular, when I view the reads in the igv I often see variants that are not represented the *variants.txt file. For example, at one locus there are 12 reads contributing to a single TAMA model, ranging from 180 to 960 bp, most of which have no clipping. A couple have 1-2 bp soft clipping and one read has 40 bp soft clipping. Otherwise they match the genome perfectly except a couple 2 bp of insertions. These are at 5 locations, 4 of which are consigned to single reads, but one of which is in 4 / 12 reads. None of these 5 insertions are reported in the variants file, so what is the logic behind the variant calling?

This is important to me because I'm working with a non-model alga species and I'm interested in having as accurate as possible predicted protein sequences. Our PacBio genome is about 40x coverage and we have a lot of isoseq data which we're using to build the gene models. Both sets of data still have PacBio errors. To make things more complicated, the genome is diploid, with moderate heterozygosity, and there are some parts of the genome where the two haplotypes cannot be resolved by Falcon-unzip / Purge Haplotigs. I've been trying to use the *variants.txt file to help me decide whether to use sequence data from the genome or the isoseq data for building the a final CDS at different locations. For example, if R = 'count' / 'cov_count', then high R might indicate that the variant is due to a PacBio error in the genome; low R a PacBio error in the isoseq read; and R= 0.5 may be due to the genomic locus still being a mixture of two alleles resulting in isoseq from both alleles mapping at one locus. I'm then using this information to build the final CDS.

Any help you can give would be much appreciated!

Many thanks,
Alastair

log argument not work

Dear TAMA developer,

I found the -log parameter is not working when I run tama_collapse.py. When I run the script with out -log, I did not see a log file when the run is finished. When I add the -log log_on, I kept get an error saying Please use log_on or log_off, and the run just stoped. Could you please take a look at this?

Thanks,
Changyu

BAM file is not supported for tama_collapse.py

Dear TAMA developer,

I got an error saying Please try again with complete arguments Traceback (most recent call last): File "/data/group/lewseylab/home/cyi/tama/tama/tama_collapse.py", line 207, in <module> sam_file = opts.s[0] TypeError: 'NoneType' object has no attribute '__getitem__' when I run tama_collapse.py using this code python tama_collapse.py -b my_sorted.bam -f my_fasta -p iso_set2 -x no_cap . However, it works if I convert the bam file to sam file. Could you please take a look at this? Please let me know if you need more information.

Thanks,
Changyu

tama_flnc_polya_cleanup.py fails for "A"-only sequences

Hi
The script tama_flnc_polya_cleanup.py fails in my hand for some IsoSeq derived sequences.
Trying to find the source I had sequences which contained only PolyA.
E.g.

>transcript/113571 full_length_coverage=9;length=189
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAA

If such a sequence is part of the FASTA file given to the script it fails with the message:

Traceback (most recent call last):
  File "/tmp/tama/tama_go/sequence_cleanup/tama_flnc_polya_cleanup.py", line 153, in <module>
    this_block_obj = a_block_dict[this_block_index]
KeyError: 0

If I do modify the sequence and add e.g. a single C at the position 1 everything is fine.
This is obviously pretty rare but made the script fail for some of my IsoSeq data.
A more biology related issue is maybe the fact that if we add instead e.g. another nucleotide at any different spot in the sequence it works and takes everything 5' as sequence and everything 3' as tail.
I guess I would argue that in this case it is much more likely that we have a wrong base-call instead.

What kind of fasta should be used for sam file?

Hi @GenomeRIK,
I want to try TAMA to find APA(Alternative PolyAdenylation ). You know, FLNC reads were removed polyA tails. BUT I'm confused which fasta should be used, CCS.fasta, FLNC.fasta, or High-quality fasta, for producing sam file.
I've already tried some software, like TAPIS, IDP-APA. But the results seem not to be good. Hope you can reply to me soon, I'd greatly appreciated that.

Evidence for CDS selection

TAMA currently uses identity to Uniprot proteins to select ORFs/CDSs. For genes without a homologue in Uniprot, it would great to be able to incorporate other lines of evidence.

Use custom reference protein sets (i.e. from a related species) in place of Uniprot data.
Select ORFs that contain a PFAM domain hit (as is done in transdecoder)
Select ORFs with best Kozak sequence match for that genome. The Kozak consensus can be obtained from the high-confidence homologues of Uniprot matches. ( CodingQuarry trains a GHMM for this.)

Error: invalid literal for int() with base 10

Hi Richard,

I am getting an error when running TAMA Collapse on public data (unpolished FLNC):

Traceback (most recent call last): File "/home/bin/tama/tama_collapse.py", line 4447, in <module> map_seq_length = mapped_seq_length(cigar) File "/home/bin/tama/tama_collapse.py", line 360, in mapped_seq_length map_seq_length = map_seq_length + int(cig_dig_list[i]) ValueError: invalid literal for int() with base 10: '11=1'

Any advice? I used minimap2 with recommended flags and it always worked for me before.
Thanks!
Ksenia

tama_cds_regions_bed_add issue

Hi,

I've reached the final step of the tama GO pipeline and encountered an error:

Traceback (most recent call last):
  File "/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py", line 140, in <module>
    block_start_list = line_split[11].split(",")
IndexError: list index out of range

The output from the blastp parser in the previous step looks as it should as far as I can tell (I've attached the file: parser_all.txt), and the bed/fasta files I'm using haven't been changed since I've started the pipeline.

This was my command:
python2 /tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p parsed_blastp_output -a l_tama_bed_file.bed -f l.fasta -o l_final_tama.bed

Any help would be appreciated!

Thanks,
Emma

TAMA and cDNA Cupcake

Good afternoon,
For my PAC-BIO sequencing I performed the 5' cap selection step, therefore I am currently interested in using the TAMA Collapse.
But I also want to use the SQANTI2 for classification and filtering of the isoforms. https://github.com/Magdoll/SQANTI2
In the SQANTI2, cDNA cupcake is used for collapsing https://github.com/Magdoll/cDNA_Cupcake/wiki#install
https://github.com/Magdoll/cDNA_Cupcake/wiki/Cupcake:-supporting-scripts-for-Iso-Seq-after-clustering-step
However for me is not clear if it supports 5' cap selected libraries.
Is it more advised to run the SQANTI2 as it is or to use TAMA Collapse and then use the result to run SQANTI2?
Thank you
Beatriz

stops/freezes at high coverage regions

Hi Richard,

I am sorry to bother you again. I have an another problem. When I am using a larger data set (~5M reads) tama just stops/freezes when it reaches highly covered regions. High coverage means 30000 reads aligned on to a gene coding site. I split the alignment (sam) with you "tama_mapped_sam_splitter.py" into 10 pieces but this do not resolved the problem. When I tried to collapse the piece which contains the highly covered gene region the tama just freezes when arrived to that region.
What do you suggest? How can I resolve this problem? Should I cut the alignment (sam) file into smaller pieces?

Thanks for your help!
Cheers,
Botond

tama_collapse error when writing variant file (IndexError: list index out of range)

Hi Rik,

First of all, thanks for providing this wonderful tool. It is exactly what I am looking for, now I only have to get it running ;)

I am doing tama_collapse.py on a GMAP mapping file. The tool runs fine and creates nice outputs. However, it stops during creating the *_variants.txt file, and returns the following error:

Writing variant file
Traceback (most recent call last):
File "/home/martin/tama/tama_collapse.py", line 6216, in
ref_allele = fasta_dict[scaffold][var_pos]
IndexError: list index out of range

(Python 2.7.15; Biopython 1.76; Ubuntu 18.04)

I would highly appreciate any ideas.

Best,
Martin

How it is possible to split gene models created from adjacent, slightly overlapping transcripts?

Hi Richard,
You mentioned in one of your answer that it is possible to split the gene models created from adjacent, slightly overlapping transcripts. Can you explain it in a more detail? I am really curious about it!
Thanks a lot!
Cheers,
Botond

tama merge -s error

Hi Richard,

I ran into an error when running tama merge using a refseq annotation file as source for gene/transcript ID with the -s option.

Traceback (most recent call last):
File "/project/devil_lp18/yuanyuan/tama/tama_merge.py", line 3595, in
total_gene_count = process_trans_group(group_trans_list,total_gene_count)
File "/project/devil_lp18/yuanyuan/tama/tama_merge.py", line 3408, in process_trans_group
this_source_gene_id = source_trans_gene_dict[this_source_name][this_source_trans_id]##################################################
KeyError: 'XM'

It probably has something to do with refseq's transcript ID format, which contains an underscore between the accession prefix and number?

Thanks,
Yuanyuan

Meaning of variant type 'S'

Many thanks for this wonderful program - it's saved our project!

I just have a query about the *variants.txt output of TAMA merge. I can't work out what variant type 'S' is. It generally seems to happen towards the end of reads, so something to do with soft-clipping??

For example, if I look at this line from the variants file

000000F_arrow   666056  S       C       7       158     P:86414:216:840:60,A:16022:42:1132:60,P:47592:827:1234:60,A:17843:3:1152:46,P:67241:39:1057:60,B:79276:31:507:60,P:75873:21:936:60

And look at this position on the genome browser, then the position 666056 is the penultimate base of read P:86414:216:840:60. The read sequence from 666055-666057 is CAC and the genome sequence is also CAC. So what is the variant that's being reported?

Any insights would be much appreciated!
Alastair

KeyError: 73 mapped_flag = sam_flag_dict[sam_flag]

Hi,

I was trying to use tama collapse on my isoseq data and it keeps giving errors. The error I get is:

tc_version_date_2020_07_18
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default 5 prime threshold: 10
Default exon/splice junction threshold: 10
Default 3 prime threshold: 10
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using SAM format for reading in.
Default log output on
Default run mode original
Default 5 read threshold
time taken since last check: 0.0:0.0:0.0
time taken since beginning: 0.0:0.0:0.0
going through fasta
time taken since last check: 0.0002777777777777778:0.0:1.0
time taken since beginning: 0.0002777777777777778:0.0:1.0
going through sam file
Traceback (most recent call last):
File "tama/tama_collapse.py", line 5130, in
mapped_flag = sam_flag_dict[sam_flag]
KeyError: 73

when I run python tama_collapse.py -s hq.reads.sam -f cogent.fake_genome.fasta -p prefix -x capped.
I have made some changes to the tama_collapse file because there where a few errors because I am using python3 but I cant find the reason for this error. I created the sam files by minimap2 -ax sr. Do you know what the issue might be?

Thanks,

issue with format converter tama_format_gtf_to_bed12_ncbi.py

Hi Richard,

I see that you have fixed the issue already - thanks for the quick response to my email! I am just leaving this comment here in case other users using the older version of the script run into the same issue.

The issue was that columns 2 and 3 of the output bed file were showing the start and end coordinates of the first exon of a transcript, instead of the transcript start and end; and for transcripts on the reverse strand, exon relative starts in column 12 were negative.

Thanks for making the tools :)

Yuanyuan

Forking and converting to Python3

Dear @GenomeRIK ,
I have had a look to your code recently as I am looking for solutions to improve the alignment of NanoPore reads to the genome prior feeding them into our RNASeq transcript annotation and model selection pipeline, Mikado and its associated gene annotation pipeline. Our team (@swarbred, @cschuh and @gemygk) is interested in testing TAMA for this purpose.

After looking at the code, though, I would like to perform some changes (port to Python3, use PySam to read directly BAM files, and potentially others after I start working on it). Would we have your permission to do so and introduce the forked program, of course properly cited both in the code and in a possible future publication, in our workflows?

Hidden characters in shebang

I found a little bug in the python scripts tama_collapse.py and tama_merge.py,
When executing them directly I had errors such as

./tama_collapse.py
/usr/bin/env: ‘python\r’: No such file or directory

Which I did not understand as the header looks on the first glance correct:

#!/usr/bin/env python

and the execution via python2 tama_collapse.py worked correctly.
Upon inspection with od -c I found though for both - merge and collapse - the following hidden characters:

0000000   #   !   /   u   s   r   /   b   i   n   /   e   n   v       p
0000020   y   t   h   o   n  \r  \n  \r  \n
0000031

replacing this with sed 's|\r||' tama_collapse.py seems to fix that behaviour

KeyError with ONT data

I have been testing out TAMA on some 5’ capped ONT data that I have obtained.

I get the following error after a while:

Traceback (most recent call last):
File "/mnt/shared/scratch/mc42302/201903_RTD2/software/tama/tama_collapse.py", line 4572, in
group_trans_list_dict[group_count].append(read_id)
KeyError: 2340

This was with the latest version, which I downloaded today.

Command:

python /mnt/shared/scratch/mc42302/201903_RTD2/software/tama/tama_collapse.py
-s ont.fasta.sorted.filtered.sam
-f /mnt/shared/projects/barley/201906_MorexGenomeV2/genome/Barley_Morex_V2_pseudomolecules.fasta
-p ont_new_collapsed -d merge_dup -x capped -m 0 -a 0 -z 0 -sj sj_priority -lde 30 -sjt 30

See below for last part of error stream. Would you be able to help? Let me know if you need more information.

cb77f91c-3e86-40c4-9a96-552c40dc1935 16 chr1H 493924813 60 1236M * 0 0 GCAATCCTCAGACACCAGCAGTCCATCCCATCCTAGCTCCGGCGGCGATGCCGGCCGGCCAGCAGCCGCAGGCGCGGGAACCCGACGGCGGCGCCGACTCCAACCACCGCCAGCCCTCGCCGCCCCACACGCCAGCGCTGCCATCGGAGGTGGTGCCGGCCTACCCGCCGCCGGAGTCGGAGGACGACGAGTCGTGGGTGTGGACGCAGATCAAGGCGGAGGCCCGGCGCGACGCGGACGCGGAGCCGGCGCTGGCCTCCTTCCTCTACGCCACGGTGCTGTCCCACCCCTCCCTGCCCCGCTCCCTCTCCTTCCACCTCGCCAACAAGCTCTGCTCCTCCACCCTCCTCTCCACGCTCCTCTACGACCTCTTCCTCGCCTCCCTCACCGCGCACCCCTCCCTCCGCGCCGCCGTCGTCGCCGACCTCCTCGCCGCCCGCGCCCGCGACCCCGCCTGCGTCGGATTCTCCCACTGCCTCCTCAACTACAAGGGCTTCCTCGCCATCCAGGCGCACCGCGTCGCGCACGTCCTCTGGGCGCAGAACCGCCGCCCCCTCGCGCTCGCCCTCCAGTCCCGCGTCGCCGACGTCTTCGCCGTCGACATCCACCCCGCCGCCGTCGTCGGCAAGGCCATCCTCCTCGACCACGCCACCGGCGTCGTCATAGGCGAGACCGCCGTCGTCGGGGACAACGTCTCCATCCTCCACCACGTCACCCTCGGCGGGACCGGCAAGGCCGTCGGCGACCGCCACCCCAAGATTGGGGACGGGGTGCTCATAGGCGCCGGCGCCACAATCCTCGGCAACGTCATGATTGGAGCCGGGGCCAAGATTGGGGCTGGCTCCGTGGTGCTCATAGATGTGCCGGCGCGGAGCACGGCGGTGGGGAACCCTGCCAGGCTCCTCGGAGGGAGGAAGGGCGAGGCCGACAAGGACGAGGACATGCCCGGAGAGTCCATGGATCACACCTCCTTCATACGCCAGTGGTCCGACTACACCATCTGATTCTGAGGAGCCATTGTGCAAGGTCTATTACTCATCCTCTGTATCAGTAGTCGTGTTGTACTACCAAAATACACAGTGATCTTGTTTTGGTATATTGCTCACTTGTGGTTGTGGATGAACATCAACTGTAGTCTAGTGTGTATGACCAATTCCAATTGTTTCTTCAGCTGAGCGATCTTTGCTCGGATACTGATAGCAGATGATTGATGAATGAATAATCTTGTAATCCATG * NM:i:0 ms:i:1236 AS:i:1236 nn:i:0 ts:A:+ tp:A:Pcm:i:409 s1:i:1236 s2:i:260 dv:f:0.0002
100.0
100.0
Traceback (most recent call last):
File "/mnt/shared/scratch/mc42302/201903_RTD2/software/tama/tama_collapse.py", line 4589, in
group_trans_list_dict[group_count].append(read_id)
KeyError: 2340

issues with tama_remove_fragment_models.py

Hello,
I am trying several ways to filter my transcripts, and I have the following issues:

issue 1
I am using tama_remove_fragment_models.py with the output file of tama_cds_regions_bed_add.py, but the bed file is not returning the full label "G1;G1.20;Q3ZCT1;full_length;90_match;prot_ok;F2" it is only returning something like "G1;G1.20". Having the full label is useful when looking at the files in Genome Browser. Is the script only keeping "full_length;full_match"?
issue 2
With the same data mentioned above, I ran tama_remove_polya_models_levels.py, then tama_remove_fragment_models.py, and then the ORF_NMD scripts. However, while trying to run the last step with tama_cds_regions_bed_add.py. I get the error
"Mismatching lengths!
G1;G1.83 10763 8511"

Please, advise!

Thanks,

Ariadna

Can TAMA Merge preserve read counts from different samples?

Hi Richard,

Thank you for creating these really useful tools for long read RNA-seq analysis!

I'm particularly interested in comparing technical replicability between different Isoseq runs. I have both technical replicates from the same sample and biological replicates from different samples. I'd like to use TAMA to merge the different transcript models from each sample and then compare the abundances (read counts) between samples. Is this currently possible with TAMA Merge?

I've been exploring the chaining functions created by Liz Tseng as well and I can see that it's quite a difficult problem to match isoforms across datasets and it doesn't always seem to work as expected.

error using bam file with tama_collapse

Hi Richard,

There seems to be an issue with tama_collapse when bam file is provided as input file with flag -b. Please see below for the full output lines. Would you mind fixing this in the next update?

Thanks,
Yuanyuan

tc_version_date_2020_07_18
Sam file is missing
Default capped flag will be used: Capped
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default 5 prime threshold: 10
Default exon/splice junction threshold: 10
Default 3 prime threshold: 10
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using BAM format for reading in.
Default log output on
Default run mode original
Default 5 read threshold
Please try again with complete arguments
Traceback (most recent call last):
File "/project/devil_lp18/yuanyuan//tama/tama_collapse.py", line 207, in
sam_file = opts.s[0]
TypeError: 'NoneType' object has no attribute 'getitem'

KeyError: 20 - mapped_flag - sam_flag_dict[sam_flag]

Good afternoon,

After aligning my hq_quality.fasta files with desalt I planned to use tama collapse
When I sorted using samtools sort I got this message, which I don't know if it is a problem or not (waiting for reply about it on desalt github).
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped

When I run the sorted-desalt sample on tama collapse I got this error
sam count 175000
sam count 180000
Traceback (most recent call last):
File "tama_collapse.py", line 4406, in
mapped_flag = sam_flag_dict[sam_flag]
KeyError: 20

I don't get neither of this problems when I use gmap for example
Is there anyway to solve this problem?

Genome seq error

Hi Rik,
I have tried out recently tama with 5’-capped ONT reads. The reads were mapped to the reference genome with desalt mapper. When I run the tama_collapse.py script the script started to process the reads but after a short time It gave me this message: Genome seq is not the same length as query seq. Can you help me to figure out what could be the problem?
Thanks for any help!

Used command: python tama/tama_collapse.py -s alignment.bam -b BAM -f ref_genome.fasta -p strain_name -x capped -icm ident_map

tama_merge filelist error

I am trying to assemble a transcriptome for my non-model plant species. I would like to perform a RNA-seq study afterwards. I multiplexed 2 samples in a Sequel II 8M, and I was able to conclude the isoseq3 + cogent pipelines. These 2 samples are from different plant individuals, but both from leaf tissue, so I think it makes sense to merge them first as a single transcriptome before I conduct the RNA-seq study.

I saw your presentation at one of the UC Davis workshops where you present about TAMA, and I think tama_merge is the tool I am looking for to merge the two assembled transcriptomes.

I am getting the following error when using tama_merge.py:

(anaCogent) [Linux@helens-vm1 TAMA_Merge]$ python ../tama/tama_merge.py -f filelist.txt -p merged_transcriptome -d merge_dup
Default collapse exon ends flag will be used: common_ends
Default 5 prime threshold: 20
Default exon end/splice junction threshold: 10
Default 3 prime threshold: 20
Default source ID merge flag: no_source_id
Default CDS merge flag: no_cds
opening file list
Incorrect seq types given in filelist file. Please use capped or no_cap

And this is how my filelist looks like:

file_name cap_flag merge_priority(start,junctions,end) source_name
hq_transcripts.fasta.no5merge.collapsed.filtered_G10.bed capped 1,1,1 caplib
hq_transcripts.fasta.no5merge.collapsed.filtered_P1.bed no_cap 2,1,1 nocaplib

My questions:

Do you see any typo in the filelist? It looks like the exampled in the wiki page.
The input files were converted from the gff file after collapsing cogent files and filtering out transcripts with 5' degradation. Am I doing this right?
After merging, how do I combine the two fasta files from the two different transcriptome assemblies so that I can map my RNA-seq reads?

Thank you for the help!
Caio

Help for explaination to PolyA results from TAMA collapse

Hi @GenomeRIK,
I first mapped high-quality FLNC reads to the genome and got sam file for next running TAMA collapse.
I'm a little confused with PolyA results. if you collapsed some similar isoforms as one isoform, like G3959.1(see below screenshot), Why there are some different polyA sequence for G3959.1? Isn't one isoform one polyA sequence?
By the way, I compared the polyA results got from TAPIS and TAMA, the number of polyA is totally different, TAMA was far less than TAPIS. Could you explain the principle of this program for identifying polyA? or which software do you recommend? I'd really appreciated that.
ling

color of transcripts in merge bed file?

What do the transcript colors represent for the TAMA collapse and merge output bed files? I can't seem to find it explained in the wiki. Thanks!

Key error

Hi ,

I'm trying to run the collapse script of pacbio data mapped with map but I keep getting this error and haven't figured out why.

Traceback (most recent call last):
File "tama_collapse.py", line 2426, in
[h_count,s_count,i_count,d_count,mis_count,nomatch_dict] = calc_error_rate(start_pos,cigar,seq_list,scaff_name,read_id)
File "tama_collapse.py", line 409, in calc_error_rate
genome_slice = fasta_dict[scaffold][genome_start:genome_end]
KeyError: 'chr1'

Tama Merge colour on UCSC Genome Browser

Hello Richard,

I really like the feature with the colouring of the transcripts by source from TAMA Merge. However, this only seems to only appear on IGV, and not when I upload the bed file on UCSC Genome Browser.

Is there a way that the colours can show on the Genome Browser instead?

Thanks!
Best wishes,
Szi Kay

How to get "primary genes"

Hi,

First, thank you for the tool, it is a pleasure to use with both ONT and PB data.

I have generated ~29,000 genes with ~130k transcripts in total from multiple datasets using the TAMA pipeline, including ORF/NMD prediction. Now I have the bed (and gtf) files for the annotations and for most purposes (RNA-seq mapping, promoter analysis) it is perfect.

However, I would like to have a "primary transcript" list, both mRNA and CDS/AA for e.g. gene family tree building. I searched every corner of the internet, and although it sounds easy, there is no definitive answer or tool. The closest I got was using gffread with options -M -K, but it still did not merge all isoforms. I just wonder if you have any solutions for this, it may be helpful for other people, too to have readily available in tama-go (e.g. longest isoform, most abundant isoform or based on the ORF prediction).

Cheers,
Gergo

Can 'tama_flnc_polya_cleanup.py' input/output FASTQ or BAM format?

I am analysing Iso-Seq libraries produced with Lexogen's Teloprime v2. In the process, I've taken full advantage of Richard Kuo's suggestion in Twitter to employ "tama_flnc_polya_cleanup.py" instead of "isoseq3 refine --require-polya --min-polya-length 12 ...". In fact, I recover 82-99% more FLNC reads with the TAMA script than with the isoseq3 --require-polya approach. Big thanks for that!

Unfortunately, my polyA-trimmed transcripts are only available in FASTA format (not even FASTQ), and that limits my possibilities for downstream application (e.g., isoseq3 cluster, isONclust).

I was wondering if the "tama_flnc_polya_cleanup.py" could input/output a different format (i.e., FASTQ, BAM) so that the quality information is preserved. Alternatively, a convertor could be added since the BQ info is in the BAM from the "isoseq3 refine" step.

Kind regards,

Fernando

tama_merge.py collapse_transcript bug for uneven exon counts

Hi,

I've encountered a bug in the collapse_transcript in the section that starts with the following code:
for i in xrange(max_exon_num): #go from 3 prime end

A downstream KeyError will occur when accessing the results returned from the lists created by the following two lines because best_e_start and best_e_end are not guaranteed to get set to a real start or end, they may stay as their originally initialized -1 or 0:
collapse_start_list.append(best_e_start)
collapse_end_list.append(best_e_end)
The issue arises when using varied weights for different gene model sets.

Imagine that you are merging two transcripts in the forward strand (though the real life example I've explored has been in the minus strand):

Transcript	StartPriority	EndPriority	JxnPriority	Exon Chain
A	1	2	1	(20, 30), (40, 50)
B	2	1	1	(20,30), (40-50), (60-70)

What gets reported is something like:
Starts: 20, 40, 60
Ends: 30, 50, 0
Those are then subsequently sorted and when the calling function tries to look up a value for -1 or 0, it doesn't exist in the dictionary and the program dies.

I'll work around it for now by just setting all the priorities to be equal so I can take advantage of the voting by junction counts.

Hope this is enough to figure out the issue.

tama_collapse.py --> retuning error: no groups found! [input sorted.bam from minimap2 using HQ polished reads from isoseq3]

Hi am running TAMA collapse using as input a sorted bam file (output of minimap2, using as input HQ polished reads from isoseq3) and I get the following error:

Command line

tama_collapse.py -s infile.bam -b BAM -f refGenome.fa -p myPrefix -x capped

Output

tc_version_date_2020_07_18
Default collapse exon ends flag will be used: common_ends
Default coverage: 99
Default identity: 85
Default identity calculation method: ident_cov
Default 5 prime threshold: 10
Default exon/splice junction threshold: 10
Default 3 prime threshold: 10
Default duplicate merge flag: merge_dup
Default splice junction priority: no_priority
Default splice junction error threshold: 10
Default splice junction local density error threshold: 1000
Default simple error symbol for matches is the underscore "_" .
Using BAM format for reading in.
Default log output on
Default run mode original
Default 5 read threshold
time taken since last check: 0:0:0
time taken since beginning: 0:0:0
going through fasta
time taken since last check: 0:0:44
time taken since beginning: 0:0:44
going through sam file
1
time taken since last check: 0:0:10
time taken since beginning: 0:0:55
going through groups: 0
Error, no groups found!

I am also running the same command but using another input file (a sorted bam file that is the output of minimap2, using as input FLNC reads from isoseq3) and is running without a problem.
Any advice?

Thanks!

Odd output from Tama Merge

Hello Richard,

I'm trying to merge 12 of my Iso-Seq samples using TAMA Merge (admittedly, from Cupcake collapse - I have generated the bed12 file from Cupcake's gtf and feed that into TAMA Merge).

The script runs (very quickly, within a minute!), and finishes with "TAMA Merge has completed successfully!" but the files for the gene_report.txt and trans_report.txt look very odd and appears to be missing columns.

python tama_merge.py -f filelist.txt -a 50 -z 50 -m 20 -p Individual &> Tama_merge.log
(last updated tama_merge.py script: 2020/07/21)

The output bed file seems to be correctly labelled though:

Attached is my filelist and log file.

Any guidance on this would be greatly appreciated,

Thank you,
Szi Kay

Individual_Tama_filelist.txt
Tama_merge.log

blastp output format expected

Hi,

for the tama_orf_blastp_parser.py which blastp output format is expected?

Thank you,

how to predict NMD

Dear author:
Can you explain how to predict the NMDs ? Is there have a criterion to predict ,besides,how to explain the results

Improve annotation

Hi Richard, may you help me with this issue?

I have my bed file annotated using TAMA GO: ORF and NMD predictions, Uniprot_100 DB and diamond BLASTP. However, different isoforms from the same gene_id have different Unirprot IDs:

gene_id	isoform	gene	trans_length_flag	blastp_match_flag
G5	G5.1	A0A445DLR5	5prime_degrade	bad_match
G5	G5.2	A0A4P1R3Y6	full_length	bad_match
G5	G5.3	A0A151T2D9	full_length	bad_match
G5	G5.4	A0A2G3AYH1	full_length	bad_match
G5	G5.5	A0A2G3AYH1	full_length	bad_match
G5	G5.6	UPI0010A4D799	full_length	90_match
G5	G5.7	UPI0010A4D799	full_length	90_match
G5	G5.8	A0A445DLR5	full_length	bad_match
G5	G5.9	A0A445DLR5	full_length	bad_match
G6	G6.1	A0A314Z1D9	5prime_degrade	bad_match
G6	G6.2	A0A314Z1D9	5prime_degrade	bad_match
G6	G6.3	A0A224X4A1	5prime_degrade	50_match
G6	G6.4	A0A1R3HQH4	5prime_degrade	90_match

I want to use something like ifelse to unify gene annotation based on the highest blastp_match_flag. i.e. annotate all isoforms G6 as A0A1R3HQH4. Is that possible?

Best,

Cesar

genomerik / tama Goto Github PK

tama's People

Contributors

Stargazers

Watchers

Forkers

tama's Issues

Command line

Output

Recommend Projects

Recommend Topics

Recommend Org