Giter VIP home page Giter VIP logo

Comments (6)

dewyman avatar dewyman commented on August 18, 2024

Why does transcript c1417/f31p12/2759 get a score of 5 in MatchAnnot, but is not among the transcripts with internal exon matches in TALON? [FIXED]

TALON results (which don't look right to me):

  • Total transcripts processed: 1881
  • Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 496
  1. Make a SAM file containing only this transcript
grep "c1417/f31p12/2759" test_files/sorted.GM12878_chr1_canonicalOnly.sam > test_files/c1417-f31p12-2759.sam
  1. Run TALON
python talon.py --f test_files/c1417-f31p12-2759.sam -g test_files/gencode.v24.annotation.chr1.gtf

[FIXED] I noticed that the problem is another GTF reading bug. For transcripts on the minus strand, the exons are listed (and numbered) in their 5' to 3' orientation rather than being listed with respect to the forward strand. This is creating problems in my exon string process.

from talon.

dewyman avatar dewyman commented on August 18, 2024

Does TALON recover all score 5 events from Tofu-STAR-MatchAnnot? [ANSWERED] Almost

  1. Collect all transcript IDs that scored a MatchAnnot 5 in one file:
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
grep "result" test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
  1. Collect all transcripts with internal exon matches (MatchAnnot score 5 equivalent) from TALON:
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
  1. Get number of lines in MatchAnnot file that are not in TALON file:
MatchAnnot_file=test_files/Tofu_STAR_MatchAnnot/score_5_transcripts.txt
TALON_file=test_files/permissive_3_5_transcripts.txt
comm -23 $MatchAnnot_file $TALON_file | wc -l

The result was 13 lines (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e), which means that TALON did not quite cover all of the cases that MatchAnnot did. Here are the IDs:
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407

from talon.

dewyman avatar dewyman commented on August 18, 2024

Comparing TALON to direct MatchAnnot (no Tofu-STAR): Do they give the same results for score 5 transcripts?

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MatchAnnot_file=test_files/MatchAnnot_direct/score_5_transcripts.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2}' | awk -F'|' '{print $NF}' | sort -u > $MatchAnnot_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MatchAnnot_file $TALON_file | wc -l

The result is 15 lines that are in MatchAnnot but not TALON, and 4 lines that are in TALON but not MatchAnnot. I should look into these cases further.
c11484/f2p30/3345
c13754/f2p1/1380
c14744/f1p2/3017
c14773/f1p0/3486
c18234/f1p0/3056
c19214/f1p1/3120
c2229/f33p31/3156
c22984/f1p5/3378
c23761/f1p0/2920
c26076/f2p16/2864
c26372/f1p0/3232
c29587/f4p2/3479
c31433/f1p3/3125
c33225/f1p0/3407
c34432/f1p6/3188

Let's start by investigating the very first one, c11484/f2p30/3345. The MatchAnnot entry for it is:

result:   c11484/f2p30/3345    ALDH4A1    ALDH4A1-001    ex: 15  sc: 5  5-3:    -2    10

When I print debugging messages for this transcript in TALON, I find out that it was assigned to a different gene, RP13-279N23.2. This strikes me as odd and probably wrong. When I dig deeper, I find that c11484/f2p30/3345 overlapped the following genes, with the amount listed in parantheses.
gene_matches:

  • MIR1290 (78)
  • RP13-279N23.2 (31338)
  • ALDH4A1 (31338)
  • MIR4695 (74)
    We can see that there was a tie between ALDH4A1 and RP13-279N23.2. Here are the gene annotations from GENCODE v24:
chr1	HAVANA	gene	18871430	18902781	.	-	.gene_id "ENSG00000159423.16"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "ALDH4A1"; level 1; tag "ncRNA_host"; havana_gene "OTTHUMG00000002443.6";

chr1	HAVANA	gene	18849273	18921121	.	-	.gene_id "ENSG00000255275.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP13-279N23.2"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000167131.3";

This suggests to me that in the rare event of a transcript having the same amount of overlap with 2 genes, it is not sufficient to simply return one of them as I am doing. I checked on the first three cases, on the list and they all come from the same tied gene situation. Clearly this is a problem. I changed the gene fetching function to return more than one gene in case of ties, and repeated the comparison. This time, I got the following 6 matches in MatchAnnot but not in TALON (a subset of the last list):
c14773/f1p0/3486: MIR4425
c18234/f1p0/3056: NAP1L4P1
c22984/f1p5/3378: MIR34A
c23761/f1p0/2920: RP11-61J19.5
c26372/f1p0/3232: ZNF670
c33225/f1p0/3407: PHTF1
To investigate, I start again with the first transcript, c14773/f1p0/3486.

MatchAnnot entry:

result:   c14773/f1p0/3486    MIR4425    MIR4425-201    ex:  1  sc: 5 rev  5-3:     5 -3402

Notice that this is a microRNA gene. How long is the transcript? Did we perhaps filter it out with our 200 bp requirement? Checking the length, this does NOT appear to be the case.

grep "c14773/f1p0/3486" test_files/sorted.GM12878_chr1_canonicalOnly.sam | awk '{print length($10)}'
3486

Looking at this area in the UCSC genome browser, I noticed something really interesting that points to a problem with MatchAnnot. MIR4425 is the only gene in the neighborhood here, and it intersects with c14773/f1p0/3486. Since there is only one "exon", MatchAnnot classifies this match as a score 5 because "only the 3' and 5' ends are different", which is a bad move. The correct move would be to create a novel gene.

I found a second problem with MatchAnnot when I examined example c23761/f1p0/2920. MatchAnnot assign the transcript to RP11-61J19.5, but the gene and the transcript are on opposite strands. So I'm comfortable with the fact that TALON does not find a gene match for c23761/f1p0/2920.

from talon.

dewyman avatar dewyman commented on August 18, 2024

Update: Comparing TALON and MatchAnnot after exon-based redesign of TALON (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931)

TALON results:

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 964

Transcripts that matched in MatchAnnot but not in TALON:

There are only three discrepancies.

c14773/f1p0/3486

MatchAnnot assigns this single-exon minus strand transcript to MIR4425-001, and TALON finds no transcript match. The reason for this is that MIR4425 is on the plus strand. This discrepancy is acceptable.

c23761/f1p0/2920

MatchAnnot assigns this single-exon plus strand transcript to RP11-61J19.5-001, while TALON reports no match. Again, the reason is a strand discrepancy, so this is ok.

c33225/f1p0/3407

This is a strand discrepancy that was documented in issue #19. This is ok.

Transcripts that matched in TALON but not in MatchAnnot:

There were zero cases like this.

from talon.

dewyman avatar dewyman commented on August 18, 2024

TALON and MatchAnnot find "score 5" matches for virtually the same set of transcripts. But if we look at the transcript assignments themselves, are they the same? (as of commit dewyman@87b67d87606d9a70852b2533e8e5aa2fde7fa931) [ANSWERED] Yes, except for multi-matching TALON cases (tiebreaker needs to be implemented)

Note: I have seen cases so far where TALON returns more than one transcript match since I haven't implemented a tiebreaker yet. I won't worry too much about this for now.

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf | sort -n -k1,1 > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l

Assignments in MatchAnnot that were not in TALON:

c10364/f4p17/3199 TOR1AIP1-015: TALON assigned TOR1AIP1-015,TOR1AIP1-004
c12687/f3p2/3317 RBM15-004: TALON assigned RBM15-001,RBM15-201,RBM15-004
c13055/f1p27/3190 LRRC41-201: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c14773/f1p0/3486 MIR4425-201 This was expected
c15401/f82p47/2854 LRRC41-006: TALON assigned LRRC41-201,LRRC41-001,LRRC41-006
c15529/f16p41/2957 LRRC41-201: multiple assignment again
c17240/f4p6/2673 TCEB3-003: multiple assignment
c17998/f1p0/2984 HIST2H2AA4-001: multiple assignment
c17998/f1p0/2984 HIST2H3C-001: multiple assignment
c2355/f16p17/3301 TOR1AIP1-015: multiple assignment
c23761/f1p0/2920 RP11-61J19.5-001 This was expected
c24723/f2p5/4046 ATPAF1-001: multiple assignment
c26869/f1p8/3622 RSBN1-004: multiple assignment
c30828/f2p12/2747 LRRC41-006: multiple assignment
c33225/f1p0/3407 PHTF1-004 This was expected
c36587/f1p0/3228 JMJD4-006: multiple assignment
c8235/f3p8/2930 RSBN1-004: multiple assignment

All of these cases (except for the three enumerated previously) were situations where TALON matched the query to more than one transcript, including the MatchAnnot answer within this set. So that is fine for now. Once I write a tiebreaker for TALON, these discrepancies should disappear.

Assignments in TALON that were not in MatchAnnot:

When I check this, I find only multi-match cases from the previous section. That's a pass

from talon.

dewyman avatar dewyman commented on August 18, 2024

Comparing TALON and MatchAnnot again after tiebreaker implementation for full matches (as of commit dewyman@728805b3c5e04650030b6eeae608e3c441dfea8)

  1. Run TALON and save internal match transcripts to file (same as previous)
TALON_file=test_files/permissive_3_5_transcripts.txt
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf --d test --o test
grep -v "dataset" test | grep "known" | awk '{print $2,$10}' | sort -n -k1,1  > $TALON_file
  1. Run MatchAnnot directly on the same data that TALON ran on
MatchAnnot_out=test_files/MatchAnnot_direct/match_annot_results.txt
MA_transcript_file=test_files/MatchAnnot_direct/score_5_transcripts_and_assignment.txt
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 test_files/sorted.GM12878_chr1_canonicalOnly.sam > \
$MatchAnnot_out
grep "result" $MatchAnnot_out | grep "sc: 5" | awk '{print $2,$4}' | sort -n -k1,1 > $MA_transcript_file
  1. Get number of lines in MatchAnnot file that are not in TALON file
comm -23 $MA_transcript_file $TALON_file | wc -l

Lines that are in MatchAnnot file but not TALON

c14773/f1p0/3486 MIR4425-201
c17998/f1p0/2984 HIST2H2AA4-001
c17998/f1p0/2984 HIST2H3C-001
c23761/f1p0/2920 RP11-61J19.5-001
c33225/f1p0/3407 PHTF1-004

c17998/f1p0/2984 comes up twice here. TALON assigns it to HIST2H3A-001. HIST2H3A-001 and HIST2H2AA4-001 both seem like reasonable choices, and the former minimizes the total 3' and 5' difference, so I'm not going to worry too much about this.

Lines that are in TALON but not MatchAnnot

Only one:
c17998/f1p0/2984 HIST2H3A-001

from talon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.