Admittedly I am running this on a very large data set. All in the merge BAM contains s

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yes, but you will run the risk of having duplicated tran IDs. For instance, nove

TALON seems to be stuck or I have a least no idea what it is doing. about talon HOT 11 CLOSED

callumparr commented on July 18, 2024

TALON seems to be stuck or I have a least no idea what it is doing.

from talon.

Comments (11)

callumparr commented on July 18, 2024

Ah OK so it was doing something but then when it started to update database it had many errors.

[ 2023-06-11 16:51:28 ] All jobs complete. Starting database update.
[ 2023-06-11 17:20:18 ] Validating database........
Database counter for 'genes' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 153594
counter_value: 172923
Database counter for 'transcripts' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 1644012
counter_value: 2021938
Database counter for 'location' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2116074
counter_value: 2397222
Database counter for 'edge' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 3020322
counter_value: 3559795
Database counter for 'observed' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 164632280
counter_value: 173941741
Traceback (most recent call last):
  File "/home/callum/miniconda3/bin/talon", line 33, in <module>
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2464, in main
    end_support = parse_custom_SAM_tags(sam_record)
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 1781, in update_database
    # get overlap and compare
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2095, in check_database_integrity
    except Exception as e:
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
/analysisdata/fantom6/Interactome/ONT-CAGE_TALON_dorado/scripts/talon.sh: line 22: rep1: command not found
/tmp/F6_interactome_neurogenesis_QC.log:         74.5% -- replaced with /tmp/F6_interactome_neurogenesis_QC.log.gz
gzip: /tmp/*talon_read_annot.tsv: No such file or directory

from talon.

fairliereese commented on July 18, 2024

Hey, my suggestion when dealing with this much data is to run TALON sequentially. I have had luck with running it on 100s of millions of reads if I run ~40 million reads at a time.

from talon.

callumparr commented on July 18, 2024

Hi @fairliereese thanks for the reply!

I am trying to get all samples (context) in at once. So I instead now load in per chr to reduce size of the data TALON has to handle, so basically running TALON 25 times to include the major chr contigs. I hope this doesn't break some logic of how talon works.

from talon.

callumparr commented on July 18, 2024

@fairliereese

To speed up the database generation I took two tacs but both involved splitting all samples alignments to chromosomes and running them either a) sequentially into the same database, one chr config at a time, b) or in parallel creating a database for each chr and adding a prefix to TALON. The latter is obviously faster to generate all the annotations but then it means having to do a lot of downstream work handling the different talon.db. Given that each has the same hg38 build and gencode v39 annotation in the talon initializing. Is it possible to merge these into one database? There would be overlap would be for the initial gencode annotations from initalizing a database for each chr.

from talon.

fairliereese commented on July 18, 2024

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

from talon.

callumparr commented on July 18, 2024

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.

Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?

from talon.

fairliereese commented on July 18, 2024

I can't think of any downsides to this off the top of my head.

…

On Fri, Jun 16, 2023, 22:21 callumparr ***@***.***> wrote: Y Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit. Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus <https://github.com/mortazavilab/cerberus>, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you. Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly. Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something? — Reply to this email directly, view it on GitHub <#131 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFBGIWN4ISGHRWXQF7M6XNLXLU5GLANCNFSM6AAAAAAZBQCBP4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from talon.

fairliereese commented on July 18, 2024

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

from talon.

callumparr commented on July 18, 2024

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

I extracted an abundance file for each chr.db. can I not simply rbind the results .tsv files and it will be like a chr sorted abundance file? Counts for each isoform should only appear once, as they are located on one chr only.

Sorry, I probably misunderstood your point.

Every time I run a new database from the same gencode annotation, TALON will assign the same index to these known annotations right?

from talon.

fairliereese commented on July 18, 2024

Yes, but you will run the risk of having duplicated transcript IDs. For instance, novel transcript number 1 from chromosome 1 will not be the same as novel transcript number 1 from chromosome 2. This is perhaps an obvious point and there would be easy ways to make your novel transcript IDs unique but I wanted to make sure to point it out nonetheless.

from talon.

callumparr commented on July 18, 2024

ah, I see yes I added a prefix for novel annotations when initializing the database.

from talon.

TALON seems to be stuck or I have a least no idea what it is doing. about talon HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent