Giter VIP home page Giter VIP logo

Comments (11)

callumparr avatar callumparr commented on July 18, 2024

Ah OK so it was doing something but then when it started to update database it had many errors.

[ 2023-06-11 16:51:28 ] All jobs complete. Starting database update.
[ 2023-06-11 17:20:18 ] Validating database........
Database counter for 'genes' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 153594
counter_value: 172923
Database counter for 'transcripts' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 1644012
counter_value: 2021938
Database counter for 'location' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2116074
counter_value: 2397222
Database counter for 'edge' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 3020322
counter_value: 3559795
Database counter for 'observed' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 164632280
counter_value: 173941741
Traceback (most recent call last):
  File "/home/callum/miniconda3/bin/talon", line 33, in <module>
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2464, in main
    end_support = parse_custom_SAM_tags(sam_record)
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 1781, in update_database
    # get overlap and compare
  File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2095, in check_database_integrity
    except Exception as e:
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
/analysisdata/fantom6/Interactome/ONT-CAGE_TALON_dorado/scripts/talon.sh: line 22: rep1: command not found
/tmp/F6_interactome_neurogenesis_QC.log:         74.5% -- replaced with /tmp/F6_interactome_neurogenesis_QC.log.gz
gzip: /tmp/*talon_read_annot.tsv: No such file or directory

from talon.

fairliereese avatar fairliereese commented on July 18, 2024

Hey, my suggestion when dealing with this much data is to run TALON sequentially. I have had luck with running it on 100s of millions of reads if I run ~40 million reads at a time.

from talon.

callumparr avatar callumparr commented on July 18, 2024

Hi @fairliereese thanks for the reply!

I am trying to get all samples (context) in at once. So I instead now load in per chr to reduce size of the data TALON has to handle, so basically running TALON 25 times to include the major chr contigs. I hope this doesn't break some logic of how talon works.

from talon.

callumparr avatar callumparr commented on July 18, 2024

@fairliereese

To speed up the database generation I took two tacs but both involved splitting all samples alignments to chromosomes and running them either a) sequentially into the same database, one chr config at a time, b) or in parallel creating a database for each chr and adding a prefix to TALON. The latter is obviously faster to generate all the annotations but then it means having to do a lot of downstream work handling the different talon.db. Given that each has the same hg38 build and gencode v39 annotation in the talon initializing. Is it possible to merge these into one database? There would be overlap would be for the initial gencode annotations from initalizing a database for each chr.

from talon.

fairliereese avatar fairliereese commented on July 18, 2024

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

from talon.

callumparr avatar callumparr commented on July 18, 2024

Y

Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.

Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.

Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.

Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?

from talon.

fairliereese avatar fairliereese commented on July 18, 2024

from talon.

fairliereese avatar fairliereese commented on July 18, 2024

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

from talon.

callumparr avatar callumparr commented on July 18, 2024

Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.

I extracted an abundance file for each chr.db. can I not simply rbind the results .tsv files and it will be like a chr sorted abundance file? Counts for each isoform should only appear once, as they are located on one chr only.

Sorry, I probably misunderstood your point.

Every time I run a new database from the same gencode annotation, TALON will assign the same index to these known annotations right?

from talon.

fairliereese avatar fairliereese commented on July 18, 2024

Yes, but you will run the risk of having duplicated transcript IDs. For instance, novel transcript number 1 from chromosome 1 will not be the same as novel transcript number 1 from chromosome 2. This is perhaps an obvious point and there would be easy ways to make your novel transcript IDs unique but I wanted to make sure to point it out nonetheless.

from talon.

callumparr avatar callumparr commented on July 18, 2024

ah, I see yes I added a prefix for novel annotations when initializing the database.

from talon.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.