Comments (11)
Ah OK so it was doing something but then when it started to update database it had many errors.
[ 2023-06-11 16:51:28 ] All jobs complete. Starting database update.
[ 2023-06-11 17:20:18 ] Validating database........
Database counter for 'genes' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 153594
counter_value: 172923
Database counter for 'transcripts' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 1644012
counter_value: 2021938
Database counter for 'location' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2116074
counter_value: 2397222
Database counter for 'edge' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 3020322
counter_value: 3559795
Database counter for 'observed' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 164632280
counter_value: 173941741
Traceback (most recent call last):
File "/home/callum/miniconda3/bin/talon", line 33, in <module>
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2464, in main
end_support = parse_custom_SAM_tags(sam_record)
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 1781, in update_database
# get overlap and compare
File "/home/callum/miniconda3/lib/python3.6/site-packages/talon/talon.py", line 2095, in check_database_integrity
except Exception as e:
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
/analysisdata/fantom6/Interactome/ONT-CAGE_TALON_dorado/scripts/talon.sh: line 22: rep1: command not found
/tmp/F6_interactome_neurogenesis_QC.log: 74.5% -- replaced with /tmp/F6_interactome_neurogenesis_QC.log.gz
gzip: /tmp/*talon_read_annot.tsv: No such file or directory
from talon.
Hey, my suggestion when dealing with this much data is to run TALON sequentially. I have had luck with running it on 100s of millions of reads if I run ~40 million reads at a time.
from talon.
Hi @fairliereese thanks for the reply!
I am trying to get all samples (context) in at once. So I instead now load in per chr to reduce size of the data TALON has to handle, so basically running TALON 25 times to include the major chr contigs. I hope this doesn't break some logic of how talon works.
from talon.
To speed up the database generation I took two tacs but both involved splitting all samples alignments to chromosomes and running them either a) sequentially into the same database, one chr config at a time, b) or in parallel creating a database for each chr and adding a prefix to TALON. The latter is obviously faster to generate all the annotations but then it means having to do a lot of downstream work handling the different talon.db. Given that each has the same hg38 build and gencode v39 annotation in the talon initializing. Is it possible to merge these into one database? There would be overlap would be for the initial gencode annotations from initalizing a database for each chr.
from talon.
Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.
Currently there is no way within TALON to merge transcripts from separate databases. There are however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.
from talon.
Y
Actually splitting by chromosome will not really help with speeding up because TALON already tries to do this in order to parallelize. It splits the input BAM files into non-overlapping genomic segments which often just end up splitting by chromosome. So by splitting data up this way you won't really be getting a speed benefit.
Currently there is no way within TALON to merge transcripts from separate databases. There are, however, other tools that we have developed that accomplish this. See my library Cerberus, which harmonizes transcriptome annotations to use a unified set of coordinates. As a note of caution transcriptome merging typically involves introducing flexibility at the 5' and 3' ends, as we can't really rely on exact matching across transcripts as we can for things like splice sites. If you're interested in using Cerberus I can try to work with you to do that. I've used it successfully on output from multiple TALON databases and have a lot of code lying around that might help you.
Yes separating by chr and running sequentially doesn't make sense I am realizing as the parallelization comes from this exactly.
Running and outputting a .db for each chr is very quick and at the moment we are thinking to create filter whitelists and GTFs from them and then just merging the chr annotation files into one. As each annotation we will merge is from separate chromosomes this shouldn't cause any headaches. Or am I missing something?
from talon.
from talon.
Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.
from talon.
Actually, now that I'm thinking about it the only thing you'll need to be careful for is not merging abundance of transcripts from the separate chromosomes together even if they have the same transcript ID.
I extracted an abundance file for each chr.db. can I not simply rbind the results .tsv files and it will be like a chr sorted abundance file? Counts for each isoform should only appear once, as they are located on one chr only.
Sorry, I probably misunderstood your point.
Every time I run a new database from the same gencode annotation, TALON will assign the same index to these known annotations right?
from talon.
Yes, but you will run the risk of having duplicated transcript IDs. For instance, novel transcript number 1 from chromosome 1 will not be the same as novel transcript number 1 from chromosome 2. This is perhaps an obvious point and there would be easy ways to make your novel transcript IDs unique but I wanted to make sure to point it out nonetheless.
from talon.
ah, I see yes I added a prefix for novel annotations when initializing the database.
from talon.
Related Issues (20)
- Isoforms defined by reads with high fraction A (>0.5) HOT 2
- Abundance and fraction_as not showing properly
- TALON support for CIGAR strings found in pbmm2 sam files
- new release? HOT 2
- TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType HOT 1
- Does TALON database contain the alignments ? HOT 4
- Issue with "Could not retrieve index file" HOT 2
- NameError: name 'vertex_counter' is not defined HOT 6
- Error with installation HOT 1
- Problem with talon_initialize_database HOT 13
- Question - merging TALON databases HOT 1
- Issue with talon filter HOT 4
- internal priming on PCS111 cDNA kit HOT 1
- Antisense after pychopper minimap2. to -uf or not -uf HOT 1
- 'check_database_integrity' error HOT 3
- error when running talon annotator
- What is the meaning of ISM None HOT 2
- Strange error HOT 2
- Multithreading is not working
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from talon.