mitonuc's People
mitonuc's Issues
Add protein sequence GI list creation to taxID_to_GIs.pl
Count how many of the mitochondrial GIs are in the big GI list I'm subtracting them from
I'm curious how accurate the main GI list from the taxdump is. To test for how well this agrees with Entrez, when I'm subtracting them from master GI lists from each family:
- Count each mitochondrial GI as we loop through ($mtGIcounter)
- If a match is successful, increment the match counter ($matchcounter)
- After looping through all the lines in the mt GI file, do $matchcounter / $mtGIcounter and report the results.
updateblastdb.sh should update taxonomy database in $BLASTDB
subset_mito_db.pl currently works on the higher taxon ID level
We don't want to cluster the high-level taxon ID (e.g. 8948), we want to:
- Create a master GI list for 8948 for subtraction via remove_mito_gis_from_gi_lists.pl
- Create mito gi lists for each family (?)
Must catch genome sequences in the refseq_genomic database
Must do database operations against nt and (at least) refseq_genomic. For example, of the 22 records for taxon group 8948 full mitochondrial genomes, only 12 are found in the nt database, and 10 are refseqs, so are (I'm assuming) in the refseq_genomic blast database.
Clustering is broken--clustering to species (from long records) instead of by gene
The clustering is not working as anticipated. Instead of clustering different species into gene groups, it is clustering by species. This is because, for the mitochondrial groupings that I have tested, the FASTA databases I am clustering include full mitochondrial genomes. Thus cd-hit-est is clustering the database against the longest sequence, which is the full mt genome. So it is picking all the sequences that align to the full mitochondrial genome above xx% sequence identity. This is obviously picking up all the mitochondrial genes for a species instead of what I want it to do.
updateblastdb.sh should create aliases for necessary database combinations (nucleotides/proteins/etc...)
Feed subset_mito_db.pl lower level (family) taxon IDs iteratively, instead of high-level identifier (like 8948)
Instead of using taxID_to_GIs.pl to create files of subsets of GIs for all families from a higher-level taxon like 8948, instead just use taxID_to_GIs.pl to get the taxon identifiers of all families, then iteratively feed these identifiers to subset_mito_db.pl in order to create sub-databases for mitochondrion and nucleus for every family of the higher taxon ID.
This makes part of the functionality of taxID_to_GIs.pl redundant. We no longer need it to create GI lists at all, since we are getting the GI lists from Entrez. Using this method we are ignoring the data from gi_taxid_nucl.dmp.gz.
My concern with this new method is that it this way we are making way more calls to Entrez (3 for each family--one for nuclear, one for mitochondrion (not full genome) and one for mitochondrion (full genome) ) over the interwebs, which might not be as robust. I think the proposed is more straightforward and simple, however. Worth a shot.
I will be interested to see which method is faster (I'm guessing the first).
Add command line flag for taxIDs_to_GIs.pl to select high-level taxID to search for
Doesn't yet deal with integrating mitochondrial genomes into mt results
They won't have independent GI numbers once I break them up into their constituent parts--can I just do this step after all the blast db processing is done?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.