Giter VIP home page Giter VIP logo

mitonuc's People

Contributors

atcg avatar

Watchers

 avatar  avatar

mitonuc's Issues

Count how many of the mitochondrial GIs are in the big GI list I'm subtracting them from

I'm curious how accurate the main GI list from the taxdump is. To test for how well this agrees with Entrez, when I'm subtracting them from master GI lists from each family:

  1. Count each mitochondrial GI as we loop through ($mtGIcounter)
  2. If a match is successful, increment the match counter ($matchcounter)
  3. After looping through all the lines in the mt GI file, do $matchcounter / $mtGIcounter and report the results.

Must catch genome sequences in the refseq_genomic database

Must do database operations against nt and (at least) refseq_genomic. For example, of the 22 records for taxon group 8948 full mitochondrial genomes, only 12 are found in the nt database, and 10 are refseqs, so are (I'm assuming) in the refseq_genomic blast database.

Clustering is broken--clustering to species (from long records) instead of by gene

The clustering is not working as anticipated. Instead of clustering different species into gene groups, it is clustering by species. This is because, for the mitochondrial groupings that I have tested, the FASTA databases I am clustering include full mitochondrial genomes. Thus cd-hit-est is clustering the database against the longest sequence, which is the full mt genome. So it is picking all the sequences that align to the full mitochondrial genome above xx% sequence identity. This is obviously picking up all the mitochondrial genes for a species instead of what I want it to do.

Feed subset_mito_db.pl lower level (family) taxon IDs iteratively, instead of high-level identifier (like 8948)

Instead of using taxID_to_GIs.pl to create files of subsets of GIs for all families from a higher-level taxon like 8948, instead just use taxID_to_GIs.pl to get the taxon identifiers of all families, then iteratively feed these identifiers to subset_mito_db.pl in order to create sub-databases for mitochondrion and nucleus for every family of the higher taxon ID.

This makes part of the functionality of taxID_to_GIs.pl redundant. We no longer need it to create GI lists at all, since we are getting the GI lists from Entrez. Using this method we are ignoring the data from gi_taxid_nucl.dmp.gz.

My concern with this new method is that it this way we are making way more calls to Entrez (3 for each family--one for nuclear, one for mitochondrion (not full genome) and one for mitochondrion (full genome) ) over the interwebs, which might not be as robust. I think the proposed is more straightforward and simple, however. Worth a shot.

I will be interested to see which method is faster (I'm guessing the first).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.