Giter VIP home page Giter VIP logo

datasnakes / orthoevolution Goto Github PK

View Code? Open in Web Editor NEW
29.0 4.0 4.0 41.63 MB

An easy to use and comprehensive python package which aids in the analysis and visualization of orthologous genes. šŸµ

Home Page: https://orthoevolution.readthedocs.io/en/master/

Python 38.81% Shell 0.33% HTML 10.19% TeX 1.46% PLSQL 13.28% SQLPL 0.19% PLpgSQL 3.93% CSS 0.09% JavaScript 14.59% TSQL 16.85% R 0.25%
python bioinformatics orthologs biology phylogenetics ncbi ftp blast qsub orthology-inference

orthoevolution's People

Contributors

dependabot[bot] avatar grabear avatar grabearummc avatar sdhutchins avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

orthoevolution's Issues

Remove `Archive` folder

We need to migrate the archive folder to its own github repository and make it private. I think there are some insecure things in there (not sure if I removed them). @grabear

Reference this.

GI deprecation

NCBI has deprecated using GI numbers for anything in the future. Here we must also do this.

In order to begin this process I have commented out code and created todos:

# TODO: GI deprecation.

The following commits were made on individual files for deprecating the GI functionality:

Alignment Class

Decide on different alignment strategies that we would like to be able to employ.

  • Filtered Sequence Alignment
    • Amino Acid Alignment with multiple GUIDANCE2 iterations.
    • Nucleic Acid Alignment without filtered sequences using PAL2NAL.
      • Catch PAL2NAL errors and remove sequences that aren't working
        • Employ the Filtered Column Alignment strategy for the sequences that aren't working by aligning multiple query sequences (Homo_sapiens, Macaca_mulatta, Mus_musculus) with the "bad" sequences.
  • Solitary Alignment
    • CLUSTAL Omega
    • PAL2NAL
  • Filtered Column Alignment
    • For sequences that are more cluttered, deploy a strategy for safely removing "bad" columns from the alignments by using GUIDANCE2
    • Use PAL2NAL on what's left.

Integrate Flask with Shiny.

From @grabear on April 10, 2017 15:55

Flask will be useful in database management, login functionality, role functionality (admin vs user), etc. Flask will be used to manage our shiny applications/URLs, which is another dive into the unknown. However, this should be a relatively seamless issue.

Copied from original issue: grabear/Orthologs-Project#20

Agree on a line length for our package.

PEP8 convention calls for a line length of ~80 characters:

Limit all lines to a maximum of 79 characters.

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

However, I like something a bit longer. Around 120 characters. I don't mind doing 120 for both code and docstrings, but I can do 120 for code and 72 for docstrings.

Let me know if you prefer something different @sdhutchins.

Clean docstrings

  • Use pydocstyle as referenced by @datasnakes/snakes to update all doc strings.

Update Alignment configuration.

While the other modules are configuring successfully now, the data management class's configuration for the alignment class needs to change. Since each alignment takes different parameters, Each type of alignment will have it's own nested key-value pairs in the Alignment_config variable that exists in the YAML files. Guidance for example will look like this:

Alignment_config:
  aln_program: 'GUIDANCE2'
  Guidance_config:
    seqFile: "~\\Datasnakes-Scripts\\Datasnakes\\Manager\\config\\test.faa"
    msaProgram: "CLUSTALW"
    seqType: "nuc"

Further down the road we can do this for a consensus alignment:

Alignment_config:
  aln_program: 'GUIDANCE2'
  Consensus_config:
    Guidance_config:
      seqFile: "~\\Datasnakes-Scripts\\Datasnakes\\Manager\\config\\test.faa"
      msaProgram: "CLUSTALW"
      seqType: "nuc"
    IQ-Tree_config:
      var1:  "hope"
      var2: "dreams"

Obviously there are other options, but for now we've got to KISS.

Use SQL for our project and data.

From @grabear on April 10, 2017 17:5

Create an SQL schema that will hold our data.
See which we should do:

  1. Alter the bio-python BioSQL schema so we an add our data.

  2. Use other schema for our data.

  3. Start from scratch.


Data to add:

  • Accession file (Lister class)
    • Missing accession data
      • Based on Organism
      • Based on Gene
    • Duplicate accession data
      • Based on Organism
      • Based on Gene
  • GenBank files
    • Database reference or foreign key
  • FASTA files
    • Multi FASTA
    • Alignment files
  • Phylogenetic Tree files
    • Newick format, etc.
  • PAML data

The flask based login page will store user data, which will be accessible from a separate database.
This will also hold other form data that's created by each user, when they register or alter their account information.

Copied from original issue: grabear/Orthologs-Project#24

Clean Scripts

  • Comment out any unused imports
  • Comment any tricky or unclear code
  • Improve style
    • Conform to pep8
    • Ensure style is conserved across all scripts

Database Management for the pipeline.

  • Database Management will first be developed in Manager/database_management.py
  • Another module for creating template BioSQL databases will be developed in Manager/BioSQL/biosql.py
  • It will help keep the following databases updated:
    • ETE3's NCBI-taxonomy database
    • Local NCBI databases

Integrate LogIt into package

  • Use LogIt as a general logging tool.
    • Use in each module and make a master log.
    • Give each module a specific color.

Documentation

As it stands, we are in a fairly stable place #36 #33 #66 and others have all given our project a well defined shape and functionality.

For me in particular, I need to add documentation to the Classes and modules that I've created.

  • Mini Tutorial
  • Additions to main package README file
  • Basic workflow vs Full workflow
  • Cookies (custom cookiecutter templates)
  • Manager module and Management class
  • DataManagement class
  • MultipleSequenceAlignment class
  • GenBank class
  • CompGenBLASTn
  • I'm definitely forgetting something.

Combine data_management and the Luigi class

What are you trying to accomplish?

The Pipeline module is great. Whenever #38 is finished, the functionality in Manager/data_management.py needs to be used/moved in the Pipeline module.

What is the issue?

The workflow is awkward as it stands.
Will help with #32.

Works Cited

Go through different code and preform work cited where necessary in a md or rst document.
Comment code with Work Cited Index numbers.
Do research to see if there is a standard way to do works cited for programming projects.

    Format:
{
file: 'file',
path: 'path',
line: 000,
citation: 'biopython',
etc: 'etc'
}

Rename and refactor package

Before the official release, we need to decide on renaming the package and refactoring it.

  • Rename repository from Datasnakes-Scripts to OrthoEvolution
  • Refactor Datasnakes to OrthoEvol

Feel free to change the name of this.

Update readthedocs, wiki, and READMEs

We need to update the readthedocs and wiki with our current documentation (mainly the tutorial).

I'm considering changing the style of the readthedocs page.

As it stands, we are in a fairly stable place #36 #33 #66 and others have all given our project a well-defined shape and functionality.

We need to add documentation to the Classes and modules that we've created.

  • Mini Tutorial (Readme in OrthoEvol directory)
  • Main README file
  • Tools
  • Cookies (custom cookiecutter templates)
  • Manager
  • Orthologs

Make a list of startup processes. (host:port if applicable)

From @grabear on April 13, 2017 23:58

  • Cockpit (162.243.56.106:9090)
  • Shiny Server (162.243.56.106:3838)
  • R-Studio Server (162.243.56.106:8787)
  • Flask (162.243.56.106:5252)

  • - Wasabi (162.243.56.106:8000)

  • - FTP

  • - RichFilemanager

Copied from original issue: grabear/Orthologs-Project#25

Add better functionality to the Tier system used by the Accession file.

For the master accession files, we can control the workflow by giving the end user access to
priority labels that we define. For instance...

Our default set of tier labels:

  • Priority Levels:
    • Higher priority genes are put through the pipeline first.

    • Numerical ranges from 1 to infinity

      • 0 = Do not execute
      • 1 = High Priority
      • 100000000 = Low Priority
    • Alternatively allow the user to create a YAML file

      • This allows for strings in the .csv files
      • For instance:
       Opioid: 1
       Alcohol: 2
       Cocaine: 3
       Opioid/Cocaine: 4
    • String levels

      • "Skip" could actually skip over genes that don't need to be analyzed right away
      • "Isoform-1-Insert-Gene-Here" could be used for identifying isoforms
        • Isoform-0-HTR1A (the desired gene to analyze)
        • Isoform-1-HTR1A
       Isoform-0-HTR1A: 1
       Isoform-1-HTR1A:  "Skip"

This could also be very useful when we get to parsing updated HGNC files directly.

Index File Consistency

Minimization

Right now we have a ton of different files we are using as an index for our project. I'd like to propose that we narrow it down to a minimum. Currently we have the following:

  1. MAFV3.1.csv
  2. MAFV3.2.csv
  3. MAFV3.3.csv
  4. Homo_sapiens_Accession.csv
  5. Macaca_mulatta_Accession.csv
  6. Organisms & Taxonomy IDs.csv
  7. Organisms.csv
  8. taxids.csv
  9. ShinyGenes.csv
  10. ShinyOrganisms.csv
  11. shinydir_mana.txt
  12. commonnames.csv
  • The first 3 are test files for the Analysis class so they will eventually go.
  • We can get rid of 4, 5, and 6 because they are redundant.
  • We can keep 8 for now, but I'd like to work the taxon id's into our Master Accession file (it will be easier to keep them ordered this way) for offline use, and get them from ete3 for online use.
  • For 9 and 10 we were using these because of weird compatibility with R and Python, but I know there is a much better way to deal with this via csv file/dataframe. So we can get rid of these two files, but only after we fix the shiny app code that deals with these files.
  • 11 is a configuration file for the shiny app so we can do something different with this...
  • And lastly 12. This file is useful but we can do it programmatically as well, so this also seems redundant.
  • I'd also like to keep the GI files here.
    Note: This is just from my perspective so if you need any one of these files just say so. I just wanted to narrow down on what we have and standardize the naming conventions.

Standardization

In order to standardize things we need to come up with a naming convention. I propose that we use extensions to identify file types:

  • For accession files we denote them with a .ma. The user (for now just us) can name them ANYTHING!! So it could be Crazy-MAST-ACC.ma.csv
  • For taxids it could be .tax.
  • Our database files (BLAST and RefSeq) will be different. We can access them via directory name.

Using YAML tags and Python types with PyYAML.

(Note: This is mostly just a forum for discussing different ways of using these YAML tags to do work with various python modules. It makes it convenient to store Path-Like objects as well as with using the pkg_resources module.)

PyYaml Documentation

YAML tag Python type
Standard YAML tags Ā 
!!null None
!!bool bool
!!int intĀ orĀ longĀ (intĀ in Python 3)
!!float float
!!binary strĀ (bytesĀ in Python 3)
!!timestamp datetime.datetime
!!omap,Ā !!pairs listĀ of pairs
!!set set
!!str strĀ orĀ unicodeĀ (strĀ in Python 3)
!!seq list
!!map dict
Python-specific tags Ā 
!!python/none None
!!python/bool bool
!!python/bytes (bytesĀ in Python 3)
!!python/str strĀ (strĀ in Python 3)
!!python/unicode unicodeĀ (strĀ in Python 3)
!!python/int int
!!python/long longĀ (intĀ in Python 3)
!!python/float float
!!python/complex complex
!!python/list list
!!python/tuple tuple
!!python/dict dict
Complex Python tags Ā 
!!python/name:module.name module.name
!!python/module:package.module package.module
!!python/object:module.cls module.clsĀ instance
!!python/object/new:module.cls module.clsĀ instance
!!python/object/apply:module.f value ofĀ f(...)

Use Case Examples

Importing Data files

Python implementation

from pkg_resources import resource_filename
from Datasnakes.Manager.config import references
path = resource_filename(references.__name__, "Local_Reliability_Measures_from_Sets_of_Co-optimal_Multiple_Sequence_Alignments.pdf")

YAML implementation

path: '!!python/object/apply:pkg_resources.resource_filename [Datasnakes.Manager.config.references, "Local_Reliability_Measures_from_Sets_of_Co-optimal_Multiple_Sequence_Alignments.pdf"]'

Importing path-like objects

Python implementation

from pathlib import Path
NCBI_refseq_release = Path('NCBI/refseq/release') 

YAML implementation

NCBI_refseq_release: "!!python/object/apply:pathlib.Path ['NCBI/refseq/release']"

Integrate full pipeline

Things to remember for each step

  • Background scripts
    • Weekly or Monthly Updates for BLAST database (1 degree of archiving)
    • Weekly or Monthly Updates for GenBank database (1 degree of archiving)
    • Email updates for milestones, and errors
    • Logging
    • Master Python and BASH script
      • Configuration script
  • Create tutorials with Markdown

Tying it all together

  • 1: Integrate BLAST and GenBank Gathering
  • 2: Integrate Multi-FASTA creation
  • 3: Integrate Alignment
    • Integrate Guidance2
    • Integrate Pal2Nal
  • 4: Integrate Tree generation
    • Integrate IQTree
  • 5: Integrate PAML analysis
  • 6: Integrate R w/ python (send files from each server - DataMana)
  • 7: Integrate data w/ rshiny app
    • ggtree integration

Copied from original issue: grabear/Orthologs-Project#22

Finish Tools Module

The tools module needs to be finished.

  • ftp
    • test written?
    • readme written?
    • class/function written?
  • logit
    • test written?
    • readme written?
    • class/function written?
    • integrate with package :person_with_blond_hair:
  • mygene
    • test written?
    • readme written?
    • class/function written?
  • pandoc
    • test written?
    • readme written?
    • class/function written?
  • parallel
    • test written?
    • readme written?
    • class/function written?
  • pybasher
    • test written?
    • readme written?
    • class/function written?
  • pbs
    • test written?
    • readme written?
    • class/function written?
  • send2server
    • test written?
    • readme written?
    • class/function written?
  • slackify Will be removed
    • test written?
    • readme written?
    • class/function written?
    • integrate with package :person_with_blond_hair:
  • other-utils ARCHIVED
    • test written?
    • readme written?
    • class/function written?
  • mpi
    • test written?
    • readme written?
    • class/function written?

Find a pipeline workflow package and integrate it

Top Tasks

  • Create a simple test that integrates finished modules.
  • Morph existing pipeline with argparse or janis to use on command line.
  • Enhance existing pipeline for SGE usage & multiple processes/cluster without SGE

General Workflow

  • 1: Integrate BLAST and GenBank Gathering
  • 2: Integrate Multi-FASTA creation
  • 3: Integrate Alignment
    • Integrate Guidance2
    • Integrate Pal2Nal
    • Integrate clustalomega
  • 4: Integrate Tree generation
    • Integrate IQTree
  • 5: Integrate PAML analysis
  • 6: Integrate R w/ python (send files from each server - DataMana)
  • 7: Integrate data w/ rshiny app
    • ggtree integration

Project Structure

From @grabear on April 10, 2017 15:58

Project structure is of immediate priority and interest. However, the structure is currently in flux. We will have to continue to modify it based on our needs until we reach something stable.

Copied from original issue: grabear/Orthologs-Project#21

Deciding on DocString style

I think pycharm is using rest? docstring (for restructured text). @grabear

def function():
  """This function does something.

    :param name: The name to use.
    :type name: str.
    :param state: Current state to be in.
    :type state: bool.
    :returns:  int -- the return code.
    :raises: AttributeError, KeyError

    """

Are you okay with this format since you're already using similar? I'm going to work on adding to all the docstrings I've written. Won't be hard.

Also, there's an example here. Since we're already using sphinx for documentation, I think it'd be great to use it.

Speeding up BLAST (Phasing out GI numbers)

So in order to keep our code-base up to date, we need to drop the GI list functionality.
link

I think it would be best for long term development if we do this. It would simply require changing the following:

  • -outfmt '%g %T' to -outfmt '%a %T' (blastdbcmd)
  • gilist parameter to seqidlist parameter (NCBIblastnCommandline)

So instead of creating GI list we are creating Accession.Version lists.

To my understanding this should be a relatively seamless transition.
Alternatively we can also create a masked database, but I know you have more experience with that than I do @sdhutchins

Module Testing

  • Cookies
  • Manager
    • DatabaseManagement
    • ProjectManagement
  • Orthologs
    • Align
    • Blast
    • Genbank
    • CompGenetics
    • Phylogenetics
      • PAML
      • PhyML
  • Tools
    • ftp
    • multiprocess
    • qsub
    • logit
    • mygene
  • Add coverage
  • Use pytest

Figure out Perl to Python tags.

Manager/BioSQL is causing our code to be tagged with Perl instead of Python.

The only solution is to change the Perl scripts to python scripts.

Requirements

So far this is the list of requirements I have. I used pipreqs to get this list.

ete3==3.0.0b35
pandas==0.19.2
pexpect==4.2.1
slacker==0.9.42
biopython==1.68
tablib==0.11.4
mygene==3.0.0
cookiecutter==1.5.1
flask==0.12

pip install pipreqs
Then, pipreqs [options] <path>

Please add any other non distribution-python packages.

Convert pathlib calls.

Currently paths are handled using the pathlib.Path library.

  • Convert pathlib.Path calls to pathlib.PurePath calls and use pathlib.Path more like the os library.

    • Instead of having to do Path(home) / Path ('target') / Path('file'), we can do PurePath(home, 'target', 'file'). Investigate the pathlib library to see if you can use an iterable with PurePath.
  • Exclude this for the main path variables that represent the directories for the different Cookies in the Mana, RepoMana, UserMana, ProjectMana, WebMan, and soon-to-be DatabaseMana class.

Test or Rewrite Send2Server module

The goal of the send2server module is to create a class that sends data (zipped files or individual files) from one server to another server.

Please take a look at our send2server module and test it, redo it completely, or add to it.

How to contribute

  1. Fork our repo
  2. Check out the dev-master branch
  3. Create your own branch from the dev-master branch
  4. Add and test your updates.
  5. Submit a PR that adds the updates to the dev-master branch

Re-address the current ways that the various Orthologs modules handle the data/databases.

Problem:

When using CompGenObjects, the initial runtime for a full file can take up to 5 minutes.

Solution:

In order to speed this up, it would be good to add the entire csv file to a database as is. After CompGenObjects has created its pre/post blast dictionaries, it would be good to store these there too. The duplicates dictionary takes the longest amount of time.

Change ProjMana.__init__()

So that anyone can use the base level project structure without having to also utilize RepoMana, WebMana, etc. This will make it easier to manage projects individually, without having to create the entire framework aimed at FLASK.

Create a Pull Request Template

It could be a good idea to create a pull request template.

Details on creating one are in this article.

Maybe it should look like below. Not sure what's common.

Fixes # .


Changes proposed in this pull request:
- 
- 
- 

Attention: @datasnakes/snakes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.