Remove `Archive` folder

We need to migrate the archive folder to its own github repository and make it private. I think there are some insecure things in there (not sure if I removed them). @grabear

Reference this.

GI deprecation

NCBI has deprecated using GI numbers for anything in the future. Here we must also do this.

In order to begin this process I have commented out code and created todos:

# TODO: GI deprecation.

The following commits were made on individual files for deprecating the GI functionality:

DatabaseManagement: 9713e87
get_gi_list.py: f9c5956
CompGenBlastn
- 92db4ac
- 8a20c32
Blast README: 6a5ac79
Blast\utils.py: 2f94c21

Alignment Class

Decide on different alignment strategies that we would like to be able to employ.

Flask will be useful in database management, login functionality, role functionality (admin vs user), etc. Flask will be used to manage our shiny applications/URLs, which is another dive into the unknown. However, this should be a relatively seamless issue.

Copied from original issue: grabear/Orthologs-Project#20

Agree on a line length for our package.

PEP8 convention calls for a line length of ~80 characters:

Limit all lines to a maximum of 79 characters.

For flowing long blocks of text with fewer structural restrictions (docstrings or comments), the line length should be limited to 72 characters.

However, I like something a bit longer. Around 120 characters. I don't mind doing 120 for both code and docstrings, but I can do 120 for code and 72 for docstrings.

Let me know if you prefer something different @sdhutchins.

Clean docstrings

Use pydocstyle as referenced by @datasnakes/snakes to update all doc strings.

python 3.4 compatibility

Issue with importing Manager.

Update Alignment configuration.

While the other modules are configuring successfully now, the data management class's configuration for the alignment class needs to change. Since each alignment takes different parameters, Each type of alignment will have it's own nested key-value pairs in the Alignment_config variable that exists in the YAML files. Guidance for example will look like this:

Alignment_config:
  aln_program: 'GUIDANCE2'
  Guidance_config:
    seqFile: "~\\Datasnakes-Scripts\\Datasnakes\\Manager\\config\\test.faa"
    msaProgram: "CLUSTALW"
    seqType: "nuc"

Further down the road we can do this for a consensus alignment:

Alignment_config:
  aln_program: 'GUIDANCE2'
  Consensus_config:
    Guidance_config:
      seqFile: "~\\Datasnakes-Scripts\\Datasnakes\\Manager\\config\\test.faa"
      msaProgram: "CLUSTALW"
      seqType: "nuc"
    IQ-Tree_config:
      var1:  "hope"
      var2: "dreams"

Obviously there are other options, but for now we've got to KISS.

Use SQL for our project and data.

From @grabear on April 10, 2017 17:5

Create an SQL schema that will hold our data.
See which we should do:

Alter the bio-python BioSQL schema so we an add our data.
Use other schema for our data.
Start from scratch.

Data to add:

Accession file (Lister class)
- Missing accession data
  - Based on Organism
  - Based on Gene
- Duplicate accession data
  - Based on Organism
  - Based on Gene
GenBank files
- Database reference or foreign key
FASTA files
- Multi FASTA
- Alignment files
Phylogenetic Tree files
- Newick format, etc.
PAML data

The flask based login page will store user data, which will be accessible from a separate database.
This will also hold other form data that's created by each user, when they register or alter their account information.

Copied from original issue: grabear/Orthologs-Project#24

Migrating Old Repository Issues and Projects

I'm just now getting to this... I'm going to have to port over some of the projects, and issues from the old repository. I want to delete it after we get some of those issues back in our field of view. @sdhutchins Will this be a problem?

Clean Scripts

Comment out any unused imports
Comment any tricky or unclear code
Improve style
- Conform to pep8
- Ensure style is conserved across all scripts

Make a list of files that are using access tokens or some sort of keys.

From @grabear on April 14, 2017 0:2

GitHub with our Applications page
- R
- Shows commits, messages, and committer
SSH?
Slackify in Tools module

Copied from original issue: grabear/Orthologs-Project#26

new_basic_project Cookie for standalone projects.

...data
...databases
...index
...raw_data

Database Management for the pipeline.

Database Management will first be developed in Manager/database_management.py
Another module for creating template BioSQL databases will be developed in Manager/BioSQL/biosql.py
It will help keep the following databases updated:
- ETE3's NCBI-taxonomy database
- Local NCBI databases
  - blast database /blast/db NEW (May 2019)
  - GenBank flat files from NCBI's Refseq release (BioSQL) /refseq/release/<collection_subset>
  - ~~[ ] gi lists OR Should we convert this to accession.version via this or this~~
    - ~~vertebrate_mammalian~~

Integrate LogIt into package

Use LogIt as a general logging tool.
- Use in each module and make a master log.
- Give each module a specific color.

Documentation

As it stands, we are in a fairly stable place #36 #33 #66 and others have all given our project a well defined shape and functionality.

For me in particular, I need to add documentation to the Classes and modules that I've created.

Combine data_management and the Luigi class

What are you trying to accomplish?

The Pipeline module is great. Whenever #38 is finished, the functionality in Manager/data_management.py needs to be used/moved in the Pipeline module.

What is the issue?

The workflow is awkward as it stands.
Will help with #32.

Works Cited

Go through different code and preform work cited where necessary in a md or rst document.
Comment code with Work Cited Index numbers.
Do research to see if there is a standard way to do works cited for programming projects.

    Format:
{
file: 'file',
path: 'path',
line: 000,
citation: 'biopython',
etc: 'etc'
}

Rename and refactor package

Before the official release, we need to decide on renaming the package and refactoring it.

Rename repository from Datasnakes-Scripts to OrthoEvolution
Refactor Datasnakes to OrthoEvol

Feel free to change the name of this.

Update readthedocs, wiki, and READMEs

We need to update the readthedocs and wiki with our current documentation (mainly the tutorial).

I'm considering changing the style of the readthedocs page.

As it stands, we are in a fairly stable place #36 #33 #66 and others have all given our project a well-defined shape and functionality.

We need to add documentation to the Classes and modules that we've created.

Make a list of startup processes. (host:port if applicable)

From @grabear on April 13, 2017 23:58

Cockpit (162.243.56.106:9090)
Shiny Server (162.243.56.106:3838)
R-Studio Server (162.243.56.106:8787)
Flask (162.243.56.106:5252)

- Wasabi (162.243.56.106:8000)
- FTP
- RichFilemanager

Copied from original issue: grabear/Orthologs-Project#25

Add better functionality to the Tier system used by the Accession file.

For the master accession files, we can control the workflow by giving the end user access to
priority labels that we define. For instance...

Our default set of tier labels:

Priority Levels:
- Higher priority genes are put through the pipeline first.
- Numerical ranges from 1 to infinity
  - 0 = Do not execute
  - 1 = High Priority
  - 100000000 = Low Priority
- Alternatively allow the user to create a YAML file
  - This allows for strings in the .csv files
  - For instance:
```
 Opioid: 1
 Alcohol: 2
 Cocaine: 3
 Opioid/Cocaine: 4
```
- String levels
  - "Skip" could actually skip over genes that don't need to be analyzed right away
  - "Isoform-1-Insert-Gene-Here" could be used for identifying isoforms
    - Isoform-0-HTR1A (the desired gene to analyze)
    - Isoform-1-HTR1A
```
 Isoform-0-HTR1A: 1
 Isoform-1-HTR1A:  "Skip"
```

This could also be very useful when we get to parsing updated HGNC files directly.

Index File Consistency

Minimization

Right now we have a ton of different files we are using as an index for our project. I'd like to propose that we narrow it down to a minimum. Currently we have the following:

MAFV3.1.csv
MAFV3.2.csv
MAFV3.3.csv
Homo_sapiens_Accession.csv
Macaca_mulatta_Accession.csv
Organisms & Taxonomy IDs.csv
Organisms.csv
taxids.csv
ShinyGenes.csv
ShinyOrganisms.csv
shinydir_mana.txt
commonnames.csv

The first 3 are test files for the Analysis class so they will eventually go.
We can get rid of 4, 5, and 6 because they are redundant.
We can keep 8 for now, but I'd like to work the taxon id's into our Master Accession file (it will be easier to keep them ordered this way) for offline use, and get them from ete3 for online use.
For 9 and 10 we were using these because of weird compatibility with R and Python, but I know there is a much better way to deal with this via csv file/dataframe. So we can get rid of these two files, but only after we fix the shiny app code that deals with these files.
11 is a configuration file for the shiny app so we can do something different with this...
And lastly 12. This file is useful but we can do it programmatically as well, so this also seems redundant.
I'd also like to keep the GI files here.
Note: This is just from my perspective so if you need any one of these files just say so. I just wanted to narrow down on what we have and standardize the naming conventions.

Standardization

In order to standardize things we need to come up with a naming convention. I propose that we use extensions to identify file types:

For accession files we denote them with a .ma. The user (for now just us) can name them ANYTHING!! So it could be Crazy-MAST-ACC.ma.csv
For taxids it could be .tax.
Our database files (BLAST and RefSeq) will be different. We can access them via directory name.

Abstract the functions that use cookiecutter. (Manager to Cookie)

Remove the various cookiecutter function from the Manager/management.py file, and put them in the Cookies module via cookies.py.

Using YAML tags and Python types with PyYAML.

(Note: This is mostly just a forum for discussing different ways of using these YAML tags to do work with various python modules. It makes it convenient to store Path-Like objects as well as with using the pkg_resources module.)

PyYaml Documentation

YAML tag	Python type
Standard YAML tags
!!null	None
!!bool	bool
!!int	int or long (int in Python 3)
!!float	float
!!binary	str (bytes in Python 3)
!!timestamp	datetime.datetime
!!omap, !!pairs	list of pairs
!!set	set
!!str	str or unicode (str in Python 3)
!!seq	list
!!map	dict
Python-specific tags
!!python/none	None
!!python/bool	bool
!!python/bytes	(bytes in Python 3)
!!python/str	str (str in Python 3)
!!python/unicode	unicode (str in Python 3)
!!python/int	int
!!python/long	long (int in Python 3)
!!python/float	float
!!python/complex	complex
!!python/list	list
!!python/tuple	tuple
!!python/dict	dict
Complex Python tags
!!python/name:module.name	module.name
!!python/module:package.module	package.module
!!python/object:module.cls	module.cls instance
!!python/object/new:module.cls	module.cls instance
!!python/object/apply:module.f	value of f(...)

Use Case Examples

Importing Data files

Python implementation

from pkg_resources import resource_filename
from Datasnakes.Manager.config import references
path = resource_filename(references.__name__, "Local_Reliability_Measures_from_Sets_of_Co-optimal_Multiple_Sequence_Alignments.pdf")

YAML implementation

path: '!!python/object/apply:pkg_resources.resource_filename [Datasnakes.Manager.config.references, "Local_Reliability_Measures_from_Sets_of_Co-optimal_Multiple_Sequence_Alignments.pdf"]'

Importing path-like objects

Python implementation

from pathlib import Path
NCBI_refseq_release = Path('NCBI/refseq/release')

YAML implementation

NCBI_refseq_release: "!!python/object/apply:pathlib.Path ['NCBI/refseq/release']"

Integrate full pipeline

Things to remember for each step

Background scripts
- Weekly or Monthly Updates for BLAST database (1 degree of archiving)
- Weekly or Monthly Updates for GenBank database (1 degree of archiving)
- Email updates for milestones, and errors
- Logging
- Master Python and BASH script
  - Configuration script
Create tutorials with Markdown

Tying it all together

Copied from original issue: grabear/Orthologs-Project#22

Finish Tools Module

The tools module needs to be finished.

Re-Create the Digital Ocean Server.

Clean the Digital Ocean Server and Restart. Create R-Markdown documentation along the way.

Convert StreamIEO class to LogIT class method.

Via eb891a8#commitcomment-25360680
We should create a StreamHandler() logzero logger to replace the StreamIEO.

Create an Icon or Image for the package.

What are you trying to accomplish?

Branding our package.

Maybe a phylogenetic tree with branches that are DNA strands.

For Example:

Find a pipeline workflow package and integrate it

Top Tasks

Create a simple test that integrates finished modules.
Morph existing pipeline with argparse or janis to use on command line.
Enhance existing pipeline for SGE usage & multiple processes/cluster without SGE

General Workflow

ETE3PAML - Clean and polish

@sdhutchins needs to simplify this and integrate IQtree

Project Structure

From @grabear on April 10, 2017 15:58

Project structure is of immediate priority and interest. However, the structure is currently in flux. We will have to continue to modify it based on our needs until we reach something stable.

Copied from original issue: grabear/Orthologs-Project#21

Command Line Interface

Develop a CLI. This will have to be developed in another release.

Deciding on DocString style

I think pycharm is using rest? docstring (for restructured text). @grabear

def function():
  """This function does something.

    :param name: The name to use.
    :type name: str.
    :param state: Current state to be in.
    :type state: bool.
    :returns:  int -- the return code.
    :raises: AttributeError, KeyError

    """

Are you okay with this format since you're already using similar? I'm going to work on adding to all the docstrings I've written. Won't be hard.

Also, there's an example here. Since we're already using sphinx for documentation, I think it'd be great to use it.

Speeding up BLAST (Phasing out GI numbers)

So in order to keep our code-base up to date, we need to drop the GI list functionality.
link

I think it would be best for long term development if we do this. It would simply require changing the following:

-outfmt '%g %T' to -outfmt '%a %T' (blastdbcmd)
gilist parameter to seqidlist parameter (NCBIblastnCommandline)

So instead of creating GI list we are creating Accession.Version lists.

To my understanding this should be a relatively seamless transition.
Alternatively we can also create a masked database, but I know you have more experience with that than I do @sdhutchins

AQUA development and integration.

From @grabear on April 10, 2017 16:56

Create an AQUA script for quality analysis of alignments.

Copied from original issue: grabear/Orthologs-Project#23

Module Testing

Weird importing issue.

03c6514#commitcomment-24706630

When using DataManagement, there was an Error when importing CompGenBLASTn.

Implement better logging and notification.

https://github.com/liiight/notifiers

Figure out Perl to Python tags.

Manager/BioSQL is causing our code to be tagged with Perl instead of Python.

The only solution is to change the Perl scripts to python scripts.

Requirements

So far this is the list of requirements I have. I used pipreqs to get this list.

ete3==3.0.0b35
pandas==0.19.2
pexpect==4.2.1
slacker==0.9.42
biopython==1.68
tablib==0.11.4
mygene==3.0.0
cookiecutter==1.5.1
flask==0.12

pip install pipreqs
Then, pipreqs [options] <path>

Please add any other non distribution-python packages.

Convert pathlib calls.

Currently paths are handled using the pathlib.Path library.

Convert pathlib.Path calls to pathlib.PurePath calls and use pathlib.Path more like the os library.
- Instead of having to do Path(home) / Path ('target') / Path('file'), we can do PurePath(home, 'target', 'file'). Investigate the pathlib library to see if you can use an iterable with PurePath.
Exclude this for the main path variables that represent the directories for the different Cookies in the Mana, RepoMana, UserMana, ProjectMana, WebMan, and soon-to-be DatabaseMana class.

Create repository for external apps - Linux Admin Experience

We need to create a repository for the external apps we're using with this package for users to easily install them.
See ETE's repository

Apps to add:

Abstract the configuration code in the Orthologs submodules

CompGenObjects
GenBank
Alignment
Soon to be Phylogenetics/PAML

All of these classes use the same configuration as seen in ff1cb04.
It would help with readability to abstract these configurations into a function that exists elsewhere (utils.py?).

Test or Rewrite Send2Server module

The goal of the send2server module is to create a class that sends data (zipped files or individual files) from one server to another server.

Please take a look at our send2server module and test it, redo it completely, or add to it.

How to contribute

Fork our repo
Check out the dev-master branch
Create your own branch from the dev-master branch
Add and test your updates.
Submit a PR that adds the updates to the dev-master branch

Re-address the current ways that the various Orthologs modules handle the data/databases.

Problem:

When using CompGenObjects, the initial runtime for a full file can take up to 5 minutes.

Solution:

In order to speed this up, it would be good to add the entire csv file to a database as is. After CompGenObjects has created its pre/post blast dictionaries, it would be good to store these there too. The duplicates dictionary takes the longest amount of time.

Change ProjMana.init()

So that anyone can use the base level project structure without having to also utilize RepoMana, WebMana, etc. This will make it easier to manage projects individually, without having to create the entire framework aimed at FLASK.

Create a Pull Request Template

It could be a good idea to create a pull request template.

Details on creating one are in this article.

Maybe it should look like below. Not sure what's common.

Fixes # .


Changes proposed in this pull request:
- 
- 
- 

Attention: @datasnakes/snakes

datasnakes / orthoevolution Goto Github PK

orthoevolution's People

Contributors

Stargazers

Watchers

Forkers

orthoevolution's Issues

What are you trying to accomplish?

What is the issue?

Minimization

Standardization

PyYaml Documentation

Use Case Examples

Importing Data files

Python implementation

YAML implementation

Importing path-like objects

Python implementation

YAML implementation

What are you trying to accomplish?

Top Tasks

General Workflow

How to contribute

Problem:

Solution:

Recommend Projects

Recommend Topics

Recommend Org