Giter VIP home page Giter VIP logo

proteofav's Introduction

ProteoFAV

Protein Features, Annotations and Variants

Pypi Build Status Documentation Status Python: versions License

ProteoFAV is a Python module that address the challenge of cross-mapping protein structures and protein sequences, allowing for protein structures to be annotated with sequence features and annotations. It implements methods for working with protein structures (via mmCIF, PDB, PDB Validation, DSSP and SIFTS files), sequence Features (via UniProt GFF annotations) and genetic variants (via UniProt/EBI Proteins, Ensembl REST and TCGA TCGA Pan cancer APIs). Cross-mapping of structure and sequence is performed with the aid of SIFTS.

ProteFAV relies heavily in the Pandas library to quickly load data into DataFrames for fast data exploration and analysis. Structure and sequence data are parsed/fetched onto Pandas DataFrames that are then merged-together (collapsed) onto a single DataFrame.

Data such as protein structures (sequence and atom 3D coordinates) and respective annotations (from structural analysis, e.g. interacting interfaces, secondary structure and solvent accessibility), as well as protein sequences and annotations (e.g. genetic variants, and other functional information obtained from SIFTS and UniProt) are handled by the classes/methods so that each modular (component) table can be integrated onto a single 'merged table'.

proteofav.png

The methods implemenented in proteofav/mergers.py allow for the different components to be merged together onto a single Pandas DataFrame.

Getting Started

Dependencies

ProteoFAV was developed to support Python 3.5+ and Pandas 0.20+. Check requirements for specific requirements.

Installation

To install the stable release, run this command in your terminal:

$ pip install proteofav

If you don't have pip installed, this Python installation guide can guide you through the process.

Installing from source in a virtual environment

Getting ProteoFAV:

$ wget https://github.com/bartongroup/ProteoFAV/archive/master.zip -O ProteoFAV.zip
$ unzip ProteoFAV.zip

# alternatively, cloning the git repository
$ git clone https://github.com/bartongroup/ProteoFAV.git

Installing with Virtualenv:

$ virtualenv --python `which python` env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python path/to/ProteoFAV/setup.py install

Installing With Conda:

$ conda-env create -n proteofav -f path/to/ProteoFAV/requirements.txt
$ source activate proteofav
$ cd path/to/ProteoFAV
$ pip install .

Testing the installation

Test dependencies should be resolved with:

$ python path/to/ProteoFAV/setup.py develop --user

Run the Tests with:

$ python path/to/ProteoFAV/setup.py test
# or
$ cd path/to/ProteoFAV/tests
$ python -m unittest discover

ProteoFAV Configuration

ProteoFAV uses a configuration file config.ini where the user can specify the directory paths, as well as urls for commonly used data sources.

After installing run:

$ proteofav-setup

Example Usage

Example usage is currently provided as a Jupyter Notebook, which can be viewed with the GitHub's file viewer or with the Jupyter nbviewer.

You can download the Jupyter notebook from GitHub and test it with your ProteoFAV's installation.

Contributing and Bug tracking

Feel free to fork, clone, share and distribute. If you find any bugs or issues please log them in the issue tracker.

Before you submit your Pull-requests read the Contributing Guide.

Credits

See the Credits

Changelog

See the Changelog

Licensing

The MIT License (MIT). See license for details.

proteofav's People

Contributors

biomadeira avatar stuartmac avatar tbrittoborges avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

proteofav's Issues

Improve column types checking

We currently have a confirm_column_types utility method. Probably best to make it use a named tuple or a similar. Having a full dictionary with all the keys-type, means that within subgroups there could potentially be a case where a key with the same name has a different type requirement in a different subgroup (e.g. "id" for mmCIF and "id" for Variants)...

`select_uniprot_variants` cannot handle mismatches with canonical transcript for a given UniProt

from variants.to_table import select_uniprot_variants
select_uniprot_variants('P11586')
2016-02-08 09:36:41,670 - INFO - Starting new HTTP connection (1): www.uniprot.org 
2016-02-08 09:36:41,994 - DEBUG - "GET /uniprot/P11586 HTTP/1.1" 200 None 
2016-02-08 09:36:42,336 - INFO - Starting new HTTP connection (1): www.uniprot.org 
2016-02-08 09:36:42,386 - DEBUG - "GET /uniprot/?query=accession%3AP11586&contact=&columns=organism%2Csequence&format=tab HTTP/1.1" 200 None 
2016-02-08 09:36:42,390 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:42,934 - DEBUG - "GET /xrefs/symbol/homo_sapiens/P11586 HTTP/1.1" 200 222 
2016-02-08 09:36:42,937 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:43,474 - DEBUG - "GET /sequence/id/ENSP00000450560?type=protein HTTP/1.1" 200 261 
2016-02-08 09:36:43,475 - WARNING - Sequences don't match! skipping... ENSP00000450560 
2016-02-08 09:36:43,477 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:44,083 - DEBUG - "GET /sequence/id/ENSP00000216605?type=protein HTTP/1.1" 200 935 
2016-02-08 09:36:44,084 - WARNING - Sequences don't match! skipping... ENSP00000216605 
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-974825407562>", line 1, in <module>
    select_uniprot_variants('P11586')
  File "/Users/smacgowan/PycharmProjects/ProteoFAV/variants/to_table.py", line 444, in select_uniprot_variants
    table = pd.concat(tables)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/merge.py", line 812, in concat
    copy=copy)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/merge.py", line 845, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

This happened because there was a mismatch between the EnsEMBL and UniProt sequences:

image

We can work around it by attempting a 'permissive' EnsEMBL - UniProt comparison that just checks that the sequences are of the same length and log the number of mismatches.

Logging config

Logging configuration should go all in the config file. Avoid hard coded logging levels and etc.

Downloader should accept a file object

The new Downloader could accept a _file parameter to use a file object instead of a file path. Alternatively, we could use the following pattern:

import tempfile
uniprot_id = 'P17612'
with tempfile.NamedTemporaryFile(delete=False) as temp:
    Annotation.download(
        identifier=uniprot_id, 
        filename=temp.name)
    P17612_annotation = Annotation.read(filename=temp.name)

Which is also not working right now. I will investigate.

Design decision in _fetch_uniprot_variants

Hi @biomadeira @stuartmac

We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.

Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?

Update merge tables to work with biounits

We need to update merge_tables to work with biological units as updated here #25

The asym_ids (label and auth) are updated and 2 new columns added to these atom lines with the original auth/label asym ids.
For example chain A and B, becomes A + AA and B + BA

PDBe REST api now (v1.8) provides uniprot_canonical flag

/mappings/all_isoforms/
+ is_canonical field added (true if the UniProt sequence is the canonical
one, false otherwise).
+ The call now accepts UniProt accessions as input, including isoforms
(e.g. P83949-5).

The latest version of the PDB REST API flags whether a UniProt isoform is canonical or not. This is relevant to ProteoFAV.

Developing a more modular approach

Proposed design:

ProteoFAV's main features:
1 - Reading/parsing formatted files to pandas DataFrames (e.g. mmCIF, PDB, SIFTS XML, DSSP files)
2 - Downloading data files on the fly (e.g. mmCIF, PDB, SIFTS XML, DSSP files)
3 - Fetching sequence annotations (features) (e.g. variants from Ensembl and UniProt)
4 - Merging all the previous data onto a main DataFrame

With this in mind, I think would be great to have a structure like this:

proteofav.mmCIF.read() 		
proteofav.mmCIF.write() 
proteofav.mmCIF.download()
proteofav.mmCIF.select()
proteofav.PDB.read()
proteofav.PDB.write()
proteofav.PDB.download()
proteofav.PDB.select()
proteofav.DSSP.read()
proteofav.DSSP.download()
proteofav.DSSP.select()
proteofav.SIFTS.read()
proteofav.SIFTS.download()
proteofav.SIFTS.select()
proteofav.Validation.read()
proteofav.Validation.download()
proteofav.Validation.select()
proteofav.Annotations.read()
proteofav.Annotations.download()
proteofav.Annotations.select()
proteofav.Variants.fetch()
proteofav.Variants.select()
proteofav.Tables.merge()
proteofav.Tables.generate()

Classes generally have the following basic methods

  • read - read/parse from file
  • write - write output to a file
  • download - downloads data to a file (mmCIF, etc.)
  • fetch - downloads data to the handle, but can be cached (JSON, etc.)
  • merge - merge any set of DataFrames, so each DataFrame should be aware of what type of data it contains
  • generate - automated table generation by input (i.e. input PDB ID/CHAIN ID or input UniProt ID)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.