bartongroup / proteofav Goto Github PK

Open-source framework for simple and fast integration of protein structure data with sequence annotations and genetic variation

License: MIT License

Python 82.85% Jupyter Notebook 17.15%

bioinformatics structural-biology data-analysis python pandas pdb mmcif dssp sifts features

proteofav's Introduction

ProteoFAV

Protein Features, Annotations and Variants

ProteoFAV is a Python module that address the challenge of cross-mapping protein structures and protein sequences, allowing for protein structures to be annotated with sequence features and annotations. It implements methods for working with protein structures (via mmCIF, PDB, PDB Validation, DSSP and SIFTS files), sequence Features (via UniProt GFF annotations) and genetic variants (via UniProt/EBI Proteins, Ensembl REST and TCGA TCGA Pan cancer APIs). Cross-mapping of structure and sequence is performed with the aid of SIFTS.

ProteFAV relies heavily in the Pandas library to quickly load data into DataFrames for fast data exploration and analysis. Structure and sequence data are parsed/fetched onto Pandas DataFrames that are then merged-together (collapsed) onto a single DataFrame.

Data such as protein structures (sequence and atom 3D coordinates) and respective annotations (from structural analysis, e.g. interacting interfaces, secondary structure and solvent accessibility), as well as protein sequences and annotations (e.g. genetic variants, and other functional information obtained from SIFTS and UniProt) are handled by the classes/methods so that each modular (component) table can be integrated onto a single 'merged table'.

The methods implemenented in proteofav/mergers.py allow for the different components to be merged together onto a single Pandas DataFrame.

Getting Started

Dependencies

ProteoFAV was developed to support Python 3.5+ and Pandas 0.20+. Check requirements for specific requirements.

Installation

To install the stable release, run this command in your terminal:

$ pip install proteofav

If you don't have pip installed, this Python installation guide can guide you through the process.

Installing from source in a virtual environment

Getting ProteoFAV:

$ wget https://github.com/bartongroup/ProteoFAV/archive/master.zip -O ProteoFAV.zip
$ unzip ProteoFAV.zip

# alternatively, cloning the git repository
$ git clone https://github.com/bartongroup/ProteoFAV.git

Installing with Virtualenv:

$ virtualenv --python `which python` env
$ source env/bin/activate
$ pip install -r requirements.txt
$ python path/to/ProteoFAV/setup.py install

Installing With Conda:

$ conda-env create -n proteofav -f path/to/ProteoFAV/requirements.txt
$ source activate proteofav
$ cd path/to/ProteoFAV
$ pip install .

Testing the installation

Test dependencies should be resolved with:

$ python path/to/ProteoFAV/setup.py develop --user

Run the Tests with:

$ python path/to/ProteoFAV/setup.py test
# or
$ cd path/to/ProteoFAV/tests
$ python -m unittest discover

ProteoFAV Configuration

ProteoFAV uses a configuration file config.ini where the user can specify the directory paths, as well as urls for commonly used data sources.

After installing run:

$ proteofav-setup

Example Usage

Example usage is currently provided as a Jupyter Notebook, which can be viewed with the GitHub's file viewer or with the Jupyter nbviewer.

You can download the Jupyter notebook from GitHub and test it with your ProteoFAV's installation.

Contributing and Bug tracking

Feel free to fork, clone, share and distribute. If you find any bugs or issues please log them in the issue tracker.

Before you submit your Pull-requests read the Contributing Guide.

Credits

See the Credits

Changelog

See the Changelog

Licensing

The MIT License (MIT). See license for details.

proteofav's People

Contributors

Stargazers

Watchers

proteofav's Issues

Improve column types checking

We currently have a confirm_column_types utility method. Probably best to make it use a named tuple or a similar. Having a full dictionary with all the keys-type, means that within subgroups there could potentially be a case where a key with the same name has a different type requirement in a different subgroup (e.g. "id" for mmCIF and "id" for Variants)...

Jalview workflow example

Write a tutorial on using ProteoFAV with Jalview.

Python 3 support

Confirm support to python 3

ValueError: 3a5o|C Cif and DSSP files have diffent sequences.

`select_uniprot_variants` cannot handle mismatches with canonical transcript for a given UniProt

from variants.to_table import select_uniprot_variants
select_uniprot_variants('P11586')
2016-02-08 09:36:41,670 - INFO - Starting new HTTP connection (1): www.uniprot.org 
2016-02-08 09:36:41,994 - DEBUG - "GET /uniprot/P11586 HTTP/1.1" 200 None 
2016-02-08 09:36:42,336 - INFO - Starting new HTTP connection (1): www.uniprot.org 
2016-02-08 09:36:42,386 - DEBUG - "GET /uniprot/?query=accession%3AP11586&contact=&columns=organism%2Csequence&format=tab HTTP/1.1" 200 None 
2016-02-08 09:36:42,390 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:42,934 - DEBUG - "GET /xrefs/symbol/homo_sapiens/P11586 HTTP/1.1" 200 222 
2016-02-08 09:36:42,937 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:43,474 - DEBUG - "GET /sequence/id/ENSP00000450560?type=protein HTTP/1.1" 200 261 
2016-02-08 09:36:43,475 - WARNING - Sequences don't match! skipping... ENSP00000450560 
2016-02-08 09:36:43,477 - INFO - Starting new HTTP connection (1): rest.ensembl.org 
2016-02-08 09:36:44,083 - DEBUG - "GET /sequence/id/ENSP00000216605?type=protein HTTP/1.1" 200 935 
2016-02-08 09:36:44,084 - WARNING - Sequences don't match! skipping... ENSP00000216605 
Traceback (most recent call last):
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2883, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-13-974825407562>", line 1, in <module>
    select_uniprot_variants('P11586')
  File "/Users/smacgowan/PycharmProjects/ProteoFAV/variants/to_table.py", line 444, in select_uniprot_variants
    table = pd.concat(tables)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/merge.py", line 812, in concat
    copy=copy)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/tools/merge.py", line 845, in __init__
    raise ValueError('No objects to concatenate')
ValueError: No objects to concatenate

This happened because there was a mismatch between the EnsEMBL and UniProt sequences:

We can work around it by attempting a 'permissive' EnsEMBL - UniProt comparison that just checks that the sequences are of the same length and log the number of mismatches.

Logging config

Logging configuration should go all in the config file. Avoid hard coded logging levels and etc.

Downloader should accept a file object

The new Downloader could accept a _file parameter to use a file object instead of a file path. Alternatively, we could use the following pattern:

import tempfile
uniprot_id = 'P17612'
with tempfile.NamedTemporaryFile(delete=False) as temp:
    Annotation.download(
        identifier=uniprot_id, 
        filename=temp.name)
    P17612_annotation = Annotation.read(filename=temp.name)

Which is also not working right now. I will investigate.

Found a bug

I'm having a problem with this.

Design decision in _fetch_uniprot_variants

Hi @biomadeira @stuartmac

We currently have two low level functions for parsing Uniprot variants. The first one is _fetch_uniprot_variants which is quick and dirty. The second one is something I had wrote up earlier but wasn't in the code base due its complexity. It first uses the Uniprot guidelines to parse the text. Following, regex for parsing the gff annotation, showing: the variant ids, reference/mutated residues, disease name, is_cancer and is_germline. The biggest difference between the functions is that the first has multiple rows the same residue and the second has one row per residue. I add an parameter, group_residues, to map_gff_features_to_sequence to add this feature to the function. I think we can keep both function and use as different engines in the select_variant function.

Now the issue: some proteins have multi-residue variants (e.g. http://www.uniprot.org/uniprot/P04637.gff look for VAR_047158 which span over two residues). In general we just look at SNP (single nucleotide polymorphism). So we need to decide, is this a missense variant in the residue 29 and 30, or just in 29 or 30? Or we ignore those?

Update merge tables to work with biounits

We need to update merge_tables to work with biological units as updated here #25

The asym_ids (label and auth) are updated and 2 new columns added to these atom lines with the original auth/label asym ids.
For example chain A and B, becomes A + AA and B + BA

Check over Chimera module again.

What does it do, how does it work?

Use coverage.py to detected dead functions

Some functions in utils are never called. We should remove the one we are not using.

Add a visualisation function to plot number of variants on proteins

The function to get the attached image.

poly_vs_somatic.pdf

Merge table would be better represented as a Class

Class representation with individual methods would be a better object representation. Right now merge_table is a very complex function.

I will add more details in the near future.

Methods not used anywhere - what to do about them?

Some unused methods highlighted in commit 6699aaf

Add biological question/answer narrative to example notebook

Write short paragraphs in the notebook to explain what is happening in plain english.

Add an example of analysis, as simple as totalling number of variants in a sequence.

Found a bug3

I'm having a problem with this.

Methods to work with multiple sequence alignments

I will be adding methods for working specifically with Pfam/CATH MSAs. Other sources would be good to have as well!

check_sequence='warn'

'warn' option is raising an Exception instead of warning and ignoring.

Found a bug2

I'm having a problem with this.

Methods to export Jalview Feature/Annotation files for a sequence or multiple sequences (MSA)?

Methods to export to chimera command file?

Structures 5uyx/5uyz/4abo are not parsed correctly

...because the ATOM lines are formatted slightly differently than expected. For example, the line doesn't start with "ATOM (...)" but rather " ATOM (...)"

Fixing this also fixes the failing test_merge_4abo_A_DSSP_missing_first_residue!

Found a bug

I'm having a problem with this.

check_sequence uniprot and ensembl variants

I would recommend to have a extra columns over the variant table, so you can compare the sequences and handle the mis mappings by hand.

Test for merge_tables and all atoms

We need to test merge_table in all atoms and backbone atoms mode.

Let's merge branches a put it in production?

@biomadeira @stuartmac what do you think?

PDBe REST api now (v1.8) provides uniprot_canonical flag

/mappings/all_isoforms/
+ is_canonical field added (true if the UniProt sequence is the canonical
one, false otherwise).
+ The call now accepts UniProt accessions as input, including isoforms
(e.g. P83949-5).

The latest version of the PDB REST API flags whether a UniProt isoform is canonical or not. This is relevant to ProteoFAV.

Test for merge tables with assembly units

Adding VCF parsing/loading for using with gnomAD variants

Developing a more modular approach

Proposed design:

ProteoFAV's main features:
1 - Reading/parsing formatted files to pandas DataFrames (e.g. mmCIF, PDB, SIFTS XML, DSSP files)
2 - Downloading data files on the fly (e.g. mmCIF, PDB, SIFTS XML, DSSP files)
3 - Fetching sequence annotations (features) (e.g. variants from Ensembl and UniProt)
4 - Merging all the previous data onto a main DataFrame

With this in mind, I think would be great to have a structure like this:

proteofav.mmCIF.read() 		
proteofav.mmCIF.write() 
proteofav.mmCIF.download()
proteofav.mmCIF.select()
proteofav.PDB.read()
proteofav.PDB.write()
proteofav.PDB.download()
proteofav.PDB.select()
proteofav.DSSP.read()
proteofav.DSSP.download()
proteofav.DSSP.select()
proteofav.SIFTS.read()
proteofav.SIFTS.download()
proteofav.SIFTS.select()
proteofav.Validation.read()
proteofav.Validation.download()
proteofav.Validation.select()
proteofav.Annotations.read()
proteofav.Annotations.download()
proteofav.Annotations.select()
proteofav.Variants.fetch()
proteofav.Variants.select()
proteofav.Tables.merge()
proteofav.Tables.generate()

Classes generally have the following basic methods

read - read/parse from file
write - write output to a file
download - downloads data to a file (mmCIF, etc.)
fetch - downloads data to the handle, but can be cached (JSON, etc.)
merge - merge any set of DataFrames, so each DataFrame should be aware of what type of data it contains
generate - automated table generation by input (i.e. input PDB ID/CHAIN ID or input UniProt ID)

Hierarchical Edge Bundling visualisation

The Hierarchical Edge Bundling visualisation is the second iteration of the circular contact map commonly used by others. I think we could use it.

http://bl.ocks.org/mbostock/1044242

bartongroup / proteofav Goto Github PK

proteofav's Introduction

ProteoFAV

Getting Started

Dependencies

Installation

Installing from source in a virtual environment

Testing the installation

ProteoFAV Configuration

Example Usage

Contributing and Bug tracking

Credits

Changelog

Licensing

proteofav's People

Contributors

Stargazers

Watchers

proteofav's Issues

Recommend Projects

Recommend Topics

Recommend Org