elelab / mavisp Goto Github PK

the MAVISp database for structural variants annotation

License: GNU General Public License v3.0

Python 100.00%

mavisp's Introduction

Cancer Systems Biology, Technical University of Denmark, 2800, Lyngby, Denmark
Cancer Structural Biology, Danish Cancer Institute, 2100, Copenhagen, Denmark
Repository associated to the publication:

MAVISp: A Modular Structure-Based Framework for Genomic Variant Interpretation Matteo Arnaudi, Ludovica Beltrame, Kristine Degn, Mattia Utichi, Simone Scrima, Pablo Sanchez Izquierdo, Karolina Krzesinska, Francesca Maselli, Terezia Dorcakova, Jordan Safer, Alberte Heering Estad, Katrine Meldgard, Philipp Becker, Julie Bruun Brockhoff, Amalie Drud Nielsen, Valentina Sora, Alberto Pettenella, Jeremy Vinhas, Peter Wad Sackett, Claudia Cava, Anna Rohlin, Mef Nilbert, Sumaiya Iqbal, Matteo Lambrughi, Matteo Tiberti, Elena Papaleo. bioRxiv https://doi.org/10.1101/2022.10.22.513328

MAVISp web app

Introduction

This is the web app of the MAVISp database for structural variants annotation.

If you use MAVISp, please cite our preprint:

MAVISp: Multi-layered Assessment of VarIants by Structure for proteins
Matteo Arnaudi, Ludovica Beltrame, Kristine Degn, Mattia Utichi, Pablo Sánchez-Izquierdo, Simone Scrima, Francesca Maselli, Karolina Krzesińska, Terézia Dorčaková, Jordan Safer, Katrine Meldgård, Julie Bruun Brockhoff, Amalie Drud Nielsen, Alberto Pettenella, Jérémy Vinhas, Peter Wad Sackett, Claudia Cava, Sumaiya Iqbal, View ORCID ProfileMatteo Lambrughi, Matteo Tiberti, Elena Papaleo biorxiv, doi: https://doi.org/10.1101/2022.10.22.513328

Please see the CHANGELOG.md file on this repository for current and expected data releases as well as updates on the software.

Please see the CURATORS.md file in this repository for an up to date list of our curators.

This web app uses by default a database based on a set of CSV files, available on our OSF repository. These were generated by using the MAVISp Python package from raw data files. See below for details about the Python package.

Requirements

Running the MAVISp web app requires a working Python 3.9+ installation with the following Python packages:

streamlit 1.28.2
streamlit-aggrid 0.3.4.post3
pandas 2.1.3
matplotlib 3.7.4

In principle, it is compatible with all operating systems that support Python.

It has been last test on Linux (Ubuntu 18.04), and on macOS (13.5.2), with Python 3.9.6 and the following package versions:

streamlit 1.28.2
streamlit-aggrid 0.3.4.post3
pandas 2.1.3
matplotlib 3.7.4

In order to download the full MAVISp dataset from OSF, you will also need the wget program (see below). We last tested the download with wget 1.21.3.

Installing requirements

These instructions apply to both Linux and macOS, using the terminal. You will need to have a recent (>=3.9) Python distribution installed on your system or Anaconda.

Installing requirements using a virtualenv Python environment

if you have access to the virtualenv Python environment module, you can create a virtual environment:

virtualenv -p python3.9 MAVISp_env

then, activate it:

source MAVISp_env/bin/activate

you can install the requirements in the environment using pip:

pip install pandas==2.1.3 matplotlib==3.7.4 streamlit==1.28.2 streamlit-aggrid==0.3.4.post3

Installing requirements using a conda Python environment

if you have access to Anaconda or Miniconda (executable conda), you can use it to create a virtual environment:

conda create -n MAVISp_env python

then you can activate it:

conda activate MAVISp_env

you need to install the remaining requirements, using pip:

pip install pandas==2.1.3 matplotlib==3.7.4 streamlit==1.28.2 streamlit-aggrid==0.3.4.post3

Installation time is typically up to a few minutes.

Running the app

In the following instructions

hostname denotes the hostname of the server (i.e. the one you usually ssh to)
user denotes your username on the server

Running the app locally - full dataset

This requires downloading the full MAVISp dataset.

In order to run our web server locally with its full content, you will need to download the full MAVISp dataset from OSF, as follow, as well as download our web app from GitHub. If you'd rather test the web app on a small subset, please follow the instructions in the "Running the app locally - test dataset" instead.

If you haven't already, activate your Python environment (see previous steps)
create a local copy of the MAVISp repository in your system:

git clone https://github.com/ELELAB/MAVISp

Download the database files from our OSF repository. This requires the wget program or similar. If it's not available, you can manually download a zip file containing all the database files from this link.

cd MAVISp
rm -rf ./database
mkdir database 
cd database
wget -O database.zip 'https://files.de-1.osf.io/v1/resources/ufpzm/providers/osfstorage/65579865874c2e15e54e7d34/?zip=' && unzip database.zip 
rm database.zip
cd ..

At the end of the process, you should have a database folder inside the MAVISp folder including all the contents of the database folder on OSF.

With your Python environment still active and from inside the MAVISp repository directory, run:

streamlit run Welcome.py

a browser window displaying the MAVISp web app should open.

Running the app locally - test dataset

These instructions allow to run our web app on a minimal test dataset that is included in the distribution.

If you haven't already, activate your Python environment (see previous steps)
create a local copy of the MAVISp repository in your system:

git clone https://github.com/ELELAB/MAVISp

Create a copy of the test dataset in the MAVISp directory

cd MAVISp
rm -rf ./database
cp -r test_data/mavisp_web_server database

With your Python environment still active and from inside the MAVISp repository directory, run:

streamlit run Welcome.py

a browser window displaying the MAVISp web app should open.

Running the app remotely - if you have access to a host

the steps are the same as the last section, but a browser window will not open. Instead, you will have to connect from your local browser to the host printed out by streamlit in the terminal.

Running the app remotely - with X forwarding

This option is useful for who wants to run MAVISp remotely, but doesn't have direct access to the MAVISp web service (e.g. because the outbound port that streamlit uses is blocked). It is however pretty slow and clunky, meaning it is not very effective for in-depth data exploration or analysis. It also requires to be able to connected with X forwarding to your server, and to have a web browser installed on your server. For a faster alternative that doesn't use X-forwarding see instructions below.

connect to your server via ssh, with X forwarding:

ssh -XY user@hostname

Follow the previous instructions to install requirements, download required files and run the app locally. A browser window should pop up.

Running the app remotely - without X forwarding

You can use an ssh tunnel to connect to your streamlit instance. This is slightly more complicated but leads to an overall much better user experience.

ssh without X forwarding to the server:

ssh user@hostname

Follow the previous instructions to install the requirements and download the required files for MAVISp
run the app in headless mode:

cd MAVISp
streamlit run --logger.level=info --server.headless=true Welcome.py

Please note down the port that Streamlit is using (i.e. in this case it is 8501)

on your workstation open a new terminal and open an SSH tunnel:

ssh -N -L 8080:hostname:8501 user@hostname

notice that you need to change the port number at the right of hostname: with the one that Streamlit is providing service on

on your workstation open a new browser window and visit the website localhost:8080. MAVISp should load.

MAVISp Python package

the MAVISp Python package is not necessary to run the web app - it is however necessary to generate the database file starting from raw input files. It includes a user-executable script (mavisp) with this very purpose, which performs sanity checks on the input data, prints a report, and performs the conversion. Please see the help text of the script itself for further details (mavisp -h) or instructions below.

Requirements

The MAVISp Python package is designed to run on any operating system that supports Python. It has been tested on Ubuntu Linux 18.04 and macOS (13.5.2)

In order to install the package and all its requirements automatically, you will need to have a working Python 3.9+ installation available. We recommend installing the package in its own virtual environment - please see previous instructions on how to create a virtual environment.

The MAVISp Python package requires the following packages, and has been tested with the following versions:

pandas 2.1.3
tabulate 0.9.0
matplotlib 3.7.4
numpy 1.26.2
PyYAML 6.0.1
streamlit 1.28.2
streamlit-aggrid 0.3.4.post3
requests 2.31.0
termcolor 2.3.0

Installation

Once your virtual environment is ready and active, you need to

create a local copy of the repository:

git clone https://github.com/ELELAB/MAVISp

install the Python package

cd MAVISp
pip install .

the mavisp executable will be available.

Installation time is typically up to a few minutes.

Notice that installing the Python package always installs all the requirements for the web app, meaning it is ready to run.

Running the `mavisp` script on the example dataset

A typical command line of the mavisp script looks like:

mavisp -d input_data -o output_database

where input_data is a folder containing the raw data in a specific format and output_database is were the csv files database will be written. The script

performs all the parsing on the input file as well as basic sanity checks
prints a summary of all the available datasets and their status
prints a more detailed report of the status of each MAVISp module in each dataset
writes the output database. The output directories needs not to be present, unless option -f is set, which forces writing or overwriting the database.

To test the script on the MAVISp dataset included in the repository, from inside the MAVISp folder you have cloned from GitHub (see previous instructions), just run:

mavisp -d test_data/mavisp_python_package -o test_output

the test_output folder will now contain the output of the command. A reference output for this command can be found in the test_data/mavisp_web_server/ folder. The execution should complete in a few seconds.

Additional instructions

The MAVISp modules can have 4 possible states:

OK, colored in green, for when no warnings or errors were detected
WARNING, colored in yellow, for when prcessing the module generated some messages that can be useful for the user, but was still able to read and elaborate the data for the module correctly
ERROR, colored in red, for when processing the module resulted in a critical error i.e. a problem that couldn't be overcome
NOT_AVAILABLE, colored in the default terminal color, when the module was not present in the dataset

Additionally, if the mutation list or metadata file are not available, the whole entry is flagged as being in a CRITICAL state - if this is the case, the corresponding modules will not be processed and the detected error will be displayed in the report.

If any module generates at least one error or warning for a given dataset, the status of the corresponding dataset in the summary table is set accordingly.

If any module generates an error, the script will not write the corresponding database file and exit.

It is possible to process only some of the available proteins in the dataset, or exclude some, by using options -p or -e respectively, with a comma-separated list, e.g.

mavisp -d input_data -o output_database -p MAP1LC3B,BCL2

it is also possible to perform a check of the raw data without writing the database files by using option -n.

mavisp's People

Contributors

Stargazers

Watchers

mavisp's Issues

running streamlit on the server - firefox window doesn't show up

I followed the instructions in the readme and worked here:
/data/user/ele/mavisp/MAVISp

it seems to start but then it stop like this

You can now view your Streamlit app in your browser.

Local URL: http://localhost:8502
Network URL: http://172.30.0.24:8502

connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory
connect /tmp/.X11-unix/X0: No such file or directory

added rosetta for local interactions for MAP1LC3B - not showing in the database

I added the flexddg results to mavisp_data but the column for it and the one with the classification do not show up in the database

Column with mutations is missing in the mutation table

BCL2 PTM data contains mistakes

we need to update the repository accordingly

replace filtered_out with 'neutral'

with allosigma2 and the current protocol if a mutation has been assigned UP or DOWN and not found after filtering is because should be mapped as 'neutral'

Documentation needs to be updated for preliminary release

info in first table

in the first table for each protein the information on the source of the structure, PDB id/AF model, and the coverage of the mutations are no longer shown

clinvar.py gives problems

in many cases including MSH2 that has many clinvar variants it is not reporting them anymore

add DeMask column

to add the outputs from DeMask and the classification according to them for comparison

Add annotations from Cancermuts metatable

We should add to the dataset table annotations from cancermuts, namely

REVEL Score
gnomAD (eventually, popmax)
source(s) of the mutations

Need to replace "basic mode" with "simple mode"

AlloSigma should handle cases in which only the input txt file is available

right now it complains that it needs 2 or 3 files to be present. We should support the 1-case file

Mistakes and missing data in database

we need to fix

the format of T69I and T69P in BCL2 PTM module
update the mutation list of KRAS, BLM, MLH1 and TRAP1
remove AlloSigma data for MLH1 since we need to handle better this case

local interactions parsing ignores chain information

Instead, it should consider chain A as the protein of interest and chain B as the interaction.
if you don't filter by this, we might incur into cases in which we take the DDG values from the mutations of the interaction.

MAVISp should ensure that mutations are shown only once

one row per mutation - if multiple identical mutations are annotated in the input file(s), we should still display one

Add Clinvar interpretation

Directly from clinvar file
If possible, express them as interpretation string with a link to relevant Clinvar entry. If not, add an extra column with Clinvar ID
use not_available file to double check we are using all of them

issue importing TP53_2OCJbd for local interaction

when I try to import the energies.csv from mutatex for the homodimer of TP53 (chain b and d) it doesn't seem to import any energy value - we should check together

Add column with PMIDs for publication reference

Stability classification sometimes assigns Stabilising class incorrectly

there's a bug in the classification code:

        if row[foldx_header] and row[rosetta_header] < (- stab_co):

should be

        if row[foldx_header] < (- stab_co) and row[rosetta_header] < (- stab_co):

Add support for multiple ranges in stability structure names

For now we are supporting the like of:

AF2_1-100

we should also support cases in which multiple stretches are present:

AF2_1-100_2-200

index column to order according to seq number not residue type

for the mutation list, the index should refer to the sequence number not the residue type

to check for when in stability the source is xray structure

in the case of PTK2 we have the first xray structure (+modeller remodelling) - we should check if the current tree is correct (or how it should be) since the data doesn't seem to display

what to report in source mutation source

if manual annotation the part 'Manual annotations from mutations_' should be removed by the table of the database and only the
name of the csv file kept, e.g. clinvar, marinara etc

The PMID column in the dataset summary table should be PMID/DOI

as we want to support both identifiers

Column order should have Classification columns right after the corresponding dataset

e.g.

Stability
Its classification
Local interaction
its classification
etc

We need a container image to make the app more portable

MAVISp crashes when only one source of local interaction values is provided

also, the processing shouldn't happen when zero local interactions/stability columns are found

add PTM annotations - examples ready for usage

I have finalized how we can annotate the ptm, it should be possible to import them now

move column with classification of modules

always to have the column with the module classification next (on the right) after the single results of the classification

We need to split the data ingest and data display parts of the software

We will have

an executable that parses the database in its current form and outputs csv file that can be read by the streamlit app. It also does some basic check
In this context it would be good to refactor this part of the code with a class-based implementation
the streamlit app that just reads said csvs to display them

allosigma with analyses of multiple domains

we need to find a solution for the allosigma outputs when the analysis come from different domains because we cannot simply
concatenate the files (i.e. the columns are different)

curator information

we need a column in the first table with the names of curators - to decide in which format should be provided

local_interactions will require an additional page with details

we will have results of binding free energy on multiple interactors so we need its own separate link (table which is per mutation) with the different predicted changes and classification
For the main page I would just create a file to include info for four columns that are summative, for example
#mutation local_interaction effects

where local_interaction can be: uncertain, stabilizing, destabilizing, neutral
the score we have to discuss out to calculate but perhaps could be in a first phase the number of observables over the total
with that classification?
i.e. uncertain(0.3), destabilzing(0.5), stabilzing (0.1), neutral (0.1)

add long-range and allosigma2

DOIs don't always show up correctly in the final table

In particular when a PMID is associated to more than one specific mutation, only the first mutation of the set is annotated with that PMID.

This same problem also probably affects the PTM module (even though it is much less likely to have identical rows for that data, so this might have not have shown up so far)

In the stability modules, we should support multiple structures

We should support multiple non-overlapping structures in terms of sequence coverage for the same protein; that's because we may have cases in which we use different PDBs entry separately to cover different regions of the same proteins, e.g. structural domains connected by linkers that were resolved separately

cosmic in cancermuts annotations(?)

I noticed that for some proteins in MAVISdb there is not annotation for COSMIC mutations which sound suspicious to me so perhaps we have to check all the logs of the cancermuts run?

mutatex to change code to use csv directly

if possible to readme the csv from ddg2excell and match the mutations (as we do with rosetta) since it can save us time and work
only in raw_data

column of allosigma UP/DOWN mutations not imported correctly

with BCL2 is that it is not importing correctly the data from /data/raw_data/computational_data/mavisp_data/BCL2/basic_mode/long_range/allosigma2/allosigma_mut.txt - in streamlit I only see a couple of UP mutations and the rest assigned as N.A. but this is not the case in the file

no caching in the Streamlit app

Currently we don't do any caching, which means the app is slow when many proteins are present. We should at least implement basic caching of the data we load

add code for ptm module

according to the decision tree for regulation, stability and function to implement the code in mavisp

Add first step: parsing module for the MAVISp file structure

Add classification column for stability

We should add a column in which we classify mutations depending on the DDG values we get from stability predictors (in kcal/mol below):

Condition	Annotation
BOTH > 3	Destabilizing
BOTH < -3	Stabilising
]-2;2[	Neutral
DISCORDANT OR [-3;-2] OR [2;3]	Uncertain

rosettaddg and data from different AF structures

for mutatex-foldx5 we solved the stability module from multiple AF models doing a cat of the summary.txt files from ddg2summary. For rosettaddgprediction I tried to do the same with the csv table after the aggregation step but Streamlit gives an error and we need to see what the best procedure for this is

├── cancermuts
├── clinvar
├── local_interactions
│   └── foldx5
│       └── SQSTM1_2ZJDab
├── mutation_list
├── pmid_list
└── stability
    └── AF2_1-120
        └── alphafold
            └── model_AF2DB
                └── foldx5

the ingest routines should comply with this

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

elelab / mavisp Goto Github PK

mavisp's Introduction

MAVISp web app

Introduction

Requirements

Installing requirements

Installing requirements using a virtualenv Python environment

Installing requirements using a conda Python environment

Running the app

Running the app locally - full dataset

Running the app locally - test dataset

Running the app remotely - if you have access to a host

Running the app remotely - with X forwarding

Running the app remotely - without X forwarding

MAVISp Python package

Requirements

Installation

Running the mavisp script on the example dataset

Additional instructions

mavisp's People

Contributors

Stargazers

Watchers

mavisp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Running the `mavisp` script on the example dataset