wells-wood-research / de-stress Goto Github PK

View Code? Open in Web Editor NEW

16.0 3.0 1.0 22.42 MB

DE-STRESS is a model evaluation pipeline that aims to make protein design more reliable and accessible.

Home Page: http://destressprotein.design

HTML 0.07% Elm 80.82% Dockerfile 0.13% Python 17.84% JavaScript 1.04% Shell 0.10%

protein protein-structure protein-design structural-biology bioinformatics pipeline webapp

de-stress's Introduction

de-stress

DEsigned STRucture Evaluation ServiceS

DE-STRESS is a web application that provides a suite of tools for evaluating protein designs. Our aim is to help make protein design more reliable, by providing tools to help you select the most promising designs to take into the lab.

The application is available for non-commercial use through the following URL:

https://pragmaticproteindesign.bio.ed.ac.uk/de-stress/

Citing DE-STRESS

If you use DE-STRESS, please cite the following article:

Stam MJ and Wood CW (2021) DE-STRESS: A user-friendly web application for the evaluation of protein designs, Protein Engineering, Design and Selection, 34, gzab029.

Contacting Us

If you find a bug or would like to request a feature, we'd really appreciate it if you report it as an issue. If you're stuck and need help or have any general feedback, please create a post on the discussion page.

For more information about our research group, check out our group website.

Local Deployment

Make sure you have all the relevant dependencies in de-stress/dependencies_for_de-stress/. Currently, these are:

Aggrescan3D
DFIRE 2 pair
DSSP
EvoEF2 (source)
Rosetta (source)

Create a .env file in the top level de-stress folder. You can copy de-stress/.env-testing and update that. This

Download big_structure.dump and place it in de-stress/database.

Next, from within de-stress/, build all the containers:

# use production-compose.yml if you're deploying in a production environment
docker-compose -f development-compose.yml build

Compile the dependencies in the container:

docker run \
    -it \
    --rm \
    -v /absolute/path/to/de-stress/dependencies_for_de-stress/:/dependencies_for_de-stress \
    de-stress_big-structure:latest \
    sh build_dependencies.sh

This will compile the software, but the output will be stored on the host machine as a volume is used. This means that you cannot move or delete this folder while the application is being served or it will break.

Launch the application:

# Change rq-worker to however many processes you want to use for analysis
docker-compose -f development-compose.yml --env-file .env up -d --scale rq-worker=4

Navigate to de-stress/database and run import_db_dump.sh.

Headless DE-STRESS

The DE-STRESS webserver has a few limitations which are there to ensure the stability of the webserver. These limitations are listed below.

Only proteins with 500 residues or less can be uploaded.
Only 30 files can be uploaded at a time.
There is a max run time of 20 seconds for all the DE-STRESS metrics.

The headless version of DE-STRESS can be ran locally and the user can change the settings to run a larger set of PDB files. The code has been written to allow multiprocessing so that large amounts of files can be ran in a reasonable amount of time. The .env-headless file can be used to update the MAX_RUN_TIME, HEADLESS_DESTRESS_WORKERS and HEADLESS_DESTRESS_BATCH_SIZE variables to change the amount of seconds the DE-STRESS metrics are allowed to run, how many PDB files are in a batch, and how many processers should be used respectively.

Firstly the docker image needs to be built. There is a different docker compose file called headless-compose.yml that needs to be used instead of the development-compose.yml file.

docker compose -f headless-compose.yml build

After this, make sure the dependencies have been built. The path /absolute/path/to/de-stress/dependencies_for_de-stress/ needs to be replaced with the user's local path to the DE-STRESS dependencies.

docker run -it --rm -v /absolute/path/to/de-stress/dependencies_for_de-stress/:/dependencies_for_de-stress de-stress-big-structure:latest sh build_dependencies.sh

Finally, run headless DE-STRESS with the following command and change the /absolute/path/to/ to the the local file path to these folders.

docker run -it --rm --env-file .env-headless -v /absolute/path/to/de-stress/dependencies_for_de-stress/:/dependencies_for_de-stress -v /absolute/path/to/input_path/:/input_path de-stress-big-structure:latest poetry run headless_destress /input_path

de-stress's People

Contributors

Stargazers

Watchers

Forkers

lunaprau

de-stress's Issues

Data export for a reference set

All the metrics for the reference set so that analysis can be performed using external software

Fix the size of the metric cards value section

Implement optional structure file checks

Check to see that the structure file that was submitted contains valid input for the various programmes. @MichaelJamesStam has started this with the find_disallowed_monomers function in big-structure/src/destress_big_structure/analysis.py.

Re-think 'protein only' test in state model.

Maybe this is the wrong approach and we should return what type of molecules are in the PDB rather than to exclude everything except protein.

Data export

The plan for the open beta is not to have server side storage of designs. A limitation of this approach is that you can't share design metrics very easily. A solution to this is the ability to export design data to a text file. There's a couple of varieties of export that I'm interested in:

A text dump of the designs that can be loaded back into another instance of the application.
Nicely formatted data on the designs/reference sets that enables people to creat plots in external software.

Add new metric fields to specifications

We haven't added the new metrics as optional requirements in specifications. This will need to be done retrospectively for EvoEF2, but should be incorporated into the PRs for the other methods (#28, #29, #30).

Headless DE-STRESS fix: README commands

In summary:

docker compose instead of docker-compose
docker container created by docker compose is named "de-stress-big-structure" not "de-stress_big-structure"
wrong .env file given in the last docker run command

Leading to fixed commands:

docker compose -f headless-compose.yml build
docker run -it --rm -v /absolute/path/to/de-stress/dependencies_for_de-stress/ de-stress-big-structure:latest sh build_dependencies.sh
docker run -it --rm --env-file .env-headless -v //absolute/path/to/de-stress/dependencies_for_de-stress/:/dependencies_for_de-stress -v /absolute/path/to/input_path/:/input_path de-stress-big-structure:latest poetry run headless_destress /input_path

Embed specification as a requirement in another specification

I think that this would be useful when creating more complex specification, to have some simple specifications that you can combine into more complex specifications.

Fix stacked bar charts

I converted the composition plot to stacked bar charts to save space without realising that the results would be aggregated.

EvoEF2 hangs for some PDB files

EvoEF2 hangs for two PDB files, 2ht0.pdb and 4dyq.pdb. For 2ht0.pdb, some nan values are returned when running EvoEF2 from the command line. This could potentially be causing a problem as DE-STRESS hangs on "Job is running on server". On the other hand, running EvoEF2 on the file 4dyq.pdb from the command line, causes a segmentation fault. On DE-STRESS, it hangs on "Job submitted to server".

Fix plotting issues

We currently have issues with the Vega Lite plots not rendering correctly. I think that there are a couple of routes to fixing this:

Change the VL javascript code to be a custom element that manages it's own state and updates automatically based on changes in Elm.
Use another library for plotting, preferably one that is Elm native.

I think that option 2 is probably more future proof and will enable tighter integration between the plots and the application, more control over the formatting of the plots and clearer code, mainly due to not having to use Vega's DSL to create plots. The disadvantages are that we're duplicating work we've already done, but we have a fair amount of overhead fixing these plots after every update.

ValueError keeps getting printed to console when running the backend

A ValueError keeps getting printed to the console every time the web socket is checked. I am not sure what is causing this but it could be related to the recent changes in the pull request Fix: Changes the way that WSs are polled to fix timeouts. #15.

Add overview page for a group of designs

Generate plots comparing a whole batch of designs and the reference set in a separate page.

Remove webpack and move to CDN for serving Vega and NGL

In a previous update I added webpack to bundle our javascript resources, but really slows down the page load. I'm going to move back to getting the JS dependencies from a CDN.

Change the colour of the designs that have met the specifications to green

Create private git repo for EvoEF2, Rosetta, DFIRE2 and PASTA 2.0 binaries

These can then be incorporated into Github actions so that the automatic tests can work.

Incorporating DFIRE2 into DE-STRESS

A function in analysis.py needs to be added to run the DFIRE2 energy function on pdb files.
An object needs to be added to elm_types.py and big_structure_models.py.
The schema.py needs to be updated and create_entry.py.
Finally, Metrics.elm and Uuid_String.elm need to be updated to display the energy values.

Fix new specification screen

It is currently full width.

Additional metrics from available data

Resolution for big structure
SS proportion
Radius of gyration or some shape information
BUDE-FF

Beta deployment

As we're getting closer to release, we need to try to deploy the application on production hardware. To do this, we need to:

Remove the usage of Debug from the Elm code so that we can compile with --optimize
Set-up Nginx reverse proxy
Setup static file server for the front end
Create deployment script

Add residue level aggrescan3d score plot

Currently there is only summary level data that is shown for aggrescan3d but residue level data is captured as well. We could make a plot to show residue number vs aggrescan3d score to show these results to the user.

Domain name

We need to purchase a shorter domain name before publication, maybe de-stress.app. The default domain is pragmaticproteindesign.bio.ed.ac.uk/de-stress and the new domain will just forward to this location.

Slim down reference set data and download in batches

We're downloading more data than we need for the reference sets, we probably don't need any of the log information or even the detailed score breakdown. If we remove these fields, we should be able to have much larger reference sets. Another thing we should do is download data to create a reference set in batches, so that the http request does not time out.

Server-side logging

We should make the logging on the server a bit more deliberate. Switching from print statements over to using the logging module is a good idea. This article gives a nice overview: https://towardsdatascience.com/stop-using-print-and-start-using-logging-a3f50bc8ab0

Add overview information for reference sets

Currently, if you click a reference set, there isn't much information about it contained in the page. I'd like to add:

Overview plots
Information about the number of designs
A way to view the names of files in the reference set
A way to export the stats of the reference set

Add proper logging for the analysis.py module

DE-STRESS UI and headless

Allow user to define reference set from uploaded models

Using the selection tools, it is possible to make a new reference set from the structures uploaded by the user, enabling them to use unpublished PDB files.

Incorporating the Rosetta energy function into DE-STRESS

A function in analysis.py needs to be added to run the Rosetta energy function on pdb files.
An object needs to be added to elm_types.py and big_structure_models.py.
The schema.py needs to be updated and create_entry.py.
Finally, Metrics.elm and Uuid_String.elm need to be updated to display the energy values.

Fix design renaming

I broke this a while back in the design detail pages, I need to fix it before release.

Adding Glossary of DE-STRESS metrics

We recieved comments from the reviewers of the DE-STRESS submission to PEDS that we need to have a clear glossary of the metrics used in the application.

Fix default biological unit.

Currently the 1th biological unit is labelled as preferred if none is defined, and it should be the 0th.

Some pdb files cause the error "Server error while creating metrics: The job failed to run: None"

I was uploading some pdb files of antibody structures to DE-STRESS and noticed that a fair few gave the error "Server error while creating metrics: The job failed to run: None". I checked the log and it looks like a key error with DSSP and maybe happening in ampal/assembly.py. The PDB file 5XRQ.pdb gives the key error 'AA' while the pdb file 5UBY.pdb gives the key error 'AH'.

Collapsible Section

We're using collapsible sections in a couple of places now (/designs and EvoEF2 results on /designs/uuid-string), so it makes sense to make these have consistent formatting and replace the temporary styling. The generic code should be moved to Style.elm.

Add specification requirement relative to a reference set

One of the most obvious requirements that someone would want to set is to say that a metric must be within a given range relative to a reference set. A couple of thoughts on this:

Should be able to set in std devs or %age
Should convert to a raw value so there is no dependency on the reference set after the specification is created

Add metrics references to the citation section

Would be good to thank the authors of the tools too

Providing descriptions for basic metrics

Some of the basic metrics like hydrophobic fitness and the secondary structure assignment don't have any explanations or pop ups at the moment. We need to provide a bit of info to explain to the user what these fields mean.

Fix composition deviation in specification

I don't think that this is working at the moment, I'm not sure if the reference set is being used to calculate it.

Tests are broken

The automated testing is currently broken, although the Python tests are passing on my system when running in the big-structure Docker container. The elm tests are also failing. After this fix, no PR that fails the tests should be merged.

Limit compute available for each session

Currently, a user could dump 1,000 structure files with 1,000,000 residues each into de-stress and the server would fall over. Although I trust our future users, it's probably not ideal behaviour. There are a few ways that we can tackle this:

Limit the number of files that can be running on the server for an individual session
Limit the number of residues allowed in a structure file
Add a time limit to individual jobs running on the server

These are not mutually exclusive and I think we should probably use all of these.

Incorporating Aggrescan 3D into DE-STRESS

A function in analysis.py needs to be added to run Aggrescan3D on pdb files.
An object needs to be added to elm_types.py and big_structure_models.py.
The schema.py needs to be updated and create_entry.py.
Finally, Metrics.elm and Uuid_String.elm need to be updated to display the output.

Add central error reporting system in Shared.elm

All errors that should be displayed to the user should be routed through Shared.elm and displayed at the top level.

Sequence does not show non-canonical amino acids

The sequence view on Designs/Dynamic does not show non-canonical amino acids. This must be a server side problem as the front end does not determine the sequences, they are returned with the design metrics.

Build full size big structure

Once the metrics for the beta have been finalised, we need to build the full version of big structure on lilis.

Hide No metrics available if all designs have metrics

BUDEFF does not return any information on failure

The BUDEFF record does not return any information on fail. It would be nice to return something to the user!

No error/warning message displayed when loading .cif files or other .txt files.

These files don't load but there is no message to the user to tell them what has happened.

Creating some popups that can provide extra details or 'help' to the user.

These pop ups will be useful to provide extra details about the energy values from EvoEF2 and Rosetta. At the moment they have more human friendly names, but we should allow the user to see the original names from the energy function output.

Think deeply about `_weights.txt`

EvoEF2 produces a file called _weights.txt as it is running, what is this for? Do we need to return it to the user? Is it used as a cache?

Add option to download structures from PDB

I imagine that lots of people will test it, or include, structures from the PDB. As all the metrics have been precalculated and cached for these structures, it makes sense to have a separate input method for these so that we don't recompute them every time. I don't think this is super high priority, so I'm not putting it into the v0.1.0 milestone.

Headless DE-STRESS fix: .env-headless file

.env-headless file needs HEADLESS_DESTRESS_BATCH_SIZE defined. E.g.:

POSTGRES_PASSWORD=testpassword

GUNICORN_WORKERS=2
APP_PORT=8181

EVOEF2_BINARY_PATH=/dependencies_for_de-stress/EvoEF2/EvoEF2
DFIRE2_FOLDER_PATH=/dependencies_for_de-stress/DFIRE2-pair/
ROSETTA_BINARY_PATH=/dependencies_for_de-stress/rosetta_src_2020.08.61146_bundle/main/source/bin/score_jd2.linuxgccrelease
AGGRESCAN3D_SCRIPT_PATH=/dependencies_for_de-stress/Aggrescan3D/aggrescan3D_cli_run.py

RQ_DASHBOARD_REDIS_URL=redis://redis:6379
RQ_DASHBOARD_PORT=8182

MAX_RUN_TIME=200
HEADLESS_DESTRESS_WORKERS=3
HEADLESS_DESTRESS_BATCH_SIZE=10