pubseq / bh20-seq-resource Goto Github PK

Tool to upload SARS-CoV-2 sequences to BH20 Arvados instance and orchestrate analysis

License: Apache License 2.0

Python 54.04% Dockerfile 0.45% HTML 15.68% Makefile 0.02% TeX 0.42% Common Workflow Language 13.49% Shell 1.53% CSS 3.79% JavaScript 4.63% Ruby 5.95%

biohackcovid20

bh20-seq-resource's Introduction

COVID-19 PubSeq: Public Sequence uploader

This repository provides a sequence uploader for the COVID-19 Virtual Biohackathon's Public Sequence Resource project. There are two versions, one that runs on the command line and another that acts as web interface. You can use it to upload the genomes of SARS-CoV-2 samples to make them publicly and freely available to other researchers. For more information see the paper.

To get started, first install the uploader, and use the bh20-seq-uploader command to upload your data.

Installation

There are several ways to install the uploader. The most portable is with a virtualenv.

Installation with `virtualenv`

Prepare your system. You need to make sure you have Python, and the ability to install modules such as pycurl and pyopenssl. On Ubuntu 18.04, you can run:

sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev

Create and enter your virtualenv. Go to some memorable directory and make and enter a virtualenv:

virtualenv --python python3 venv
. venv/bin/activate

Note that you will need to repeat the . venv/bin/activate step from this directory to enter your virtualenv whenever you want to use the installed tool.

Install the tool. Once in your virtualenv, install this project:

Install from PyPi:

pip3 install bh20-seq-uploader

Install from git:

pip3 install git+https://github.com/arvados/bh20-seq-resource.git@master

Test the tool. Try running:

bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Make sure you are in your virtualenv whenever you run the tool! If you ever can't run the tool, and your prompt doesn't say (venv), try going to the directory where you put the virtualenv and running . venv/bin/activate. It only works for the current terminal window; you will need to run it again if you open a new terminal.

Installation with `pip3 --user`

If you don't want to have to enter a virtualenv every time you use the uploader, you can use the --user feature of pip3 to install the tool for your user.

Prepare your system. Just as for the virtualenv method, you need to install some dependencies. On Ubuntu 18.04, you can run:

sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev

Install the tool. You can run:

pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master

Make sure the tool is on your PATH. The pip3 command will install the uploader in .local/bin inside your home directory. Your shell may not know to look for commands there by default. To fix this for the terminal you currently have open, run:

export PATH=$PATH:$HOME/.local/bin

To make this change permanent, assuming your shell is Bash, run:

echo 'export PATH=$PATH:$HOME/.local/bin' >>~/.bashrc

Test the tool. Try running:

bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Installation from Source for Development

If you plan to contribute to the project, you may want to install an editable copy from source. With this method, changes to the source code are automatically reflected in the installed copy of the tool.

Prepare your system. On Ubuntu 18.04, you can run:

sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev

Clone and enter the repository. You can run:

git clone https://github.com/arvados/bh20-seq-resource.git
cd bh20-seq-resource

Create and enter a virtualenv. Go to some memorable directory and make and enter a virtualenv:

virtualenv --python python3 venv
. venv/bin/activate

Note that you will need to repeat the . venv/bin/activate step from this directory to enter your virtualenv whenever you want to use the installed tool.

Install the checked-out repository in editable mode. Once in your virtualenv, install with this special pip command:

pip3 install -e .

Test the tool. Try running:

bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Installation with GNU Guix

For running/developing the uploader with GNU Guix see INSTALL.md

Usage

Run the uploader with a FASTA or FASTQ file and accompanying metadata file in JSON or YAML:

bh20-seq-uploader example/metadata.yaml example/sequence.fasta

If the sample_id of your upload matches a sample already in PubSeq, it will be considered a new version and supercede the existing entry.

Workflow for Generating a Pangenome

All these uploaded sequences are being fed into a workflow to generate a pangenome for the virus. You can replicate this workflow yourself.

An example is to get your SARS-CoV-2 sequences from GenBank in seqs.fa, and then run a series of commands

minimap2 -cx asm20 -X seqs.fa seqs.fa >seqs.paf
seqwish -s seqs.fa -p seqs.paf -g seqs.gfa
odgi build -g seqs.gfa -s -o seqs.odgi
odgi viz -i seqs.odgi -o seqs.png -x 4000 -y 500 -R -P 5

Here we convert such a pipeline into the Common Workflow Language (CWL) and sources can be found here.

For more information on building pangenome models, see this wiki page.

Web Interface

This project comes with a simple web server that lets you use the sequence uploader from a browser. It will work as long as you install the packager with the web extra.

To run it locally:

virtualenv --python python3 venv
. venv/bin/activate
pip install -e ".[web]"
env FLASK_APP=bh20simplewebuploader/main.py flask run

Then visit http://127.0.0.1:5000/.

Production

For production deployment, you can use gunicorn:

pip3 install gunicorn
gunicorn bh20simplewebuploader.main:app

This runs on http://127.0.0.1:8000/ by default, but can be adjusted with various gunicorn options.

bh20-seq-resource's People

Contributors

Stargazers

Watchers

Forkers

pjotrp adamnovak cp-weiland bonfacekilz stain inutano ambarishk dcgenomics daniwelter bio-ontology-research-group mandosoft gitter-badger mr-c sravani2000hub proccaserra prasunanand urbanslug mady1258 svonworl

bh20-seq-resource's Issues

Move QC analysis to a CWL workflow, keep failing submission +logs for re-analysis later

Request from the pangenome viewer team

Web page: submit button should only be disabled when submitting

The validation step needs to run before disabling the button.

Upload base set of sequences

Better sample id management

From discussion 27 August 2020:

sample_id should be the same between the metadata file and fasta header from upload, should be validated
If an uploaded, valid sequence has the same sample id as an existing validated sequence, copy the new sequence/metadata to the existing collection. Enable versioning on Arvados
Cleaning up
- Clean up existing sequences,
- merge based on sequence_label and take the latest (most recent created_at).
- revalidate previously validated samples that have invalid dates or specimen fields

Originally on the list I don't think we're doing this right now:

For namespacing identifiers, sample_id should be a URI. Add command line option to uploader to give URI prefix. Give instructions to put your institution's web page if you don't know what else to use. Validate that sample_id is a valid URI.

Move 10,000 JSON files into subdirectory

Can we create a subdirectory for the JSON files? If we lead people to the latest results the current list is confusing.

submitterShap error

I'm seeing this error running import

[2020-07-07 20:43:19] WARNING 'MT385461.1 uploaded by unknown@50f4c4f28070 from 3.89.224.155' (lugli-4zz18-nb1luabe2d62v9k) has valid
ation errors:   Testing <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> against shape https://raw.github
usercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submissionShape
    Testing _:b1 against shape https://raw.githubusercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submitterShape
    _:b1 context:
      <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> MainSchema:submitter _:b1 .
         _:b1 sio:SIO_000116 "Data Science" .
         _:b1 sio:SIO_000172 "Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA" .

         No matching triples found for predicate obo:NCIT_C42781

wrong predicate for lab_address

The predicate in the schema for lab_address is http://purl.obolibrary.org/obo/OBI_0600047 which is a typo because that's actually the predicate for sample_sequencing_technology. Need to determine the correct predicate and fix the schema.

Automated uploads

We need to add a permaid (see #103). I think submitter and submission are missing too. I'll check.

Create GFA subgraph for pangenome browser

Ideally based on SPARQL queries.

[Build] Unable to install the package from Github

Description:

I'm getting the following after trying to install the package from GH:

Click here to see debug info

Collecting git+https://github.com/arvados/bh20-seq-resource.git
  Cloning https://github.com/arvados/bh20-seq-resource.git to /tmp/pip-req-build-sx_tz19n
  Running command git clone -q https://github.com/arvados/bh20-seq-resource.git /tmp/pip-req-build-sx_tz19n
Collecting arvados-python-client
  Using cached arvados-python-client-2.0.2.tar.gz (182 kB)
Collecting schema-salad
  Using cached schema_salad-5.0.20200416112825-py3-none-any.whl (457 kB)
Collecting python-magic
  Using cached python_magic-0.4.15-py2.py3-none-any.whl (5.5 kB)
Collecting pyshex
  Using cached PyShEx-0.7.14-py3-none-any.whl (50 kB)
Collecting ciso8601>=2.0.0
  Using cached ciso8601-2.1.3.tar.gz (15 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Collecting google-api-python-client<1.7,>=1.6.2
  Using cached google_api_python_client-1.6.7-py2.py3-none-any.whl (56 kB)
Collecting httplib2>=0.9.2
  Using cached httplib2-0.17.3-py3-none-any.whl (95 kB)
Processing /home/bonface/.cache/pip/wheels/40/ae/bd/3e7d7af6588020c7e993f6f114fb708d966276dbc2f224d3f9/pycurl-7.43.0.5-cp38-cp38-linux_x86_64.whl
Collecting ruamel.yaml<=0.15.77,>=0.15.54
  Using cached ruamel.yaml-0.15.77.tar.gz (312 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/bonface/projects/bh20-seq-resource/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-h7g8lchz/ruamel.yaml/pip-egg-info
         cwd: /tmp/pip-install-h7g8lchz/ruamel.yaml/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 211, in 
        pkg_data = _package_data(__file__.replace('setup.py', '__init__.py'))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 184, in _package_data
        data = literal_eval("".join(lines))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 156, in literal_eval
        return _convert(node_or_string)
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 95, in _convert
        if isinstance(node, Str):
    NameError: name 'Str' is not defined
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

To Reproduce:

Steps to reproduce:

virtualenv --python python3 venv
. venv/bin/activate
pip install git+https://github.com/arvados/bh20-seq-resource.git

Expected Behaviour:

Should be able to install the package without any problems

Environment setup:

OS: Arch Linux
Python Version: python3.8.2

Add edit links to documents

Add BAM support (after fastq)

Docker image needs 'latest' tag

Currently, pangenome-generate.cwl fails when trying to pull jerven/spodgi; looking at Dockerhub, it does not have the latest tag. Explicitly pulling jerven/spodgi:0.0.5 fixes the problem.

Better solution would be to add the latest tag to the image on dockerhub.

Package uploader in pip

Give nice error messages on qc_metadata fail

Add seq similarity and overlap to metadata

Add pangenome browser

Run pipeline on SPARQL query

Allow for metadata updates

It is possible the same sequence gets multiple metadata entries (e.g. for clinical outcome). So we need to be able to add metadata several times. Furthermore it is important to be able to update existing metadata - I think simply by adding versions.

Add metadata on workflows

We should capture the metadata on workflows somehow. As per @LLTommy's suggestion

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml (the pangenome-generate workflow is listed, as of 2020-10-06) -> https://dockstore.org/my-workflows/github.com/arvados/bh20-seq-resource/Pangenome%20Generator

Note, the pangenome-generator has also been published to https://workflowhub.eu/workflows/63

Add navigation tabs to web site

Compute HASH on inputs

When submitting a sequence and metadata we can compute a hash value over the submission to make sure it is not already in the database. People will accidentily resubmit stuff and there is no reason to trigger the pipeline. Or, @tetron, is this automatic in Arvados?

Add news feed

We can soon add @BonfaceKilz' news feed.

Add phylogeny workflow

Updated virtuoso instance

We have a script which can update virtuoso. It checks Arvados for updates of the metadata.ttl. I still need to:

Run as a CRON job
Clear the old graph before updates (requires permissions)
Add the update time stamp to the store
Perhaps bring in graph versioning

Add support for bulk uploads

To support bulk uploads we don't want to trigger the workflows at every step. I think we should have a switch that prevents the workflows from running on individual submissions. Just add the sequence and metadata to Keep.

Add check to pipeline for SARC-CoV-2 origin of sequence

We should allow only a limited number of viral species. This requires a homology check at the FASTA sequence level.

Make Keep links referrable for sequence files and metadata files

The URIs in the database should be resolvable, e.g.

http://arvados.org/keep:00a6af865453564f6a59b3d2c81cc7c1+123/sequence.fasta

Search box stopped working (HTML disabled on demo page)

Add EBI sequences and metadata?

EBI has some 2K sequences we could bring in too. @AndreaGuarracino can you take a look at:

https://www.covid19dataportal.org/sequences?db=embl

Adding field buttons are broken on the submit page

For multi-field options the [+] button is not working.

Add more clinical metadata

Add MSA workflow

@ekg is working on an MSA workflow

Demp: list by time stamp

Be nice to show when sequences were sampled.

Metadata

@LLTommy is adding metadata and validation

How do we make sure data is retained?

From now on all data should be guaranteed to remain in Arvados for the life time of the project.

Connect output of workflows

Need to check before go-live

Add CC0 license as an option

https://creativecommons.org/share-your-work/public-domain/cc0

Create a nice website for users

Be good to present some output and GFA visualisation (for example)

Ontology for assemblies

At least:

pangenome from only de-novo assemblies (should be of higher quality)
pangenome from de-novo assemblies and read mapping experiments (reference biased)

Share identifiers Redcap

Redcap has a clinical HIPAA compliant database. We should share a field that refers to clinical patient information https://redcap-covid19.elixir-luxembourg.org/redcap/. One strain may have multiple records in Redcap.

ModuleNotFoundError: No module named 'qc_metadata'

I've installed bh20-seq-resource with a combination of pip and Guix. I ran 'guix environment --ad-hoc python curl python-pycurl' to get an environment with python3 and python-pycurl and then ran 'pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master'. ~/.local/bin/bh20-seq-uploader --help gives me the output:

Traceback (most recent call last):
File "/home/efraimf/.local/bin/bh20-seq-uploader", line 11, in
load_entry_point('bh20-seq-uploader==1.0.20200410122633', 'console_scripts', 'bh20-seq-uploader')()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2793, in load_entry_point
return ep.load()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2411, in load
return self.resolve()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2417, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/efraimf/.local/lib/python3.7/site-packages/bh20sequploader/main.py", line 9, in
import qc_metadata
ModuleNotFoundError: No module named 'qc_metadata'

Web page does not display correctly in Chromium

@BonfaceKilz having a small issue with the grid not aligning under the intro box. I can't figure it out. Can you see if you can fix it? Running at http://covid-19.genenetwork.org/

Add metadata about the workflows themselves

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml

Filter out list of sequences before building GFA

We need a script added to pangenome-generate-cwl to remove the following sequences from the public sequence resource prior to creating a GFA output:

00ef4c4427c0881a0030f7f400ce1ed0+123/sequence.fasta 1a191370cb868f80c824d93f9169599a+126/sequence.fasta 9e6fe32c3f7d281332ba958b5f62d109+123/sequence.fasta bafb25a84fa5167d5a049fa43d607a44+126/sequence.fasta 9fe51f2847f3e8e3060c9ddebf3a41e5+123/sequence.fasta d637278d9b95bbd1a5ef0bcd17a95c21+123/sequence.fasta 53fa57b401f3695feb0facf498f60871+123/sequence.fasta 392451211d0b7500ebaaa4e3182838be+123/sequence.fasta bc7dcac01570c2fb81f16f76b98add9d+126/sequence.fasta 898c212f7a9d4984c382d782bad53fd4+123/sequence.fasta f8001cec2144c59cbd851706b898ddfe+123/sequence.fasta 71063763aabd91e0b33d6861294bdff6+123/sequence.fasta 57dca4995c2186b11b67ab1cff0b005b+126/sequence.fasta f95a298c57718bf290d9facdda59eb66+123/sequence.fasta 71da768110cd21ff99f5664bc335a4ec+126/sequence.fasta 06f5726c45483d0e8fdea3004f2c4adf+123/sequence.fasta f9cea932bff8e83a2cb490c3bd694742+123/sequence.fasta 5914683bbe1ff047a163b3e57110f11b+126/sequence.fasta 27bb9a654a5f46e08888f55021d37b17+126/sequence.fasta a9be2d60f66fd03a75418b40306ededc+126/sequence.fasta aa1d1c497dabed0589c8ea6423179441+123/sequence.fasta c6f8550cf6940591fea7de5f2159d88b+123/sequence.fasta ab9c2241bda0599d20877ece1e1bc04e+126/sequence.fasta 5caa10de623c2384a31160c72a8f4f9c+126/sequence.fasta 0f24420528d58bff3468084aca3d7328+123/sequence.fasta 4887cadadce95997fed59d129e47b47b+126/sequence.fasta e8e00929537a550b0989be12147d6241+126/sequence.fasta 7ebbc05a6949a6ce0637fa692af183ad+126/sequence.fasta 6566c86da5313159640092f16ac8a0cb+123/sequence.fasta d04a38579335168796dd8d25f362ff8f+123/sequence.fasta 810d1e1012cbc4f63226159bd8b1fa08+123/sequence.fasta 4d40985616d6975a41a117c41fd38145+123/sequence.fasta d2062c46515c5fffed7d27b95a9e32c9+126/sequence.fasta

and

https://bigd.big.ac.cn/ncov?lang=en

Add remark field to metadata

A random remark field is probably a good idea. It can lead to additions to the schema. Also I think we can allow for additional RDF if people want that. At least give a proposal field.

pubseq / bh20-seq-resource Goto Github PK

bh20-seq-resource's Introduction

COVID-19 PubSeq: Public Sequence uploader

Installation

Installation with virtualenv

Installation with pip3 --user

Installation from Source for Development

Installation with GNU Guix

Usage

Workflow for Generating a Pangenome

Web Interface

Production

bh20-seq-resource's People

Contributors

Stargazers

Watchers

Forkers

bh20-seq-resource's Issues

Description:

To Reproduce:

Expected Behaviour:

Environment setup:

Recommend Projects

Recommend Topics

Recommend Org

Installation with `virtualenv`

Installation with `pip3 --user`