Giter VIP home page Giter VIP logo

bh20-seq-resource's Introduction

COVID-19 PubSeq: Public Sequence uploader

Join the chat at https://gitter.im/arvados/pubseq

This repository provides a sequence uploader for the COVID-19 Virtual Biohackathon's Public Sequence Resource project. There are two versions, one that runs on the command line and another that acts as web interface. You can use it to upload the genomes of SARS-CoV-2 samples to make them publicly and freely available to other researchers. For more information see the paper.

alt text

To get started, first install the uploader, and use the bh20-seq-uploader command to upload your data.

Installation

There are several ways to install the uploader. The most portable is with a virtualenv.

Installation with virtualenv

  1. Prepare your system. You need to make sure you have Python, and the ability to install modules such as pycurl and pyopenssl. On Ubuntu 18.04, you can run:
sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev
  1. Create and enter your virtualenv. Go to some memorable directory and make and enter a virtualenv:
virtualenv --python python3 venv
. venv/bin/activate

Note that you will need to repeat the . venv/bin/activate step from this directory to enter your virtualenv whenever you want to use the installed tool.

  1. Install the tool. Once in your virtualenv, install this project:

Install from PyPi:

pip3 install bh20-seq-uploader

Install from git:

pip3 install git+https://github.com/arvados/bh20-seq-resource.git@master
  1. Test the tool. Try running:
bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Make sure you are in your virtualenv whenever you run the tool! If you ever can't run the tool, and your prompt doesn't say (venv), try going to the directory where you put the virtualenv and running . venv/bin/activate. It only works for the current terminal window; you will need to run it again if you open a new terminal.

Installation with pip3 --user

If you don't want to have to enter a virtualenv every time you use the uploader, you can use the --user feature of pip3 to install the tool for your user.

  1. Prepare your system. Just as for the virtualenv method, you need to install some dependencies. On Ubuntu 18.04, you can run:
sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev
  1. Install the tool. You can run:
pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master
  1. Make sure the tool is on your PATH. The pip3 command will install the uploader in .local/bin inside your home directory. Your shell may not know to look for commands there by default. To fix this for the terminal you currently have open, run:
export PATH=$PATH:$HOME/.local/bin

To make this change permanent, assuming your shell is Bash, run:

echo 'export PATH=$PATH:$HOME/.local/bin' >>~/.bashrc
  1. Test the tool. Try running:
bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Installation from Source for Development

If you plan to contribute to the project, you may want to install an editable copy from source. With this method, changes to the source code are automatically reflected in the installed copy of the tool.

  1. Prepare your system. On Ubuntu 18.04, you can run:
sudo apt update
sudo apt install -y virtualenv git libcurl4-openssl-dev build-essential python3-dev libssl-dev
  1. Clone and enter the repository. You can run:
git clone https://github.com/arvados/bh20-seq-resource.git
cd bh20-seq-resource
  1. Create and enter a virtualenv. Go to some memorable directory and make and enter a virtualenv:
virtualenv --python python3 venv
. venv/bin/activate

Note that you will need to repeat the . venv/bin/activate step from this directory to enter your virtualenv whenever you want to use the installed tool.

  1. Install the checked-out repository in editable mode. Once in your virtualenv, install with this special pip command:
pip3 install -e .
  1. Test the tool. Try running:
bh20-seq-uploader --help

It should print some instructions about how to use the uploader.

Installation with GNU Guix

For running/developing the uploader with GNU Guix see INSTALL.md

Usage

Run the uploader with a FASTA or FASTQ file and accompanying metadata file in JSON or YAML:

bh20-seq-uploader example/metadata.yaml example/sequence.fasta

If the sample_id of your upload matches a sample already in PubSeq, it will be considered a new version and supercede the existing entry.

Workflow for Generating a Pangenome

All these uploaded sequences are being fed into a workflow to generate a pangenome for the virus. You can replicate this workflow yourself.

An example is to get your SARS-CoV-2 sequences from GenBank in seqs.fa, and then run a series of commands

minimap2 -cx asm20 -X seqs.fa seqs.fa >seqs.paf
seqwish -s seqs.fa -p seqs.paf -g seqs.gfa
odgi build -g seqs.gfa -s -o seqs.odgi
odgi viz -i seqs.odgi -o seqs.png -x 4000 -y 500 -R -P 5

Here we convert such a pipeline into the Common Workflow Language (CWL) and sources can be found here.

For more information on building pangenome models, see this wiki page.

Web Interface

This project comes with a simple web server that lets you use the sequence uploader from a browser. It will work as long as you install the packager with the web extra.

To run it locally:

virtualenv --python python3 venv
. venv/bin/activate
pip install -e ".[web]"
env FLASK_APP=bh20simplewebuploader/main.py flask run

Then visit http://127.0.0.1:5000/.

Production

For production deployment, you can use gunicorn:

pip3 install gunicorn
gunicorn bh20simplewebuploader.main:app

This runs on http://127.0.0.1:8000/ by default, but can be adjusted with various gunicorn options.

bh20-seq-resource's People

Contributors

adamnovak avatar andreaguarracino avatar bonfacekilz avatar daniwelter avatar dcgenomics avatar gitter-badger avatar heuermh avatar inutano avatar lltommy avatar mr-c avatar pjotrp avatar proccaserra avatar stain avatar tetron avatar uniqueg avatar urbanslug avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bh20-seq-resource's Issues

Better sample id management

From discussion 27 August 2020:

  • sample_id should be the same between the metadata file and fasta header from upload, should be validated
  • If an uploaded, valid sequence has the same sample id as an existing validated sequence, copy the new sequence/metadata to the existing collection. Enable versioning on Arvados
  • Cleaning up
    • Clean up existing sequences,
    • merge based on sequence_label and take the latest (most recent created_at).
    • revalidate previously validated samples that have invalid dates or specimen fields

Originally on the list I don't think we're doing this right now:

For namespacing identifiers, sample_id should be a URI. Add command line option to uploader to give URI prefix. Give instructions to put your institution's web page if you don't know what else to use. Validate that sample_id is a valid URI.

submitterShap error

I'm seeing this error running import

[2020-07-07 20:43:19] WARNING 'MT385461.1 uploaded by unknown@50f4c4f28070 from 3.89.224.155' (lugli-4zz18-nb1luabe2d62v9k) has valid
ation errors:   Testing <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> against shape https://raw.github
usercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submissionShape
    Testing _:b1 against shape https://raw.githubusercontent.com/arvados/bh20-seq-resource/master/bh20sequploader/bh20seq-shex.rdf#submitterShape
    _:b1 context:
      <http://arvados.org/keep:67ad83a2d78ebaa3142fda891604051d+126/metadata.yaml> MainSchema:submitter _:b1 .
         _:b1 sio:SIO_000116 "Data Science" .
         _:b1 sio:SIO_000172 "Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA" .

         No matching triples found for predicate obo:NCIT_C42781

Automated uploads

We need to add a permaid (see #103). I think submitter and submission are missing too. I'll check.

[Build] Unable to install the package from Github

Description:

I'm getting the following after trying to install the package from GH:

Click here to see debug info
Collecting git+https://github.com/arvados/bh20-seq-resource.git
  Cloning https://github.com/arvados/bh20-seq-resource.git to /tmp/pip-req-build-sx_tz19n
  Running command git clone -q https://github.com/arvados/bh20-seq-resource.git /tmp/pip-req-build-sx_tz19n
Collecting arvados-python-client
  Using cached arvados-python-client-2.0.2.tar.gz (182 kB)
Collecting schema-salad
  Using cached schema_salad-5.0.20200416112825-py3-none-any.whl (457 kB)
Collecting python-magic
  Using cached python_magic-0.4.15-py2.py3-none-any.whl (5.5 kB)
Collecting pyshex
  Using cached PyShEx-0.7.14-py3-none-any.whl (50 kB)
Collecting ciso8601>=2.0.0
  Using cached ciso8601-2.1.3.tar.gz (15 kB)
Collecting future
  Using cached future-0.18.2.tar.gz (829 kB)
Collecting google-api-python-client<1.7,>=1.6.2
  Using cached google_api_python_client-1.6.7-py2.py3-none-any.whl (56 kB)
Collecting httplib2>=0.9.2
  Using cached httplib2-0.17.3-py3-none-any.whl (95 kB)
Processing /home/bonface/.cache/pip/wheels/40/ae/bd/3e7d7af6588020c7e993f6f114fb708d966276dbc2f224d3f9/pycurl-7.43.0.5-cp38-cp38-linux_x86_64.whl
Collecting ruamel.yaml<=0.15.77,>=0.15.54
  Using cached ruamel.yaml-0.15.77.tar.gz (312 kB)
    ERROR: Command errored out with exit status 1:
     command: /home/bonface/projects/bh20-seq-resource/venv/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"'; __file__='"'"'/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-h7g8lchz/ruamel.yaml/pip-egg-info
         cwd: /tmp/pip-install-h7g8lchz/ruamel.yaml/
    Complete output (11 lines):
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 211, in 
        pkg_data = _package_data(__file__.replace('setup.py', '__init__.py'))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 184, in _package_data
        data = literal_eval("".join(lines))
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 156, in literal_eval
        return _convert(node_or_string)
      File "/tmp/pip-install-h7g8lchz/ruamel.yaml/setup.py", line 95, in _convert
        if isinstance(node, Str):
    NameError: name 'Str' is not defined
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

To Reproduce:

Steps to reproduce:

virtualenv --python python3 venv
. venv/bin/activate
pip install git+https://github.com/arvados/bh20-seq-resource.git

Expected Behaviour:

Should be able to install the package without any problems

Environment setup:

  • OS: Arch Linux
  • Python Version: python3.8.2

Docker image needs 'latest' tag

Currently, pangenome-generate.cwl fails when trying to pull jerven/spodgi; looking at Dockerhub, it does not have the latest tag. Explicitly pulling jerven/spodgi:0.0.5 fixes the problem.

Better solution would be to add the latest tag to the image on dockerhub.

Allow for metadata updates

It is possible the same sequence gets multiple metadata entries (e.g. for clinical outcome). So we need to be able to add metadata several times. Furthermore it is important to be able to update existing metadata - I think simply by adding versions.

Add metadata on workflows

We should capture the metadata on workflows somehow. As per @LLTommy's suggestion

https://docs.dockstore.org/en/develop/advanced-topics/best-practices/best-practices.html

Make sure each CWL workflow of interest is listed in https://github.com/arvados/bh20-seq-resource/blob/master/.dockstore.yml (the pangenome-generate workflow is listed, as of 2020-10-06) -> https://dockstore.org/my-workflows/github.com/arvados/bh20-seq-resource/Pangenome%20Generator

Note, the pangenome-generator has also been published to https://workflowhub.eu/workflows/63

Compute HASH on inputs

When submitting a sequence and metadata we can compute a hash value over the submission to make sure it is not already in the database. People will accidentily resubmit stuff and there is no reason to trigger the pipeline. Or, @tetron, is this automatic in Arvados?

Updated virtuoso instance

We have a script which can update virtuoso. It checks Arvados for updates of the metadata.ttl. I still need to:

  • Run as a CRON job
  • Clear the old graph before updates (requires permissions)
  • Add the update time stamp to the store
  • Perhaps bring in graph versioning

Add support for bulk uploads

To support bulk uploads we don't want to trigger the workflows at every step. I think we should have a switch that prevents the workflows from running on individual submissions. Just add the sequence and metadata to Keep.

Ontology for assemblies

At least:

  • pangenome from only de-novo assemblies (should be of higher quality)
  • pangenome from de-novo assemblies and read mapping experiments (reference biased)

ModuleNotFoundError: No module named 'qc_metadata'

I've installed bh20-seq-resource with a combination of pip and Guix. I ran 'guix environment --ad-hoc python curl python-pycurl' to get an environment with python3 and python-pycurl and then ran 'pip3 install --user git+https://github.com/arvados/bh20-seq-resource.git@master'. ~/.local/bin/bh20-seq-uploader --help gives me the output:

Traceback (most recent call last):
File "/home/efraimf/.local/bin/bh20-seq-uploader", line 11, in
load_entry_point('bh20-seq-uploader==1.0.20200410122633', 'console_scripts', 'bh20-seq-uploader')()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 489, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2793, in load_entry_point
return ep.load()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2411, in load
return self.resolve()
File "/gnu/store/mpzj907020y44yvi9gyzrmgs409xwkw9-profile/lib/python3.7/site-packages/pkg_resources/init.py", line 2417, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "/home/efraimf/.local/lib/python3.7/site-packages/bh20sequploader/main.py", line 9, in
import qc_metadata
ModuleNotFoundError: No module named 'qc_metadata'

Filter out list of sequences before building GFA

We need a script added to pangenome-generate-cwl to remove the following sequences from the public sequence resource prior to creating a GFA output:

00ef4c4427c0881a0030f7f400ce1ed0+123/sequence.fasta 1a191370cb868f80c824d93f9169599a+126/sequence.fasta 9e6fe32c3f7d281332ba958b5f62d109+123/sequence.fasta bafb25a84fa5167d5a049fa43d607a44+126/sequence.fasta 9fe51f2847f3e8e3060c9ddebf3a41e5+123/sequence.fasta d637278d9b95bbd1a5ef0bcd17a95c21+123/sequence.fasta 53fa57b401f3695feb0facf498f60871+123/sequence.fasta 392451211d0b7500ebaaa4e3182838be+123/sequence.fasta bc7dcac01570c2fb81f16f76b98add9d+126/sequence.fasta 898c212f7a9d4984c382d782bad53fd4+123/sequence.fasta f8001cec2144c59cbd851706b898ddfe+123/sequence.fasta 71063763aabd91e0b33d6861294bdff6+123/sequence.fasta 57dca4995c2186b11b67ab1cff0b005b+126/sequence.fasta f95a298c57718bf290d9facdda59eb66+123/sequence.fasta 71da768110cd21ff99f5664bc335a4ec+126/sequence.fasta 06f5726c45483d0e8fdea3004f2c4adf+123/sequence.fasta f9cea932bff8e83a2cb490c3bd694742+123/sequence.fasta 5914683bbe1ff047a163b3e57110f11b+126/sequence.fasta 27bb9a654a5f46e08888f55021d37b17+126/sequence.fasta a9be2d60f66fd03a75418b40306ededc+126/sequence.fasta aa1d1c497dabed0589c8ea6423179441+123/sequence.fasta c6f8550cf6940591fea7de5f2159d88b+123/sequence.fasta ab9c2241bda0599d20877ece1e1bc04e+126/sequence.fasta 5caa10de623c2384a31160c72a8f4f9c+126/sequence.fasta 0f24420528d58bff3468084aca3d7328+123/sequence.fasta 4887cadadce95997fed59d129e47b47b+126/sequence.fasta e8e00929537a550b0989be12147d6241+126/sequence.fasta 7ebbc05a6949a6ce0637fa692af183ad+126/sequence.fasta 6566c86da5313159640092f16ac8a0cb+123/sequence.fasta d04a38579335168796dd8d25f362ff8f+123/sequence.fasta 810d1e1012cbc4f63226159bd8b1fa08+123/sequence.fasta 4d40985616d6975a41a117c41fd38145+123/sequence.fasta d2062c46515c5fffed7d27b95a9e32c9+126/sequence.fasta

Add fastq support

Need to check the recompute of fastq from long reads and short reads. I think it is a good idea to focus on ONT initially.

Add remark field to metadata

A random remark field is probably a good idea. It can lead to additions to the schema. Also I think we can allow for additional RDF if people want that. At least give a proposal field.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.