cdcgov / seqsender Goto Github PK

Automated Pipeline to Generate FTP Files and Manage Submission of Sequence Data to Public Repositories

Home Page: https://cdcgov.github.io/seqsender/

License: Apache License 2.0

Python 96.71% Dockerfile 3.29%

genbank gisaid ncbi-biosamples ncbi-genbank ncbi-sra ncbi-submission bioinformatics-pipeline biosample gisaid-format gisaid-upload

seqsender's Introduction

Public Database Submission Pipeline

Beta Version: 1.1.0. This pipeline is currently in Beta testing, and issues could appear during submission. Please use it at your own risk. Feedback and suggestions are welcome!

General Disclaimer: This repository was created for use by CDC programs to collaborate on public health related projects in support of the CDC mission. GitHub is not hosted by the CDC, but is a third party website used by CDC and its partners to share information and collaborate on software. CDC use of GitHub does not imply an endorsement of any one particular service, product, or enterprise.

Documentation

Overview

seqsender is a Python program that is developed to automate the process of generating necessary submission files and batch uploading them to NCBI archives (such as BioSample, SRA, and Genbank) and GISAID databases (e.g. EpiFlu and EpiCoV). Presently, the pipeline is capable of uploading Influenza A Virus (FLU) and SARS-COV-2 (COV) data. However, the dynamic nature of this pipeline can allow for additional uploads of other organisms in future updates or requests.

Contacts

Role	Contact
Creator	Dakota Howard, Reina Chau
Maintainer	Dakota Howard
Back-Up	Reina Chau, Brian Lee

Prerequisites

NCBI Submissions

seqsender utilizes an UI-Less Data Submission Protocol to bulk upload submission files (e.g., submission.xml, submission.zip, etc.) to NCBI archives. The submission files are uploaded to the NCBI server via FTP on the command line. Before attempting to submit a submission using seqsender, submitter will need to

Have a NCBI account. To sign up, visit NCBI website.
Required for CDC users and highly recommended for others is creating a center account for your institution/lab NCBI Center Account Instructions. Center accounts allow you to perform submissions UI-less submissions as your institution/lab.
Required for CDC users and also recommended is creating a submission group in NCBI Submission Portal. A group should include all individuals who need access to UI-less submissions through the web interface with your center account. Each member of the group must also have an individual NCBI account. NCBI website.
Refer to this page for information regarding requirements for GenBank submissions via FTP only. This page applies only for COVID and Influenza NCBI GenBank FTP Submissions For further questions contact [email protected] to discuss requirements for submissions.
Coordinate a NCBI namespace name (spuid_namespace) that will be used with Submitter Provided Unique Identifiers (spuid) in the submission. The liaison of spuid_namespace and spuid is used to report back assigned accessions as well as for cross-linking objects within submission. The values of spuid_namespace are up to the submitter to decide but they must be unique and well-coordinated prior to make a submission. For more information about these two fields, see BioSample / SRA / GENBANK metadata requirements.

GISAID Submissions

seqsender makes use of GISAID’s Command Line Interface tools to bulk uploading meta- and sequence-data to GISAID databases. Presently, the pipeline only allows upload to EpiFlu (Influenza A Virus) and EpiCoV (SARS-COV-2) databases. Before uploading, submitter needs to

Have a GISAID account. To sign up, visit GISAID Platform.
Request a client-ID for EpiFlu or EpiCoV database in order to use its CLI tool. The CLI utilizes the client-ID along with the username and password to authenticate the database prior to make a submission. To obtain a client-ID, please email [email protected] to request. Important note: If submitter would like to upload a “test” submission first to familiarize themselves with the submission process prior to make a real submission, one should additionally request a test client-id to perform such submissions.
Download the EpiFlu or EpiCoV CLI from the GISAID platform and stored them in the destination of choice prior to perform a batch upload.

Here is a quick look of where to store the downloaded GISAID CLI package.

Requirement Files

Before submitters can perform a batch submission using seqsender, they must make sure the requirement files (such as config.yaml, metadata.csv, sequence.fasta, raw reads, etc.) are already prepared and stored in a submission directory of choice.

To prep for FLU submissions, select one of the databases below to get started:

BioSample
SRA
Genbank
GISAID

To prep for COV submissions, select one of the databases below to get started:

BioSample
SRA
Genbank
GISAID

Quick Start

Code Attributions

Dakota Howard and Reina Chau for majority of the code base with input and testing from colleagues.

Public Domain Standard Notice

This repository constitutes a work of the United States Government and is not subject to domestic copyright protection under 17 USC § 105. This repository is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication. All contributions to this repository will be released under the CC0 dedication. By submitting a pull request you are agreeing to comply with this waiver of copyright interest.

License Standard Notice

The repository utilizes code licensed under the terms of the Apache Software License and therefore is licensed under ASL v2 or later.

This source code in this repository is free: you can redistribute it and/or modify it under the terms of the Apache Software License version 2, or (at your option) any later version.

This source code in this repository is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY, without even the implied warranty of MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. See the Apache Software License for more details.

You should have received a copy of the Apache Software License along with this program. If not, see http://www.apache.org/licenses/LICENSE-2.0.html

The source code forked from other open source projects will inherit its license.

Privacy Standard Notice

This repository contains only non-sensitive, publicly available data and information. All material and community participation is covered by the Disclaimer and Code of Conduct. For more information about CDC’s privacy policy, please visit http://www.cdc.gov/other/privacy.html.

Contributing Standard Notice

Anyone is encouraged to contribute to the repository by forking and submitting a pull request. (If you are new to GitHub, you might start with a basic tutorial.) By contributing to this project, you grant a world-wide, royalty-free, perpetual, irrevocable, non-exclusive, transferable license to all users under the terms of the Apache Software License v2 or later.

All comments, messages, pull requests, and other submissions received through CDC including this GitHub page may be subject to applicable federal law, including but not limited to the Federal Records Act, and may be archived. Learn more at http://www.cdc.gov/other/privacy.html.

Records Management Standard Notice

This repository is not a source of government records, but is a copy to increase collaboration and collaborative potential. All government records will be published through the CDC web site.

Additional Standard Notices

Please refer to CDC’s Template Repository for more information about contributing to this repository, public domain notices and disclaimers, and code of conduct.

seqsender's People

Contributors

Stargazers

Watchers

Forkers

yil479 osnofianresearch phl-2 nbx0 mamtagiri leebrian ammaraziz concentricbyginkgo erikwolfsohn dthoward96

seqsender's Issues

Step by step docs with screenshots

Complete step-by-step document on the website with screenshots of how to prepare files and run SS. The users will mainly be non-CLI, so the instructions need to be overly verbose. Write them for FLU use specifically for now.

Document for setup (define the lab’s configs)
Documents for operating
i. Putting metadata into template excel
ii. SS commands and verifying successful submission.

Use Flu Submission Org

As part of the enhancement to support flu submissions, please use a different group to identify submissions made by flu programs (as opposed to sc2 groups using CDC_OAMD).

Create Automated Test/Validation Scripts

Creating automated testing for script updates with drastically increase testing time. Testing is tedious when it depends on databases processing files to determine if the changes work correctly.

Environment testing:

Automatic docker deployment to GHCR
Master branch now automatically builds and deploys the latest docker image to github container repository.
Automatic docker testing on pull request
Automatic python/mamba versioning testing

Mypy Testing:

All Functions mypy testing
Automated github-action mypy testing

Pydantic Testing:

All functions pydantic testing
Automated github-action pydantic testing

Add Influenza submission

cc: @leebrian @rchau88 @kristinelacek

automatic biosample package validation

Biosample packages can be incorporated into seqsender using the biosample attribute xml. It lists off the requirements for every biosample package and can be automated to regularly collect the most up to date xml to also store locally on github. This will allow users who want to use seqsender to instantly use their desired package without having to adjust the main_config file to support their organisms.

GitHub action to weekly scrape the biosample attribute xml to keep the repo up to date with latest attribute.
Seqsender function to pull down latest biosample attribute from web.
Seqsender biosample update to incorporate biosample package xml in addition to required fields in main config.

can't check status

I'm having another issue, which I suspect may be the result of my limited experience with docker...

When I try and check the status of a submission, I get this

`docker exec -it seqsender bash seqsender-kickoff check_submission_status --submission_dir ./ --submission_name sub_1 --organism FLU

Error: Submission name: sub_1 for FLU production-data is not found in the submission log file.

Error: Either a submission has not been made or an entry has been moved.
`

but the submission log file does exist right where I'm pointing, and contains that submission name

cat submission_log.csv Submission_Name,Organism,Database,Submission_Position,Submission_Type,Submission_Date,Submission_Status,Submission_Directory,Config_File,Table2asn,GFF_File,Update_Date sub_1,FLU,BIOSAMPLE,1,Production,2024-04-02,pending;submitted,/data,/data/sub_1/config.yaml,False,,2024-04-02 sub_1,FLU,SRA,1,Production,2024-04-02,pending;submitted,/data,/data/sub_1/config.yaml,False,,2024-04-02 sub_1,FLU,GENBANK,1,Production,2024-04-02,---;---,/data,/data/sub_1/config.yaml,False,,2024-04-02

xml submissions to NCBI do not require 'org_id'

Is your feature request related to a problem? Please describe.
xml submissions to NCBI do not require the 'org_id' field/attribute defined in *config.yaml. The SRA team confirmed that this number is for internal use and submitters only need to include their center/group name in the xml.

Describe the solution you'd like
Remove org_id from the config and associated parsing. The recommendation from SRA team is to simplify the organization block as:

Describe alternatives you've considered
Using the dummy value 12345 from the template appears to cause no adverse events.

Additional context
Add any other context or screenshots about the feature request here.

Update readme with doc style info

To help users and potential collaborators, please update the readme to follow the flu doc style and explicitly call out how you like folks to submit issues, test issues, commit (eg, dev branches vs main vs releases).

After your changes a user or collaborator will be able to understand how you work on and make changes to seqsender and how to expect to watch for in progress work, completed work and new releases.

Pandera metadata validation

User metadata can be validated using pandera validation. This will allow for metadata field requirements based on a schema file. This will allow seqsender to automatically detect issues with user metadata. Pandera is a better alternative than hardcoding metadata field validation into seqsender because a schema can be created for each virus with multiple valid options for each field. This can then be easily expanded to include restrictions for other viruses or to roll back restrictions.

Pandera metadata schema files:

Issue templates to update

Additional templates

New contributor
Suggest new virus to support

Update existing templates

Bug Report
Feature Request
Maintenance

Info to add to existing templates

Virus information
Instrument information
Database information

Default to GISAID sub first and attach EPI_SEQUENCE_ID to GenBank

Setup the Default behavior of submitting to all repos to be GISAID -> NCBI in order to first capture the EPI_SEQUENCE_ID assigned by GISAID and then adding to GenBank Structure Comment field like:

https://www.ncbi.nlm.nih.gov/nuccore/OP845736.1/

COMMENT     ##FluData-START##
            EPI_ISOLATE_ID   :: EPI_ISL_9631596
            NAME             :: A/Wisconsin/01/2022
            TYPE             :: H3
            Segment_name     :: HA
            HOST_GENDER      :: F
            PASSAGE          :: Original
            LOCATION         :: United States / Wisconsin
            COLLECT_DATE     :: 11-Jan-2022
            SPECIMEN_ID      :: 22VR005083 ORIGINAL
            SENDER_LAB       :: Wisconsin State Laboratory of Hygiene
            SEQLAB_SAMPLE_ID :: 3030725183
            EPI_SEQUENCE_ID  :: EPI1981213
            ##FluData-END##
        
FEATURES             Location/Qualifiers
     source          1..1737
                     /organism="Influenza A virus"
                     /mol_type="viral cRNA"
                     /strain="A/Wisconsin/01/2022"
                     /serotype="H3N2"
                     /host="Homo sapiens"
                     /db_xref="taxon:11320"
                     /segment="4"
                     /country="USA: Wisconsin"
                     /collection_date="11-Jan-2022"
                     /note="passage details:Original"

Functions to add to next version

check-submissions: Allow option to update a single submission instead of updating all submissions in log.
other organism: Allow any organism to be used with the flag "other" .
Other is currently added. It doesn't allow for GISAID submissions since it cannot be determined easily which epiCLI to use. The other option is a default generic submission template which will allow for any organism to be submitted to NCBI.
gisaid: Create gisaid submission as a toggle option to be used with any organism. This will allow automated upload for NCBI but manual submission for gisaid when a CLI option doesn't exist.
In order to support turning off GISAID submissions for other organisms, all epiCLI's have support added for them. This is to allow for any epiCLI to be connected to seqsender and used without issue. New epiCLI's can be easily added by its information to the internal metadata config file.
- EpiArbo (Arbovirus)
- EpiPox (Monkeypox)
Table2asn submission validation.
Table2asn submissions are made via email, this prevents seqsender from being able to validate a submission is correct before submitting it. Using the Table2asn validation file seqsender can now parse this file and detect issues which will then prevent submission and notify the user of what issues to correct.
User config file validation.
Config files are used to store user info and determine how seqsender processes their submissions. Current checks only validated that it loaded correctly as a yaml file. Now config files are checked against schema files which can determine if a user incorrectly filled out their submission file. If the user did incorrectly fill out their submission file it will now report an error message directing the user to the incorrect field and notifying them of what to change it to.

Adding these two features to this update as they are needed to resolve issues with incorporating Enteric BioSample attributes

Can someone documenthow to obtain NCBI username and password

Is your feature request related to a problem? Please describe.
The seqsender submission configuration has fields for NCBI username and password. But NCBI accounts are created and logged into vai thied party systems (google etc). How do we obtain an NCBI username/password pair for submission to e.g. Biosample and SRA.

Describe the solution you'd like
Please provide documentation, or a pointer to documentation, detailing how to obtain an NCBI username/passowrd pair that would enable us to make submissions to Biosample, SRA, Genbank

Describe alternatives you've considered
NA

Additional context
We would need the credentials to work for submissions to Biosample and SRA.

Errors in script for production submission

Hello, I am trying to do our first production submission and this is the output I am getting.

Traceback (most recent call last):
File "seqsender.py", line 616, in
main()
File "seqsender.py", line 591, in main
submission_preparation.process_submission(args.unique_name, args.fasta, args.metadata, os.path.join(os.path.dirname(os.path.abspath(file)), "config_files", args.config))
File "/root/miniconda3/seqsender/submission_preparation.py", line 493, in process_submission
main_df = merge(fasta_file, metadata_file)
File "/root/miniconda3/seqsender/submission_preparation.py", line 228, in merge
main_df = fasta.merge(metadata, left_on = "fasta_name_orig", right_on = config_dict["general"]["fasta_sample_name_col"], how = "left")
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/frame.py", line 7963, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 87, in merge
validate=validate,
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 652, in init
) = self._get_merge_keys()
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1005, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "/root/miniconda3/envs/seqsender/lib/python3.6/site-packages/pandas/core/generic.py", line 1563, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'specimen_collector_sample_id'

How to correctly add Source Modifiers for genbank submisson?

Hi,

Thanks for creating and maintaining this very useful program.

I am trying to include patient metadata for the NCBI submission part of the process. Similar to gisaid which allows gender, passage etc, NCBI allows source modifiers such as sex.

I am not 100% sure how to include such information for the NCBI part of the metadata in the config file.

genbank_src_metadata:
  column_names:
    isolate: genbank_name
    host: host
    country: location
    isolation-source: isolation_source

Lets say I wanted to include the ncbi source modifier Sex (assuming I have a column in my metadata called gender), would I add the following:

genbank_src_metadata:
  column_names:
  ....
  Sex: gender

Is that correct?

A related question, for the structured data section eg:

COMMENT     ##Assembly-Data-START##
            Assembly Method       :: CLC Genomics
            Sequencing Technology :: PacBio Sequel II
            ##Assembly-Data-END##

How can I add more information than?

My main aim is to match all the metadata that is required in gisaid to ncbi submission.

Thanks,

Ammar

Clarify GISAID CLI usage

Clarify in documentation the need for users to go to gisaid.org and download their CLI-API, where to install it for SeqSender to use it and requesting their token.
Refactor seqsender internal code to import their CLI python pkg and use their commands rather than the raw API. This will assure that changes made by GISAID get inherited to SeqSender more smoothly.

cc: @leebrian @rchau88 @kristinelacek

User defined date specificity

Is your feature request related to a problem? Please describe.
Hard-coded date formatting at YYYY-MM-DD creates challenges for generalizing to other microbial pathogens, the majority of which must be submitted to BioSample with only YYYY or YYYY-MM to ensure privacy.

Describe the solution you'd like
Consider letting users define their own date specificity, perhaps in the *_config.yaml. That would preserve the current default requirements for SC2 and Flu. A more advanced option would be to allow setting a minimum (or maximum) specificity rather than a fixed requirement for flexibility during submission (e.g. [1] YYYY or YYYY-MM, but not YYYY-MM-DD vs [2] YYYY-MM or YYYY-MM-DD, but not YYYY).

Describe alternatives you've considered
Maybe this also gets covered in your solution to #43 but BioSample itself does not impose strict requirements for date specificity and it's generally up to submitters to determine what is appropriate.

Additional context
Add any other context or screenshots about the feature request here.

NCBI & GISAID account creation docs

Will be very helpful to have step-by-step instructions with screenshots in the documentation for creating an account, highlighting which fields will then be needed later in seqsender.

FTP error: [Errno 2] No such file or directory: '/test_input/test_fastq_R1.fastq'

I am getting this error when I try to run the test submit command:
python seqsender.py submit --unique_name test_submission --config test_config.yaml --metadata /root/miniconda3/seqsender/test_input/test_metadata.tsv --fasta /root/miniconda3/seqsender/test_input/test_fasta.fasta --test

The output first says:
Processing test_submission.
Processing Files.
Creating GISAID files.
Creating Genbank files.
Creating BioSample files.
Creating SRA files.
test_submission complete.

Submission report exists pulling down.
Submitting to SRA/BioSample.

Followed by the FTP error described above. Does anything appear to be missing? Thank you

There is no submission directory at /home/user1/FLU_reporting

This may be an extremely simple question, I'm probably overlooking something to do with docker, but I'm trying a test submission via Docker, the command (and response) being

`docker exec -it seqsender bash seqsender-kickoff submit
--organism FLU
-bsn
--submission_dir /home/user1/FLU_reporting/
--submission_name test_sub
--config_file config.yaml
--metadata_file metadata.csv
--fasta_file RI-M04353-2024013.fasta
--test

There is no submission directory at /home/user1/FLU_reporting
`

the path is correct, my submission folder "test_sub" is indeed at /home/user1/FLU_reporting/

I'm at a loss