lab-grid / swabseq-analysis Goto Github PK

Turning kkovary/swabseq_aws into a containerized Flask API

License: MIT License

Dockerfile 7.17% Python 45.94% R 44.06% Shell 2.21% PowerShell 0.61%

swabseq-analysis's Issues

add fake data to the repo for demo, training, testing purposes

Use distribution based model to adjust RPP30 threshold in water control wells

At the moment, the water control wells are using a fixed RPP30 threshold (>10 counts), but this will lead to a high number of control failures. Instead we will use threshold that is based on the distribution of RPP30 reads in the run.

Implicit assumption is that RPP30 reads come from a mixture of distributions and that for samples we look and see if reads are possibly coming from left tail of RPP30 present distribution, and for neg controls we look and see if reads are possibly coming from right tail of RPP30 absent distribution.

version suggestion

swabseq-analysis/Dockerfile

Line 4 in 27cc1e4

ARG SERVER_VERSION=local+container

From Jamie: "I recommend not setting this here and force that the value gets passed in directly and force an error. This will ensure that the health check will have the right version."

Add flag to optionally use basespace auth to get sequence data or pull from local directory (user has to specify directory)

Add bcl2fastq installer to repo

https://github.com/lab-grid/swabseq-analysis/blob/main/Dockerfile#L16

This url might change at some point and break docker build... we should just keep this file in the repo.
(Suggestion from Jamie.)

Add classification for control success / control fail

The following changes should be made to the Sample Categorization plot in the qc_report:

Add control success / control fail logic for positions A1 and B1 so that it's easy to see if there was an issue
Update color scheme so that COVID positive and COVID negative classifications are more striking, along with control wells.

Switch swabseq-analysis to use `script-runner`

run bcl2fastq without compression

We could shave off ~30 seconds or so by adding the argument --no-bgzf-compression to bcl2fastq to convert bcl files to fastq files instead of fastq.gz files.

Normally fastq.gz is preferred since fastq files are so large, but since we're deleting the run files after analysis this is not an issue, and decompression takes a while.

I haven't tested this out yet but I'm curious if it improves speed.

New arguments for pipeline

I've added two new arguments to the pipeline that I think could be useful:

--season

Here we can specify winter, spring, summer, or fall
This allows the pipeline to pull in the correct forward and reverse barcode information in so that the plots in the PDF file will be correct

--debug

There are some outputs and plots that take extra time to generate and may not always be necessary.
- We can discuss if this is how we want to move forward or if we just want all analyses to be done all of the time.
If --debug TRUE, the pipeline will cary out these extra steps if there is a potential issue with the run.

Documentation around the usage of swabseq-analysis

Only return necessary files

The QA/QC pdf is necessary, and the LIMS_results.csv are the per-DNA-barcode results. The run_info.csv data is already in the pdf, and the other info is not necessary to QA/QC the run.

Batch report for each run

Test if wrapping python script into R script using reticulate

At the moment there are two major scripts in the pipeline, countAmpliconsAWS.R and dict_align.py.

At the moment, countAmpliconsAWS.R runs dict_align.py towards the beginning of the pipeline in order to align and count the amplicons. When dict_align.py is finished running, it saves the output as results.csv, which is then loaded into memory by countAmpliconsAWS.R for downstream analysis. This write/read step takes extra time, and it would be better to keep the results in memory for downstream analysis instead of writing it to the drive and then reading it back in.

The reticulate library for R provides an R interface to python that may allow us to bypass this write/read step (https://rstudio.github.io/reticulate/). I haven't used this library yet, but I'm interested in trying it out to see if it improves speed.

Generify swabseq-analysis to be able to execute arbitrary scripts.

See: https://github.com/lab-grid/script-runner

lab-grid / swabseq-analysis Goto Github PK

swabseq-analysis's Issues

add fake data to the repo for demo, training, testing purposes

Use distribution based model to adjust RPP30 threshold in water control wells

version suggestion

Add flag to optionally use basespace auth to get sequence data or pull from local directory (user has to specify directory)

Add bcl2fastq installer to repo

Add classification for control success / control fail

Switch swabseq-analysis to use `script-runner`

run bcl2fastq without compression

New arguments for pipeline

Documentation around the usage of swabseq-analysis

Only return necessary files

Batch report for each run

Test if wrapping python script into R script using reticulate

Generify swabseq-analysis to be able to execute arbitrary scripts.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent