shawhahnlab / umbra Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 753 KB

Python package and executable for Linux for managing Illumina sequencing runs

License: GNU Affero General Public License v3.0

Python 99.66% Shell 0.34%

bioinformatics illumina sequencing

umbra's People

Contributors

Stargazers

Watchers

umbra's Issues

Signal cleanup

Signal handling should be reorganized to allow USR1 and USR2 to handle verbosity.

INT, TERM: clean shutdown
HUP: reload
USR1: increase verbosity
USR2: decrease verbosity

Metadata parsing should accept bare email addresses

Currently it assumes a name associated with each address like Name Lastname <[email protected]>. Just [email protected] should work too.

We need a task to just copy raw FASTQ to the processing directory

There's a copy task but that's for the entire run directory with no distinction between projects. More useful would be a task to copy the raw fastq.gz files for a specific project.

Box uploads should be using chunked approach

Box's upload method is intended for small (< ~ 50 MB) files and that's probably why we get sporadic timeouts with larger ones. This should be switched to use the chunked upload features as per:

https://developer.box.com/en/guides/uploads/chunked/with-sdks/
https://medium.com/box-developer-blog/introducing-the-chunked-upload-api-f82c820ccfcb
https://github.com/box/box-python-sdk/blob/master/boxsdk/object/folder.py

Email addresses formatted incorrectly in email task

ProjectData.send_email incorrectly puts the angle brackets on the name reference rather than the mailbox part of the email addresses. (This passes testing because TestProjectDataEmail contains the same mistake.) This causes smtplib to misinterpret the name as an address at localhost, or reject the message entirely if no cc_addrs is supplied.

Config cleanup

A few configuration-related changes that should happen:

the /etc/ config should be automatic, rather than a default for the command-line argument.
the action config should be loaded after the /etc/ one, to take a higher priority
installation should run strictly off of the existing config paths, to keep the /etc/ handling simple

Report should use Unix line endings

The CSV report file uses Windows-style \r\n line endings instead of Unix-style \n but that trips up things like grep. This can be changed by passing a lineterminator argument when setting up the writer object in processor.IlluminaProcessor.report.

Non-unicode characters crash load_csv

illumina.util.load_csv assumes UTF-8, but in case there happens to be, say, an ISO/IEC 8859-1 0xCA (Ê) inserted into the file for some reason it'll crash with:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 5534: invalid continuation byte

What's the "right" behavior here? Intentionally throw an exception for this? Allow these to be automatically stripped out with a warning?

Box auth setup should show access error for web server log

If scrape_log_for_code encounters a problem while streaming in the server log for authentication details it's not currently shown and it just hangs waiting on the expected text. One way to handle this is to use subprocess.Popen(..., stderr=subprocess.PIPE) and watch for errors in stderr. Possibly there's a better way to catch the error from the process directly.

BoxUploader should check file size first

If too large a file is given to BoxUploader.upload it will still try to upload it but will eventually fail when the file size limit is reached. The file size should be checked beforehand so we don't bother trying for a file that would fail and can give a more informative error.

The maximum should be available via the API via the max_upload_size attribute of the current user information.

Inferring expected files from sample names is just awful

Since there isn't a straightforward mapping of sample names to file names, different Illumina programs do that name conversion differently, and duplicate sample names are allowed, we need to switch to using sample numbers and existing files on disk rather than trying to figure out what the files should be named. This can be done internally in the illumina package to start with, with minimal visible changes, before switching the project metadata to be sample-number-based. This change is a first step toward addressing #50.

Blank rows should be ignored in metadata spreadsheet

Currently blank rows in the metadata spreadsheet are read in as usual, with everything set to an empty string, which generates spurious errors. These should be filtered when it's read in.

Custom tasks should be supported via configuration file

The framework for install-specific custom processing tasks is already in place in the introspection in the task package. To finish this off an entry in the configuration file will need to specify an additional filesystem path containing custom tasks.

Optionally store implicit task dependency output in configurable subdirectory

For user-facing simplicity it should be an option to store implicit task dependency outputs within a subdirectory of the processing directory.

Duplicate sample name case should be handled

The sequencers do allow a sample sheet to specify the same name for multiple samples, and currently this results in all but the first sample of a given name being "lost" during processing. How should this be handled? Based on Illumina's approach it should probably be supported, weird as it is. Maybe we should use sample number as the basis for tracking samples instead.

Runs with unspecified experiment should be handled during processing

We have some old runs that have sample sheets with no Experiment defined, so the experiment property returns None. This is fine for the Illumina Run/Alignment objects but crashes IlluminaProcessor.

Original metadata should be bundled with output

Currently the processing status YAML is saved in the zip file, but not the sample sheet or metadata spreadsheet. These should all be stored together in a metdata subdirectory.

Manual processing should allow timeout

If a project with manual processing specified is never actually finished manually, it'll take up one worker indefinitely. This potentially leads to all possible workers clogged with orphaned tasks in the long run. One simple safeguard would be a timeout (potentially quite long, like a week) before giving up, failing processing, and moving on.

Run failure should be available via Run class and the report

If a run fails it never shows up as completed in IlluminaProcessor or the report or anywhere. This information should be available through a simple property of the Run class, and reflected in a clear way in the reporting.

Illumina doesn't summarize this status clearly anywhere in the run directory as far as I can tell, though. The best we may be able to do is parse the error status out of the files in the Logs directory.

Processing tasks should run multithreaded

Currently the ProjectData class can accept and use a threads attribute (for assembly) but this is not supported by IlluminaProcessor. nthreads_per_project should be a main configuration option.

Checkpoint.txt file in Alignment directories is not parsed correctly

Related to #91, illumina.util.load_checkpoint only works as expected with the most common case, for a completed FASTQGeneration. Looking at Checkpoint.txt for interrupted/failed cases I see that in general there's both an integer and a keyword (which just happens to be an empty string in the common case). load_checkpoint should be updated to parse out both an integer and a keyword in all cases.

Migrate from Travis CI for automated testing

Travis CI is no longer a viable option for automated testing under their new pricing model.

We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis. Should you want to apply for these credits please open a request with Travis CI support stating that you’d like to be considered for the OSS allotment.

Exceptions during alignment intake should trigger admin email

Currently if an exception is raised during task processing, IlluminaProcessor._worker will catch it and send an email to the address defined in to_addrs_on_error in the config. The same should happen if an exception reaches _proc_new_alignment, possibly also including what's currently caught in ProjectData.from_alignment for any problems when loading metadata. (Possibly just remove the except clause there and handle within IlluminaProcessor.)

Avoid spurious warnings for newly-arriving run data

If run data is written to the monitored path directly or copied on a rolling basis and a IlluminaProcessor.refresh() occurs while a write is incomplete, a log message of "skipped unrecognized run" or "Alignment not recognized" is triggered. This is harmless since the same location is re-checked on the next cycle, but should still be avoided.

Allowing for a requirement of a minimum file change timestamp age when trying to load run data might be a reasonable way. These timestamps are filesystem-level so they are unaffected by how the files were created or copied (e.g., rsync -a, cp -p, etc.).

Invalid runs not actually skipped in IlluminaProcessor.refresh

In processor.IlluminaProcessor._run_setup ValueError is caught and logged for invalid runs, but the invalid run object is still returned. None should be returned instead so that the processor won't try to refresh or query anything from the run directory.

Metadata output not respecting configured output path for implicit tasks

When the list of tasks does not explicitly include the word "metadata" the metadata task output should arrive in the directory specified by implicit_tasks_path. This worked before the reorganization in #41 and the always-explicit feature in #42.

Old runs skipped should only be logged once

After old runs are skipped once and logged they can only get older, so we get endless log messages about them. These should get tracked so they're only logged once. If we only track and inhibit the message itself the processor will still be able to process a given run later on if conditions somehow change.

See processor.IlluminaProcessor._run_setup.

BoxUploader should have retry feature for upload failure

If a network blip causes an upload to fail (e.g. OSError 32, EPIPE) the task fails. Instead BoxUploader should default to a few retries internally for known problems before letting the exception propagate.

Missing FASTQ case in project setup crashes procecessor

In ProjectData.from_alignment, sample_paths may be left as None if a FileNotFoundError is raised in alignment.sample_paths(), but when that None object is assigned into the ProjectData instance's sample_paths, the setter tries to access attributes and crashes. This should be considered a special case of missing samples versus what the experiment metadata predicted, raising ProjectError (instead of the AttributeError that crashes everything).

load_sample_sheet fails if unicode byte order mark is included

If load_csv is given a text file with unicode Byte Order Mark prefix, the magic bytes are left in place in the returned data, messing up parsing later. Using an explicit encoding like open(..., encoding="utf-8-sig") seems to work to strip out the BOM if present but ignore if not.

Zip task does not actually perform compression

The zip archives created by ProjectData.zip call zipfile.ZipFile without specifying a compression method, and the default is ZIP_STORED, giving no compression. This should be changed to ZIP_DEFLATED for the zip default.

min_age setting should be applied to Alignment dirs as well as Run

The min_age setting (that ignores run directories until their ctime timestamp becomes old enough) should be applied to the alignment sub-directories inside run directories as well. Otherwise we can still get spurious warnings about missing data on the first pass over incompletely-written directories.

Store task logs under configurable subdirectory

The destination paths in the processing directory for task logs ~~and metadata (sample sheet / metadata.csv / YAML)~~ should be configurable.

Email notices should include run and experiment identifiers

The processing emails may long outlive the zipfiles they link to. The working directory names are given, but it'd be best to include the original Run ID and sample sheet experiment entry as well.

Systemd service should support reload command

There's already support for reloading input data from scratch via a signal but this should be supported with systemctl reload umbra, too.

Failed alignments should be handled

Currently an alignment directory with incomplete output from a processing failure on the sequencer triggers a repeating "Alignment not recognized" error for the run, but that's only because the wrong ValueError is inadvertently caught. The Checkpoint.txt file for these cases is both an integer and a keyword (and the keyword gets included when trying to cast as int in umbra.illumna.util.load_checkpoint).

In these cases Checkpoint.txt looks like:

0
Demultiplexing

instead of:

(So there's an empty string in the usual case, and presumably other integers and keywords for the intermediates but I haven't seen them.)

Instead, load_checkpoint should be updated to get both the integer and the keyword, and any error entries in CompletedJobInfo.xml should be noted.

The tests are atrociously inefficient and disorganized

Running python -m unittest currently takes several minutes to complete locally and upwards of 30 minutes on CircleCI which is kind of absurd. The tests should be disentangled from one another, superfluous subclasses cleaned up, and mock objects used wherever appropriate.

sample paths properties should exclude quote characters

GenerateFASTQ's output on disk does not include ' (and probably ") when those appear in the sample sheet. These should be excluded in illumina.alignment.sample_files_for_num.

Directory timestamp not reliable for min_age feature

Our MiniSeq keeps touching its last run output directory after the run completes, which updates the timestamp and prevents the run from processing because of how the min_age option is handled. Possibly we should instead use the CompletionTime entry in CompletedJobInfo.xml for the run.

Report should including project working directory name

There currently isn't any easy way to see how the metadata in /seq/status corresponds to the processing directories. There should be a column in the report CSV for this.

ProjectData work_dir can collide for simultaneous runs under same project

The short name for a project instance associated with a single run is built from the project name, run completion date, and user contact names. I figured that would be enough because we won't have multiple runs finishing on the same day for the same project and the same people, right? Well, we do. In this case one of the instances never processes because the other is already present.

Maybe instead add the flow cell of the run into the project work_dir attribute (see _init_work_dir_name).

Contig assembly task mis-numbers contigs

In TaskAssemble.prep_contigs_for_geneious the contig numbering is mangled by an incorrect regular expression. re.match("^NODE_([0-9])+_.*", rec.id) should be re.match("^NODE_([0-9]+)_.*", rec.id) with the capture group including all of the digits. As it currently stands the same one-digit labels are recycled over and over.

Handle sample name mismatches between metadata.csv and SampleSheet.csv

Currently it's assumed that all samples defined in metadata.csv for a given project will be present for the matched Run/Alignment, but that might not be the case, either accidentally (typo or incorrect experiment name) or intentionally (shared experiment metadata between multiple runs). This should be checked and logged appropriately: some missing names should trigger a warning, and a completely mismatched set should set the project status to failed.

Task column in experiment metadata should be automatically lowercased

It should be safe to formally define the task names as all-lowercase, and if so, we can automatically convert entries int the Task column in each experiment metadata.csv to lowercase and avoid one potential point of confusion.

Executable should support --version argument

We should be able to run umbra --version to get the current version via umbra.__version__.

File contents dumped to log for each Box upload

The Box SDK files log messages at level INFO via boxsdk.network.default_network that contain the binary contents of uploaded files in their entirety. Since we pass along everything logged at level INFO and higher this dumps huge amounts of info into syslog with each uploaded file.

A simple workaround would be to set the level of that specific logger to something higher, like WARNING.

Minimum versions for dependencies need to be specified

I just saw some uploads to box fail due to a missing method and realized the installed boxsdk was too old to have it. Adding boxsdk>=2.7.1 (the version I'm using in testing) in setup.py would have prevented it. The same should be done for the other dependencies.

shawhahnlab / umbra Goto Github PK

umbra's People

Contributors

Stargazers

Watchers

umbra's Issues

Recommend Projects

Recommend Topics

Recommend Org