Giter VIP home page Giter VIP logo

umbra's People

Contributors

ressy avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

umbra's Issues

Signal cleanup

Signal handling should be reorganized to allow USR1 and USR2 to handle verbosity.

  • INT, TERM: clean shutdown
  • HUP: reload
  • USR1: increase verbosity
  • USR2: decrease verbosity

Email addresses formatted incorrectly in email task

ProjectData.send_email incorrectly puts the angle brackets on the name reference rather than the mailbox part of the email addresses. (This passes testing because TestProjectDataEmail contains the same mistake.) This causes smtplib to misinterpret the name as an address at localhost, or reject the message entirely if no cc_addrs is supplied.

Config cleanup

A few configuration-related changes that should happen:

  • the /etc/ config should be automatic, rather than a default for the command-line argument.
  • the action config should be loaded after the /etc/ one, to take a higher priority
  • installation should run strictly off of the existing config paths, to keep the /etc/ handling simple

Non-unicode characters crash load_csv

illumina.util.load_csv assumes UTF-8, but in case there happens to be, say, an ISO/IEC 8859-1 0xCA (Ê) inserted into the file for some reason it'll crash with:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 5534: invalid continuation byte

What's the "right" behavior here? Intentionally throw an exception for this? Allow these to be automatically stripped out with a warning?

Box auth setup should show access error for web server log

If scrape_log_for_code encounters a problem while streaming in the server log for authentication details it's not currently shown and it just hangs waiting on the expected text. One way to handle this is to use subprocess.Popen(..., stderr=subprocess.PIPE) and watch for errors in stderr. Possibly there's a better way to catch the error from the process directly.

BoxUploader should check file size first

If too large a file is given to BoxUploader.upload it will still try to upload it but will eventually fail when the file size limit is reached. The file size should be checked beforehand so we don't bother trying for a file that would fail and can give a more informative error.

The maximum should be available via the API via the max_upload_size attribute of the current user information.

Inferring expected files from sample names is just awful

Since there isn't a straightforward mapping of sample names to file names, different Illumina programs do that name conversion differently, and duplicate sample names are allowed, we need to switch to using sample numbers and existing files on disk rather than trying to figure out what the files should be named. This can be done internally in the illumina package to start with, with minimal visible changes, before switching the project metadata to be sample-number-based. This change is a first step toward addressing #50.

Custom tasks should be supported via configuration file

The framework for install-specific custom processing tasks is already in place in the introspection in the task package. To finish this off an entry in the configuration file will need to specify an additional filesystem path containing custom tasks.

Duplicate sample name case should be handled

The sequencers do allow a sample sheet to specify the same name for multiple samples, and currently this results in all but the first sample of a given name being "lost" during processing. How should this be handled? Based on Illumina's approach it should probably be supported, weird as it is. Maybe we should use sample number as the basis for tracking samples instead.

Manual processing should allow timeout

If a project with manual processing specified is never actually finished manually, it'll take up one worker indefinitely. This potentially leads to all possible workers clogged with orphaned tasks in the long run. One simple safeguard would be a timeout (potentially quite long, like a week) before giving up, failing processing, and moving on.

Run failure should be available via Run class and the report

If a run fails it never shows up as completed in IlluminaProcessor or the report or anywhere. This information should be available through a simple property of the Run class, and reflected in a clear way in the reporting.

Illumina doesn't summarize this status clearly anywhere in the run directory as far as I can tell, though. The best we may be able to do is parse the error status out of the files in the Logs directory.

Processing tasks should run multithreaded

Currently the ProjectData class can accept and use a threads attribute (for assembly) but this is not supported by IlluminaProcessor. nthreads_per_project should be a main configuration option.

Checkpoint.txt file in Alignment directories is not parsed correctly

Related to #91, illumina.util.load_checkpoint only works as expected with the most common case, for a completed FASTQGeneration. Looking at Checkpoint.txt for interrupted/failed cases I see that in general there's both an integer and a keyword (which just happens to be an empty string in the common case). load_checkpoint should be updated to parse out both an integer and a keyword in all cases.

Migrate from Travis CI for automated testing

Travis CI is no longer a viable option for automated testing under their new pricing model.

We will be offering an allotment of OSS minutes that will be reviewed and allocated on a case by case basis. Should you want to apply for these credits please open a request with Travis CI support stating that you’d like to be considered for the OSS allotment.

Exceptions during alignment intake should trigger admin email

Currently if an exception is raised during task processing, IlluminaProcessor._worker will catch it and send an email to the address defined in to_addrs_on_error in the config. The same should happen if an exception reaches _proc_new_alignment, possibly also including what's currently caught in ProjectData.from_alignment for any problems when loading metadata. (Possibly just remove the except clause there and handle within IlluminaProcessor.)

Avoid spurious warnings for newly-arriving run data

If run data is written to the monitored path directly or copied on a rolling basis and a IlluminaProcessor.refresh() occurs while a write is incomplete, a log message of "skipped unrecognized run" or "Alignment not recognized" is triggered. This is harmless since the same location is re-checked on the next cycle, but should still be avoided.

Allowing for a requirement of a minimum file change timestamp age when trying to load run data might be a reasonable way. These timestamps are filesystem-level so they are unaffected by how the files were created or copied (e.g., rsync -a, cp -p, etc.).

Invalid runs not actually skipped in IlluminaProcessor.refresh

In processor.IlluminaProcessor._run_setup ValueError is caught and logged for invalid runs, but the invalid run object is still returned. None should be returned instead so that the processor won't try to refresh or query anything from the run directory.

Missing FASTQ case in project setup crashes procecessor

In ProjectData.from_alignment, sample_paths may be left as None if a FileNotFoundError is raised in alignment.sample_paths(), but when that None object is assigned into the ProjectData instance's sample_paths, the setter tries to access attributes and crashes. This should be considered a special case of missing samples versus what the experiment metadata predicted, raising ProjectError (instead of the AttributeError that crashes everything).

load_sample_sheet fails if unicode byte order mark is included

If load_csv is given a text file with unicode Byte Order Mark prefix, the magic bytes are left in place in the returned data, messing up parsing later. Using an explicit encoding like open(..., encoding="utf-8-sig") seems to work to strip out the BOM if present but ignore if not.

Zip task does not actually perform compression

The zip archives created by ProjectData.zip call zipfile.ZipFile without specifying a compression method, and the default is ZIP_STORED, giving no compression. This should be changed to ZIP_DEFLATED for the zip default.

min_age setting should be applied to Alignment dirs as well as Run

The min_age setting (that ignores run directories until their ctime timestamp becomes old enough) should be applied to the alignment sub-directories inside run directories as well. Otherwise we can still get spurious warnings about missing data on the first pass over incompletely-written directories.

Failed alignments should be handled

Currently an alignment directory with incomplete output from a processing failure on the sequencer triggers a repeating "Alignment not recognized" error for the run, but that's only because the wrong ValueError is inadvertently caught. The Checkpoint.txt file for these cases is both an integer and a keyword (and the keyword gets included when trying to cast as int in umbra.illumna.util.load_checkpoint).

In these cases Checkpoint.txt looks like:

0
Demultiplexing

instead of:

3

(So there's an empty string in the usual case, and presumably other integers and keywords for the intermediates but I haven't seen them.)

Instead, load_checkpoint should be updated to get both the integer and the keyword, and any error entries in CompletedJobInfo.xml should be noted.

The tests are atrociously inefficient and disorganized

Running python -m unittest currently takes several minutes to complete locally and upwards of 30 minutes on CircleCI which is kind of absurd. The tests should be disentangled from one another, superfluous subclasses cleaned up, and mock objects used wherever appropriate.

ProjectData work_dir can collide for simultaneous runs under same project

The short name for a project instance associated with a single run is built from the project name, run completion date, and user contact names. I figured that would be enough because we won't have multiple runs finishing on the same day for the same project and the same people, right? Well, we do. In this case one of the instances never processes because the other is already present.

Maybe instead add the flow cell of the run into the project work_dir attribute (see _init_work_dir_name).

Contig assembly task mis-numbers contigs

In TaskAssemble.prep_contigs_for_geneious the contig numbering is mangled by an incorrect regular expression. re.match("^NODE_([0-9])+_.*", rec.id) should be re.match("^NODE_([0-9]+)_.*", rec.id) with the capture group including all of the digits. As it currently stands the same one-digit labels are recycled over and over.

Handle sample name mismatches between metadata.csv and SampleSheet.csv

Currently it's assumed that all samples defined in metadata.csv for a given project will be present for the matched Run/Alignment, but that might not be the case, either accidentally (typo or incorrect experiment name) or intentionally (shared experiment metadata between multiple runs). This should be checked and logged appropriately: some missing names should trigger a warning, and a completely mismatched set should set the project status to failed.

File contents dumped to log for each Box upload

The Box SDK files log messages at level INFO via boxsdk.network.default_network that contain the binary contents of uploaded files in their entirety. Since we pass along everything logged at level INFO and higher this dumps huge amounts of info into syslog with each uploaded file.

A simple workaround would be to set the level of that specific logger to something higher, like WARNING.

Minimum versions for dependencies need to be specified

I just saw some uploads to box fail due to a missing method and realized the installed boxsdk was too old to have it. Adding boxsdk>=2.7.1 (the version I'm using in testing) in setup.py would have prevented it. The same should be done for the other dependencies.

Trigger email on job failure

Exceptions raised during ProjectData.process are caught and logged, but should also be emailed to a configurable list of addresses with plenty of output.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.