probic / mgems Goto Github PK
View Code? Open in Web Editor NEWmGEMS Genomic epidemiology with mixed samples
License: MIT License
mGEMS Genomic epidemiology with mixed samples
License: MIT License
Currently, alignment files from Themisto have to be converted to kallisto format which is cumbersome. They should be read in directly.
If pseudoalignments from paired reads are used as input and one or the other - for whatever reason - has differring number of alignments (reads) in it, mGEMS will exit with the error Error: stoul exiting
.
It might be a good idea to verify the number of reads and alignments before running, or otherwise give a more informative error message when this happens.
msweep-assembly makes use of multiple commands that have to be performed in the command line. To avoid confusion and issues with different shells, these should be performed by the executable whenever possible.
The program will crash with cryptic error messages if anything fails. The messages should be made more useful for public release.
It is currently possible for some input reads to remain unassigned. mGEMS should have an option (off by default) to write these reads to a separate bin.
If any of the input files to mGEMS are compressed with a compression algorithm that is unsupported by the bxzstr installation, running mGEMS will fail with the error
Error: stold exiting
A more informative error message containing at least the responsible filepath and possibly also the supported compression types should be added to help in locating the source of the error.
At the moment I think mGEMS generates fastq files for all groups form the mSWEEP output. Would it be possible to add a filter so that only those groups with a prevalence say above 1% would be considered?
I'm currently doing this manually using the --groups
option but it would be great if it could be made simpler.
All commands should be merged under a single 'mGEMS' executable. Proposed behaviour:
Current version (v0.2.0) of mGEMS does not support using kallisto as the pseudoaligner because it is not possible to extract the read assignments to equivalence classes from the standard kallisto output.
kallisto pseudo provides a --pseudobam flag to write a .sam file which contains the relevant information for mGEMS. However, this file is massive and impractical for many applications, so adding support for this cumbersome format is relatively low-priority.
mGEMS is lacking the ability to output the raw read assignments to reference groups table. While it is possible to manually extract the table from the current output (either the lists of assigned reads for each group, or the extracted fastq files), this approach is cumbersome.
For some applications it would be useful to have the read assignments in a table form, and since writing such a table is an easy task to do, this should be included in the next version of the software as an option.
mGEMS will crash with the error message Error: basic_string
if the program is called with an input abundances file (-a
option) that contains bootstrap iterations from mSWEEP version 1.4.0 or earlier (abundances file created using the --iters
option when running mSWEEP). The crash is caused by an extra empty line at the end of the abundances file, which does not exist in abundances files that have been produced without the mSWEEP bootstrapping option.
Since the issue is in mSWEEP, a workaround for successfully running mGEMS when bootstrapped abundances are used is to delete the empty line from the end of the input files.
However, mGEMS should produce more informative error messages if the input files contain nonsense or are in the wrong format. Closing this issue requires adding better error messages for such cases.
Several points should be addressed in the documentation in order to make mGEMS easier to adopt.
(Bio)conda has become somewhat of a standard way to easily install bioinformatics tools and pipelines. mGEMS should be installable via conda to make the tool available for less tech savvy users.
mGEMS should have an option to write out only the reads that are assigned to a single lineage.
Currently running mGEMS extract
always names the files with the suffix "_1.fastq.gz", "_2.fastq.gz", "_3.fastq.gz" etc. depending on the number and order of the input files. It would be useful to add an option to change the name (or print to cout) to enable usage with calls like the following:
mGEMS extract --bins input.bin -r reads_1.fastq.gz -o outdir &
mGEMS extract --bins input.bin -r reads_2.fastq.gz -o outdir &
wait
This may be faster than extracting both reads with a single command as compressing the reads sometimes takes more time than actually writing them. Current implementation does not allow the above call to work, because both calls will attempt to write to "input_1.fastq.gz".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.