probic / mgems Goto Github PK

View Code? Open in Web Editor NEW

15.0 3.0 1.0 198 KB

mGEMS Genomic epidemiology with mixed samples

License: MIT License

C++ 79.57% CMake 20.43%

plate-sweep genomic-epidemiology metagenomics taxonomic-binning high-throughput-sequencing c-plus-plus

mgems's People

Contributors

Stargazers

Watchers

Forkers

gaworj

mgems's Issues

Support reading Themisto files in directly

Currently, alignment files from Themisto have to be converted to kallisto format which is cumbersome. They should be read in directly.

`Error: stoul exiting` if input pseudoalignments from paired reads have unequal numbers of reads

If pseudoalignments from paired reads are used as input and one or the other - for whatever reason - has differring number of alignments (reads) in it, mGEMS will exit with the error Error: stoul exiting.

It might be a good idea to verify the number of reads and alignments before running, or otherwise give a more informative error message when this happens.

Incorporate the steps performed with shell commands

msweep-assembly makes use of multiple commands that have to be performed in the command line. To avoid confusion and issues with different shells, these should be performed by the executable whenever possible.

Create better error messages

The program will crash with cryptic error messages if anything fails. The messages should be made more useful for public release.

Add option to write unassigned reads

It is currently possible for some input reads to remain unassigned. mGEMS should have an option (off by default) to write these reads to a separate bin.

`Error: stold exiting` if input file(s) are compressed with unsupported compression type

If any of the input files to mGEMS are compressed with a compression algorithm that is unsupported by the bxzstr installation, running mGEMS will fail with the error
Error: stold exiting

A more informative error message containing at least the responsible filepath and possibly also the supported compression types should be added to help in locating the source of the error.

Filter mSWEEP input

At the moment I think mGEMS generates fastq files for all groups form the mSWEEP output. Would it be possible to add a filter so that only those groups with a prevalence say above 1% would be considered?

I'm currently doing this manually using the --groups option but it would be great if it could be made simpler.

Create the 'mGEMS' executable

All commands should be merged under a single 'mGEMS' executable. Proposed behaviour:

./mGEMS — run everything.
./mGEMS read — process the alignment files.
./mGEMS assign — assign the reads to the references.
./mGEMS filter — create the samples.

Add kallisto support

Current version (v0.2.0) of mGEMS does not support using kallisto as the pseudoaligner because it is not possible to extract the read assignments to equivalence classes from the standard kallisto output.

kallisto pseudo provides a --pseudobam flag to write a .sam file which contains the relevant information for mGEMS. However, this file is massive and impractical for many applications, so adding support for this cumbersome format is relatively low-priority.

Add option to write the raw read assignments to reference groups table

mGEMS is lacking the ability to output the raw read assignments to reference groups table. While it is possible to manually extract the table from the current output (either the lists of assigned reads for each group, or the extracted fastq files), this approach is cumbersome.

For some applications it would be useful to have the read assignments in a table form, and since writing such a table is an easy task to do, this should be included in the next version of the software as an option.

`Error: basic_string` when using mSWEEP abundances with bootstrap iterations in them as input

mGEMS will crash with the error message Error: basic_string if the program is called with an input abundances file (-a option) that contains bootstrap iterations from mSWEEP version 1.4.0 or earlier (abundances file created using the --iters option when running mSWEEP). The crash is caused by an extra empty line at the end of the abundances file, which does not exist in abundances files that have been produced without the mSWEEP bootstrapping option.

Since the issue is in mSWEEP, a workaround for successfully running mGEMS when bootstrapped abundances are used is to delete the empty line from the end of the input files.

However, mGEMS should produce more informative error messages if the input files contain nonsense or are in the wrong format. Closing this issue requires adding better error messages for such cases.

Improve usage documentation

Several points should be addressed in the documentation in order to make mGEMS easier to adopt.

Add more information about how to use (and install) themisto & mSWEEP since they are essential parts of the pipeline.
Create a new tutorial on how to prepare a reference database and use pubMLST or PopPUNK to assign the reference sequences to lineages.

Create a conda recipe for easy installation

(Bio)conda has become somewhat of a standard way to easily install bioinformatics tools and pipelines. mGEMS should be installable via conda to make the tool available for less tech savvy users.

Add option to exclude multi-group reads

mGEMS should have an option to write out only the reads that are assigned to a single lineage.

Improve `mGEMS extract`

Currently running mGEMS extract always names the files with the suffix "_1.fastq.gz", "_2.fastq.gz", "_3.fastq.gz" etc. depending on the number and order of the input files. It would be useful to add an option to change the name (or print to cout) to enable usage with calls like the following:

mGEMS extract --bins input.bin -r reads_1.fastq.gz -o outdir &
mGEMS extract --bins input.bin -r reads_2.fastq.gz -o outdir &
wait

This may be faster than extracting both reads with a single command as compressing the reads sometimes takes more time than actually writing them. Current implementation does not allow the above call to work, because both calls will attempt to write to "input_1.fastq.gz".