benoitmorel / pargenes Goto Github PK

View Code? Open in Web Editor NEW

38.0 38.0 5.0 545 KB

A massively parallel tool for model selection and tree inference on thousands of genes

License: GNU General Public License v3.0

Python 89.99% Shell 10.01%

pargenes's People

Contributors

Stargazers

Watchers

Forkers

myces altingia karenvn

pargenes's Issues

Add a multi-raxml option to specify the modeltest criteria to use

It's currently hardcoded as AICc, but some user might want to select models from another criteria.

Rename the project

Rename multi-raxml to pargenes, unless someone finds a better name before

Create per-job result directory only when the job started

Because right now it's annoying for debugging

Improve logs

Do as in raxml-ng: the main logs should be printed in cout AND stored in a file.
Add some reporting files (warnings, list of filtered MSAs...)
Print more messages in the main log output.
Check that when something goes wrong, it's printed in the main log output.

Add specific error message when most of the parsing commands failed

Add an option to run multiple tree searches per MSA

Add an option to run multiple tree searches per MSA (with distinct starting trees), and pick the best-scoring ML tree.
Default value is 1.

Example:

python multi-raxml.py <blabla> --raxml-starting-trees 20

Model-test compilation broken

When running the script install_debug.sh on haswell

Scheduler: add a summary of failed commands

Missing concatenated bootstraps files when running with a lot of cores

With the 1024 cores muscle run, all the boostraps files are well generated, but after the concatenation step, a lot are missing.

It might be because of this weird line in mr_bootstraps.py:
60 with concurrent.futures.ThreadPoolExecutor(max_workers = int(cores)) as e:
To many threads are started, but this can run only on one single node...

OpenMP implementation: log redirection not working

And this makes travis unhappy

Split modeltest jobs per partition

This might be tricky, but would help a lot, for analysis like 1kite.

Rethink the per-job cores number assignment

To decide the number of cores to allocate to a raxml run, we currently use the minimum response time raxml output. (maybe /2)
When the MSA get large (for instance with the 1kite analysis), this value is irrelevant.

Solution: implement several different policies (minimum response time, maximum throughput, something between etc.). Set the most common (i.e. for gene trees analysis) by default, but add a parameter for advanced users.

Experiment with these different policies!

Implement a debug mode

The debug mode should be run after on job crash if the user does not know which one crashed
In this mode, all the jobs that were running when the crash occured are run again.
But this time, each MPI rank should run a job without MPI through a system call, to detect crashes without crashing itself.

Since this is a very inefficient use of the available resource, we might allow ParGenes to allocate the idle cores to start the other pending jobs until all the potential buggy jobs are processed.

Add a more generic way of specifying the list of input MSAs

The current way of specifying the list of input MSAs is to provide a directory containing all the MSAs.
But the user might not have all the MSAs in the same directory, and should not have to copy them (or to do any other trick) to use ParGenes.

One idea (to discuss) is to provide an alternative argument to -a <input_directory> which would be a file listing all the paths to the MSAs to process. For each path, the user would also specify a unique id, to be able to find back the per-msa results.

The file format would be, for instance:

id1 path1
id2 path2
id3 path3

Implement per-MSA modeltest option file

Allow the user to add specific options for each MSA.
This should be done through a file like that:

msa_filename1.fasta --template raxml
msa_filename2.fasta --mp
msa_filename3.fasta --ml

These options will be concatenated to the general modeltest options.

If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional modeltest option

Add a dry-run option

To parse the arguments and the MSAs and recommend some number of cores

Change multi-raxml scripts syntax

Do not use an hardcoded number of arguments. Use --argname arg instead

Travis enhancements

In the unit tests:
- check that the results files are produced (bestTree, model, etc.)
- check that the correct models are used when -m is enabled
- add more usecases? (cusomized parameters)
- test checkpoint (with a script that kills the program and runs it again with --redo)
compile with/without MPI and OpenMP

Add an option to differenciate aa and nt datasets

Rationale:
Modeltest needs to know whether we are dealing with aa or nt. It's already possible for the user to specify some modeltest arguments to tell him whether it's aa or nt, but most users won't think about it and will get some weird error message during the modeltest runs.
Instead, we need to force the user to specify aa or nt from multi-raxml, and multi-raxml will handle the rest.
Todo:
add a -d --datatype option and feed modeltest with the parameter

[Error] <class 'OSError'> [Errno 5] Input/output error

Happened when running pargenes on the split 1kite dataset, after parsing modeltest outputs.
Some ideas of what's wrong here: https://stackoverflow.com/questions/30325351/ioerror-errno-5-input-output-error-while-using-smbus-for-analog-reading-thr
Failing run on phobos at /home/morelbt/github/1kite_experiments/1kite_lg4m_lg4x_pargenes_run

Scheduler: add an implementation based on a thread pools

Because the OpenMP solution:
-is spawning more threads than needed (one openmp thread + one system call thread)

does not fit the MPI-like interface
introduces an OpenMP dependency

Try either with:

system(command + "&")
fork
phread
std future / async

Implement per-MSA raxml option file

Allow the user to add specific options for each MSA.
This should be done through a file like that:

msa_filename1.fasta --model GTR
msa_filename2.fasta --model JC
msa_filename3.fasta --model GTR --site-repeats ON

These options will be concatenated to the general raxml options.

If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional raxml option (we might change this behaviour later on)

Crash when building support value trees

The crash occurred when analyzing the muscle dataset.

Files in haswell: /hits/basement/sco/morel/github/phd_experiments/results/multi-raxml/muscle/bootstraps_100_modeltest_-s_10_-p_10/haswell_256/run_0

/hits/basement/sco/morel/github/multi-raxml/raxml-ng/src/Tree.cpp:226: NameIdMap Tree::tip_ids() const: Assertion !result.empty()' failed.`

Fix checkpoint when ParGenes crashes

When there is a crash, the python script continues executing until the end. So the checkpoint is set to the final value, although most of the jobs were not processed.

Do not flood the main logs with warnings

When an MSA is invalid and will be skipped, a warning is produced. When 1000 MSAs are invalid, 1000 warnings are produced, and the top level logs are flooded.

Instead, thow one unique warning with the number of invalid MSAs and list them in a different file.

Parallelize within the ParGenes python script

Some "small" operations are executed by the script sequentially. When the number of cores increases, it starts being a bottleneck.

Operations to parallelize:

Parsing of raxml logs to get the number of sites and taxa
Parsing of model-test logs to get the best models
Parsing of the raxml logs to get the best starting tree
Bootstrap trees concatenation

Do not synchronize between modeltest and raxml runs

ParGenes currently executes all the modeltest runs in parallel, then synchronizes, and then executes all the raxml runs. This might be bad for load balance.

Todo: do everything within the same parrallel run, with some dependencies system.

Improve checkpoint system

In particular, do not parse all the raxml logs after a checkpoint. Instead, store a summary of the MSA characteristics and reload it after a checkpoint.
Check that we do not waste time in other parts of the pipeline.
This is important since some people showed some interest in the checkpoint feature

Add an option to set the number of cores per modeltest job

Because Modeltest-ng does not parallelize over the sites, we chose to assign a fixed number of cores (16) per modeltest job. On unpartitionned gene families, this is the best we can do.

But when dealing with unpartitionned datasets (as I do with 1kite), we might want to use more cores.

Todo: add an option --cores-per-modeltest-job to customize the number of cores per modeltest job.

Add more logs in the scheduler

Add some logs, especially at the startup.
I just had a crash at the begin of a 1kite run but I have no idea where, because the scheduler is not verbose enough...
Also add some information about the time of the day.

Do not try to build support values when no bootstraps are computed

Port pargenes to python2

Because some clusters still live in the past

Modeltest: add a summary file

Add a file with the number of times each model is picked:
Example:

JTT+G4+F 20
JTT-DCMUT 10

would mean that two different models were selected for the 30 inputs MSAs.
It might be very interesting for some post-analysis.

Add an OpenMP implementation

For users running ParGenes on their laptop

Output file structure

Improve it a bit.
Document it!

Do not duplicate phy files much than needed

We currently duplicate fasta/phy files for each raxml run.
We should also try to use the rba files (but we need some additional raxml step after modeltest to treat models correctly).

Fix the bootstrap example script

It does not produce the bs trees...

Do not load RBA files when specifying a model

RDB files already store the model. The raxml parameter --model is then ignored.
The benefit of loading RDB is very small for ParGenes anyway.

Handle better the case were the user forgets to specify raxml model

Several ideas:

add default model
print warning or error
add a --model option

Add much more checks in multi-ramxl

Lots of simple missuses of multi-raxml lead to unreadable python exceptions.
Todo:

Check that the output directory does not exist yet
Check that all the files specified in the command line exist
Each time we try to open an non existing file, add an explicit message to the user to explain what's wrong
When executing a lot of commands, add a summary of the failures

Openmp: killing pargenes does not kill the running children jobs

Implement multi-raxml checkpoint

Most of the work is already done, but the syntax does not allow it yet.
Also test it!

OpenMP: redirect jobs stderr to files

Improve wiki

Update the requirements
Document the examples
Describe all the parameters better (do not just copy paste python help message!)

Add an option to disable the 3% heuristic

By default, ParGenes assign twice more cores to the 3% jobs that have the highest number of taxa. This allows us to get rid of most of the "tails" (the few cores that finish much later than the other) and improves the load balance.

But in some cases, especially when the number of jobs is low, this heuristics might slow down the whole execution. Advanced users should have the possibility of disabling this heuristic. (and it will be useful to compare with/without the heuristic!)

Add a parameter to list the filenames to process

This parameter would complete the --fasta-dir option, to treat only a subset of the files in this directory.

Motivation: the user might want to filter out some files that are not MSA files from the analysis.

Do not require an input model when -m enabled

When the user wants to infer the model with modeltest, he currently needs to specify an input model because raxml needs it for the initial MSA parsing (although it's only used to decide bewteen aa or nt).
Todo: in this case, add a default hardcoded model to satisfy raxml parsing step.

Pass general ramxl and modeltest options through a string parameter

Example:

python multi-raxml.py <blabla> --modeltest-options "--ml" --raxml-options "--search"

We should decide later on if this is a good practice, or if we should use files instead.

Add a dynamic list of currently running jobs

When monitoring a ParGenes run, it would be super useful to know which jobs are currently being run.

One way would be to create a file per job and to delete it when it finishes.

This might also be useful to implement #14