Giter VIP home page Giter VIP logo

pargenes's People

Contributors

benoitmorel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pargenes's Issues

Rename the project

Rename multi-raxml to pargenes, unless someone finds a better name before

Improve logs

  • Do as in raxml-ng: the main logs should be printed in cout AND stored in a file.
  • Add some reporting files (warnings, list of filtered MSAs...)
  • Print more messages in the main log output.
  • Check that when something goes wrong, it's printed in the main log output.

Missing concatenated bootstraps files when running with a lot of cores

With the 1024 cores muscle run, all the boostraps files are well generated, but after the concatenation step, a lot are missing.

It might be because of this weird line in mr_bootstraps.py:
60 with concurrent.futures.ThreadPoolExecutor(max_workers = int(cores)) as e:
To many threads are started, but this can run only on one single node...

Rethink the per-job cores number assignment

To decide the number of cores to allocate to a raxml run, we currently use the minimum response time raxml output. (maybe /2)
When the MSA get large (for instance with the 1kite analysis), this value is irrelevant.

Solution: implement several different policies (minimum response time, maximum throughput, something between etc.). Set the most common (i.e. for gene trees analysis) by default, but add a parameter for advanced users.

Experiment with these different policies!

Implement a debug mode

The debug mode should be run after on job crash if the user does not know which one crashed
In this mode, all the jobs that were running when the crash occured are run again.
But this time, each MPI rank should run a job without MPI through a system call, to detect crashes without crashing itself.

Since this is a very inefficient use of the available resource, we might allow ParGenes to allocate the idle cores to start the other pending jobs until all the potential buggy jobs are processed.

Add a more generic way of specifying the list of input MSAs

The current way of specifying the list of input MSAs is to provide a directory containing all the MSAs.
But the user might not have all the MSAs in the same directory, and should not have to copy them (or to do any other trick) to use ParGenes.

One idea (to discuss) is to provide an alternative argument to -a <input_directory> which would be a file listing all the paths to the MSAs to process. For each path, the user would also specify a unique id, to be able to find back the per-msa results.

The file format would be, for instance:

id1 path1
id2 path2
id3 path3

Implement per-MSA modeltest option file

Allow the user to add specific options for each MSA.
This should be done through a file like that:

msa_filename1.fasta --template raxml
msa_filename2.fasta --mp
msa_filename3.fasta --ml

These options will be concatenated to the general modeltest options.

If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional modeltest option

Travis enhancements

  • In the unit tests:
    • check that the results files are produced (bestTree, model, etc.)
    • check that the correct models are used when -m is enabled
    • add more usecases? (cusomized parameters)
    • test checkpoint (with a script that kills the program and runs it again with --redo)
  • compile with/without MPI and OpenMP

Add an option to differenciate aa and nt datasets

Rationale:
Modeltest needs to know whether we are dealing with aa or nt. It's already possible for the user to specify some modeltest arguments to tell him whether it's aa or nt, but most users won't think about it and will get some weird error message during the modeltest runs.
Instead, we need to force the user to specify aa or nt from multi-raxml, and multi-raxml will handle the rest.
Todo:
add a -d --datatype option and feed modeltest with the parameter

Scheduler: add an implementation based on a thread pools

Because the OpenMP solution:
-is spawning more threads than needed (one openmp thread + one system call thread)

  • does not fit the MPI-like interface
  • introduces an OpenMP dependency

Try either with:

  • system(command + "&")
  • fork
  • phread
  • std future / async

Implement per-MSA raxml option file

Allow the user to add specific options for each MSA.
This should be done through a file like that:

msa_filename1.fasta --model GTR
msa_filename2.fasta --model JC
msa_filename3.fasta --model GTR --site-repeats ON

These options will be concatenated to the general raxml options.

If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional raxml option (we might change this behaviour later on)

Crash when building support value trees

The crash occurred when analyzing the muscle dataset.

Files in haswell: /hits/basement/sco/morel/github/phd_experiments/results/multi-raxml/muscle/bootstraps_100_modeltest_-s_10_-p_10/haswell_256/run_0

/hits/basement/sco/morel/github/multi-raxml/raxml-ng/src/Tree.cpp:226: NameIdMap Tree::tip_ids() const: Assertion !result.empty()' failed.`

Fix checkpoint when ParGenes crashes

When there is a crash, the python script continues executing until the end. So the checkpoint is set to the final value, although most of the jobs were not processed.

Do not flood the main logs with warnings

When an MSA is invalid and will be skipped, a warning is produced. When 1000 MSAs are invalid, 1000 warnings are produced, and the top level logs are flooded.

Instead, thow one unique warning with the number of invalid MSAs and list them in a different file.

Parallelize within the ParGenes python script

Some "small" operations are executed by the script sequentially. When the number of cores increases, it starts being a bottleneck.

Operations to parallelize:

  • Parsing of raxml logs to get the number of sites and taxa
  • Parsing of model-test logs to get the best models
  • Parsing of the raxml logs to get the best starting tree
  • Bootstrap trees concatenation

Do not synchronize between modeltest and raxml runs

ParGenes currently executes all the modeltest runs in parallel, then synchronizes, and then executes all the raxml runs. This might be bad for load balance.

Todo: do everything within the same parrallel run, with some dependencies system.

Improve checkpoint system

In particular, do not parse all the raxml logs after a checkpoint. Instead, store a summary of the MSA characteristics and reload it after a checkpoint.
Check that we do not waste time in other parts of the pipeline.
This is important since some people showed some interest in the checkpoint feature

Add an option to set the number of cores per modeltest job

Because Modeltest-ng does not parallelize over the sites, we chose to assign a fixed number of cores (16) per modeltest job. On unpartitionned gene families, this is the best we can do.

But when dealing with unpartitionned datasets (as I do with 1kite), we might want to use more cores.

Todo: add an option --cores-per-modeltest-job to customize the number of cores per modeltest job.

Add more logs in the scheduler

Add some logs, especially at the startup.
I just had a crash at the begin of a 1kite run but I have no idea where, because the scheduler is not verbose enough...
Also add some information about the time of the day.

Modeltest: add a summary file

Add a file with the number of times each model is picked:
Example:

JTT+G4+F 20
JTT-DCMUT 10

would mean that two different models were selected for the 30 inputs MSAs.
It might be very interesting for some post-analysis.

Do not duplicate phy files much than needed

We currently duplicate fasta/phy files for each raxml run.
We should also try to use the rba files (but we need some additional raxml step after modeltest to treat models correctly).

Add much more checks in multi-ramxl

Lots of simple missuses of multi-raxml lead to unreadable python exceptions.
Todo:

  • Check that the output directory does not exist yet
  • Check that all the files specified in the command line exist
  • Each time we try to open an non existing file, add an explicit message to the user to explain what's wrong
  • When executing a lot of commands, add a summary of the failures

Improve wiki

  • Update the requirements
  • Document the examples
  • Describe all the parameters better (do not just copy paste python help message!)

Add an option to disable the 3% heuristic

By default, ParGenes assign twice more cores to the 3% jobs that have the highest number of taxa. This allows us to get rid of most of the "tails" (the few cores that finish much later than the other) and improves the load balance.

But in some cases, especially when the number of jobs is low, this heuristics might slow down the whole execution. Advanced users should have the possibility of disabling this heuristic. (and it will be useful to compare with/without the heuristic!)

Add a parameter to list the filenames to process

This parameter would complete the --fasta-dir option, to treat only a subset of the files in this directory.

Motivation: the user might want to filter out some files that are not MSA files from the analysis.

Do not require an input model when -m enabled

When the user wants to infer the model with modeltest, he currently needs to specify an input model because raxml needs it for the initial MSA parsing (although it's only used to decide bewteen aa or nt).
Todo: in this case, add a default hardcoded model to satisfy raxml parsing step.

Add a dynamic list of currently running jobs

When monitoring a ParGenes run, it would be super useful to know which jobs are currently being run.

One way would be to create a file per job and to delete it when it finishes.

This might also be useful to implement #14

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.