benoitmorel / pargenes Goto Github PK
View Code? Open in Web Editor NEWA massively parallel tool for model selection and tree inference on thousands of genes
License: GNU General Public License v3.0
A massively parallel tool for model selection and tree inference on thousands of genes
License: GNU General Public License v3.0
It's currently hardcoded as AICc, but some user might want to select models from another criteria.
Rename multi-raxml to pargenes, unless someone finds a better name before
Because right now it's annoying for debugging
... instead of hardcoding it
Add an option to run multiple tree searches per MSA (with distinct starting trees), and pick the best-scoring ML tree.
Default value is 1.
Example:
python multi-raxml.py <blabla> --raxml-starting-trees 20
When running the script install_debug.sh on haswell
With the 1024 cores muscle run, all the boostraps files are well generated, but after the concatenation step, a lot are missing.
It might be because of this weird line in mr_bootstraps.py:
60 with concurrent.futures.ThreadPoolExecutor(max_workers = int(cores)) as e:
To many threads are started, but this can run only on one single node...
And this makes travis unhappy
This might be tricky, but would help a lot, for analysis like 1kite.
To decide the number of cores to allocate to a raxml run, we currently use the minimum response time raxml output. (maybe /2)
When the MSA get large (for instance with the 1kite analysis), this value is irrelevant.
Solution: implement several different policies (minimum response time, maximum throughput, something between etc.). Set the most common (i.e. for gene trees analysis) by default, but add a parameter for advanced users.
Experiment with these different policies!
The debug mode should be run after on job crash if the user does not know which one crashed
In this mode, all the jobs that were running when the crash occured are run again.
But this time, each MPI rank should run a job without MPI through a system call, to detect crashes without crashing itself.
Since this is a very inefficient use of the available resource, we might allow ParGenes to allocate the idle cores to start the other pending jobs until all the potential buggy jobs are processed.
The current way of specifying the list of input MSAs is to provide a directory containing all the MSAs.
But the user might not have all the MSAs in the same directory, and should not have to copy them (or to do any other trick) to use ParGenes.
One idea (to discuss) is to provide an alternative argument to -a <input_directory> which would be a file listing all the paths to the MSAs to process. For each path, the user would also specify a unique id, to be able to find back the per-msa results.
The file format would be, for instance:
id1 path1
id2 path2
id3 path3
Allow the user to add specific options for each MSA.
This should be done through a file like that:
msa_filename1.fasta --template raxml
msa_filename2.fasta --mp
msa_filename3.fasta --ml
These options will be concatenated to the general modeltest options.
If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional modeltest option
To parse the arguments and the MSAs and recommend some number of cores
Do not use an hardcoded number of arguments. Use --argname arg
instead
Rationale:
Modeltest needs to know whether we are dealing with aa or nt. It's already possible for the user to specify some modeltest arguments to tell him whether it's aa or nt, but most users won't think about it and will get some weird error message during the modeltest runs.
Instead, we need to force the user to specify aa or nt from multi-raxml, and multi-raxml will handle the rest.
Todo:
add a -d --datatype option and feed modeltest with the parameter
Happened when running pargenes on the split 1kite dataset, after parsing modeltest outputs.
Some ideas of what's wrong here: https://stackoverflow.com/questions/30325351/ioerror-errno-5-input-output-error-while-using-smbus-for-analog-reading-thr
Failing run on phobos at /home/morelbt/github/1kite_experiments/1kite_lg4m_lg4x_pargenes_run
Because the OpenMP solution:
-is spawning more threads than needed (one openmp thread + one system call thread)
Try either with:
Allow the user to add specific options for each MSA.
This should be done through a file like that:
msa_filename1.fasta --model GTR
msa_filename2.fasta --model JC
msa_filename3.fasta --model GTR --site-repeats ON
These options will be concatenated to the general raxml options.
If am MSA file is not specified but is present in the MSA directory, it will still be processed, without any additional raxml option (we might change this behaviour later on)
The crash occurred when analyzing the muscle dataset.
Files in haswell: /hits/basement/sco/morel/github/phd_experiments/results/multi-raxml/muscle/bootstraps_100_modeltest_-s_10_-p_10/haswell_256/run_0
/hits/basement/sco/morel/github/multi-raxml/raxml-ng/src/Tree.cpp:226: NameIdMap Tree::tip_ids() const: Assertion
!result.empty()' failed.`
When there is a crash, the python script continues executing until the end. So the checkpoint is set to the final value, although most of the jobs were not processed.
When an MSA is invalid and will be skipped, a warning is produced. When 1000 MSAs are invalid, 1000 warnings are produced, and the top level logs are flooded.
Instead, thow one unique warning with the number of invalid MSAs and list them in a different file.
Some "small" operations are executed by the script sequentially. When the number of cores increases, it starts being a bottleneck.
Operations to parallelize:
ParGenes currently executes all the modeltest runs in parallel, then synchronizes, and then executes all the raxml runs. This might be bad for load balance.
Todo: do everything within the same parrallel run, with some dependencies system.
In particular, do not parse all the raxml logs after a checkpoint. Instead, store a summary of the MSA characteristics and reload it after a checkpoint.
Check that we do not waste time in other parts of the pipeline.
This is important since some people showed some interest in the checkpoint feature
Because Modeltest-ng does not parallelize over the sites, we chose to assign a fixed number of cores (16) per modeltest job. On unpartitionned gene families, this is the best we can do.
But when dealing with unpartitionned datasets (as I do with 1kite), we might want to use more cores.
Todo: add an option --cores-per-modeltest-job
to customize the number of cores per modeltest job.
Add some logs, especially at the startup.
I just had a crash at the begin of a 1kite run but I have no idea where, because the scheduler is not verbose enough...
Also add some information about the time of the day.
Because some clusters still live in the past
Add a file with the number of times each model is picked:
Example:
JTT+G4+F 20
JTT-DCMUT 10
would mean that two different models were selected for the 30 inputs MSAs.
It might be very interesting for some post-analysis.
For users running ParGenes on their laptop
Improve it a bit.
Document it!
We currently duplicate fasta/phy files for each raxml run.
We should also try to use the rba files (but we need some additional raxml step after modeltest to treat models correctly).
It does not produce the bs trees...
RDB files already store the model. The raxml parameter --model is then ignored.
The benefit of loading RDB is very small for ParGenes anyway.
Several ideas:
Lots of simple missuses of multi-raxml lead to unreadable python exceptions.
Todo:
Most of the work is already done, but the syntax does not allow it yet.
Also test it!
By default, ParGenes assign twice more cores to the 3% jobs that have the highest number of taxa. This allows us to get rid of most of the "tails" (the few cores that finish much later than the other) and improves the load balance.
But in some cases, especially when the number of jobs is low, this heuristics might slow down the whole execution. Advanced users should have the possibility of disabling this heuristic. (and it will be useful to compare with/without the heuristic!)
This parameter would complete the --fasta-dir
option, to treat only a subset of the files in this directory.
Motivation: the user might want to filter out some files that are not MSA files from the analysis.
When the user wants to infer the model with modeltest, he currently needs to specify an input model because raxml needs it for the initial MSA parsing (although it's only used to decide bewteen aa or nt).
Todo: in this case, add a default hardcoded model to satisfy raxml parsing step.
Example:
python multi-raxml.py <blabla> --modeltest-options "--ml" --raxml-options "--search"
We should decide later on if this is a good practice, or if we should use files instead.
When monitoring a ParGenes run, it would be super useful to know which jobs are currently being run.
One way would be to create a file per job and to delete it when it finishes.
This might also be useful to implement #14
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.