iqbal-lab-org / make_prg Goto Github PK
View Code? Open in Web Editor NEWCode to create a PRG from a Multiple Sequence Alignment file
License: Other
Code to create a PRG from a Multiple Sequence Alignment file
License: Other
@leoisl @mbhall88 could you paste here an example of the ML-path format that pandora uses, to describe a ML path with respect to the linearised prg?
Then I can implement it in gramtools.
Or, we could move to expressing this ML path on a GFA; this means pandora and gramtools would need to use the GFA produced by make_prg.
Hello there!
I am reading this code and trying to understand a little bit of it! I have a doubt on the kmer ID attribution, this code. I'd propose to check for the presence of the kmer instead of the full seq at the if.
In this test, these are the values of self.kmer_dict
and seq_kmer_counts
before:
15/03/2019 05:13:47 self.kmer_dict = {'TTT': 62, 'TTA': 59, 'TAT': 60, 'ATT': 61, 'TTC': 34, 'TCT': 35, 'CTT': 36, 'TTG': 55, 'TGT': 56, 'GTT': 57}
15/03/2019 05:13:47 These vectors have length 63
15/03/2019 05:13:47 seq_kmer_counts = [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 2. 2. 2. 6.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 6.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 2. 2. 2. 6.]]
and after a small fix:
15/03/2019 05:18:27 self.kmer_dict = {'TTT': 0, 'TTA': 1, 'TAT': 2, 'ATT': 3, 'TTC': 4, 'TCT': 5, 'CTT': 6, 'TTG': 7, 'TGT': 8, 'GTT': 9}
15/03/2019 05:18:27 These vectors have length 10
15/03/2019 05:18:27 seq_kmer_counts = [[6. 2. 2. 2. 2. 2. 2. 1. 1. 1.]
[6. 1. 1. 1. 3. 3. 3. 1. 1. 1.]
[6. 2. 2. 2. 0. 0. 0. 3. 3. 3.]]
There is not really difference regarding self.kmer_dict
- each kmer still has an unique ID, but the seq_kmer_counts
, which is fed into the KMeans
algorithm might contains a lot of not meaningful 0s (whenever a kmer is repeated in the considered subalignment, we have an additional 0).
I am wondering if this is really a (minor) bug or I am just misunderstanding...
Thanks!
See https://llvllatrix.wordpress.com/2016/02/19/python-vs-copy-on-write . TLDR: every python object has reference counting. If we are sharing read-only python objects hoping this will be efficient due to copy-on-write optimisation, it might not be efficient because even reading a python object increases its reference counting, i.e. we modify the reference count, thus we modify the page, and thus effectively copy it in the child process. However, I do remember testing data sharing and copy-on-write optimisation in python, and this did not show up on my tests I think... But I am not sure I read the whole large data object on the child process. Something to benchmark later
When I try and execute the (v0.4.0) precomiled binary on codon I get the following error
$ /hps/nobackup/iqbal/mbhall/drprg/src/ext/make_prg from_msa -h
[1880013] Error loading Python lib '/hps/scratch/lsf_tmpdir/hl-codon-44-04/_MEIXP3fLm/libpython3.10.so.1.0': dlopen: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /hps/scratch/lsf_tmpdir/hl-codon-44-04/_MEIXP3fLm/libpython3.10.so.1.0)
Hey @mbhall88 , I finally remembered the reason for me to use global variables for the CLI options
and for the update_shared_data
variable, which holds a DenovoVariantsDB object, that represents all denovo variants for a run, a super heavy object. The reason was all about performance. When we do multiprocessing in python, one of the steps is pickling the input args on the parent process and unpickling them later on the child process to run the specified function. The issue is that pickling/unpickling involves disk operations and is slow.
However, I think removing the CLI options
from global is the right thing to do. We are already pickling/unpickling a variable when running the multiprocessing part. This options
variable is very light, and is probably fast to pickle/unpickle, although we have to do it thousands of times.
On the other hand, I don't think the update_shared_data
variable should pickled/unpickled everytime we run an update. This data structure is huge and will definitely kill performance. We will spend most of our time pickling/unpickling this object. The trick here was to set this variable as global, so it would exist in the parent process, and would be copied to the child processes when the fork happened when multiprocessing. Although this does not seem we gain efficiency, because we would still copy this heavy data structure to the child process, linux (and I hope MacOS) implement the Copy-on-write optimisation: it just copies the pages from the parent to the child process if the child makes a write to the page. In our case, the update_shared_data
variable is read-only from the child, i.e. the child does not change anything, so in our case it is never going to copy the page from the parent to the child process, allowing for effective memory sharing between the parent and the child. I actually had this issue ~1.5y ago: https://github.com/iqbal-lab/pandora1_paper/issues/252#issuecomment-813302128 , which is when I chose to move to the global variable approach. My mistake was not to document this.
So, in summary, the idea of multiprocessing in python is to gain performance, but if we have input/output args that are huge, we have to pickle/unpickle them, and we lose all the performance we wanted to gain. If the child processes never change the input args, then we can efficiently share the args in memory using global variables. Global variables are bad though, so we should use a singleton approach. I think to be sure we should measure the performance on the 20-way using the current code, which will pickle/unpickle the heavy data structure, and then try the singleton approach...
make_prg
can produce sites with i) direct deletions (eg REF AT, ALT "") ii) direct insertions (eg REF "", ALT AT). I refer to REF as the first allele in the site, that's how we embed a 'reference' in gramtools.
Though @mbhall88 has rightly pointed a site made by this tool does not have to translate to one in pandora/gramtools, I argue if we fix this problem here, there's no need to deal with it there. This is especially relevant for gramtools as by default each site produced here is a variant site in the output of genotype
. It is also important since vcf spec (https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf section 1.6.1) states neither REF nor ALT should be empty.
I will have a look at how to fix this
I am incrementally improving our test coverage and have a question relating to some code from @bricoletc
If the input_char
is lowercase but valid base then we will raise an AssertionError
here. Is this the intended behaviour or, should we treat all input as uppercase?
I couldn't find a solution for this for now. As part of the new release, I've tried upgrading python version to 3.10.8 and deps to the latest ones what we could find wheel for it, so that installation is quick. pytest
unfortunately gets stuck on running make_prg integration tests. In my machine, it always gets stuck at test___match_nonmatch_shortmatch
. If I remove it, it gets stuck on the next one.
This is not the best idea neither, as Python version 3.7 end-of-life (EOL) is 27 Jun 2023, and we should be compatible with (I guess) the oldest python version still supported. Will revert to the latest python 3.7 version and try to update deps
I have recently tested drprg out after upgrading make_prg from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 to the latest v0.4.0
The results were almost the same across ~8500 samples. However, there were 3 that have gone from being TPs to FNs.
It has to do with a region in the gene gid where the positions have weird overlaps that differ between make_prg versions. This is all a bit easier if I put in some examples
Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)
gid 505 f3d1ef78 G T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid 505 2de45418 GTCACGGGC TTGGGCGGCAGCGACGCTGC . PASS VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_R137W,gid_R137P,gid_A138T;PREDICT=F,F,F,F GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF .:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid 513 aac746d5 C T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=0,1;VARID=gid_A138T,gid_A138V;PREDICT=.,R GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 1:0,23:0,13:0,23:0,13:0,23:0,13:1,0:-205.786,-7.87333:197.913
However, this is the same region with the latest v0.4.0 make_prg
gid 505 8b2d50ba G T . PASS VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid 505 ec5fd1a2 GTCACGGGC TTGGGCGGCAGCGACGCTGC . PASS VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_A138T,gid_R137W,gid_R137P;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid 511 bad07f5f GGC GGT . frs VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0.342105,0.657895;VARID=gid_R137W,gid_A138V,gid_R137P,gid_A138T;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF 0:8,16:5,9:0,23:0,13:25,48:15,28:0.666667,0.333333:-132.07,-69.6442:62.4261
you can see the POS 511 variant is filtered due to FRS - there is coverage on the ref and alt, where the same variant (POS 513) in the previous version has no coverage on the ref.
The second variant in those examples overlaps both the first and third variants.
This 513C>T variant is discovered by denovo in both cases - i.e. the reference PRG is effectively just the first two variants
In both cases I am using pandora v0.10.0-alpha.0
The make_prg command is:
For the v0.3.0 version (you can find all the drprg/pandora/make_prg files at /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499
)
make_prg from_msa -t 4 -L 5 -N 5 -v -o updated -i /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499/update_msas
and for the v0.4.0 version (all files are at /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/
)
make_prg from_msa -t 2 -L 5 -N 5 -v --force -o updated -i /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/update_msas
Let me know if you need any other info @leoisl
Hello folks
Thanks for what looks like a very promising method. I'm attempting to create a PanRG from aligned genes from orthofinder across a ~30 sample pangenome (i.e. 30 annotated genome assemblies), so I can use pandora to genotype variation in several hundred short read libraries. I'm following the toy example in @leoisl's fork as linked from another issue here, but the version of make_prg
in conda doesn't support those CLI options. Additionally, it looks like the latest version in this repo (0.2.0) was released after version 0.3.0 from @leoisl's fork from July, and both forks seem to have ongoing development.
Which fork and version of make_prg should I use? Is there a guide or docs similar to @leoisl toy example for the latest code in this repo? And does this repo support building the PRG for a series of MSA at once, a la --input msas/
from @leoisl's code?
Best,
Kevin
If I'm building a graph of n genomes of E.coli, is there an option in pandora to get the vcf of these genomes?
I will force scikit-learn = "0.24.2"
in pyproject.toml
for now
Hey @leoisl!
I am running make_prg v0.4.0 on gene alignments for ~1400 E.coli and have consistently run into the bug below when make_prg is processing the attached alignment.
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/subcommands/from_msa.py", line 114, in process_MSA
builder = prg_builder.PrgBuilder(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/prg_builder.py", line 42, in __init__
self.root: RecursiveTreeNode = NodeFactory.build(alignment, self, None)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 445, in build
return MultiIntervalNode(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 189, in __init__
super().__init__(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
self._children: List["RecursiveTreeNode"] = self._get_children(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
return [
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
NodeFactory.build(alignment, self.prg_builder, self)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
return MultiClusterNode(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
super().__init__(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
self._children: List["RecursiveTreeNode"] = self._get_children(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
return [
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
NodeFactory.build(alignment, self.prg_builder, self)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
return MultiClusterNode(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
super().__init__(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
self._children: List["RecursiveTreeNode"] = self._get_children(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
return [
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
NodeFactory.build(alignment, self.prg_builder, self)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
return MultiClusterNode(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
super().__init__(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
self._children: List["RecursiveTreeNode"] = self._get_children(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
return [
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
NodeFactory.build(alignment, self.prg_builder, self)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
return MultiClusterNode(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
super().__init__(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
self._children: List["RecursiveTreeNode"] = self._get_children(
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
return [
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
NodeFactory.build(alignment, self.prg_builder, self)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 453, in build
clustering_result = kmeans_cluster_seqs(alignment, min_match_length)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/from_msa/cluster_sequences.py", line 291, in kmeans_cluster_seqs
assert len(alignment) == sum(
AssertionError: Each input sequence should be in a cluster
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/bin/make_prg", line 8, in <module>
sys.exit(main())
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/__main__.py", line 94, in main
args.func(args)
File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/subcommands/from_msa.py", line 183, in run
pool.starmap(process_MSA, args, chunksize=1)
File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
AssertionError: Each input sequence should be in a cluster
Traceback (most recent call last):
File "/hps/nobackup/iqbal/dander/amira_panRG/panaroo_qc_panRGs/make_panRG_from_panaroo_qced.py", line 99, in <module>
subprocess.run("make_prg from_msa -t 64 -i " + alignment_path + " -o horesh.card.panRG.qc.cov.mode.0.card.included", shell=True, check=True)
File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'make_prg from_msa -t 64 -i mmseq2_out_cov_mode_0_card_supplemented_aligned -o horesh.card.panRG.qc.cov.mode.0.card.included' returned non-zero exit status 1.
See https://mafft.cbrc.jp/alignment/software/linuxportable.html
Like this, make_prg
dependencies are all python
deps, and there is no need to install mafft
anymore
I produced nested PRGs (--min_match_length
=7, --max_nesting
=5) from multiple sequence alignments of plasmodium genes and found that make_prg
can produce ambiguous PRGs.
Here's an example I drew out by hand to understand
_o
and _c
mean site opening and closing nodes, with a numbered ID before. 7T
means 7 consecutive Ts.
The path CA9T
can be produced going through either sites 33/34/35 or 33/36. This can be horrible for genotyping as if the dataset has that sequence then you have to either say you can't genotype because of ambiguity, or randomly choose one of the two possibilities. The latter could cause comparison issues across datasets. For now in gramtools
I've decided to bail out in these cases and null genotype, with a FILTER flag to AMBIG to signal this occurred.
I tried to run the make_prg singularity container (latest) in a snakemake
pipeline, but got an issue. snakemake
run the container as (we necessarily need bash
):
Command ' singularity exec --home <suppressed> bash -c 'set -euo pipefail; python /home/leandro/git/myforks/pandora_analysis_pipeline/.snakemake/scripts/tmp_b7uof0h.run_make_prg.py'' returned non-zero exit status 255.
Error is:
FATAL: "bash": executable file not found in $PATH
Upon further inspection of the container, indeed bash
not only cannot be found in $PATH
, but it is not installed at all, which seems weird to me, as this line: https://github.com/rmcolq/make_prg/blob/c00fc137456aa474f2f75dd23e1f10c08876dfe9/Singularity#L12 seems to install it
Hey there,
Sorry if I missed this, but what are the column descriptions for the summary.tsv file?
Thanks,
Gavin
Hi!
I have facing problem while running make_prg version = 0.3.0 with fasta files that have "N" (Stands for any Nucleic acid). Is there an option can I use to solve this issue? Does make_prg can only work with bases "A, C, T, G" in fasta files ?
Thank you in advance
Does the make_prg tool take as input an alignment file of genomes or alignment files of genes?
I want to build a graph of E.coli
The current .prg
output file is not in a "useable" file format. I find myself having to add a fasta header after running make_prg
(which can be a pain in a pipeline).
Either we need to sort the alleles ( I thought we did, so I need to check), or we need tests to be agnostic to allele order so that they do not fail.
Not for this next 0.4.0 release, likely for 0.5.0. Adds a new parameter to make_prg update
:
-m UPDATE_METHOD, --update-method UPDATE_METHOD
How to update PRGs. Valid options are (default is careful):
exact: update the full PRG's MSA with the new alleles and rebuilds the whole PRG. This method guarantees that the
resulting PRG is equivalent to redoing the MSA and rebuilding the PRG from scratch, but it is the slowest method;
quick: update just the leaves of the PRG with the new alleles. Alleles that cross leaves are broken into several
contained sub-alleles. Although this is the quickest method, it is imprecise when adding long deletions;
careful: if there is at least one new allele that cross leaves, then the PRG is updated following the exact method.
Otherwise, it is updated following the quick method.
Important when we want to study longer variants.
This is linked to #15 as it leads to production of ambiguous prgs (multiple paths give rise to same sequence)
Clustering code can conclude that there is no meaningful clustering of a set of sequences- e.g. puts each sequence in one cluster.
However the code that calls clustering can re-run prg-building for sets of sequences that only differ in alignment (i.e. gap positioning), not sequence (here)
This leads to spurious 'nested variants' and the following pathological example case:
msp6_ambig.pdf
Of the 4 paths between nodes labeled 55 and 56, two are identical. This cause gramtools to mis-genotype down the line.
I have a fix that I'm implementing and will PR in
This is an interesting issue (but broke me for one day).
Consider the following dummy alignment:
>1
TTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAATTTTTTTAAAAAAA-------
>2
-------TTTTTTTAAAAAAATTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAA
>3
TTTTTTTAAAAAAATTTTTTTAAAAAAATTTTTTT-------GGGGGGG-------AAAAAAA
If you take the Hamming distance between any pair, it's high, but if you count all the 7-mers for each, the kmer counts are identical.
(They are: {'TTTTTTT': 3.0, 'TTTTTTG': 1.0, 'TTTTTGG': 1.0, 'TTTTGGG': 1.0, 'TTTGGGG': 1.0, 'TTGGGGG': 1.0, 'TGGGGGG': 1.0, 'GGGGGGG': 1.0, 'GGGGGGA': 1.0, 'GGGGGAA': 1.0, 'GGGGAAA': 1.0, 'GGGAAAA': 1.0, 'GGAAAAA': 1.0, 'GAAAAAA': 1.0, 'AAAAAAA': 3.0, 'AAAAAAT': 2.0, 'AAAAATT': 2.0, 'AAAATTT': 2.0, 'AAATTTT': 2.0, 'AATTTTT': 2.0, 'ATTTTTT': 2.0, 'TTTTTTA': 2.0, 'TTTTTAA': 2.0, 'TTTTAAA': 2.0, 'TTTAAAA': 2.0, 'TTAAAAA': 2.0, 'TAAAAAA': 2.0}
)
The alignment wouldn't actually look like this, it's the best simple example I could come up with. In practice, I saw this happen in MSAs of 5,800 sequences of 23kb each around EBA175 gene in P falciparum. Ie, looking at some repeat-rich (Plasmo ATs) subalignments the algorithm says they are not 'one-reference-like', so asks kmeans to cluster with increasing number of clusters; but kmeans when looking at the kmers, finds a smaller number of distinct data points due to same kmer counts, and cannot even produce that many clusters.
This leads to empty clusters, due to another code bug which let empty clusters be produced in the above case, and then prg construction fails.
I'm delving into make_prg
some more because looking at interoperability with gramtools, and have found the following behaviour from function get_interval_seqs
:
{"GAAAT", "AAAAT", "GTAAT", "ATAAT"}
RRAAT
it expands to {"GGAAT","AAAAT"}
The former is complete, the latter is not: missing GAAAT and AGAAT.
(R is ambiguous IUPAC for either G or A)
(Not showing a failing test case because the one I have made is in refactored code)
Is this deliberate?
I am assuming not and have already made a fix which i'll PR in if I get confirmation
When running the 4-way pipeline, updating the E coli PRG with illumina data, I got this error:
Traceback (most recent call last):
File "/usr/local/bin/make_prg", line 10, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/site-packages/make_prg/__main__.py", line 94, in main
args.func(args)
File "/usr/local/lib/python3.9/site-packages/make_prg/subcommands/update.py", line 182, in run
denovo_variants_db = DenovoVariantsDB(
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 548, in __init__
locus_name_to_denovo_loci = self._get_locus_name_to_denovo_loci()
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 532, in _get_locus_name_to_denovo_loci
return self._get_locus_name_to_denovo_loci_core(filehandler)
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 522, in _get_locus_name_to_denovo_loci_core
variants = self._read_variants(
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 495, in _read_variants
denovo_variant = cls._read_DenovoVariant(
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 477, in _read_DenovoVariant
denovo_variant = DenovoVariant(
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 41, in __init__
DenovoVariant._param_checking(
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 59, in _param_checking
DenovoVariant._check_sequence_is_composed_of_ACGT_only(alt)
File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 85, in _check_sequence_is_composed_of_ACGT_only
raise DenovoError(f"Found a non-ACGT seq ({seq}) in a denovo variant")
make_prg.update.denovo_variants.DenovoError: Found a non-ACGT seq (N) in a denovo variant
There are 36416 new variants found, and only a single one has N
in it. I'd very much prefer to simply issue a warning here:
Related to iqbal-lab-org/pandora#310 (comment)
We solve this issue and I think we are ready to make the new release.
Here's my directory
/nfs/research/zi/zi/analysis/2023/20230528_lassafever_2019_study/LASVsequencing2019
and here's the command:
URI="docker://quay.io/iqballab/make_prg"
singularity exec "$URI" make_prg from_msa -L 15 -n LASV -O pg --log outputs/L15/log.txt -f clustal clustalo-I20230528-213752-0692-38655046-p2m.clustal_num
output
WARNING: Cache disabled - cache location /hps/scratch/singularity is not writable.
INFO: Downloading library image to tmp cache: /tmp/sbuild-tmp-cache-454068105
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
2023/05/28 22:42:40 info unpack layer: sha256:a0d0a0d46f8b52473982a3c466318f479767577551a53ffc9074c9fa7035982e
2023/05/28 22:42:41 info unpack layer: sha256:70ea1bcdfee87cea0d3be42b0218fb732db46342e43b5a987332d51f67d372cc
2023/05/28 22:42:48 info unpack layer: sha256:f04e8d230c4332c19a0210f818150e1fcb2320d7abc4a20893ed0eeeb2b94734
2023/05/28 22:42:48 info unpack layer: sha256:29253eb9da6161a7e8500dc26100f98d90600c2dd1cdd6742c420ab3e364610b
2023/05/28 22:42:48 info unpack layer: sha256:a319e2b90519afce6a9b453c97eedfdfa581e8753fc9c43e38c8e5b46c249346
2023/05/28 22:42:49 info unpack layer: sha256:d28652de1e2af1c965f40a025b973bf86a480a13cf69bc3d46c740051effa13f
2023/05/28 22:42:49 info unpack layer: sha256:ad75af7e4a07bbdf41b8b262fcf2c660e585f80595ea89c3e2c0ca082bad494b
2023/05/28 22:42:49 info unpack layer: sha256:70797a5213a6145868be3b8b69eaa76d5ab8b045277d50dd423afdd381004866
INFO: Creating SIF file...
Segmentation fault
cat outputs/L15/log.txt
INFO 28/05/2023 10:30:28 Input parameters max_nesting: 5, min_match_length: 15
INFO 28/05/2023 10:30:28 Read from MSA file clustalo-I20230528-213752-0692-38655046-p2m.clustal_num
INFO 28/05/2023 10:43:46 Input parameters max_nesting: 5, min_match_length: 15
INFO 28/05/2023 10:43:46 Read from MSA file clustalo-I20230528-213752-0692-38655046-p2m.clustal_num
INFO 28/05/2023 10:43:47 match intervals: []; non_match intervals: [[0, 3202]]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.