iqbal-lab-org / make_prg Goto Github PK

View Code? Open in Web Editor NEW

21.0 21.0 7.0 8.63 MB

Code to create a PRG from a Multiple Sequence Alignment file

License: Other

Python 71.43% Dockerfile 0.23% Shell 19.87% Roff 3.53% Perl 4.75% Just 0.19%

make_prg's People

Contributors

Stargazers

Watchers

Forkers

leoisl bricoletc martinghunt mbhall88 mirror-dump

make_prg's Issues

GFA format and ML path format

@leoisl @mbhall88 could you paste here an example of the ML-path format that pandora uses, to describe a ML path with respect to the linearised prg?

Then I can implement it in gramtools.

Or, we could move to expressing this ML path on a GFA; this means pandora and gramtools would need to use the GFA produced by make_prg.

I am reading this code and trying to understand a little bit of it! I have a doubt on the kmer ID attribution, this code. I'd propose to check for the presence of the kmer instead of the full seq at the if.

In this test, these are the values of self.kmer_dict and seq_kmer_counts before:

15/03/2019 05:13:47 self.kmer_dict = {'TTT': 62, 'TTA': 59, 'TAT': 60, 'ATT': 61, 'TTC': 34, 'TCT': 35, 'CTT': 36, 'TTG': 55, 'TGT': 56, 'GTT': 57}
15/03/2019 05:13:47 These vectors have length 63
15/03/2019 05:13:47 seq_kmer_counts = [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 2. 2. 2. 6.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 6.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 2. 2. 2. 6.]]

and after a small fix:

15/03/2019 05:18:27 self.kmer_dict = {'TTT': 0, 'TTA': 1, 'TAT': 2, 'ATT': 3, 'TTC': 4, 'TCT': 5, 'CTT': 6, 'TTG': 7, 'TGT': 8, 'GTT': 9}
15/03/2019 05:18:27 These vectors have length 10
15/03/2019 05:18:27 seq_kmer_counts = [[6. 2. 2. 2. 2. 2. 2. 1. 1. 1.]
 [6. 1. 1. 1. 3. 3. 3. 1. 1. 1.]
 [6. 2. 2. 2. 0. 0. 0. 3. 3. 3.]]

There is not really difference regarding self.kmer_dict - each kmer still has an unique ID, but the seq_kmer_counts, which is fed into the KMeans algorithm might contains a lot of not meaningful 0s (whenever a kmer is repeated in the considered subalignment, we have an additional 0).

I am wondering if this is really a (minor) bug or I am just misunderstanding...

Thanks!

Copy-on-write might not be efficient in python

See https://llvllatrix.wordpress.com/2016/02/19/python-vs-copy-on-write . TLDR: every python object has reference counting. If we are sharing read-only python objects hoping this will be efficient due to copy-on-write optimisation, it might not be efficient because even reading a python object increases its reference counting, i.e. we modify the reference count, thus we modify the page, and thus effectively copy it in the child process. However, I do remember testing data sharing and copy-on-write optimisation in python, and this did not show up on my tests I think... But I am not sure I read the whole large data object on the child process. Something to benchmark later

Precompiled binary issue

When I try and execute the (v0.4.0) precomiled binary on codon I get the following error

$ /hps/nobackup/iqbal/mbhall/drprg/src/ext/make_prg from_msa -h
[1880013] Error loading Python lib '/hps/scratch/lsf_tmpdir/hl-codon-44-04/_MEIXP3fLm/libpython3.10.so.1.0': dlopen: /lib64/libm.so.6: version `GLIBC_2.29' not found (required by /hps/scratch/lsf_tmpdir/hl-codon-44-04/_MEIXP3fLm/libpython3.10.so.1.0)

Global variables, multiprocessing and pickling

Hey @mbhall88 , I finally remembered the reason for me to use global variables for the CLI options and for the update_shared_data variable, which holds a DenovoVariantsDB object, that represents all denovo variants for a run, a super heavy object. The reason was all about performance. When we do multiprocessing in python, one of the steps is pickling the input args on the parent process and unpickling them later on the child process to run the specified function. The issue is that pickling/unpickling involves disk operations and is slow.

However, I think removing the CLI options from global is the right thing to do. We are already pickling/unpickling a variable when running the multiprocessing part. This options variable is very light, and is probably fast to pickle/unpickle, although we have to do it thousands of times.

On the other hand, I don't think the update_shared_data variable should pickled/unpickled everytime we run an update. This data structure is huge and will definitely kill performance. We will spend most of our time pickling/unpickling this object. The trick here was to set this variable as global, so it would exist in the parent process, and would be copied to the child processes when the fork happened when multiprocessing. Although this does not seem we gain efficiency, because we would still copy this heavy data structure to the child process, linux (and I hope MacOS) implement the Copy-on-write optimisation: it just copies the pages from the parent to the child process if the child makes a write to the page. In our case, the update_shared_data variable is read-only from the child, i.e. the child does not change anything, so in our case it is never going to copy the page from the parent to the child process, allowing for effective memory sharing between the parent and the child. I actually had this issue ~1.5y ago: https://github.com/iqbal-lab/pandora1_paper/issues/252#issuecomment-813302128 , which is when I chose to move to the global variable approach. My mistake was not to document this.

So, in summary, the idea of multiprocessing in python is to gain performance, but if we have input/output args that are huge, we have to pickle/unpickle them, and we lose all the performance we wanted to gain. If the child processes never change the input args, then we can efficiently share the args in memory using global variables. Global variables are bad though, so we should use a singleton approach. I think to be sure we should measure the performance on the 20-way using the current code, which will pickle/unpickle the heavy data structure, and then try the singleton approach...

Stop producing empty alleles

make_prg can produce sites with i) direct deletions (eg REF AT, ALT "") ii) direct insertions (eg REF "", ALT AT). I refer to REF as the first allele in the site, that's how we embed a 'reference' in gramtools.

Though @mbhall88 has rightly pointed a site made by this tool does not have to translate to one in pandora/gramtools, I argue if we fix this problem here, there's no need to deal with it there. This is especially relevant for gramtools as by default each site produced here is a variant site in the output of genotype. It is also important since vcf spec (https://github.com/samtools/hts-specs/blob/master/VCFv4.3.pdf section 1.6.1) states neither REF nor ALT should be empty.

I will have a look at how to fix this

Should lowercase valid base raise error

I am incrementally improving our test coverage and have a question relating to some code from @bricoletc

https://github.com/rmcolq/make_prg/blob/86dbeebd02a7d7350700cd65874e244130aeda9a/make_prg/utils.py#L144-L146

If the input_char is lowercase but valid base then we will raise an AssertionError here. Is this the intended behaviour or, should we treat all input as uppercase?

Recent version of pytest gets stuck on make_prg integration tests

I couldn't find a solution for this for now. As part of the new release, I've tried upgrading python version to 3.10.8 and deps to the latest ones what we could find wheel for it, so that installation is quick. pytest unfortunately gets stuck on running make_prg integration tests. In my machine, it always gets stuck at test___match_nonmatch_shortmatch. If I remove it, it gets stuck on the next one.

This is not the best idea neither, as Python version 3.7 end-of-life (EOL) is 27 Jun 2023, and we should be compatible with (I guess) the oldest python version still supported. Will revert to the latest python 3.7 version and try to update deps

Strange difference in overlapping positions between make_prg versions

I have recently tested drprg out after upgrading make_prg from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 to the latest v0.4.0

The results were almost the same across ~8500 samples. However, there were 3 that have gone from being TPs to FNs.

It has to do with a region in the gene gid where the positions have weird overlaps that differ between make_prg versions. This is all a bit easier if I put in some examples

Here is the region from https://github.com/leoisl/make_prg/releases/tag/v0.3.0 (this leads to a correct ALT call in POS 513)

gid     505     f3d1ef78        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     2de45418        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_R137W,gid_R137P,gid_A138T;PREDICT=F,F,F,F       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      .:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     513     aac746d5        C       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=0,1;VARID=gid_A138T,gid_A138V;PREDICT=.,R   GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    1:0,23:0,13:0,23:0,13:0,23:0,13:1,0:-205.786,-7.87333:197.913

However, this is the same region with the latest v0.4.0 make_prg

gid     505     8b2d50ba        G       T       .       PASS    VC=SNP;GRAPHTYPE=NESTED;PDP=1,0 GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:12,0:7,0:12,0:7,0:25,0:15,0:0.5,1:-12.421,-127.498:115.077
gid     505     ec5fd1a2        GTCACGGGC       TTGGGCGGCAGCGACGCTGC    .       PASS    VC=COMPLEX;GRAPHTYPE=NESTED;PDP=0.722222,0.277778;VARID=gid_A138V,gid_A138T,gid_R137W,gid_R137P;PREDICT=.,.,.,.       GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF      0:8,3:5,2:0,0:0,0:25,23:15,13:0.666667,0.833333:-39.9668,-86.3427:46.3759
gid     511     bad07f5f        GGC     GGT     .       frs     VC=PH_SNPs;GRAPHTYPE=NESTED;PDP=0.342105,0.657895;VARID=gid_R137W,gid_A138V,gid_R137P,gid_A138T;PREDICT=.,.,.,. GT:MEAN_FWD_COVG:MEAN_REV_COVG:MED_FWD_COVG:MED_REV_COVG:SUM_FWD_COVG:SUM_REV_COVG:GAPS:LIKELIHOOD:GT_CONF    0:8,16:5,9:0,23:0,13:25,48:15,28:0.666667,0.333333:-132.07,-69.6442:62.4261

you can see the POS 511 variant is filtered due to FRS - there is coverage on the ref and alt, where the same variant (POS 513) in the previous version has no coverage on the ref.

The second variant in those examples overlaps both the first and third variants.

This 513C>T variant is discovered by denovo in both cases - i.e. the reference PRG is effectively just the first two variants

In both cases I am using pandora v0.10.0-alpha.0

The make_prg command is:

For the v0.3.0 version (you can find all the drprg/pandora/make_prg files at /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499)

make_prg from_msa -t 4 -L 5 -N 5 -v -o updated -i /hps/nobackup/iqbal/mbhall/drprg/tmp/predict-ERR2510499/update_msas

and for the v0.4.0 version (all files are at /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/)

make_prg from_msa -t 2 -L 5 -N 5 -v --force -o updated -i /hps/nobackup/iqbal/mbhall/drprg/paper/results/amr_predictions/drprg/illumina/PRJEB25972/SAMEA1101807/ERR2510499/update_msas

Let me know if you need any other info @leoisl

Version/fork confusion

Hello folks

Thanks for what looks like a very promising method. I'm attempting to create a PanRG from aligned genes from orthofinder across a ~30 sample pangenome (i.e. 30 annotated genome assemblies), so I can use pandora to genotype variation in several hundred short read libraries. I'm following the toy example in @leoisl's fork as linked from another issue here, but the version of make_prg in conda doesn't support those CLI options. Additionally, it looks like the latest version in this repo (0.2.0) was released after version 0.3.0 from @leoisl's fork from July, and both forks seem to have ongoing development.

Which fork and version of make_prg should I use? Is there a guide or docs similar to @leoisl toy example for the latest code in this repo? And does this repo support building the PRG for a series of MSA at once, a la --input msas/ from @leoisl's code?

Best,
Kevin

Vcf output

If I'm building a graph of n genomes of E.coli, is there an option in pandora to get the vcf of these genomes?

scikit-learn = "^1.0" makes integration tests get stuck

I will force scikit-learn = "0.24.2" in pyproject.toml for now

AssertionError: Each sequence should be in a cluster

Hey @leoisl!
I am running make_prg v0.4.0 on gene alignments for ~1400 E.coli and have consistently run into the bug below when make_prg is processing the attached alignment.

multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/subcommands/from_msa.py", line 114, in process_MSA
    builder = prg_builder.PrgBuilder(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/prg_builder.py", line 42, in __init__
    self.root: RecursiveTreeNode = NodeFactory.build(alignment, self, None)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 445, in build
    return MultiIntervalNode(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 189, in __init__
    super().__init__(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
    self._children: List["RecursiveTreeNode"] = self._get_children(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
    return [
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
    NodeFactory.build(alignment, self.prg_builder, self)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
    return MultiClusterNode(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
    super().__init__(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
    self._children: List["RecursiveTreeNode"] = self._get_children(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
    return [
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
    NodeFactory.build(alignment, self.prg_builder, self)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
    return MultiClusterNode(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
    super().__init__(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
    self._children: List["RecursiveTreeNode"] = self._get_children(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
    return [
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
    NodeFactory.build(alignment, self.prg_builder, self)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
    return MultiClusterNode(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
    super().__init__(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
    self._children: List["RecursiveTreeNode"] = self._get_children(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
    return [
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
    NodeFactory.build(alignment, self.prg_builder, self)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 463, in build
    return MultiClusterNode(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 217, in __init__
    super().__init__(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 53, in __init__
    self._children: List["RecursiveTreeNode"] = self._get_children(
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 130, in _get_children
    return [
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 131, in <listcomp>
    NodeFactory.build(alignment, self.prg_builder, self)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/recursion_tree.py", line 453, in build
    clustering_result = kmeans_cluster_seqs(alignment, min_match_length)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/from_msa/cluster_sequences.py", line 291, in kmeans_cluster_seqs
    assert len(alignment) == sum(
AssertionError: Each input sequence should be in a cluster
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/bin/make_prg", line 8, in <module>
    sys.exit(main())
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/__main__.py", line 94, in main
    args.func(args)
  File "/hps/nobackup/iqbal/dander/amira_panRG/venv/lib/python3.9/site-packages/make_prg/subcommands/from_msa.py", line 183, in run
    pool.starmap(process_MSA, args, chunksize=1)
  File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
AssertionError: Each input sequence should be in a cluster
Traceback (most recent call last):
  File "/hps/nobackup/iqbal/dander/amira_panRG/panaroo_qc_panRGs/make_panRG_from_panaroo_qced.py", line 99, in <module>
    subprocess.run("make_prg from_msa -t 64 -i " + alignment_path + " -o horesh.card.panRG.qc.cov.mode.0.card.included", shell=True, check=True)
  File "/hps/software/spack/opt/spack/linux-centos8-sandybridge/gcc-9.3.0/python-3.9.5-jtayjftvmku5dcg53v74ilyhipv6kvxi/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'make_prg from_msa -t 64 -i mmseq2_out_cov_mode_0_card_supplemented_aligned -o horesh.card.panRG.qc.cov.mode.0.card.included' returned non-zero exit status 1.

evgS.aln.fas.gz

MAFFT is portable for linux and we can embed it into make_prg

See https://mafft.cbrc.jp/alignment/software/linuxportable.html

Like this, make_prg dependencies are all python deps, and there is no need to install mafft anymore

Ambiguous path production

I produced nested PRGs (--min_match_length=7, --max_nesting=5) from multiple sequence alignments of plasmodium genes and found that make_prg can produce ambiguous PRGs.

Here's an example I drew out by hand to understand

_o and _c mean site opening and closing nodes, with a numbered ID before. 7T means 7 consecutive Ts.

The path CA9T can be produced going through either sites 33/34/35 or 33/36. This can be horrible for genotyping as if the dataset has that sequence then you have to either say you can't genotype because of ambiguity, or randomly choose one of the two possibilities. The latter could cause comparison issues across datasets. For now in gramtools I've decided to bail out in these cases and null genotype, with a FILTER flag to AMBIG to signal this occurred.

Singularity container can't be run in snakemake pipelines

I tried to run the make_prg singularity container (latest) in a snakemake pipeline, but got an issue. snakemake run the container as (we necessarily need bash):

Command ' singularity exec --home <suppressed> bash -c 'set -euo pipefail;  python /home/leandro/git/myforks/pandora_analysis_pipeline/.snakemake/scripts/tmp_b7uof0h.run_make_prg.py'' returned non-zero exit status 255.

Error is:

FATAL:   "bash": executable file not found in $PATH

Upon further inspection of the container, indeed bash not only cannot be found in $PATH, but it is not installed at all, which seems weird to me, as this line: https://github.com/rmcolq/make_prg/blob/c00fc137456aa474f2f75dd23e1f10c08876dfe9/Singularity#L12 seems to install it

Header of summary.tsv file

Hey there,

Sorry if I missed this, but what are the column descriptions for the summary.tsv file?

Thanks,

Gavin

Handling "N" inputs in fasta file

Hi!
I have facing problem while running make_prg version = 0.3.0 with fasta files that have "N" (Stands for any Nucleic acid). Is there an option can I use to solve this issue? Does make_prg can only work with bases "A, C, T, G" in fasta files ?

Thank you in advance

Input to make_prg

Does the make_prg tool take as input an alignment file of genomes or alignment files of genes?
I want to build a graph of E.coli

Make output fasta formatted

The current .prg output file is not in a "useable" file format. I find myself having to add a fasta header after running make_prg (which can be a pain in a pipeline).

Tests fail if compiler reorders alleles

Either we need to sort the alleles ( I thought we did, so I need to check), or we need tests to be agnostic to allele order so that they do not fail.

Add --update-method to make_prg update

Not for this next 0.4.0 release, likely for 0.5.0. Adds a new parameter to make_prg update:

  -m UPDATE_METHOD, --update-method UPDATE_METHOD
                   How to update PRGs. Valid options are (default is careful):
                         exact: update the full PRG's MSA with the new alleles and rebuilds the whole PRG. This method guarantees that the
                                resulting PRG is equivalent to redoing the MSA and rebuilding the PRG from scratch, but it is the slowest method;
                         quick: update just the leaves of the PRG with the new alleles. Alleles that cross leaves are broken into several
                                contained sub-alleles. Although this is the quickest method, it is imprecise when adding long deletions;
                         careful: if there is at least one new allele that cross leaves, then the PRG is updated following the exact method. 
                                  Otherwise, it is updated following the quick method.

Important when we want to study longer variants.

Indels can cause spurious site production

This is linked to #15 as it leads to production of ambiguous prgs (multiple paths give rise to same sequence)

Clustering code can conclude that there is no meaningful clustering of a set of sequences- e.g. puts each sequence in one cluster.

However the code that calls clustering can re-run prg-building for sets of sequences that only differ in alignment (i.e. gap positioning), not sequence (here)

This leads to spurious 'nested variants' and the following pathological example case:
msp6_ambig.pdf

Of the 4 paths between nodes labeled 55 and 56, two are identical. This cause gramtools to mis-genotype down the line.

I have a fix that I'm implementing and will PR in

Repeats can cause underclustering

This is an interesting issue (but broke me for one day).

Consider the following dummy alignment:

>1
TTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAATTTTTTTAAAAAAA-------
>2
-------TTTTTTTAAAAAAATTTTTTTGGGGGGGAAAAAAATTTTTTT-------AAAAAAA
>3
TTTTTTTAAAAAAATTTTTTTAAAAAAATTTTTTT-------GGGGGGG-------AAAAAAA

If you take the Hamming distance between any pair, it's high, but if you count all the 7-mers for each, the kmer counts are identical.
(They are: {'TTTTTTT': 3.0, 'TTTTTTG': 1.0, 'TTTTTGG': 1.0, 'TTTTGGG': 1.0, 'TTTGGGG': 1.0, 'TTGGGGG': 1.0, 'TGGGGGG': 1.0, 'GGGGGGG': 1.0, 'GGGGGGA': 1.0, 'GGGGGAA': 1.0, 'GGGGAAA': 1.0, 'GGGAAAA': 1.0, 'GGAAAAA': 1.0, 'GAAAAAA': 1.0, 'AAAAAAA': 3.0, 'AAAAAAT': 2.0, 'AAAAATT': 2.0, 'AAAATTT': 2.0, 'AAATTTT': 2.0, 'AATTTTT': 2.0, 'ATTTTTT': 2.0, 'TTTTTTA': 2.0, 'TTTTTAA': 2.0, 'TTTTAAA': 2.0, 'TTTAAAA': 2.0, 'TTAAAAA': 2.0, 'TAAAAAA': 2.0})

The alignment wouldn't actually look like this, it's the best simple example I could come up with. In practice, I saw this happen in MSAs of 5,800 sequences of 23kb each around EBA175 gene in P falciparum. Ie, looking at some repeat-rich (Plasmo ATs) subalignments the algorithm says they are not 'one-reference-like', so asks kmeans to cluster with increasing number of clusters; but kmeans when looking at the kmers, finds a smaller number of distinct data points due to same kmer counts, and cannot even produce that many clusters.

This leads to empty clusters, due to another code bug which let empty clusters be produced in the above case, and then prg construction fails.

Incomplete ambiguous dna expansion

I'm delving into make_prg some more because looking at interoperability with gramtools, and have found the following behaviour from function get_interval_seqs:

If you give it the sequence 'RWAAT' it expands to {"GAAAT", "AAAAT", "GTAAT", "ATAAT"}
If you give it the sequence RRAAT it expands to {"GGAAT","AAAAT"}

The former is complete, the latter is not: missing GAAAT and AGAAT.

(R is ambiguous IUPAC for either G or A)
(Not showing a failing test case because the one I have made is in refactored code)

Is this deliberate?

I am assuming not and have already made a fix which i'll PR in if I get confirmation

Accept multiple suffixes for option -s / --suffix

See https://github.com/iqbal-lab-org/make_prg/pull/37/files/9be3b81420f10319f0f547c24b1be2fff2a6e5e6#r853619797

Ignore `N`s that might have slipped through when updating PRGs

When running the 4-way pipeline, updating the E coli PRG with illumina data, I got this error:

Traceback (most recent call last):
  File "/usr/local/bin/make_prg", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/site-packages/make_prg/__main__.py", line 94, in main
    args.func(args)
  File "/usr/local/lib/python3.9/site-packages/make_prg/subcommands/update.py", line 182, in run
    denovo_variants_db = DenovoVariantsDB(
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 548, in __init__
    locus_name_to_denovo_loci = self._get_locus_name_to_denovo_loci()
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 532, in _get_locus_name_to_denovo_loci
    return self._get_locus_name_to_denovo_loci_core(filehandler)
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 522, in _get_locus_name_to_denovo_loci_core
    variants = self._read_variants(
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 495, in _read_variants
    denovo_variant = cls._read_DenovoVariant(
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 477, in _read_DenovoVariant
    denovo_variant = DenovoVariant(
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 41, in __init__
    DenovoVariant._param_checking(
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 59, in _param_checking
    DenovoVariant._check_sequence_is_composed_of_ACGT_only(alt)
  File "/usr/local/lib/python3.9/site-packages/make_prg/update/denovo_variants.py", line 85, in _check_sequence_is_composed_of_ACGT_only
    raise DenovoError(f"Found a non-ACGT seq ({seq}) in a denovo variant")
make_prg.update.denovo_variants.DenovoError: Found a non-ACGT seq (N) in a denovo variant

There are 36416 new variants found, and only a single one has N in it. I'd very much prefer to simply issue a warning here:

make_prg/make_prg/update/denovo_variants.py

Line 85 in c2c7ff0

raise DenovoError(f"Found a non-ACGT seq ({seq}) in a denovo variant")

than erroring out and not being able to update

Understand why latest executable of new make_prg version did not produce the expected graph

We solve this issue and I think we are ready to make the new release.

segfaulting when building from_msa

Here's my directory
/nfs/research/zi/zi/analysis/2023/20230528_lassafever_2019_study/LASVsequencing2019
and here's the command:

URI="docker://quay.io/iqballab/make_prg"

singularity exec "$URI" make_prg from_msa  -L 15 -n LASV -O pg --log outputs/L15/log.txt -f clustal clustalo-I20230528-213752-0692-38655046-p2m.clustal_num

output

WARNING: Cache disabled - cache location /hps/scratch/singularity is not writable.
INFO: Downloading library image to tmp cache: /tmp/sbuild-tmp-cache-454068105
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
2023/05/28 22:42:40 info unpack layer: sha256:a0d0a0d46f8b52473982a3c466318f479767577551a53ffc9074c9fa7035982e
2023/05/28 22:42:41 info unpack layer: sha256:70ea1bcdfee87cea0d3be42b0218fb732db46342e43b5a987332d51f67d372cc
2023/05/28 22:42:48 info unpack layer: sha256:f04e8d230c4332c19a0210f818150e1fcb2320d7abc4a20893ed0eeeb2b94734
2023/05/28 22:42:48 info unpack layer: sha256:29253eb9da6161a7e8500dc26100f98d90600c2dd1cdd6742c420ab3e364610b
2023/05/28 22:42:48 info unpack layer: sha256:a319e2b90519afce6a9b453c97eedfdfa581e8753fc9c43e38c8e5b46c249346
2023/05/28 22:42:49 info unpack layer: sha256:d28652de1e2af1c965f40a025b973bf86a480a13cf69bc3d46c740051effa13f
2023/05/28 22:42:49 info unpack layer: sha256:ad75af7e4a07bbdf41b8b262fcf2c660e585f80595ea89c3e2c0ca082bad494b
2023/05/28 22:42:49 info unpack layer: sha256:70797a5213a6145868be3b8b69eaa76d5ab8b045277d50dd423afdd381004866
INFO: Creating SIF file...
Segmentation fault

cat outputs/L15/log.txt
INFO 28/05/2023 10:30:28 Input parameters max_nesting: 5, min_match_length: 15
INFO 28/05/2023 10:30:28 Read from MSA file clustalo-I20230528-213752-0692-38655046-p2m.clustal_num
INFO 28/05/2023 10:43:46 Input parameters max_nesting: 5, min_match_length: 15
INFO 28/05/2023 10:43:46 Read from MSA file clustalo-I20230528-213752-0692-38655046-p2m.clustal_num
INFO 28/05/2023 10:43:47 match intervals: []; non_match intervals: [[0, 3202]]