ivan-krukov / aligning-genealogies Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 17.59 MB

The genealogy-coalescent alignment project

Python 3.32% Jupyter Notebook 96.68% Shell 0.01%

aligning-genealogies's People

Contributors

Stargazers

Watchers

aligning-genealogies's Issues

Add harmonization step to Aligner objects

Aligner objects as currently implemented only map nodes in the tree sequence to nodes in the pedigree. This can be done either iteratively or in one step. One thing that's missing is "harmonization" or "sanity checking" for greedy algorithms. The goal of the harmonization step is to make sure that the mappings make sense and are consistent with other information that we have, e.g.:

If an internal node n in the tree sequence is matched to a founder node in the pedigree, then it's likely that all or most of the predecessors of n are out-of-pedigree nodes.
For an aligned pair (n_ts, n_ped), make sure that their successors and predecessors preserve their time-ordering. For example, no predecessor of n_ts should map to a successor of n_ped and vice versa.

After these sanity checks are implemented and used to correct the mapping (if possible), then one final thing that can be added is the pedigree node to tree-sequence edge mapping:

Given the assignment of tree sequence nodes to pedigree nodes, we need to find the most likely mapping of remaining pedigree nodes to tree sequence edges. This isn't straightforward, but as a first step we can do it using some simple heuristics.

Project structure

We want a coherent structure for a project, so that examples, figures, data, and workhorse code are neatly partitioned. Something similar to the R package structure would be ideal.

Visualization

Currently, we are using the code from networkx to plot the pedigrees. It calls out to graphviz, an external tool. I think it would be nice to have 2 things:

SVG export
Text visualization

Write tree sequence to genealogy alignment object

This would be used for evaluation.

`Genealogical.predecessors` no longer has `include_self`

Running the Experiments notebook, I am gitting the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-4d9aa12d1c2f> in <module>
      1 algn = DescMatchingAligner(gg, ct)
----> 2 algn.align()
      3 algn.evaluate()

~/work/genealogies/genealogy_aligner/alignment.py in align(self)
     60     def align(self):
     61 
---> 62         ped_ntp = self.ped.get_probands_under(climb_up_step=self.climb_up_step)
     63         ts_ntp = self.ts.get_probands_under(climb_up_step=self.climb_up_step)
     64 

~/work/genealogies/genealogy_aligner/Genealogical.py in get_probands_under(self, nodes, climb_up_step)
    122             ntp[n] = set()
    123 
--> 124             base_set = self.predecessors(n, climb_up_step, include_self=True)
    125             n_set = []
    126             for ns in base_set:

TypeError: predecessors() got an unexpected keyword argument 'include_self'

There is the include_founders parameter. Is that equivalent for the discrete matching algorithm?

Drawing for HaplotypeGraph

The draw method of the HaplotypeGraph class is currently broken.
We need a new method that properly represents the haplotypes and the individuals to whom those haplotypes belong. Issues with current representation:

Currently, nodes are laid out to minimize crossings, which leads to separating haplotypes from the same individual. Therefore, we need to figure out a layout that doesn't do that.
The method should accommodate various types of HaplotypeGraphs (e.g. autosomal, mitochondrial, etc.).
Ideally, this method should be moved to the Drawing script.

Add focused view to .draw() methods

Incorporate the ability to draw a limited subset of nodes (and their surroundings) from pedigrees/tree sequences. This should help for visualizing segments of large pedigrees without cluttering the screen.

Working idea:
Given a list of nodes node_list, draw only nodes the are up to k hops away from nodes in node_list.

Simulations should assign sex to individuals

Pedigree.simulate_from_founders() should assign sex to each individual. This is necessary for Pedigree.to_table() to work correctly, as it relies on the sex parameter for each node to put the ID into the mother or father column.

Implement Ancestral concordance metric

Look at whether we selected the correct ancestors for the sample
e.g. proportion of the times that the correct founder is identified for a sampled individual.

Do not inherit directly from nx.DiGraph

At the current stage, the Genealogical class inherits from networkx.DiGraph. When we use nx methods on the genealogy classes, it will invoke the constructor of __class__ without extra arguments. This means that at present the instance attributes of the genealogies are updated manually.

Ideally, we want to have the digraph as a member of the Genealogical class. But this will mean that some code would need to be refactored to call self.graph.

Add kinship aligner into the `Aligner` interface

The current state of hacking on the kinship aligner is in the scripts/agent_example.py. This should be integrated with the Aligner class for a straightforward interface.

Link with msprime simulations, so that we can directly simulate+ evaluate

Update sample_path to account for ploidy

The .sample_path() method in the Pedigree class currently samples a path for 1 haplotype from the probands. To have more realistic simulations, ideally we would like to incorporate paths for 2 haplotypes as well.

This will involve tracing random paths through the father's and the mother's side simultaneously.

Unify logic for dealing with ploidy

The codebase currently has 2 strategies for dealing with ploidy in Pedigrees:

[1] Specify ploidy at the stage of path-sampling, which involves:
[1.1] Assigning number of specified "ploids" to each individual.
[1.2] Sampling a path through those ploids.
[1.3] Adding a flag to the traversal object so we know its ploidy.

Advantages:

In principle, it can accommodate any ploidy level (1, 2, 4, ...).
It has data structures (e.g. Haplotype) to assign values and attributes to individual ploids (e.g. IBD segment).
Seamlessly integrates with the functionality from Pedigree class.

Disadvantages:

More messy to deal with in the Aligner object. We will need to check that tree sequence to pedigree node assignment doesn't happen more than ploidy level.

/// Alternative to this formulation ///

As an alternative, we can add an attribute to the Pedigree class self.diploid_graph, which we can initialize in path sampling. The only difficulty with this alternative is how to access the Genealogical helper methods, such as .predecessors(), .successors() and .iter_nodes().

[2] Create a DiploidGraph class that inherits from Pedigree
[2.1] Takes an initialized Pedigree and converts it into a DiploidGraph.

Advantages:

Simpler and more elegant way to carry out the alignment for diploids.

Disadvantages:

Accommodates only haploids/diploids.
Because it inherits from Pedigree class, some of the methods may break for the diploid or not make sense in this context.

/// Alternative to this formulation ///

Do not inherit from Pedigree. Just set the Pedigree to be an attribute of the DiploidGraph object. This would mean duplicating the .sample_path() method, but it should be fine.

Any thoughts?

`Pedigree.sample_path` does not work

Trying to sample a haploid path:

ped = Pedigree.from_balsac_table('data/test/simple.tsv')
tr = ped.sample_path(ploidy=1)

Leads to an error in get_parents:

UnboundLocalError                                      Traceback (most recent call last)
<ipython-input-374-5e6e17e6d5d4> in <module>
----> 1 ped.sample_path()

~/work/genealogies/genealogy_aligner/Pedigree.py in sample_path(self, probands, ploidy)
    534             for hap_obj, parent in zip(
    535                     hap_struct[n],
--> 536                     rnd.choice(self.get_parents(n), 2, replace=False)):
    537 
    538                 # If the parent is a node in the pedigree:

~/work/genealogies/genealogy_aligner/Pedigree.py in get_parents(self, n)
    169                 father, mother = rnd.choice(pred, 2, replace=False)
    170 
--> 171         return father, mother
    172 
    173 

UnboundLocalError: local variable 'father' referenced before assignment

Apropos: what are some "standard" paths through our code? We should document and test them

Add detailed documentation to methods in main classes

Add detailed documentation (ideally with examples) to methods in:

Genealogical
Pedigree
Traversal
Aligner

Test everything

I have been in the process of writing some tests for the Pedigree class. I have to say, that the process has been quite illuminating.

Todo:
[ ] add the test harness to setup.py
[ ] add test coverage through a pytest plugin

Simulate with example BALSAC data

We want to know how much relatedness information a genealogy contains. We have a sample BALSAC genealogy with 140 probands. One way to answer this is to count the number of coalescent events within a genealogy. This can either be done exhaustively or through simulation.

We can simulate a gene genealogy on a pedigree using Dom's fork of msprime. There is a simple example in: msprime_example.py and there is a more complete example in Dom's repository: pedigree_simulate.py.

The output of simulation is an tskit tree sequence. For every tree in the sequence, we want to know the number of coalescent events. The more coalescent events the better. We might also be interested in the distribution of coalescent events per generation.

Aligner should handle node to edge matchings

-Figure out data structure for node to edge matchings, and implement in Aligner

Incorporate inferred tree sequences to the software

The ultimate goal of this project is to align inferred tree sequences to the pedigree graph. Therefore, we need to create tools to infer tree sequences from genetic samples.

I talked to @kobrica and she will work on providing this functionality. This task is composed of 3 steps.

Given genetic samples (e.g. genotype matrix), infer the corresponding tree sequences using tsinfer (tutorial).
Convert the inferred tree sequences to Traversal objects.
Given msprime simulation results, obtain the corresponding inferred tree sequences. This will involve 2 steps: First, obtain the genetic samples from the msprime simulation, and then run tsinfer on those samples, as in the tutorial above.

Diploid visualization

Visualize diploid pedigree graph

Specific updates to visualization methods

I have a few specific suggestions for the visualization methods that we currently have:

Pedigree.draw(): A good first step would be to leverage sex information (when available) and convert females {'sex': 2} to circles instead of squares.
Genealogical.draw(): It would be good to allow users to pass a dictionary of colors for specific node IDs. This could help highlighting things in the pedigrees and tree sequences and would definitely be useful for iterative algorithms.
Add .draw() method to Aligner: This will be tough to get right, especially for dense pedigrees, but it would be great to have a drawing of the pedigree and the tree sequence next to each other with dashed arrows between them showing inferred or true correspondences.

Implement greedy climbing

We want to have an agent-based climbing approach for aligning pedigrees and trees.

This should include matching of pedigree nodes to coalescent edges.

Matching IDs in genealogy and the tree sequence

msprime uses the minimal integers to label the nodes. We use the IDs provided in the genealogy to establish node identity. To start the alignment, we need to match our IDs with msprime IDs

Convert traversal to tree sequences

We should be able to convert the simulated traversals into msprime tree sequences. This will allow us to do two things:

Complete the simulations once we run out of Pedigree nodes
Verify that our alignments will work with ts format down the road.

Time-aware Pedigree climbling

If probands are found at different generations (e.g. some early individuals did not leave offspring), we need to be careful when climbing up the Pedigree, so we don't double count the nodes.

For example, if we have the following Pedigree:

Probands are 7 and 8. For example, 7 climbs path 7-3-1, and 8: 8-6-3-2. This will lead to a undesired fork in the resulting tree. Instead, we want the path climbing from 7 to "wait" until 8 catches up. This can be done if we only process nodes at a given backward depth.

This is already done correctly in Pedigree.sample_haploid_path, but we want to make sure it's implemented everywhere. Perhaps another iterator, like Genealogical.iter_edges?

Incorporate path completion to greedy Aligners

We need to figure out a number of heuristics to reconstruct the full traversal trajectory of a particular haplotype for the greedy algorithms. The greedy algorithms that we have implemented so far leave a lot of nodes unmatched. These nodes are often intermediate and are subsumed into the edges connecting the tree sequence nodes.

I already implemented a simple heuristic that works in a limited number of cases (e.g. skipped parents). The code is in the .complete_paths() method of MatchingAligner class.

Heuristic (1): If a pedigree node is sandwiched between 2 matched nodes, and there's an edge between those 2 nodes in the tree sequence, then assign the pedigree node to that edge.

I also added the ped_node_to_ts_edge mappings to our evaluation metrics.

Feel free to contribute ideas or implementations.

Re-write `get_probands_under` method

After discussion with @LukeAndersonTrocme, I found that the get_probands_under method of the Genealogical class is inefficient for large genealogies, as it repeats lots of redundant computations. There's a possibility to re-write it so that it runs in O(n) time.

Basic idea: Start from the probands, and progressively pass the information up the pedigree.
Base case: n is a proband --> Set their probands_under value to set(n)
General case: n is not a proband --> Set their probands_under value to set.union(*[child.probands_under for child in self.successors(n)])

Plus: Keep track of the number of paths between a given proband and the ancestor.

ivan-krukov / aligning-genealogies Goto Github PK

aligning-genealogies's People

Contributors

Stargazers

Watchers

aligning-genealogies's Issues

Recommend Projects

Recommend Topics

Recommend Org