ivan-krukov / aligning-genealogies Goto Github PK
View Code? Open in Web Editor NEWThe genealogy-coalescent alignment project
The genealogy-coalescent alignment project
Aligner objects as currently implemented only map nodes in the tree sequence to nodes in the pedigree. This can be done either iteratively or in one step. One thing that's missing is "harmonization" or "sanity checking" for greedy algorithms. The goal of the harmonization step is to make sure that the mappings make sense and are consistent with other information that we have, e.g.:
n
in the tree sequence is matched to a founder node in the pedigree, then it's likely that all or most of the predecessors of n
are out-of-pedigree nodes.n_ts
, n_ped
), make sure that their successors and predecessors preserve their time-ordering. For example, no predecessor of n_ts
should map to a successor of n_ped
and vice versa.After these sanity checks are implemented and used to correct the mapping (if possible), then one final thing that can be added is the pedigree node to tree-sequence edge mapping:
We want a coherent structure for a project, so that examples, figures, data, and workhorse code are neatly partitioned. Something similar to the R package structure would be ideal.
Currently, we are using the code from networkx to plot the pedigrees. It calls out to graphviz
, an external tool. I think it would be nice to have 2 things:
This would be used for evaluation.
Running the Experiments notebook, I am gitting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-4d9aa12d1c2f> in <module>
1 algn = DescMatchingAligner(gg, ct)
----> 2 algn.align()
3 algn.evaluate()
~/work/genealogies/genealogy_aligner/alignment.py in align(self)
60 def align(self):
61
---> 62 ped_ntp = self.ped.get_probands_under(climb_up_step=self.climb_up_step)
63 ts_ntp = self.ts.get_probands_under(climb_up_step=self.climb_up_step)
64
~/work/genealogies/genealogy_aligner/Genealogical.py in get_probands_under(self, nodes, climb_up_step)
122 ntp[n] = set()
123
--> 124 base_set = self.predecessors(n, climb_up_step, include_self=True)
125 n_set = []
126 for ns in base_set:
TypeError: predecessors() got an unexpected keyword argument 'include_self'
There is the include_founders
parameter. Is that equivalent for the discrete matching algorithm?
The draw
method of the HaplotypeGraph
class is currently broken.
We need a new method that properly represents the haplotypes and the individuals to whom those haplotypes belong. Issues with current representation:
HaplotypeGraphs
(e.g. autosomal, mitochondrial, etc.).Drawing
script.Incorporate the ability to draw a limited subset of nodes (and their surroundings) from pedigrees/tree sequences. This should help for visualizing segments of large pedigrees without cluttering the screen.
Working idea:
Given a list of nodes node_list
, draw only nodes the are up to k
hops away from nodes in node_list
.
Pedigree.simulate_from_founders()
should assign sex to each individual. This is necessary for Pedigree.to_table()
to work correctly, as it relies on the sex
parameter for each node to put the ID into the mother
or father
column.
Look at whether we selected the correct ancestors for the sample
e.g. proportion of the times that the correct founder is identified for a sampled individual.
At the current stage, the Genealogical
class inherits from networkx.DiGraph
. When we use nx
methods on the genealogy classes, it will invoke the constructor of __class__
without extra arguments. This means that at present the instance attributes of the genealogies are updated manually.
Ideally, we want to have the digraph
as a member of the Genealogical
class. But this will mean that some code would need to be refactored to call self.graph
.
The current state of hacking on the kinship aligner is in the scripts/agent_example.py
. This should be integrated with the Aligner
class for a straightforward interface.
The .sample_path()
method in the Pedigree
class currently samples a path for 1 haplotype from the probands. To have more realistic simulations, ideally we would like to incorporate paths for 2 haplotypes as well.
This will involve tracing random paths through the father's and the mother's side simultaneously.
The codebase currently has 2 strategies for dealing with ploidy in Pedigrees:
[1] Specify ploidy at the stage of path-sampling, which involves:
[1.1] Assigning number of specified "ploids" to each individual.
[1.2] Sampling a path through those ploids.
[1.3] Adding a flag to the traversal object so we know its ploidy.
Advantages:
Haplotype
) to assign values and attributes to individual ploids (e.g. IBD segment).Pedigree
class.Disadvantages:
Aligner
object. We will need to check that tree sequence to pedigree node assignment doesn't happen more than ploidy level./// Alternative to this formulation ///
As an alternative, we can add an attribute to the Pedigree
class self.diploid_graph
, which we can initialize in path sampling. The only difficulty with this alternative is how to access the Genealogical
helper methods, such as .predecessors()
, .successors()
and .iter_nodes()
.
[2] Create a DiploidGraph
class that inherits from Pedigree
[2.1] Takes an initialized Pedigree
and converts it into a DiploidGraph
.
Advantages:
Disadvantages:
Pedigree
class, some of the methods may break for the diploid or not make sense in this context./// Alternative to this formulation ///
Do not inherit from Pedigree
. Just set the Pedigree
to be an attribute of the DiploidGraph
object. This would mean duplicating the .sample_path()
method, but it should be fine.
Any thoughts?
Trying to sample a haploid path:
ped = Pedigree.from_balsac_table('data/test/simple.tsv')
tr = ped.sample_path(ploidy=1)
Leads to an error in get_parents
:
UnboundLocalError Traceback (most recent call last)
<ipython-input-374-5e6e17e6d5d4> in <module>
----> 1 ped.sample_path()
~/work/genealogies/genealogy_aligner/Pedigree.py in sample_path(self, probands, ploidy)
534 for hap_obj, parent in zip(
535 hap_struct[n],
--> 536 rnd.choice(self.get_parents(n), 2, replace=False)):
537
538 # If the parent is a node in the pedigree:
~/work/genealogies/genealogy_aligner/Pedigree.py in get_parents(self, n)
169 father, mother = rnd.choice(pred, 2, replace=False)
170
--> 171 return father, mother
172
173
UnboundLocalError: local variable 'father' referenced before assignment
Apropos: what are some "standard" paths through our code? We should document and test them
Add detailed documentation (ideally with examples) to methods in:
Genealogical
Pedigree
Traversal
Aligner
I have been in the process of writing some tests for the Pedigree
class. I have to say, that the process has been quite illuminating.
Todo:
[ ] add the test harness to setup.py
[ ] add test coverage through a pytest
plugin
We want to know how much relatedness information a genealogy contains. We have a sample BALSAC genealogy with 140 probands. One way to answer this is to count the number of coalescent events within a genealogy. This can either be done exhaustively or through simulation.
We can simulate a gene genealogy on a pedigree using Dom's fork of msprime. There is a simple example in: msprime_example.py and there is a more complete example in Dom's repository: pedigree_simulate.py.
The output of simulation is an tskit
tree sequence. For every tree in the sequence, we want to know the number of coalescent events. The more coalescent events the better. We might also be interested in the distribution of coalescent events per generation.
-Figure out data structure for node to edge matchings, and implement in Aligner
The ultimate goal of this project is to align inferred tree sequences to the pedigree graph. Therefore, we need to create tools to infer tree sequences from genetic samples.
I talked to @kobrica and she will work on providing this functionality. This task is composed of 3 steps.
tsinfer
(tutorial).Traversal
objects.msprime
simulation results, obtain the corresponding inferred tree sequences. This will involve 2 steps: First, obtain the genetic samples from the msprime
simulation, and then run tsinfer
on those samples, as in the tutorial above.Visualize diploid pedigree graph
I have a few specific suggestions for the visualization methods that we currently have:
Pedigree.draw()
: A good first step would be to leverage sex information (when available) and convert females {'sex': 2}
to circles instead of squares.Genealogical.draw()
: It would be good to allow users to pass a dictionary of colors for specific node IDs. This could help highlighting things in the pedigrees and tree sequences and would definitely be useful for iterative algorithms..draw()
method to Aligner
: This will be tough to get right, especially for dense pedigrees, but it would be great to have a drawing of the pedigree and the tree sequence next to each other with dashed arrows between them showing inferred or true correspondences.We want to have an agent-based climbing approach for aligning pedigrees and trees.
This should include matching of pedigree nodes to coalescent edges.
msprime
uses the minimal integers to label the nodes. We use the IDs provided in the genealogy to establish node identity. To start the alignment, we need to match our IDs with msprime
IDs
We should be able to convert the simulated traversals into msprime tree sequences. This will allow us to do two things:
If probands are found at different generations (e.g. some early individuals did not leave offspring), we need to be careful when climbing up the Pedigree, so we don't double count the nodes.
For example, if we have the following Pedigree:
Probands are 7
and 8
. For example, 7
climbs path 7-3-1
, and 8
: 8-6-3-2
. This will lead to a undesired fork in the resulting tree. Instead, we want the path climbing from 7
to "wait" until 8
catches up. This can be done if we only process nodes at a given backward depth.
This is already done correctly in Pedigree.sample_haploid_path
, but we want to make sure it's implemented everywhere. Perhaps another iterator, like Genealogical.iter_edges
?
We need to figure out a number of heuristics to reconstruct the full traversal trajectory of a particular haplotype for the greedy algorithms. The greedy algorithms that we have implemented so far leave a lot of nodes unmatched. These nodes are often intermediate and are subsumed into the edges connecting the tree sequence nodes.
I already implemented a simple heuristic that works in a limited number of cases (e.g. skipped parents). The code is in the .complete_paths()
method of MatchingAligner
class.
Heuristic (1): If a pedigree node is sandwiched between 2 matched nodes, and there's an edge between those 2 nodes in the tree sequence, then assign the pedigree node to that edge.
I also added the ped_node_to_ts_edge mappings to our evaluation metrics.
Feel free to contribute ideas or implementations.
After discussion with @LukeAndersonTrocme, I found that the get_probands_under
method of the Genealogical class is inefficient for large genealogies, as it repeats lots of redundant computations. There's a possibility to re-write it so that it runs in O(n)
time.
Basic idea: Start from the probands, and progressively pass the information up the pedigree.
Base case: n
is a proband --> Set their probands_under
value to set(n)
General case: n
is not a proband --> Set their probands_under
value to set.union(*[child.probands_under for child in self.successors(n)])
Plus: Keep track of the number of paths between a given proband and the ancestor.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.