Giter VIP home page Giter VIP logo

Comments (13)

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Just a note to say that this would be extremely useful!
If anyone else is looking for information on this before an official document comes out, I think I was able to figure out how to do this by looking at the text files attached to this discussion on slim-discuss: https://groups.google.com/forum/#!topic/slim-discuss/W8wEYj9UNnw

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

Thanks! If you want to draft something for the docs I would happily help get it in there (crediting you of course), and add you to the author list.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

First maybe I should check whether I have this right! I think I mostly understand what’s going on, but as I think about this more, there is one thing I am getting confused about: why does msprime seem to need the final number in “positions” to be the length of the genome, not the final genome index?

By imitating the text files in the above-linked slim-discuss post, I’ve found that this works:

  • For an example 20-base genome where recombination rate switches every 5 bases, SLiM can use a text file that says this, where the first row is positions (ends) and the second is rates:
    4, 9, 14, 19
    1.0e-8, 1.0e-9, 1.0e-8, 1.0e-9

  • And msprime can use a text file that says:
    0, 4, 9, 14, 20
    1.0e-8, 1.0e-9, 1.0e-8, 1.0e-9, 0.0

The reason why msprime “positions” must start with 0 makes sense: In SLiM, a position or "end" indicates the end of a recombination block such that its associated "rate" applies to everything to the left of that end. In msprime, the manual says:

Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).

Which means that positions for msprime are both starts and ends. As a consequence, msprime needs a vector of positions that is 1 longer than what you give SLiM, and msprime also needs 1 fewer rates than it has positions, but you just add the 0.0 on at the end of the rates vector “by convention.”

However, I don’t understand why msprime needs the positions vector to end on 20 instead of 19. When I give msprime the same final genome position as SLiM, I get the error

ValueError: The simulated sequence length must be the same as from_ts.sequence_length

To avoid the error, I have to make my final genome position be 1 greater than whatever I gave SLiM - i.e., it has to be the length of the sequence, which is the final index + 1 in SLiM's 0-based world. I noticed this was also done by the slim-discuss poster linked above.

Why is that?

I'd be happy to send whatever code I end up using once I understand this, if that's helpful for an eventual document.

Thanks!
-Teresa

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

why does msprime seem to need the final number in “positions” to be the length of the genome, not the final genome index?

Good question. This is because intervals in tskit (which msprime uses) are "closed on the left and open on the right", which means that the genomic interval from 0.0 to 100.0 includes 0.0 but does not include 100.0. If SLiM has a final position of 99, then it could have mutations occurring at position 99; this would not be legal if we set the tskit sequence length to 99, since the position 99 would be outside of the interval from 0 to 99. So, in SLiM when we record tree sequences, we use the last position plus one - ie, the length of the genome - as the rightmost coordinate, and hence the sequence_length of the resulting tree sequence.

This ought to be explained somewhere, but I'm not managing to find where it's explained at the moment.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Great, thanks for the explanation. I think that it's actually pretty simple then. Here is something you could use as a draft for a document (some of this is just copied and pasted from the above posts). I don't feel strongly about whether I'm credited in it, I just appreciate that you helped clear it up for me, thanks!

If you want to take a SLiM recombination map and use it in msprime, there are 3 steps:

  1. add a 0 at the beginning of the positions
  2. add a 0 at the end of the rates
  3. take the final value in "positions" and add 1 to it

Or do the inverse if you have a map for msprime and you want to use it in SLiM. In both cases it should be easy to use the same recombination map text file for both programs with just a little bit of code manipulation in one or the other of them.

The reason why msprime “positions” must start with 0 (step 1) is that in SLiM, a position or "end" indicates the end of a recombination block such that its associated "rate" applies to everything to the left of that end. In msprime, the manual says:

Given an index j in these lists, the rate of recombination per base per generation is rates[j] over the interval positions[j] to positions[j + 1]. Consequently, the first position must be zero, and by convention the last rate value is also required to be zero (although it is not used).

This means that positions for msprime are both starts and ends. As a consequence, msprime needs a vector of positions that is 1 longer than what you give SLiM, and msprime also needs 1 fewer rates than it has positions, but you just add the 0 on at the end of the rates vector “by convention" (step 2).

The reason for step 3 is that intervals in tskit (which msprime uses) are "closed on the left and open on the right", which means that the genomic interval from 0.0 to 100.0 includes 0.0 but does not include 100.0. If SLiM has a final position of 99, then it could have mutations occurring at position 99; this would not be legal if we set the tskit sequence length to 99, since the position 99 would be outside of the interval from 0 to 99. So, in SLiM when we record tree sequences, we use the last position plus one - ie, the length of the genome - as the rightmost coordinate.

Code to take a recombination map used in SLiM and use it for recapitation in msprime:

import msprime, pyslim
import numpy as np
ts = pyslim.load("mytree.trees")
with open('recombinationmap.txt','r') as file:
    positions = [int(element) for element in file.readline().rstrip().split(',')]
    rates = [float(element) for element in file.readline().rstrip().split(',')]

# to make this recombination map for SLiM usable with msprime:
# step 1   
positions.insert(0, 0) 
# step 2
rates.append(0.0)
# step 3
positions[-1] = positions[-1]+1

recmap = msprime.RecombinationMap(positions,rates)
rts = ts.recapitate(recombination_map=recmap, Ne=200)

Note: This code works to the extent that it did not cause any errors or downstream problems for me, but I am otherwise not sure how to validate that the recapitation worked as expected.

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

I've put this into a new section (under Tutorial/recapitation) - see what you think? Suggestions welcome.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Thanks, I think it looks good!
I have been using this code and it has mostly been working well, but today it's started sometimes giving me this error:

_tskit.LibraryError: Bad edge interval where right <= left

This happens inconsistently, sometimes occurring and sometimes not with repeated runs of the same script (I have been re-running things repeatedly trying to figure it out).

I think it has to do with the fact that I am using variable recombination rates to simulate independent chromosomes: I have some intervals of length=1 where recombination rate = 0.5 to simulate the ends of chromosomes. I found this tskit/msprime issue about the same error from someone in a similar situation: tskit-dev/msprime#278
And this is relevant to this note in the msprime manual:

Warning
The chromosome is divided into num_loci regions of equal recombination distance, between which recombinations occur. This means that if recombination is constant and the genome is length n, then num_loci=n will produce breakpoints only at integer values. If recombination rate is not constant, breakpoints will still only occur at n distinct positions, but these will probably not coincide with the positions provided.

The suggested solution from the msprime issue is to do this:

num_loci = 10**8 + 1
mid = num_loci // 2

rm = msprime.RecombinationMap(
        [0.0, mid, mid + 1, num_loci],
        [1e-08, 0.5, 1e-08, 0.0],
        num_loci=num_loci)

Where the user has a 1e8 length genome and wants to break it into 2 chromosomes. I am doing something similar except I have more chromosomes, and I'd rather keep using my text file to read positions and rates if possible (partly because it's nice to be able to use the same file with SLiM and know I'm getting the same positions and rates in recapitation as in my simulation). I've pasted my positions and rates below in case that's helpful. I tried adding the num_loci parameter, using the final value in "positions" as num_loci, as was done in the solution from the msprime issue. But then I get this error when I try to recapitate:

_msprime.InputError: The specified recombination map is cannot translate the coordinatesfor the specified tree sequence. It is either too coarse (num_loci is too small) or contains zero recombination rates. Please either increase the number of loci or recombination rate

I'm having trouble understanding the warning from the msprime manual and what exactly what is causing these errors - do you have any insights?
I could also ask on the msprime github but I've posted this here partly because it is probably worth adding a note about this to the new Tutorial/recapitation section for anyone hoping to use a variable recombination map to simulate multiple chromosomes.

Thanks again!

positions and rates:

49999999,50000000,99999999,100000000,149999999,150000000,199999999,200000000,249999999,250000000,299999999,300000000,349999999,350000000,399999999,400000000,449999999,450000000,499999999,500000000,549999999,550000000,599999999,600000000,649999999,650000000,699999999,700000000,749999999,750000000,799999999,800000000,849999999,850000000,899999999,900000000,949999999,950000000,999999999
0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

Hello! Ok, let's see. All this will be better with the next release of msprime, which will have a discrete recombination map (so that recombination only occurs at integers, as in SLiM). (Also, that terrible error message has been fixed.) But, we can make this work in the meantime - but, I can't reproduce your problem? With msprime 0.7.5, this works for me:

positions = [0, 49999999,50000000,99999999,100000000,149999999,150000000,199999999,200000000,249999999,250000000,299999999,300000000,349999999,350000000,399999999,400000000,449999999,450000000,499999999,500000000,549999999,550000000,599999999,600000000,649999999,650000000,699999999,700000000,749999999,750000000,799999999,800000000,849999999,850000000,899999999,900000000,949999999,950000000,999999999]
rates = [0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028,0.5,0.000000028, 0.0]


rm = msprime.RecombinationMap(
        positions = positions,
        rates = rates,
        num_loci = positions[-1] + 1)

... could you post the code that produces that InputError about the recombination map?

I don't know if this will eliminate the "Bad edge interval" error, though - hopefully? If it does not fix the problem, we'll think more carefully about how to set the map.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Hi,
Yes, here's code that's giving me that Bad Interval error. I am using msprime 0.7.4 (I seem not to be able to install 0.7.5 with conda).
Note: the error does not occur every time. I would guess that it happens about 50% of the time I use this code.

import msprime, pyslim
import numpy as np

ts = pyslim.load("10July/10July_7k_simple_stage4.trees")
 	
with open('recmap1e9.txt','r') as file:
    positions = [int(element) for element in file.readline().rstrip().split(',')]
    rates = [float(element) for element in file.readline().rstrip().split(',')]

positions.insert(0, 0) 
rates.append(0)
positions[-1] = positions[-1]+1

rm = msprime.RecombinationMap(positions,rates)

rts = ts.recapitate(recombination_map=rm, Ne=200)

Output:

There are 11569 individuals alive from the final generation.
Traceback (most recent call last):
  File "10July/10July_7k_simple.py", line 30, in <module>
    rts = ts.recapitate(recombination_map=hapmap, Ne=200)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/pyslim/slim_tree_sequence.py", line 375, in recapitate
    **kwargs)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 481, in simulate
    sim, mutation_generator, 1, provenance_dict, end_time))
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 160, in _replicate_generator
    tree_sequence = sim.get_tree_sequence(mutation_generator, provenance_record)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 718, in get_tree_sequence
    return tables.tree_sequence()
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/tskit/tables.py", line 1536, in tree_sequence
    return tskit.TreeSequence.load_tables(self)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/tskit/trees.py", line 2177, in load_tables
    ts.load_tables(tables.ll_tables)
_tskit.LibraryError: Bad edge interval where right <= left

When I change the code to include a num_loci argument, I get error about not being able to translate the coordinates.

import msprime, pyslim
import numpy as np

ts = pyslim.load("10July/10July_7k_simple_stage4.trees")
 	
	

with open('recmap1e9.txt','r') as file:
    positions = [int(element) for element in file.readline().rstrip().split(',')]
    rates = [float(element) for element in file.readline().rstrip().split(',')]



positions.insert(0, 0) 
rates.append(0)
positions[-1] = positions[-1]+1

rm = msprime.RecombinationMap(positions,rates, num_loci = positions[-1])
rts = ts.recapitate(recombination_map=rm, Ne=200)

Output:

There are 11569 individuals alive from the final generation.
Traceback (most recent call last):
  File "10July/10July_7k_simple.py", line 33, in <module>
    rts = ts.recapitate(recombination_map=rm, Ne=200)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/pyslim/slim_tree_sequence.py", line 375, in recapitate
    **kwargs)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 488, in simulate
    sim, mutation_generator, 1, provenance_dict, end_time))
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 164, in _replicate_generator
    sim.run(end_time)
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 702, in run
    self.ll_sim = self.create_ll_instance()
  File "//anaconda3/envs/py3/lib/python3.7/site-packages/msprime/simulations.py", line 694, in create_ll_instance
    node_mapping_block_size=self.node_mapping_block_size)
_msprime.InputError: Input error in initialise: The specified recombination map is cannot translate the coordinatesfor the specified tree sequence. It is either too coarse (num_loci is too small) or contains zero recombination rates. Please either increase the number of loci or recombination rate

I used positions[-1] as num_loci, not positions[-1]+1 as in your example above, because my code already added one to the last element in positions. However, I also tried with num_loci= positions[-1]+1 even after that step, and got the same error.

Thanks!
-Teresa

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

Hm, ok - to look into this, I'll need the rates, I think. Could you post the file recmap1e9.txt - either attach it to a comment here, or post a link to it?

To eliminate the randomness in when the error happens, you can set the seed - so, do

rts = ts.recapitate(recombination_map=rm, Ne=200, random_seed=42)

... and then change the seed until you get the error.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Here is recmap1e9.txt - it is just a text file of the positions and rates I pasted in a few posts back.
Thanks for the suggestion about setting the seed, that will be helpful!
-Teresa
recmap1e9.txt

from pyslim.

petrelharp avatar petrelharp commented on September 14, 2024

Hm, well, this is difficult. The underlying problem is that in mpsrime <=0.7.4, recombination locations were chosen in "recombination space", which is where the discretization into num_loci pieces happens. If two of these discrete locations in recombination space map back onto the same location in the genome (ie in physical space), then you get this "left == right" error. Here's some code to verify that's what happens for your recombination map:

import msprime

with open('recmap1e9.txt','r') as file:
    positions = [int(element) for element in file.readline().rstrip().split(',')]
    rates = [float(element) for element in file.readline().rstrip().split(',')]

positions.insert(0, 0) 
rates.append(0)
positions[-1] = positions[-1]+1

num_loci = positions[-1]
rm = msprime.RecombinationMap(positions, rates, num_loci=num_loci)

for k in range(rm.get_num_loci() - 1):
    a = rm.genetic_to_physical(k) 
    b = rm.genetic_to_physical(k + 1) 
    if a == b:
        print("problem! at ", k, ", ", a, " = ", b)
        break

... which after a while tells me that

problem! at  544000002 ,  549999999.0000002  =  549999999.0000002

Unfortunately, I'm not seeing a great way around this. Notice that the problem interval is one of the "inter-chromosomal" intervals: there's just too much recombination packed into a small interval. So, increasing the number of loci actually makes this worse - changing the number of loci to 20 * positions[-1] makes the error happen earlier. A solution is therefore to decrease num_loci, counterintuitively. This seems bad since it implies there will be loci you can't get recombination events between... but, that's already true, setting num_loci = positions[-1]. So, it's an approximation, but not a bad approximation, really. (a better approximation than the approximation to the real recombination process, certainly!)

Another solution is to use the development version of msprime, which will use a discrete recombination map. If this causes issues, it shouldn't cause invisible ones - you'll get errors if there are problems.

The only other solution I can think of is to cut the tree sequence from SLiM into pieces, one piece per chromosome (using .trim()), and then deal with each one separately. I don't think we have tools to put the resulting tree sequences back together again, which might be a bit annoying.

Apologies for the difficulties: floating-point numbers can be very tricky.

from pyslim.

TeresaPegan avatar TeresaPegan commented on September 14, 2024

Ok, thanks so much for the explanation and suggestions!

from pyslim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.