API Documentation¶
Demographic models¶
- class momi.DemographicModel(N_e, gen_time=1, muts_per_gen=None)¶
Object for representing and inferring a demographic history.
- Parameters:
N_e (float) – the population size, unless manually changed by
DemographicModel.set_size()
gen_time (float) – units of time per generation. For example, if you wish to specify time in years, and a generation is 29 years, set this to 29. Default value is 1.
muts_per_gen (float,None) – mutation rate per base per generation. If unknown, set to None (the default). Can be changed with
DemographicModel.set_mut_rate()
- add_growth_param(name, g0=None, lower=-0.001, upper=0.001, rgen=None)¶
Add growth rate parameter to the demographic model.
- add_leaf(pop_name, t=0, N=None, g=None)¶
Add a sampled leaf population to the model.
The arguments t, N, g can be floats or parameter names (strings).
If N or g are specified, then the size or growth rate are also set at the sampling time t. Otherwise they remain at their previous (default) values.
Note that this does not affect the population size and growth below time t, which may be an issue if lineages are moved in from other populations below t. If you need to set the size or growth below t, use
DemographicModel.set_size()
.
- add_parameter(name, start_value=None, rgen=None, scale_transform=None, unscale_transform=None, scaled_lower=None, scaled_upper=None)¶
Low-level method to add a new parameter. Most users should instead call
DemographicModel.add_size_param()
,DemographicModel.add_pulse_param()
,DemographicModel.add_time_param()
, orDemographicModel.add_growth_param()
.In order for internal gradients to be correct,
scale_transform
andunscale_transform
should be constructed using autograd.- Parameters:
name (str) – Name of the parameter.
start_value (float) – Starting value. If None, use
rgen
to sample a random starting value.rgen (function) – Function to get a random starting value.
scale_transform (function) – Function for internally transforming and rescaling the parameter during optimization.
unscale_transform (function) – Inverse function of
scale_transform
scaled_lower (float) – Lower bound after scaling by
scale_transform
scaled_upper (float) – Upper bound after scaling by
scale_transform
- add_pulse_param(name, p0=None, lower=0.0, upper=1.0, rgen=None)¶
Add a pulse parameter to the demographic model.
- add_size_param(name, N0=None, lower=1, upper=10000000000.0, rgen=None)¶
Add a size parameter to the demographic model.
- add_time_param(name, t0=None, lower=0.0, upper=None, lower_constraints=[], upper_constraints=[], rgen=None)¶
Add a time parameter to the demographic model.
- Parameters:
name (str) – Parameter name
t0 (float) – Starting value. If None, use
rgen
to randomly samplelower (float) – Lower bound
upper (float) – Upper bound
lower_constraints (list) – List of parameter names that are lower bounds
upper_constraints (list) – List of parameter names that are upper bounds
rgen – Function to sample a random starting value. If None, a truncated exponential with rate
1 / (N_e * gen_time)
constrained to satisfy the bounds and constraints.
- fit_within_pop_diversity()¶
Estimates mutation rate using within-population nucleotide diversity.
The within-population nucleotide diversity is the average number of hets per individual in the population, assuming it is at Hardy Weinberg equilibrium.
This returns an estimated mutation rate for each (ascertained) population. Note these are non-independent estimates of the same value. It also returns standard deviations from the jacknife.
If
DemographicModel.muts_per_gen
is set, will also return Z-scores of the residuals.- Return type:
pandas.DataFrame
- get_params(scaled=False)¶
Return an ordered dictionary with the current parameter values.
If
scaled=True
, returns the parameters scaled in the internal representation used by momi during optimization (see also:DemographicModel.add_parameter()
)
- kl_div()¶
The KL-divergence at the current parameter values
- log_likelihood()¶
The log likelihood at the current parameter values
- move_lineages(pop_from, pop_to, t, p=1, N=None, g=None)¶
Move each lineage in pop_from to pop_to at time t with probability p.
The arguments t, p, N, g can be floats or parameter names (strings).
If N or g are specified, then the size or growth rate of pop_to is also set at this time, otherwise these parameters remain at their previous values.
- optimize(method='tnc', jac=True, hess=False, hessp=False, printfreq=1, **kwargs)¶
Search for the maximum likelihood value of the parameters.
This is a wrapper around
scipy.optimize.minimize()
, and arguments for that function can be passed in via**kwargs
. Note the following arguments are constructed bymomi
and not be passed in by**kwargs
:fun
,x0
,jac
,hess
,hessp
,bounds
.- Parameters:
method (str) – Optimization method. Default is “tnc”. For large models “L-BFGS-B” is recommended. See
scipy.optimize.minimize()
.jac (bool) – Whether or not to provide the gradient (computed via
autograd
) to the optimizer. If False, optimizers requiring gradients will typically approximate it via finite differences.hess (bool) – Whether or not to provide the hessian (computed via
autograd
) to the optimizer.hessp (bool) – Whether or not to provide the hessian-vector-product (via
autograd
) to the optimizerprintfreq (int) – Log current progress via
logging.info()
every printfreq iterations
- Return type:
- set_data(sfs, length=None, mem_chunk_size=1000, non_ascertained_pops=None, use_pairwise_diffs=True)¶
Set dataset for the model.
- Parameters:
sfs (Sfs) – Observed SFS
length (float) – Length of data in bases. Overrides
sfs.length
if set. Required ifDemoModel.muts_per_gen
is set andsfs.length
is not.mem_chunk_size – Controls memory usage by computing likelihood in chunks of SNPs. If
-1
then no chunking is done.non_ascertained_pops – Don’t ascertain SNPs within these populations. That is, ignore all SNPs that are not polymorphic on the other populations. The SFS is adjusted to represent probabilities conditional on this ascertainment scheme.
use_pairwise_diffs – Only has an effect if
DemoModel.muts_per_gen
is set. IfFalse
, assumes the total number of mutations is Poisson. If True, models the within population nucleotide diversity (i.e. the average number of heterozygotes per population) as independent Poissons. If there is missing data this is required to beTrue
.
- set_mut_rate(muts_per_gen)¶
Set the mutation rate.
- Parameters:
muts_per_gen (float,None) – Mutation rate per base per generation. If unknown, set to None.
- set_params(new_params=None, randomize=False, scaled=False)¶
Set the current parameter values
- Parameters:
new_params (dict/list) – dict mapping parameter names to new values, or list of new values whose length is the current number of parameters
randomize (bool) – if True, parameters not in
new_params
get randomly sampled new valuesscaled (bool) – if True, values in
new_params
have been pre-scaled according to the internal representation used by momi during optimization (see also:DemographicModel.add_parameter()
)
- set_size(pop_name, t, N=None, g=0)¶
Set population size and/or growth rate at time t.
The arguments t, N, g can be floats or parameter names (strings).
If N is not specified then only the growth rate is changed.
If N is specified and g is not then the growth rate is reset to 0. Currently it is not possible to change the size without also setting the growth rate.
- simulate_data(length, recoms_per_gen, num_replicates, muts_per_gen=None, sampled_n_dict=None, **kwargs)¶
Simulate data, using msprime as backend
- Parameters:
length (int) – Length of each locus in bases
recoms_per_gen (float) – Recombination rate per generation per base
muts_per_gen (float) – Mutation rate per generation per base
num_replicates (int) – Number of loci to simulate
sampled_n_dict (dict) – Number of haploids per population. If None, use sample sizes from the current dataset as set by
DemographicModel.set_data()
- Returns:
Dataset of SNP allele counts
- Return type:
- simulate_vcf(out_prefix, recoms_per_gen, length, muts_per_gen=None, chrom_name='1', ploidy=1, random_seed=None, sampled_n_dict=None, **kwargs)¶
Simulate a chromosome using msprime and write it to VCF
- Parameters:
outfile (str,file) – Output VCF file. If a string ending in “.gz”, gzip it.
muts_per_gen (float) – Mutation rate per generation per base
recoms_per_gen (float) – Recombination rate per generation per base
length (int) – Length of chromosome in bases
chrom_name (str) – Name of chromosome
ploidy (int) – Ploidy
random_seed (int) – Random seed
sampled_n_dict (dict) – Number of haploids per population. If None, use sample sizes from the current dataset as set by
DemographicModel.set_data()
- stochastic_optimize(num_iters, n_minibatches=None, snps_per_minibatch=None, rgen=None, printfreq=1, start_from_checkpoint=None, save_to_checkpoint=None, svrg_epoch=-1, **kwargs)¶
Use stochastic optimization (ADAM+SVRG) to search for MLE
Exactly one of of
n_minibatches
andsnps_per_minibatch
should be set, as one determines the other.- Parameters:
num_iters (int) – Number of steps
n_minibatches (int) – Number of minibatches
snps_per_minibatch (int) – Number of SNPs per minibatch
rgen (numpy.RandomState) – Random generator
printfreq (int) – How often to log progress
start_from_checkpoint (str) – Name of checkpoint file to start from
save_to_checkpoint (str) – Name of checkpoint file to save to
svrg_epoch (int) – How often to compute full likelihood for SVRG. -1=never.
- Return type:
Plotting¶
- class momi.DemographyPlot(model, pop_x_positions, ax=None, figsize=None, linthreshy=None, minor_yticks=None, major_yticks=None, color_map='cool', pulse_color_bounds=(0, 1), draw=True)¶
Object for plotting a demography.
After creating this object, call
DemographyPlot.draw()
to draw the demography.- Parameters:
model (DemographicModel) – model to plot
pop_x_positions (list,dict) – list ordering populations along the x-axis, or dict mapping population names to x-axis positions
ax (matplotlib.axes.Axes) – Canvas to draw the figure on. Defaults to
matplotlib.pyplot.gca()
figsize (tuple) – If non-None, calls
matplotlib.pyplot.figure()
to create a new figure with this width and height. Ignored ifax
is non-None.linthreshy (float) – Scale y-axis linearly below this value and logarithmically above it. (Default: use linear scaling only)
minor_yticks (list) – Minor ticks on y axis
major_yticks (list) – Major ticks on y axis
color_map (str,matplotlib.colors.Colormap) – Colormap mapping pulse probabilities to colors
pulse_color_bounds (tuple) – pair of
(lower, upper)
bounds for mapping pulse probabilities to colors
- add_bootstrap(params, alpha, rad=-0.1, rand_rad=True)¶
Add an inferred bootstrap demography to the plot.
- draw(alpha=1.0, tree_color='C0', draw_frame=True, rad=0, pulse_label=True)¶
Draw the demography.
This is method draws the demography by calling
DemographyPlot.draw_tree()
,DemographyPlot.draw_leafs()
,DemographyPlot.draw_pulse()
, andDemographyPlot.draw_frame()
. Call those methods directly for finer control.- Parameters:
- draw_N_legend(N_legend_values=None, title='N', **kwargs)¶
Draw legend of population sizes.
**kwargs
are passed ontomatplotlib.axes.Axes.legend()
.- Parameters:
N_legend_values – Values of
N
to plot.title – Title of legend
- Return type:
- draw_frame(pops=None, rename_pops=None, rotation=-30)¶
Draw tickmarks, legend, and colorbar.
- draw_leafs(leafs=None, facecolors='none', edgecolors='black', s=100, zorder=2, **kwargs)¶
Draw symbols for the leaf populations.
- Parameters:
leafs (list) – Leaf populations to draw symbols for. If None, add all leafs.
Extra parameters
facecolors, edgecolors, s, zorder, **kwargs
are passed tomatplotlib.axes.Axes.scatter()
.
- draw_pulse(pop1, pop2, t, p, rad=-0.1, alpha=1.0, pulse_label=True, adj_label_x=0, adj_label_y=0)¶
Draw a pulse.
Use
DemographyPlot.iter_pulses()
to iterate over the pulses, which can then be plotted with this method.- Parameters:
pop1 (str) – Population the arrow is pointing into
pop2 (str) – Population the arrow is pointing away from
t (float) – Time of the pulse
p (float) – Strength of the pulse
rad (float) – Arc of the pulse arrow in radians.
alpha (float) – Transparency
pulse_label (bool) – Add a label for pulse strength?
adj_label_x (float) – Adjust pulse label x position
adj_label_y (float) – Adjust pulse label y position
- draw_tree(tree_color='C0', alpha=1.0)¶
Draw the demographic tree (without pulse migrations)
- iter_pulses()¶
Iterate over the pulse arrows, to pass to
DemographyPlot.draw_pulse()
- Returns:
iterator over tuples
(pop1, pop2, t, p)
Data¶
- momi.snp_allele_counts(chrom_ids, positions, populations, ancestral_counts, derived_counts, length=None, use_folded_sfs=False)¶
Create a
SnpAlleleCounts
object.- Parameters:
chrom_ids (iterator) – the CHROM at each SNP
positions (iterator) – the POS at each SNP
populations (list) – the population names
ancestral_counts (iterator) – iterator over tuples of the ancestral counts at each SNP. tuple length should be the same as the number of populations
derived_counts (iterator) – iterator over tuples of the derived counts at each SNP. tuple length should be the same as the number of populations
use_folded_sfs (bool) – whether the folded SFS should be used when computing likelihoods. Set this to True if there is uncertainty about the ancestral allele.
- Return type:
- class momi.SnpAlleleCounts¶
The allele counts for a list of SNPs.
To create a
SnpAlleleCounts
object, useSnpAlleleCounts.read_vcf()
,SnpAlleleCounts.load()
, orsnp_allele_counts()
. Do NOT call the class constructor directly, it is for internal use only.- classmethod concatenate(to_concatenate)¶
Combine a list of
SnpAlleleCounts
into a single object.- Parameters:
to_concatenate (iterator) – Iterator over
SnpAlleleCounts
- Return type:
- down_sample(sampled_n_dict)¶
Randomly subsample the alleles at each site.
- Parameters:
sampled_n_dict (dict) – dictionary mapping population names to the subsample size (number of alleles to subsample from each population).
- Return type:
- dump(f)¶
Write data in JSON format.
- Parameters:
f (str,file) – filename or file object. If a filename, the resulting file is gzipped.
- extract_sfs(n_blocks)¶
Extracts SFS from data.
- classmethod load(f)¶
Load
SnpAlleleCounts
created fromSnpAlleleCounts.dump()
orpython -m momi.read_vcf ...
- Parameters:
f (str,file) – file object or file name to read in
- Return type:
- classmethod read_vcf(vcf_file, ind2pop, bed_file=None, ancestral_alleles=True, info_aa_field='AA')¶
Read in a VCF file and return the allele counts at biallelic SNPs.
- Parameters:
vcf_file (str) – VCF file to read in. The VCF file should be indexed, e.g. with tabix or bcftools. Alternatively, if equal to “-”, reads the VCF from stdin.
ind2pop (dict) – Maps individual samples to populations.
bed_file (str,None) – BED accessibility regions file. Only regions in the BED file are read from the VCF. The BED file is also used to determine the size of the data in bases, so the same BED file should NOT be used across multiple VCFs (otherwise regions will be double counted towards the data length). If no BED is provided, all SNPs in the VCF are read, and the length of the data is set to be unknown.
ancestral_alleles (bool,str) – If True, use the AA INFO field to determine ancestral allele, skipping SNPs missing this field. If False, ignore ancestral allele information, and set the
SnpAlleleCounts.use_folded_sfs
property so that the folded SFS is used by default when computing likelihoods. Finally, ifancestral_alleles
is a string that is the name of a population inind2pop
, then treat that population as an outgroup, using its consensus allele as the ancestral allele; SNPs without consensus are skipped.info_aa_field (str) – The INFO field to read Ancestral Allele from. Default is “AA”. Only has effect if
ancestral_alleles=True
.
- Return type:
- momi.site_freq_spectrum(sampled_pops, freqs_by_locus, length=None)¶
Create
Sfs
from a list of dicts of counts.freqs_by_locus[l][config]
gives the frequency ofconfig
on locusl
.A
config
is a(n_pops,2)
-array, whereconfig[p][a]
is the count of allelea
in populationp
.a=0
represents the ancestral allele whilea=1
represents the derived allele.
- class momi.Sfs¶
Class for representing observed site frequnecy spectrum data.
To create a
Sfs
object, useSnpAlleleCounts.extract_sfs()
,Sfs.load()
, orsite_freq_spectrum()
. Do NOT call the class constructor directly, it is for internal use only.- property config_array¶
Array of unique configs.
This is a 3-d array with shape
(n_configs, n_pops, 2)
.config_array[i, p, a]
gives the count of allelea
in populationp
in configurationi
. Allelea=0
is ancestral,a=1
is derived.Sfs.sampled_pops
gives the ordering of the populations.- Returns:
array of configs
- Return type:
- dump(f)¶
Write Sfs to file
- Parameters:
f (str,file) – Filename or object. If name ends with “.gz” gzip it.
- property freqs_matrix¶
Sparse matrix representing the frequencies at each locus.
The
(i,j)
-th entry gives the frequency of thei
-th config inSfs.config_array
at locusj
- Return type:
- classmethod load(f)¶
Load
Sfs
from file created bySfs.dump()
orpython -m momi.extract_sfs
- n_snps(vector=False)¶
Number of SNPs, either per-locus or overall.
- Parameters:
vector (bool) – Whether to return a single total count, or an array of counts per-locus
- p_missing¶
Estimate of probability that a random allele from each population is missing.
Missingness is estimated as follows: from each SNP remove a random allele; if the resulting config is monomorphic, then ignore. If polymorphic, then count whether the removed allele is missing or not.
This avoids bias from fact that we don’t observe some polymorphic configs that appear monomorphic after removing the missing alleles.
- Returns:
1-d array of missingness per population
- Return type:
- resample()¶
Create a new SFS by resampling blocks with replacement.
Note the resampled SFS is assumed to have the same length in base pairs as the original SFS, which may be a poor assumption if the blocks are not of equal length.
- Returns:
Resampled SFS
- Return type:
- property sampled_n¶
Number of samples per population
sampled_n[i]
is the number of haploids in thei
-th population ofSfs.sampled_pops
- Returns:
1-d array of integers
- Return type:
- property sampled_pops¶
Names of sampled populations
- Returns:
1-d array of strings
- Return type:
- subset_populations(populations, non_ascertained_pops=None)¶
Returns lower-dimensional SFS restricted to a subset of populations.
- Parameters:
- Returns:
Lower-dimensional SFS restricted to a subset of populations.
- Return type:
- momi.sfs_from_dadi(infile, outfile=None)¶
Generate a momi formatted
Sfs
object from the dadi frequency spectrum format as documented in section 3.1 of the dadi manual. One slight modification is that if you include population names after the “folding” flag on the info line, then this function will honor these pop names.
Statistics¶
- class momi.SfsModelFitStats(demo_model, sampled_n_dict=None)¶
Class to compare expected vs. observed statistics of the SFS.
All methods return
JackknifeGoodnessFitStat
unless otherwise stated.Currently, all goodness-of-fit statistics are based on the multinomial SFS (i.e., the SFS normalized to be a probability distribution summing to 1). Thus the mutation rate has no effect on these statistics.
See Patterson et al 2012, for definitions of f2, f3, f4 (abba-baba), and D statistics.
Note this class does NOT get updated when the underlying
demo_model
changes; a newSfsModelFitStats
needs to be created to reflect any changes in the demography.- Parameters:
demo_model (momi.DemographicModel) – Demography to compute expected statistics under.
sampled_n_dict (dict) – The number of samples to use per population. SNPs with fewer than this number of samples are ignored. The default is to use the full sample size of the data, i.e. to remove all SNPs with any missing data. For datasets with large amounts of missing data, this could potentially lead to most or all SNPs being removed, so it is important to specify a smaller sample size in such cases.
- all_pairs_ibs(fig=True)¶
Fit the IBS fraction for all pairs of populations, and optionally plot it.
- Parameters:
fig (bool) – whether to plot it
- Return type:
pandas.DataFrame
- f2(A, B)¶
Computes f2 statistic (A-B)*(A-B)
- f3(A, B, O)¶
Computes f3 statistic (O-A)*(O-B)
- f4(A, B, C, D=None)¶
Returns the ABBA-BABA (f4) statistic for testing admixture.
- f_st(A, B)¶
Returns (pi_between - pi_within) / pi_between, where pi_between, pi_within represent the average number of pairwise diffs between 2 individuals sampled from different or the same population, respectively.
- log_abba_baba(A, B, C, D=None)¶
Returns log(BABA/ABBA) = log(BABA)-log(ABBA)
- pattersons_d(A, B, C, D=None)¶
Returns Patterson’s D, defined as (BABA-ABBA)/(BABA+ABBA).
- tensor_prod(derived_weights_dict)¶
Compute rank-1 tensor products of the SFS, which can be used to express a wide range of SFS-based statistics.
More specifically, this computes the sum
\[\sum_{i,j,\ldots} SFS_{i,j,\ldots} w^{(1)}_i w^{(2)}_j \cdots\]where \(w^{(1)}_i\) is the weight corresponding to SFS entries with
i
derived alleles in population 1, etc. Note the SFS is normalized to sum to 1 here (it is a probability).- Parameters:
derived_weights_dict (dict) – Maps leaf populations to vectors (
numpy.ndarray
). If a population hasn
samples then the corresponding vectorw
should have lengthn+1
, withw[i]
being the weight for SFS entries withi
copies of the derived allele in the population.- Return type:
- class momi.JackknifeGoodnessFitStat(expected, observed, jackknifed_array)¶
Object returned by methods of
SfsModelFitStats
.Basic arithmetic operations are supported, allowing to build up complex statistics out of simpler ones.
The raw expected, observed, and jackknifed_array values can be accessed as attributes of this class.
- Parameters:
expected (float) – the expected value of the statistic
observed (float) – the observed value of the statistic
jackknifed_array (numpy.ndarray) – array of the jackknifed values of the statistic.
- property sd¶
Standard deviation of the statistic, estimated via jackknife
- property z_score¶
Z-score of the statistic, defined as (observed-expected)/sd