API Documentation¶
Demographic models¶
-
class
momi.DemographicModel(N_e, gen_time=1, muts_per_gen=None)¶ Object for representing and inferring a demographic history.
Parameters: - N_e (float) – the population size, unless manually changed by
DemographicModel.set_size() - gen_time (float) – units of time per generation. For example, if you wish to specify time in years, and a generation is 29 years, set this to 29. Default value is 1.
- muts_per_gen (float,None) – mutation rate per base per generation. If unknown, set to None (the default). Can be changed with
DemographicModel.set_mut_rate()
-
add_growth_param(name, g0=None, lower=-0.001, upper=0.001, rgen=None)¶ Add growth rate parameter to the demographic model.
Parameters:
-
add_leaf(pop_name, t=0, N=None, g=None)¶ Add a sampled leaf population to the model.
The arguments t, N, g can be floats or parameter names (strings).
If N or g are specified, then the size or growth rate are also set at the sampling time t. Otherwise they remain at their previous (default) values.
Note that this does not affect the population size and growth below time t, which may be an issue if lineages are moved in from other populations below t. If you need to set the size or growth below t, use
DemographicModel.set_size().Parameters:
-
add_parameter(name, start_value=None, rgen=None, scale_transform=None, unscale_transform=None, scaled_lower=None, scaled_upper=None)¶ Low-level method to add a new parameter. Most users should instead call
DemographicModel.add_size_param(),DemographicModel.add_pulse_param(),DemographicModel.add_time_param(), orDemographicModel.add_growth_param().In order for internal gradients to be correct,
scale_transformandunscale_transformshould be constructed using autograd.Parameters: - name (str) – Name of the parameter.
- start_value (float) – Starting value. If None, use
rgento sample a random starting value. - rgen (function) – Function to get a random starting value.
- scale_transform (function) – Function for internally transforming and rescaling the parameter during optimization.
- unscale_transform (function) – Inverse function of
scale_transform - scaled_lower (float) – Lower bound after scaling by
scale_transform - scaled_upper (float) – Upper bound after scaling by
scale_transform
-
add_pulse_param(name, p0=None, lower=0.0, upper=1.0, rgen=None)¶ Add a pulse parameter to the demographic model.
Parameters:
-
add_size_param(name, N0=None, lower=1, upper=10000000000.0, rgen=None)¶ Add a size parameter to the demographic model.
Parameters:
-
add_time_param(name, t0=None, lower=0.0, upper=None, lower_constraints=[], upper_constraints=[], rgen=None)¶ Add a time parameter to the demographic model.
Parameters: - name (str) – Parameter name
- t0 (float) – Starting value. If None, use
rgento randomly sample - lower (float) – Lower bound
- upper (float) – Upper bound
- lower_constraints (list) – List of parameter names that are lower bounds
- upper_constraints (list) – List of parameter names that are upper bounds
- rgen – Function to sample a random starting value. If None, a truncated exponential with rate
1 / (N_e * gen_time)constrained to satisfy the bounds and constraints.
-
fit_within_pop_diversity()¶ Estimates mutation rate using within-population nucleotide diversity.
The within-population nucleotide diversity is the average number of hets per individual in the population, assuming it is at Hardy Weinberg equilibrium.
This returns an estimated mutation rate for each (ascertained) population. Note these are non-independent estimates of the same value. It also returns standard deviations from the jacknife.
If
DemographicModel.muts_per_genis set, will also return Z-scores of the residuals.Return type: pandas.DataFrame
-
get_params(scaled=False)¶ Return an ordered dictionary with the current parameter values.
If
scaled=True, returns the parameters scaled in the internal representation used by momi during optimization (see also:DemographicModel.add_parameter())
-
kl_div()¶ The KL-divergence at the current parameter values
-
log_likelihood()¶ The log likelihood at the current parameter values
-
move_lineages(pop_from, pop_to, t, p=1, N=None, g=None)¶ Move each lineage in pop_from to pop_to at time t with probability p.
The arguments t, p, N, g can be floats or parameter names (strings).
If N or g are specified, then the size or growth rate of pop_to is also set at this time, otherwise these parameters remain at their previous values.
Parameters: - pop_from (str) – Population lineages are moved from (backwards in time)
- pop_to (str) – Population lineages are moved to (backwards in time)
- t (float,str) – Time of the event
- p (float,str) – Probability that lineage in pop_from moves to pop_to
- N (float,str) – Population size of pop_to
- g (float,str) – Growth rate of pop_to
-
optimize(method='tnc', jac=True, hess=False, hessp=False, printfreq=1, **kwargs)¶ Search for the maximum likelihood value of the parameters.
This is a wrapper around
scipy.optimize.minimize(), and arguments for that function can be passed in via**kwargs. Note the following arguments are constructed bymomiand not be passed in by**kwargs:fun,x0,jac,hess,hessp,bounds.Parameters: - method (str) – Optimization method. Default is “tnc”. For large models “L-BFGS-B” is recommended. See
scipy.optimize.minimize(). - jac (bool) – Whether or not to provide the gradient (computed via
autograd) to the optimizer. If False, optimizers requiring gradients will typically approximate it via finite differences. - hess (bool) – Whether or not to provide the hessian (computed via
autograd) to the optimizer. - hessp (bool) – Whether or not to provide the hessian-vector-product (via
autograd) to the optimizer - printfreq (int) – Log current progress via
logging.info()every printfreq iterations
Return type: - method (str) – Optimization method. Default is “tnc”. For large models “L-BFGS-B” is recommended. See
-
set_data(sfs, length=None, mem_chunk_size=1000, non_ascertained_pops=None, use_pairwise_diffs=True)¶ Set dataset for the model.
Parameters: - sfs (Sfs) – Observed SFS
- length (float) – Length of data in bases. Overrides
sfs.lengthif set. Required ifDemoModel.muts_per_genis set andsfs.lengthis not. - mem_chunk_size – Controls memory usage by computing likelihood in chunks of SNPs. If
-1then no chunking is done. - non_ascertained_pops – Don’t ascertain SNPs within these populations. That is, ignore all SNPs that are not polymorphic on the other populations. The SFS is adjusted to represent probabilities conditional on this ascertainment scheme.
- use_pairwise_diffs – Only has an effect if
DemoModel.muts_per_genis set. IfFalse, assumes the total number of mutations is Poisson. If True, models the within population nucleotide diversity (i.e. the average number of heterozygotes per population) as independent Poissons. If there is missing data this is required to beTrue.
-
set_mut_rate(muts_per_gen)¶ Set the mutation rate.
Parameters: muts_per_gen (float,None) – Mutation rate per base per generation. If unknown, set to None.
-
set_params(new_params=None, randomize=False, scaled=False)¶ Set the current parameter values
Parameters: - new_params (dict/list) – dict mapping parameter names to new values, or list of new values whose length is the current number of parameters
- randomize (bool) – if True, parameters not in
new_paramsget randomly sampled new values - scaled (bool) – if True, values in
new_paramshave been pre-scaled according to the internal representation used by momi during optimization (see also:DemographicModel.add_parameter())
-
set_size(pop_name, t, N=None, g=0)¶ Set population size and/or growth rate at time t.
The arguments t, N, g can be floats or parameter names (strings).
If N is not specified then only the growth rate is changed.
If N is specified and g is not then the growth rate is reset to 0. Currently it is not possible to change the size without also setting the growth rate.
Parameters:
-
simulate_data(length, recoms_per_gen, num_replicates, muts_per_gen=None, sampled_n_dict=None, **kwargs)¶ Simulate data, using msprime as backend
Parameters: - length (int) – Length of each locus in bases
- recoms_per_gen (float) – Recombination rate per generation per base
- muts_per_gen (float) – Mutation rate per generation per base
- num_replicates (int) – Number of loci to simulate
- sampled_n_dict (dict) – Number of haploids per population. If None, use sample sizes from the current dataset as set by
DemographicModel.set_data()
Returns: Dataset of SNP allele counts
Return type:
-
simulate_vcf(out_prefix, recoms_per_gen, length, muts_per_gen=None, chrom_name='1', ploidy=1, random_seed=None, sampled_n_dict=None, **kwargs)¶ Simulate a chromosome using msprime and write it to VCF
Parameters: - outfile (str,file) – Output VCF file. If a string ending in “.gz”, gzip it.
- muts_per_gen (float) – Mutation rate per generation per base
- recoms_per_gen (float) – Recombination rate per generation per base
- length (int) – Length of chromosome in bases
- chrom_name (str) – Name of chromosome
- ploidy (int) – Ploidy
- random_seed (int) – Random seed
- sampled_n_dict (dict) – Number of haploids per population. If None, use sample sizes from the current dataset as set by
DemographicModel.set_data()
-
stochastic_optimize(num_iters, n_minibatches=None, snps_per_minibatch=None, rgen=None, printfreq=1, start_from_checkpoint=None, save_to_checkpoint=None, svrg_epoch=-1, **kwargs)¶ Use stochastic optimization (ADAM+SVRG) to search for MLE
Exactly one of of
n_minibatchesandsnps_per_minibatchshould be set, as one determines the other.Parameters: - num_iters (int) – Number of steps
- n_minibatches (int) – Number of minibatches
- snps_per_minibatch (int) – Number of SNPs per minibatch
- rgen (numpy.RandomState) – Random generator
- printfreq (int) – How often to log progress
- start_from_checkpoint (str) – Name of checkpoint file to start from
- save_to_checkpoint (str) – Name of checkpoint file to save to
- svrg_epoch (int) – How often to compute full likelihood for SVRG. -1=never.
Return type:
- N_e (float) – the population size, unless manually changed by
Plotting¶
-
class
momi.DemographyPlot(model, pop_x_positions, ax=None, figsize=None, linthreshy=None, minor_yticks=None, major_yticks=None, color_map='cool', pulse_color_bounds=(0, 1), draw=True)¶ Object for plotting a demography.
After creating this object, call
DemographyPlot.draw()to draw the demography.Parameters: - model (DemographicModel) – model to plot
- pop_x_positions (list,dict) – list ordering populations along the x-axis, or dict mapping population names to x-axis positions
- ax (matplotlib.axes.Axes) – Canvas to draw the figure on. Defaults to
matplotlib.pyplot.gca() - figsize (tuple) – If non-None, calls
matplotlib.pyplot.figure()to create a new figure with this width and height. Ignored ifaxis non-None. - linthreshy (float) – Scale y-axis linearly below this value and logarithmically above it. (Default: use linear scaling only)
- minor_yticks (list) – Minor ticks on y axis
- major_yticks (list) – Major ticks on y axis
- color_map (str,matplotlib.colors.Colormap) – Colormap mapping pulse probabilities to colors
- pulse_color_bounds (tuple) – pair of
(lower, upper)bounds for mapping pulse probabilities to colors
-
add_bootstrap(params, alpha, rad=-0.1, rand_rad=True)¶ Add an inferred bootstrap demography to the plot.
Parameters:
-
draw(alpha=1.0, tree_color='C0', draw_frame=True, rad=0, pulse_label=True)¶ Draw the demography.
This is method draws the demography by calling
DemographyPlot.draw_tree(),DemographyPlot.draw_leafs(),DemographyPlot.draw_pulse(), andDemographyPlot.draw_frame(). Call those methods directly for finer control.Parameters:
-
draw_N_legend(N_legend_values=None, title='N', **kwargs)¶ Draw legend of population sizes.
**kwargsare passed ontomatplotlib.axes.Axes.legend().Parameters: - N_legend_values – Values of
Nto plot. - title – Title of legend
Return type: - N_legend_values – Values of
-
draw_frame(pops=None, rename_pops=None, rotation=-30)¶ Draw tickmarks, legend, and colorbar.
Parameters:
-
draw_leafs(leafs=None, facecolors='none', edgecolors='black', s=100, zorder=2, **kwargs)¶ Draw symbols for the leaf populations.
Parameters: leafs (list) – Leaf populations to draw symbols for. If None, add all leafs. Extra parameters
facecolors, edgecolors, s, zorder, **kwargsare passed tomatplotlib.axes.Axes.scatter().
-
draw_pulse(pop1, pop2, t, p, rad=-0.1, alpha=1.0, pulse_label=True, adj_label_x=0, adj_label_y=0)¶ Draw a pulse.
Use
DemographyPlot.iter_pulses()to iterate over the pulses, which can then be plotted with this method.Parameters: - pop1 (str) – Population the arrow is pointing into
- pop2 (str) – Population the arrow is pointing away from
- t (float) – Time of the pulse
- p (float) – Strength of the pulse
- rad (float) – Arc of the pulse arrow in radians.
- alpha (float) – Transparency
- pulse_label (bool) – Add a label for pulse strength?
- adj_label_x (float) – Adjust pulse label x position
- adj_label_y (float) – Adjust pulse label y position
-
draw_tree(tree_color='C0', alpha=1.0)¶ Draw the demographic tree (without pulse migrations)
Parameters:
-
iter_pulses()¶ Iterate over the pulse arrows, to pass to
DemographyPlot.draw_pulse()Returns: iterator over tuples (pop1, pop2, t, p)
Data¶
-
momi.snp_allele_counts(chrom_ids, positions, populations, ancestral_counts, derived_counts, length=None, use_folded_sfs=False)¶ Create a
SnpAlleleCountsobject.Parameters: - chrom_ids (iterator) – the CHROM at each SNP
- positions (iterator) – the POS at each SNP
- populations (list) – the population names
- ancestral_counts (iterator) – iterator over tuples of the ancestral counts at each SNP. tuple length should be the same as the number of populations
- derived_counts (iterator) – iterator over tuples of the derived counts at each SNP. tuple length should be the same as the number of populations
- use_folded_sfs (bool) – whether the folded SFS should be used when computing likelihoods. Set this to True if there is uncertainty about the ancestral allele.
Return type:
-
class
momi.SnpAlleleCounts¶ The allele counts for a list of SNPs.
To create a
SnpAlleleCountsobject, useSnpAlleleCounts.read_vcf(),SnpAlleleCounts.load(), orsnp_allele_counts(). Do NOT call the class constructor directly, it is for internal use only.-
classmethod
concatenate(to_concatenate)¶ Combine a list of
SnpAlleleCountsinto a single object.Parameters: to_concatenate (iterator) – Iterator over SnpAlleleCountsReturn type: SnpAlleleCounts
-
down_sample(sampled_n_dict)¶ Randomly subsample the alleles at each site.
Parameters: sampled_n_dict (dict) – dictionary mapping population names to the subsample size (number of alleles to subsample from each population). Return type: SnpAlleleCounts
-
dump(f)¶ Write data in JSON format.
Parameters: f (str,file) – filename or file object. If a filename, the resulting file is gzipped.
-
extract_sfs(n_blocks)¶ Extracts SFS from data.
Parameters: n_blocks (int) – Number of blocks to split SFS into, for jackknifing and bootstrapping Return type: Sfs
-
classmethod
load(f)¶ Load
SnpAlleleCountscreated fromSnpAlleleCounts.dump()orpython -m momi.read_vcf ...Parameters: f (str,file) – file object or file name to read in Return type: SnpAlleleCounts
-
classmethod
read_vcf(vcf_file, ind2pop, bed_file=None, ancestral_alleles=True, info_aa_field='AA')¶ Read in a VCF file and return the allele counts at biallelic SNPs.
Parameters: - vcf_file (str) – VCF file to read in. The VCF file should be indexed, e.g. with tabix or bcftools. Alternatively, if equal to “-”, reads the VCF from stdin.
- ind2pop (dict) – Maps individual samples to populations.
- bed_file (str,None) – BED accessibility regions file. Only regions in the BED file are read from the VCF. The BED file is also used to determine the size of the data in bases, so the same BED file should NOT be used across multiple VCFs (otherwise regions will be double counted towards the data length). If no BED is provided, all SNPs in the VCF are read, and the length of the data is set to be unknown.
- ancestral_alleles (bool,str) – If True, use the AA INFO field to determine ancestral allele, skipping SNPs missing this field. If False, ignore ancestral allele information, and set the
SnpAlleleCounts.use_folded_sfsproperty so that the folded SFS is used by default when computing likelihoods. Finally, ifancestral_allelesis a string that is the name of a population inind2pop, then treat that population as an outgroup, using its consensus allele as the ancestral allele; SNPs without consensus are skipped. - info_aa_field (str) – The INFO field to read Ancestral Allele from. Default is “AA”. Only has effect if
ancestral_alleles=True.
Return type:
-
classmethod
-
momi.site_freq_spectrum(sampled_pops, freqs_by_locus, length=None)¶ Create
Sfsfrom a list of dicts of counts.freqs_by_locus[l][config]gives the frequency ofconfigon locusl.A
configis a(n_pops,2)-array, whereconfig[p][a]is the count of alleleain populationp.a=0represents the ancestral allele whilea=1represents the derived allele.Parameters: Return type:
-
class
momi.Sfs¶ Class for representing observed site frequnecy spectrum data.
To create a
Sfsobject, useSnpAlleleCounts.extract_sfs(),Sfs.load(), orsite_freq_spectrum(). Do NOT call the class constructor directly, it is for internal use only.-
config_array¶ Array of unique configs.
This is a 3-d array with shape
(n_configs, n_pops, 2).config_array[i, p, a]gives the count of alleleain populationpin configurationi. Allelea=0is ancestral,a=1is derived.Sfs.sampled_popsgives the ordering of the populations.Returns: array of configs Return type: numpy.ndarray
-
dump(f)¶ Write Sfs to file
Parameters: f (str,file) – Filename or object. If name ends with “.gz” gzip it.
-
freqs_matrix¶ Sparse matrix representing the frequencies at each locus.
The
(i,j)-th entry gives the frequency of thei-th config inSfs.config_arrayat locusjReturn type: scipy.sparse.csr_matrix
-
classmethod
load(f)¶ Load
Sfsfrom file created bySfs.dump()orpython -m momi.extract_sfsParameters: f (str,file) – file object or file name to read in Return type: Sfs
-
n_snps(vector=False)¶ Number of SNPs, either per-locus or overall.
Parameters: vector (bool) – Whether to return a single total count, or an array of counts per-locus
-
p_missing¶ Estimate of probability that a random allele from each population is missing.
Missingness is estimated as follows: from each SNP remove a random allele; if the resulting config is monomorphic, then ignore. If polymorphic, then count whether the removed allele is missing or not.
This avoids bias from fact that we don’t observe some polymorphic configs that appear monomorphic after removing the missing alleles.
Returns: 1-d array of missingness per population Return type: numpy.ndarray
-
resample()¶ Create a new SFS by resampling blocks with replacement.
Note the resampled SFS is assumed to have the same length in base pairs as the original SFS, which may be a poor assumption if the blocks are not of equal length.
Returns: Resampled SFS Return type: Sfs
-
sampled_n¶ Number of samples per population
sampled_n[i]is the number of haploids in thei-th population ofSfs.sampled_popsReturns: 1-d array of integers Return type: numpy.ndarray
-
sampled_pops¶ Names of sampled populations
Returns: 1-d array of strings Return type: numpy.ndarray
-
subset_populations(populations, non_ascertained_pops=None)¶ Returns lower-dimensional SFS restricted to a subset of populations.
Parameters: Returns: Lower-dimensional SFS restricted to a subset of populations.
Return type:
-
-
momi.sfs_from_dadi(infile, outfile=None)¶ Generate a momi formatted
Sfsobject from the dadi frequency spectrum format as documented in section 3.1 of the dadi manual. One slight modification is that if you include population names after the “folding” flag on the info line, then this function will honor these pop names.Parameters: Return type:
Statistics¶
-
class
momi.SfsModelFitStats(demo_model, sampled_n_dict=None)¶ Class to compare expected vs. observed statistics of the SFS.
All methods return
JackknifeGoodnessFitStatunless otherwise stated.Currently, all goodness-of-fit statistics are based on the multinomial SFS (i.e., the SFS normalized to be a probability distribution summing to 1). Thus the mutation rate has no effect on these statistics.
See Patterson et al 2012, for definitions of f2, f3, f4 (abba-baba), and D statistics.
Note this class does NOT get updated when the underlying
demo_modelchanges; a newSfsModelFitStatsneeds to be created to reflect any changes in the demography.Parameters: - demo_model (momi.DemographicModel) – Demography to compute expected statistics under.
- sampled_n_dict (dict) – The number of samples to use per population. SNPs with fewer than this number of samples are ignored. The default is to use the full sample size of the data, i.e. to remove all SNPs with any missing data. For datasets with large amounts of missing data, this could potentially lead to most or all SNPs being removed, so it is important to specify a smaller sample size in such cases.
-
all_pairs_ibs(fig=True)¶ Fit the IBS fraction for all pairs of populations, and optionally plot it.
Parameters: fig (bool) – whether to plot it Return type: pandas.DataFrame
-
f2(A, B)¶ Computes f2 statistic (A-B)*(A-B)
Parameters:
-
f3(A, B, O)¶ Computes f3 statistic (O-A)*(O-B)
Parameters:
-
f4(A, B, C, D=None)¶ Returns the ABBA-BABA (f4) statistic for testing admixture.
Parameters:
-
f_st(A, B)¶ Returns (pi_between - pi_within) / pi_between, where pi_between, pi_within represent the average number of pairwise diffs between 2 individuals sampled from different or the same population, respectively.
-
log_abba_baba(A, B, C, D=None)¶ Returns log(BABA/ABBA) = log(BABA)-log(ABBA)
-
pattersons_d(A, B, C, D=None)¶ Returns Patterson’s D, defined as (BABA-ABBA)/(BABA+ABBA).
Parameters:
-
tensor_prod(derived_weights_dict)¶ Compute rank-1 tensor products of the SFS, which can be used to express a wide range of SFS-based statistics.
More specifically, this computes the sum
\[\sum_{i,j,\ldots} SFS_{i,j,\ldots} w^{(1)}_i w^{(2)}_j \cdots\]where \(w^{(1)}_i\) is the weight corresponding to SFS entries with
iderived alleles in population 1, etc. Note the SFS is normalized to sum to 1 here (it is a probability).Parameters: derived_weights_dict (dict) – Maps leaf populations to vectors ( numpy.ndarray). If a population hasnsamples then the corresponding vectorwshould have lengthn+1, withw[i]being the weight for SFS entries withicopies of the derived allele in the population.Return type: JackknifeGoodnessFitStat
-
class
momi.JackknifeGoodnessFitStat(expected, observed, jackknifed_array)¶ Object returned by methods of
SfsModelFitStats.Basic arithmetic operations are supported, allowing to build up complex statistics out of simpler ones.
The raw expected, observed, and jackknifed_array values can be accessed as attributes of this class.
Parameters: - expected (float) – the expected value of the statistic
- observed (float) – the observed value of the statistic
- jackknifed_array (numpy.ndarray) – array of the jackknifed values of the statistic.
-
sd¶ Standard deviation of the statistic, estimated via jackknife
-
z_score¶ Z-score of the statistic, defined as (observed-expected)/sd