Statistics

To illustrate the statistical approach used to estimate confidence limits on
experimentally-determined values for linkage distances, it is useful to first consider the
special case where two linked loci show complete concordance or no recombination
(symbolized as R = 0) in their allelic segregations among a set of *N* samples derived
either from recombinant inbred (RI) strains or from the offspring of a backcross. Let us
define the true recombination fraction —
*Theta*
— as the experimental fraction of samples
expected to be discordant (or recombinant) when *N* approaches infinity. Then the
probability of recombination in any one sample is simply
*Theta*
and the probability of non-recombination,
or concordance, is simply (1 -
*Theta*).
As long as multiple events are
completely independent of each other, one can calculate the probability that all of them
will occur by multiplying together the individual probabilities associated with each
event. Thus, if the probability of concordance in one sample is (1 -
*Theta*),
then the
probability of concordance in *N* samples is: (1 -
*Theta*)
^{N}.

In most experimental situations, the known and unknown variables are reversed in
that one begins by determining the number of discordant (or recombinant) samples *i*
that occur within a total set of *N* as a means to estimate the unknown true recombination fraction
*Theta*.
When no discordant samples are observed, the probability term just derived
can be used with the substitution of the random variable
*small-theta*
in place of
*Theta*,
to provide a
continuous probability density function indicative of the relative likelihoods for
different values of
*Theta*
between 0.0 (complete linkage) and 0.5 (no linkage).

(Equation D1)

This equation reads "the probability that the true recombination fraction
*Theta*
is equal
to a particular value
*small-theta*
is the function of
*small-theta*
given as the last term in the equation". For
both RI data and backcross data,
*Theta*
can be related directly to linkage distance in
centimorgans, *d*. In the case of backcross data, and for values of
*Theta*
less than 0.25 (see
Section 7.2.2.3), recombination fractions are converted into centimorgan estimates
through simple multiplication:

(Equation D2)

In the case of RI data, this conversion is combined with the Haldane-Waddington equation
(Equation 9.8) to yield:

(Equation D3)

An example of the probability density function associated with the experimental
observation of complete concordance among 50 backcross samples is shown in
Figure D1.
Each value of *N* will define a different function, but in all cases, the curve will look
the same with only the steepness of the fall-off increasing as *N* increases. In all cases, the
"maximum likelihood estimate" for the true recombination fraction
*Theta-hat*
— defined as the
value of
*Theta*
associated with the highest probability — will be zero. However, since this
maximum likelihood value is located at one end of the probability curve, it does not
provide a useful estimate for the likely linkage distance. A better estimate would be the
value of
*Theta*
which defines the midpoint below which and above which the true
recombination fraction value is likely to lie with equal probability; this is the definition
of the median recombination fraction estimate
*Theta*.
In mathematical terms, the value of
*Theta*
is
defined at the line which equally divides the area of the complete probability density
given by
Equation D1 (see
Figure D1).

Confidence limits are also defined by circumscribed portions of the entire
probability density; the portion that lies outside a confidence interval is called
*alpha*.
For
example, in the case of a 95% confidence interval,
*alpha*
= (1 - 0.95) = 0.05. It is standard
practice to assign equal portions of
*alpha*
to the two "tails" of the probability density
located before and after the central confidence interval. Thus, the lower confidence
limit is defined as the value of
*small-theta*
bordering the initial
*alpha*/2
fraction of the area under the
entire probability curve. The upper confidence limit is defined as the value of
*small-theta*
that
borders the ultimate
*alpha*/2
fraction of the area under the entire probability curve; this is
equivalent to saying that a "(1 -
*alpha*/2)"
fraction of area lies ahead of the upper confidence
limit.

In mathematical terms, the area beneath the entire probability density curve is equal
to the definite integral of
Equation D1 over the range of legitimate values for
*small-theta*
between 0.0 and 0.5. To determine the fraction of the probability density that lies in the region between
*Theta*
= 0 and any arbitrary
*Theta*
= x, it is necessary to integrate over the
probability density function
(Equation D1) between these two values, and divide the
result by the total area covered by the probability density. This provides the probability
that the true recombination fraction is less than or equal to x.

(Equation D4)

By standard methods of calculus,
Equation D4 can be reduced analytically to the
form:

(Equation D5)

And this equation can be reformulated to yield x as a function of *P*{
*Theta*
<= *x*} .

(Equation D6)

By solving
Equation D6 for different values of:

*P*{
*Theta*
<= *x*},

one can obtain critical values of x that define the median estimate of the recombination fraction from:

*P*{
*Theta*
<= *x*} = 0.5,

lower confidence limits from:

*P*{
*Theta*
<= *x*} =
*alpha*/2,

and upper confidence limits from:

*P*{
*Theta*
<= *x*} = (1 -
*alpha*
/2).

Once a solution for *x* has been obtained, it can be converted into a linkage distance value with either
Equation D2 for backcross data or
Equation D3 for RI strain data. Solutions to
Equation D6 over a
range of *N* RI strains and backcross animals are shown in
Figure 9.8,
Figure 9.16, and
Figure 9.17.

The statistical approach described above can be generalized to any case of *i*
discordant (or recombinant) samples observed among a total of *N* RI strains or
backcross animals that have been typed for two loci. As in the special case above, one
can arrive at a probability for the occurrence of multiple events by multiplying together
the individual probabilities for each event. In the general case, there will be *i* events of
discordance, each with an individual probability equal to the true recombination fraction
*Theta*,
and (*N* - *i*) events of concordance, each with an individual probability of (1 -
*Theta*).
These terms are multiplied together along with a "binomial coefficient" that counts
the the permutations in which the two types of events can appear to produce the
"binomial formula":

(Equation D7)

When the true recombination fraction is known, the binomial formula can be used
to provide the probability that *i* events of discordance will be observed in any set of *N*
samples. But once again, the situation encountered by geneticists is usually the reverse
one in which *i* and *N* are discrete values determined by the experiment and the true
recombination fraction
*Theta*
is unknown. In this case, one can substitute the random variable
*small-theta*
in place of
*Theta*
in
Equation D7 to generate a probability density function that
provides relative likelihoods for different values of
*Theta*
between 0.0 (complete linkage)
and 0.5 (no linkage). In this use of the binomial formula, the factorial fraction (known
as the binomial coefficient) remains constant for all values of
*small-theta*
and can be eliminated
since the purpose of the function is to provide relative probabilities only:

(Equation D8)

An example of the probability density function associated with the experimental
observation of one discordant RI strain among a total of 26 samples is shown in
Figure D2.
As one can easily see, the distribution is highly skewed toward higher
recombination fractions. Each discrete pair of values *i* and *N* will define a different
function. When both *i* and *N* are large, the density function will approximate a normal
distribution. However, with the results typically obtained in contemporary mouse
linkage studies, the density function is likely to be significantly skewed as shown in
Figure D2
and as such, it is usually not possible to take advantage of the simplified
statistical tools developed specially for use with the normal distribution.

A median estimate of linkage distance as well as lower and upper confidence limits
can be obtained in the same manner described in the special case of no recombination
described above. This can be accomplished by substituting
Equation D8 in place of the
two occurrences of
Equation D1 within
Equation D4:

(Equation D9)

The general form of the integral in this equation cannot be solved analytically but a
short computer program can be used to estimate solutions and provide critical values of
*x* for defined probability values. The computer program has been written to generate
minimum and maximum values in terms of centimorgan distances for discrete
experimentally determined values of *i* and *N* from either backcross or RI data. The
program was used to generate the values shown in
Table D1,
Table D2,
Table D3,
Table D4,
Table D5, and
Table D6
for 68% and 95% confidence intervals, but it is possible to generate confidence limits
for any other integer percentile confidence interval as well. The program will also calculate
maximum likelihood and median estimates of linkage distance
^{109}.
It is listed below as a self-contained unit that should be ready for compiling with any standard C compiler on
any computer. DOS and Macintosh version of the executable program can be downloaded over the internet from the following anonymous FTP site:
bioweb.princeton.edu.
Interested investigators should look in the folder entitled pub/mouse.

/*** A C program for the calculation of linkage distance estimates and confidence intervals ***/ #includedouble Pin(double r,int i,int N); double pow(double x, double y); double convert(double r); static int crosstype; main() { FILE *fopen(), *file; int i = 1, istart = 1, ifin = 50, iinc = 1, N = 100, P; char input; double Pin(), dmin, dmax, r,rtop, dmean, smean, Nrmlize = 0.0, Sum = 0.0, convert(), min, max; while(1){ printf("Enter the type of cross:1 for backcross,2 for RI analysis,or 3 to quit:"); scanf("%d",&crosstype); if(crosstype ! = 2 && crosstype != 1) exit(0); printf("Enter the confidence level as an integer number(e.g. 95 for 95%%):"); scanf("%d", &P); min = (1-((double)P/100.0))/2; max = 1- min; printf("Enter with comma delimiters->i-start,i-end,i-increment,and N,then return\n:>"); scanf ("%d,%d,%d,%d", &istart,&ifin,&iinc,&N); printf(" i, dist / medn, min. / max. (values in cM assuming complete interference)\n"); for ( i = istart; i <= ifin ; i += iinc){ for ( r = .0001, Nrmlize = 0 ; r <.5 ; r += .0001) Nrmlize += Pin(r,i,N); for ( r = .0001, Sum = 0; Sum < min && r<.5; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmin = convert(r); for (; Sum <.5 && r<.5; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmean = convert(r); for (; Sum < max && r<.5 ; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmax = convert(r); smean = convert((double)i/N); printf("%3d, %4.1f / %4.1f, %4.1f / %4.1f\n",i,smean,dmean,dmin,dmax);} }} double convert(double r) { double rmean; int x = 0; if(crosstype == 1) return(100*r); if(crosstype == 2) return( r*100/(4 - 6*r) );} double Pin(double r,int i,int N) { double pow(); return ((pow(r,i))*(pow(1-r,N-i)));} /************************ END OF PROGRAM ***********************/

How does one determine whether two populations of animals defined by different
inbred strains are showing a significant difference in the expression of a trait? The
answer is with a test statistic known as the "*t*-test" or "Student's *t*-test". To apply this
test, one needs to use a pair of only three values derived from an analysis of the
expression of the trait in sets of animals from each inbred strain. First is the number of
animals examined in each inbred set (*N _{1}* and

(Equation D10)

where

(Equation D11)

With values for the variance of each sample set and the size of each set, one can calculate a combined parameter refered to as the "pooled variance":

(Equation D12)

Finally, one can use the value obtained for the pooled variance together with the samples sizes and sample means to obtain a "

(Equation D13)

One final combined parameter is required to convert the

(Equation D14)

With values for