To illustrate the statistical approach used to estimate confidence limits on experimentally-determined values for linkage distances, it is useful to first consider the special case where two linked loci show complete concordance or no recombination (symbolized as R = 0) in their allelic segregations among a set of N samples derived either from recombinant inbred (RI) strains or from the offspring of a backcross. Let us define the true recombination fraction  Theta  as the experimental fraction of samples expected to be discordant (or recombinant) when N approaches infinity. Then the probability of recombination in any one sample is simply Theta and the probability of non-recombination, or concordance, is simply (1 - Theta). As long as multiple events are completely independent of each other, one can calculate the probability that all of them will occur by multiplying together the individual probabilities associated with each event. Thus, if the probability of concordance in one sample is (1 - Theta), then the probability of concordance in N samples is: (1 - Theta) N.
In most experimental situations, the known and unknown variables are reversed in 
that one begins by determining the number of discordant (or recombinant) samples i 
that occur within a total set of N as a means to estimate the unknown true recombination fraction 
Theta. 
When no discordant samples are observed, the probability term just derived 
can be used with the substitution of the random variable 
small-theta 
in place of 
Theta, 
to provide a 
continuous probability density function indicative of the relative likelihoods for 
different values of 
Theta 
between 0.0 (complete linkage) and 0.5 (no linkage).
(Equation D1)
![]()
	 	
This equation reads "the probability that the true recombination fraction 
Theta 
is equal 
to a particular value 
small-theta 
is the function of 
small-theta 
given as the last term in the equation". For 
both RI data and backcross data, 
Theta 
can be related directly to linkage distance in 
centimorgans, d. In the case of backcross data, and for values of 
Theta 
less than 0.25 (see 
Section 7.2.2.3), recombination fractions are converted into centimorgan estimates 
through simple multiplication:
(Equation D2)
![]()
	
In the case of RI data, this conversion is combined with the Haldane-Waddington equation 
(Equation 9.8) to yield:
(Equation D3)

 	 	
An example of the probability density function associated with the experimental observation of complete concordance among 50 backcross samples is shown in Figure D1. Each value of N will define a different function, but in all cases, the curve will look the same with only the steepness of the fall-off increasing as N increases. In all cases, the "maximum likelihood estimate" for the true recombination fraction Theta-hat  defined as the value of Theta associated with the highest probability  will be zero. However, since this maximum likelihood value is located at one end of the probability curve, it does not provide a useful estimate for the likely linkage distance. A better estimate would be the value of Theta which defines the midpoint below which and above which the true recombination fraction value is likely to lie with equal probability; this is the definition of the median recombination fraction estimate Theta. In mathematical terms, the value of Theta is defined at the line which equally divides the area of the complete probability density given by Equation D1 (see Figure D1).
Confidence limits are also defined by circumscribed portions of the entire probability density; the portion that lies outside a confidence interval is called alpha. For example, in the case of a 95% confidence interval, alpha = (1 - 0.95) = 0.05. It is standard practice to assign equal portions of alpha to the two "tails" of the probability density located before and after the central confidence interval. Thus, the lower confidence limit is defined as the value of small-theta bordering the initial alpha/2 fraction of the area under the entire probability curve. The upper confidence limit is defined as the value of small-theta that borders the ultimate alpha/2 fraction of the area under the entire probability curve; this is equivalent to saying that a "(1 - alpha/2)" fraction of area lies ahead of the upper confidence limit.
In mathematical terms, the area beneath the entire probability density curve is equal 
to the definite integral of 
Equation D1 over the range of legitimate values for 
small-theta
between 0.0 and 0.5. To determine the fraction of the probability density that lies in the region between 
Theta 
= 0 and any arbitrary 
Theta 
= x, it is necessary to integrate over the 
probability density function 
(Equation D1) between these two values, and divide the 
result by the total area covered by the probability density. This provides the probability 
that the true recombination fraction is less than or equal to x.
(Equation D4)

	
By standard methods of calculus, 
Equation D4 can be reduced analytically to the 
form: 
(Equation D5)

	 	
And this equation can be reformulated to yield x as a function of P{
Theta
<= x} .
(Equation D6)
![]()
	
By solving 
Equation D6 for different values of:
P{
Theta 
<= x},
 
one can obtain critical values of x that define the median estimate of the recombination fraction from:
P{
Theta 
<= x} = 0.5,
lower confidence limits from:
P{
Theta 
<= x} = 
alpha/2,
and upper confidence limits from: 
P{
Theta 
<= x} = (1 - 
alpha
/2).
Once a solution for x has been obtained, it can be converted into a linkage distance value with either 
Equation D2 for backcross data or 
Equation D3 for RI strain data. Solutions to 
Equation D6 over a 
range of N RI strains and backcross animals are shown in 
Figure 9.8, 
Figure 9.16, and 
Figure 9.17.
The statistical approach described above can be generalized to any case of i 
discordant (or recombinant) samples observed among a total of N RI strains or 
backcross animals that have been typed for two loci. As in the special case above, one 
can arrive at a probability for the occurrence of multiple events by multiplying together 
the individual probabilities for each event. In the general case, there will be i events of 
discordance, each with an individual probability equal to the true recombination fraction 
Theta, 
and (N - i) events of concordance, each with an individual probability of (1 - 
Theta). 
These terms are multiplied together along with a "binomial coefficient" that counts 
the the permutations in which the two types of events can appear to produce the 
"binomial formula":
(Equation D7)

	 	
When the true recombination fraction is known, the binomial formula can be used 
to provide the probability that i events of discordance will be observed in any set of N 
samples. But once again, the situation encountered by geneticists is usually the reverse 
one in which i and N are discrete values determined by the experiment and the true 
recombination fraction 
Theta 
is unknown. In this case, one can substitute the random variable 
small-theta 
in place of 
Theta
in 
Equation D7 to generate a probability density function that 
provides relative likelihoods for different values of 
Theta 
between 0.0 (complete linkage) 
and 0.5 (no linkage). In this use of the binomial formula, the factorial fraction (known 
as the binomial coefficient) remains constant for all values of 
small-theta 
and can be eliminated 
since the purpose of the function is to provide relative probabilities only:
(Equation D8)

	 	
An example of the probability density function associated with the experimental observation of one discordant RI strain among a total of 26 samples is shown in Figure D2. As one can easily see, the distribution is highly skewed toward higher recombination fractions. Each discrete pair of values i and N will define a different function. When both i and N are large, the density function will approximate a normal distribution. However, with the results typically obtained in contemporary mouse linkage studies, the density function is likely to be significantly skewed as shown in Figure D2 and as such, it is usually not possible to take advantage of the simplified statistical tools developed specially for use with the normal distribution.
A median estimate of linkage distance as well as lower and upper confidence limits 
can be obtained in the same manner described in the special case of no recombination 
described above. This can be accomplished by substituting 
Equation D8 in place of the 
two occurrences of 
Equation D1 within 
Equation D4:
(Equation D9)

	 	
The general form of the integral in this equation cannot be solved analytically but a 
short computer program can be used to estimate solutions and provide critical values of 
x for defined probability values. The computer program has been written to generate 
minimum and maximum values in terms of centimorgan distances for discrete 
experimentally determined values of i and N from either backcross or RI data. The 
program was used to generate the values shown in 
Table D1, 
Table D2, 
Table D3, 
Table D4, 
Table D5, and 
Table D6 
for 68% and 95% confidence intervals, but it is possible to generate confidence limits 
for any other integer percentile confidence interval as well. The program will also calculate 
maximum likelihood and median estimates of linkage distance
109. 
It is listed below as a self-contained unit that should be ready for compiling with any standard C compiler on 
any computer. DOS and Macintosh version of the executable program can be downloaded over the internet from the following anonymous FTP site: 
bioweb.princeton.edu. 
Interested investigators should look in the folder entitled pub/mouse.
/*** A C program for the calculation of linkage distance estimates and confidence intervals ***/ #includedouble Pin(double r,int i,int N); double pow(double x, double y); double convert(double r); static int crosstype; main() { FILE *fopen(), *file; int i = 1, istart = 1, ifin = 50, iinc = 1, N = 100, P; char input; double Pin(), dmin, dmax, r,rtop, dmean, smean, Nrmlize = 0.0, Sum = 0.0, convert(), min, max; while(1){ printf("Enter the type of cross:1 for backcross,2 for RI analysis,or 3 to quit:"); scanf("%d",&crosstype); if(crosstype ! = 2 && crosstype != 1) exit(0); printf("Enter the confidence level as an integer number(e.g. 95 for 95%%):"); scanf("%d", &P); min = (1-((double)P/100.0))/2; max = 1- min; printf("Enter with comma delimiters->i-start,i-end,i-increment,and N,then return\n:>"); scanf ("%d,%d,%d,%d", &istart,&ifin,&iinc,&N); printf(" i, dist / medn, min. / max. (values in cM assuming complete interference)\n"); for ( i = istart; i <= ifin ; i += iinc){ for ( r = .0001, Nrmlize = 0 ; r <.5 ; r += .0001) Nrmlize += Pin(r,i,N); for ( r = .0001, Sum = 0; Sum < min && r<.5; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmin = convert(r); for (; Sum <.5 && r<.5; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmean = convert(r); for (; Sum < max && r<.5 ; r += .0001) Sum += Pin(r,i,N)/Nrmlize; dmax = convert(r); smean = convert((double)i/N); printf("%3d, %4.1f / %4.1f, %4.1f / %4.1f\n",i,smean,dmean,dmin,dmax);} }} double convert(double r) { double rmean; int x = 0; if(crosstype == 1) return(100*r); if(crosstype == 2) return( r*100/(4 - 6*r) );} double Pin(double r,int i,int N) { double pow(); return ((pow(r,i))*(pow(1-r,N-i)));} /************************ END OF PROGRAM ***********************/ 
How does one determine whether two populations of animals defined by different 
inbred strains are showing a significant difference in the expression of a trait? The 
answer is with a test statistic known as the "t-test" or "Student's t-test". To apply this 
test, one needs to use a pair of only three values derived from an analysis of the 
expression of the trait in sets of animals from each inbred strain. First is the number of 
animals examined in each inbred set (N1 and N2). Second is the mean level of 
expression for each set (m1 and m2) calculated as:
(Equation D10)

	 	
where xi refers to the expression value obtained for the ith sample in the set. Third 
is the variance of each set of animals (s12 and s22) calculated as:
(Equation D11)

	 	
With values for the variance of each sample set and the size of each set, one can 
calculate a combined parameter refered to as the "pooled variance":
(Equation D12)

	 	
Finally, one can use the value obtained for the pooled variance together with the 
samples sizes and sample means to obtain a "t value":
(Equation D13)

	 	
One final combined parameter is required to convert the t value into a level of 
significance  the number of degrees of freedom df.
(Equation D14)
![]()
	 	
With values for t and df, one can obtain a P value from a table of critical values for 
the t distribution found in 
Table D7.