In statistical analysis, the *variance* among members of a data set shows how far apart the data points are from a trend line, also known as a *regression line*. The higher the variance, the more spread out the data points are. The study of the analysis of variance shows which parts of the variance can be explained by characteristics of the data, and which can be attributed to random factors. **The portion of the variance that cannot be explained is called the residual variance**.

## Using Excel Spreadsheets to Calculate Residual Variance

The formula to calculate residual variance involves numerous complex calculations. For small data sets, the process of calculating the residual variance by hand can be tedious. For large data sets, the task can be exhausting. By using an Excel spreadsheet, you only need to enter the data points and select the correct formula. The program handles the complex calculations and delivers a result quickly.

## Data Points

Open a new Excel spreadsheet and enter the data points into two columns. Regression lines require that each data point have two elements. Statisticians typically label these elements "X" and "Y." For example, Generic Insurance Co. wants to find the residual variance of the height and weight of its employees. The X variable represents the height and the Y variable represents the weight. Enter the heights into Column A and the weights into Column B.

## Finding the Mean

The *mean* represents the average for each element in the data set. In this example, Generic Insurance wants to find the average, standard deviation and covariance of 10 employees' heights and weights. The average of the heights listed in Column A can be found by entering the function "=AVERAGE(A1:A10)" into cell F1. The average of the weights listed in Column B can be found by entering the function "=AVERAGE(B1:B10)" into cell F3.

## Finding the Standard Deviation and Covariance

The *standard deviation* measures how far apart the data points are spread from the mean. The *covariance* measures how much the two elements of the data point change together. The standard deviation of the heights is found by entering the function "=STDEV(A1:A10)" into cell F2. The standard deviation of the weights is found by entering the function "=STDEV(B1:B10)" into cell F4. The covariance between the heights and weights is found by entering the function "=COVAR(A1:A10;B1:B10)" into cell F5.

## Finding the Regression Line

The *regression line* represents a linear function that follows the trend of the data points. The formula for the regression line looks like this: Y = aX + b.

The user can find the values for "a" and "b" by using the calculations for the means, standard deviations and covariance. The value for "b" represents the point where the regression line intercepts the Y-axis. The value can be found by taking the covariance and dividing it by the square of the standard deviation of the X-values. The Excel formula goes into cell F6 and looks like this: =F5/F2^2.

The value for "a" represents the slope of the regression line. The Excel formula goes into cell F7 and looks like this: =F3-F6*F1.

To see the formula for the regression line, enter this string concatenation into cell F8:

=CONCATENATE("Y = ";ROUND(F6;2);"X";IF(SIGN(F7)=1;" + ";" - ");ABS(ROUND(F7;2)))

## Calculate Y Values

The next step involves calculating the Y-values on the regression line for the given X-values in the data set. The formula to find the Y values goes into column C and looks like this:

=$F$6*A(i)+$F$7

Where A(i) is the value for Column A in Row (i). The formulas look like this in the spreadsheet:

=$F$6*A1+$F$7

=$F$6*A2+$F$7

=$F$6*A3+$F$7, and so on

The entries in Column D show the differences between the expected and actual values for Y. The formulas look like this:

=B(i)-C(i),

Where B(i) and C(i) are the values in Row(i) in Columns B and C, respectively.

## Finding the Residual Variance

The **formula for residual variance** goes into Cell F9 and looks like this:

=SUMSQ(D1:D10)/(COUNT(D1:D10)-2)

Where SUMSQ(D1:D10) is the sum of the squares of the differences between the actual and expected Y values, and (COUNT(D1:D10)-2) is the number of data points, minus 2 for degrees of freedom in the data.