# How to Find Marginal Distribution

When you collect data for a statistical problem in multiple variables, you have to consider every possible combination of those variables. But when you analyze that data, you might be asked to look at the distribution of data for just one of the variables and ignore the others completely. That's what it means to find the "marginal distribution" of a multi-variable trial, experiment or data set.

## Tip

Finding the marginal distribution simply means finding the full distribution of one variable in a multi-variable sample set.

When we talk about "distribution" in statistics, what we really mean is this: We have a set of data, and we're going to put out a bunch of metaphorical buckets that correspond to the possible data values within that data set. Then we're going to distribute each value of the data set into the bucket with the corresponding value.

For example, imagine that you polled your class of 18 students for their heights and found that one was under 5 feet tall, four were between 5 feet tall and 5 feet 4 inches, five were between 5 feet 5 inches and 5 feet 9 inches, six were between 5 feet 10 inches and 6 feet tall, and the remaining two students were more than 6 feet tall. Then each of your "buckets" would have the corresponding values in them:

- Under 5 feet: 1
- Between 5 feet tall and 5 feet 4: 4
- Between 5 feet 5 and 5 feet 9: 5
- Between 5 feet 10 and 6 feet: 6
- Over 6 feet tall: 2

You could use actual buckets, filling them with small items like beans to represent each data point, but that gets messy fast. So instead, the data is usually organized in a chart to show all the different values of the variable you're considering.

What if you also noted the students' hair color? Well, you're going to need a lot more buckets — because now you have to account for all possible combinations of both hair color and height. Like a single-variable set of data, all that information about two variables is best expressed in a chart; because there are two variables at work, it's called a "two-way table," with one variable shown in the rows and another in the columns. The name "marginal distribution" is quite apt, because the way those two-way tables are filled out, you'll find the information you're looking for in either the far right margin or the very bottom margin.

Now here's the great news: If you're asked to find marginal distribution for one of the two variables you're dealing with, you're being asked to completely disregard the other variable. Pretend it doesn't exist. Then all you have to do is determine how the data points are distributed for the remaining variable. In other words, how many data points are in each of the "buckets" for that remaining variable?

And that's exactly what you already did when you looked at the data distribution for students' height in your class. Here's another look at your data which, as you'll notice, completely disregards the existence of students' hair colors:

- Under 5 feet: 1
- Between 5 feet tall and 5 feet 4: 4
- Between 5 feet 5 and 5 feet 9: 5
- Between 5 feet 10 and 6 feet: 6
- Over 6 feet tall: 2

There are two important things to keep in mind about this data. First, your marginal distribution can be expressed as counts or as percentages. Second, if it's expressed as a percentage, the total of every marginal value must add up to 100% or, if you're expressing the percentages as decimals, they must add up to 1. If you're expressing the data as counts, then the counts should all add up to the total number of trials or data entries in your set. You can check that here. Remember, there were 18 students, and 1 + 4 + 5 + 6 + 2 = 18.

To express your marginal values as percentages, divide the count for each category by the total number of data points. So for the "under 5 feet" category, 1 ÷ 18 = 0.056 or 5.6%. In data sets where you can extrapolate probability, the marginal value expressed as a percentage can also be called the marginal probability.

Do you have the concept of marginal distribution now? Good. Since you've come this far, let's go ahead and discuss two more related — but slightly different — types of distribution that happen when you're analyzing two different variables.

The first is conditional distribution. This means that you set one of the variables (in this case, hair color) to a set value, and then look at all the possible data values of the other variable. So if you set the "hair color" variable to brown hair, you'd then be looking at the distribution of the other variable (height) among only the students that have brown hair.

The second is joint distribution. This term is when you look at how many of your data satisfy a set requirement for both variables. For example, assessing how many of your fellow students are black-haired and over 6 feet tall is an example of joint distribution.