What is the relationship between two variables

There are several different kinds of relationships between variables. Before drawing a conclusion, you should first understand how one variable changes with the other. This means you need to establish how the variables are related - is the relationship linear or quadratic or inverse or logarithmic or something else?

Inhaltsverzeichnis Show

Relationships in Physical and Social Sciences
Positive and Negative Correlation

Suppose you measure a volume of a gas in a cylinder and measure its pressure. Now you start compressing the gas by pushing a piston all while maintaining the gas at the room temperature. The volume of gas decreases while the pressure increases. You note down different values on a graph paper.

If you take enough measurements, you can see a shape of a parabola defined by xy=constant. This is because gases follow Boyle's law that says when temperature is constant, PV = constant. Here, by taking data you are relating the pressure of the gas with its volume. Similarly, many relationships are linear in nature.

Relationships between variables need to be studied and analyzed before drawing conclusions based on it. In natural science and engineering, this is usually more straightforward as you can keep all parameters except one constant and study how this one parameter affects the result under study.

However, in social sciences, things get much more complicated because parameters may or may not be directly related. There could be a number of indirect consequences and deducing cause and effect can be challenging.

Only when the change in one variable actually causes the change in another parameter is there a causal relationship. Otherwise, it is simply a correlation. Correlation doesn't imply causation. There are ample examples and various types of fallacies in use.

A famous example to prove the point: Increased ice-cream sales shows a strong correlation to deaths by drowning. It would obviously be wrong to conclude that consuming ice-creams causes drowning. The explanation is that more ice-cream gets sold in the summer, when more people go to the beach and other water bodies and therefore increased deaths by drowning.

Positive and Negative Correlation

Correlation between variables can be positive or negative. Positive correlation implies an increase of one quantity causes an increase in the other whereas in negative correlation, an increase in one variable will cause a decrease in the other.

It is important to understand the relationship between variables to draw the right conclusions. Even the best scientists can get this wrong and there are several instances of how studies get correlation and causation mixed up.

Relations Between Variables

Scientists are forever trying to find relations between quantities:

Does the number of minutes of exercise per week influence blood pressure?
Does the amount of time required for a ball to roll down a ramp depend on the slope of the ramp?
Does the amount of fertilizer I use on plants affect their size?

In each case above, the scientists run experiments and collect pairs of numbers for the quantities that they are trying to relate to each other:

· (weekly exercise minutes, systolic blood pressure) , e.g. (35, 136), (0, 155), (200, 121), …

· (angle of ramp (deg.), time of ball (sec.)), e.g. (5, 17), (20, 6), (45, 3), …

· (cc fertilizer/sq. meter garden, plant height m.), e.g. (.5, .33), (.77, .02), (.01,.54), …

They might plot these number pairs on a graph and examine the graph for a trend. For example, in repeating the ball-rolling experiment pictured below the student records

the pairs of numbers representing the angle of the ramp and the corresponding “rolling times. She then plots these numbers on a graph as follows:

What can you conclude about a relationship between the angle of the ramp and the time the ball rolls? Can you explain this?

Another student does a very careful experiment to relate the growth of plants with the amount of fertilizer used and comes up with the following graph:

Is there a relationship between plant height and the amount of fertilizer used? Can you explain this?

These examples of seeking a relationship between variables can be quantified by using methods of statistical analysis called Correlation and Regression.

Correlation analysis seeks to identify (by a single number) the degree to which there is a (linear) relation between the numbers in sets of data pairs. The correlation coefficient of a set of data pairs

with x- and y-means

and

respectively is

You don’t need to worry about computing this number; it’s easy to use a computer to calculate it. The interpretation of this number is more important – it is somewhere between –1 and 1. The closer r is to 1, the more positively correlated are the sets of numbers in the sense that an increase in x corresponds to a proportional increase in y; similarly with decreases in x corresponding to proportional decreases in y. On the other hand, if r is close to –1, then increases in x correspond to decreases in y and decreases in x correspond to increases in y, so we say that x and y are negatively correlated. Finally, if r is close to zero, there is little if any relationship between the variables – we say they are uncorrelated.

Consider the earlier graphs from the “ball-rolling” and “fertilizer” experiments:

In the graph of time of rolling vs. angle of ramp as the angle increases, does the rolling time generally

increase, decrease, or change in an unrelated fashion?
Explain your answer from the graph.
The correlation coefficient for this data turns out to be -.84. Does this agree with your answers above? Explain.

In the graph of plant height vs. fertilizer concentration, as the amount of fertilizer per square meter increases, does plant height generally

increase, decrease, or change in an unrelated fashion?
Explain your answer from the graph.
The correlation coefficient for this data turns out to be .37. Does this agree with your answers above? Explain.

Consider the graph that represents the weight (lb.) vs. height (in.) for players on last year’s Cincinnati Bengals football team

As the height of players increases, does the weight generally

increase, decrease, or change in an unrelated fashion?
Explain your answer from the graph.
Guess at the number below that you think best represents the correlation coefficient for this data? Explain your guess.

i. -.75 ii. .03 iii. .73 iv. .99

Consider the graph generated from Dr. Denice Robertson’s research on lobsters and their production of eggs. She has measured the number of eggs produced by a lobster and the lobster’s length (mm.). Her data is graphed below

As the length of the lobster increases, does the number of eggs produced generally

increase, decrease, or change in an unrelated fashion?
Explain your answer from the graph.
Guess at the number below that you think best represents the correlation coefficient for this data? Explain your guess

i. -.89 ii. -.13 iii. .25 iv. .91

Regression

Regression analysis is used to determine if a relationship exists between two variables. To do this a line is created that best fits a set of data pairs. We will use linear regression which seeks a line with equation

that “best fits” the data. The term “Best fits” has a precise mathematical meaning that we can think of as “minimizing the distances to the line for each data point”. In addition to an equation for the line, the regression analysis calculates a p-value and an R2 value (see below for an explanation of each).

1)Generation of the regression line and equation for the line:

For example, if a computer program for doing regression is applied to the data from the “Ball rolling experiment” the best fitting line is shown on the graph below:

It will turn out that any other line will give a larger overall distance to the points than this line does.

You can frequently estimate the equation of the regression line (y= mx + b) by estimating its slope (m)

(i.e.

) and its y-intercept (b) (i.e. the value of the value of y where the line crosses the y-axis when x = 0). In a regression graph the x-axis is the independent variable and y-axis is the dependent variable.

From the graph above, we could estimate that the line has y-intercept close to 6 because if you continue to draw the line out, it crosses the y-axis near 6.

To determine the slope you must first choose two points on the line—these are not existing data points, but points of your choice. The easiest and usually more accurate method is to use the grid lines as your guide in choosing your values on the x-axis and then estimate your y-values. So, for the above graph choose these two sets of points (20,4.5) and (60, 1.25). It is best to spread out the points you choose, one from either end of your line. It is especially important to remember to choose two points ON THE LINE because you are trying to estimate an equation for the line itself, NOT your data points. Using the points we chose, plug the numbers into the equation for the slope.

. Recognize that this is just a guess based on “eyeballing the graph”. Now plug your values for slope and the y-intercept in the equation y=mx+b and you will get:

y=-0.081 x+ 6.

2) Generation of R2 value

When you do regression analysis using a computer program, you’ll sometime see some indication of the coefficient of determination or “goodness of fit”,

, where

is the measured value and where

is the value of the regression line evaluated at

and where n is the sample size. R2 is simply an indication of how well your data points fit the regression line. It is used to determine if you can use your equation of the line to make any further predictions about the relationship between your variables. R2 values fall between 0 and 1. If the R2 value is closer to 1, it means more of your data points fall on or very near the regression line.

3) Generation of a p-value

Using the computer will allow you to calculate a p-value for your relationship. The p-value allows you to decide whether to accept or reject your null hypothesis. If your p-value is greater than 0.05 there is NO significant relationship and you would accept your null hypothesis. If your p-value is less than 0.05 there IS a significant relationship and you would reject your null hypothesis.

Which of the other examples displayed this causal relationship?

In the fertilizer/plant size experiment, the scientist controlled the amount of fertilizer spread on the field, so this plays the role of an independent variable and the plant height is treated as the dependent variable. We saw earlier that this data was very weakly correlated. Nevertheless, we can compute the regression line to be y=1.804631x+1.535264. The difference with the previous example is that the R2 value will be relatively much smaller in this case than is the previous example.

In the football player example, height doesn’t cause weight nor does weight cause height. Actually, both are related to a general quality that we could refer to as a person’s “size”. Thus, although the variables are well correlated, it doesn’t make much sense to apply regression analysis to this data.
Dr. Robertson’s study that relates egg production to size in lobsters does allow for regression analysis. Once you realize that the female lobster distributes her eggs on her tail, it is logical that the larger lobster has more room to carry eggs and thus will produce more. The regression equation for this data is

y=3758.525 x + -106704

and the graph (with regression line) is drawn below:

Homework on Correlation and Regression:

Turn in answers to problems 1—4 from earlier and also for the following:

Using the Lobster Fecundity graph above:

Estimate the slope and y-intercept of the line in above by the “eye-ball” method and create your equation for the linear regression line.
Use your equation to estimate the number of eggs produced by a lobster that is 45 mm long. Use the “best fit” regression equation above to do this same estimate. Is the difference in these two estimates meaningful?

You collect the data on minutes of weekly exercise vs. systolic blood pressure for a group of college students and plot the data:

Draw what you think is the “best fit” line on the graph above. It’s best to use a clear ruler to draw the line that you think gives the minimum total distance from all the points.
Estimate the slope and y-intercept for your line and create your regression equation.
Use your regression equation to estimate the blood pressure for a student who does 180 min. exercise per week.