Which of the following should be used to present the relationship between two continuous variables?

Perhaps you would like to test whether there is a statistically significant linear relationship between two continuous variables, weight and height (and by extension, infer whether the association is significant in the population). You can use a bivariate Pearson Correlation to test whether there is a statistically significant linear relationship between height and weight, and to determine the strength and direction of the association.

Before the Test

In the sample data, we will use two variables: “Height” and “Weight.” The variable “Height” is a continuous measure of height in inches and exhibits a range of values from 55.00 to 84.41 (Analyze > Descriptive Statistics > Descriptives). The variable “Weight” is a continuous measure of weight in pounds and exhibits a range of values from 101.71 to 350.07.

Before we look at the Pearson correlations, we should look at the scatterplots of our variables to get an idea of what to expect. In particular, we need to determine if it's reasonable to assume that our variables have linear relationships. Click Graphs > Legacy Dialogs > Scatter/Dot. In the Scatter/Dot window, click Simple Scatter, then click Define. Move variable Height to the X Axis box, and move variable Weight to the Y Axis box. When finished, click OK.

To add a linear fit like the one depicted, double-click on the plot in the Output Viewer to open the Chart Editor. Click Elements > Fit Line at Total. In the Properties window, make sure the Fit Method is set to Linear, then click Apply. (Notice that adding the linear regression trend line will also add the R-squared value in the margin of the plot. If we take the square root of this number, it should match the value of the Pearson correlation we obtain.)

From the scatterplot, we can see that as height increases, weight also tends to increase. There does appear to be some linear relationship.

Running the Test

To run the bivariate Pearson Correlation, click Analyze > Correlate > Bivariate. Select the variables Height and Weight and move them to the Variables box. In the Correlation Coefficients area, select Pearson. In the Test of Significance area, select your desired significance test, two-tailed or one-tailed. We will select a two-tailed significance test in this example. Check the box next to Flag significant correlations.

Click OK to run the bivariate Pearson Correlation. Output for the analysis will display in the Output Viewer.

Syntax

CORRELATIONS /VARIABLES=Weight Height /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE.

Output

Tables

The results will display the correlations in a table, labeled Correlations.

Which of the following should be used to present the relationship between two continuous variables?

A Correlation of Height with itself (r=1), and the number of nonmissing observations for height (n=408).

B Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.

C Correlation of height and weight (r=0.513), based on n=354 observations with pairwise nonmissing values.

D Correlation of weight with itself (r=1), and the number of nonmissing observations for weight (n=376).

The important cells we want to look at are either B or C. (Cells B and C are identical, because they include information about the same pair of variables.) Cells B and C contain the correlation coefficient for the correlation between height and weight, its p-value, and the number of complete pairwise observations that the calculation was based on.

The correlations in the main diagonal (cells A and D) are all equal to 1. This is because a variable is always perfectly correlated with itself. Notice, however, that the sample sizes are different in cell A (n=408) versus cell D (n=376). This is because of missing data -- there are more missing observations for variable Weight than there are for variable Height.

If you have opted to flag significant correlations, SPSS will mark a 0.05 significance level with one asterisk (*) and a 0.01 significance level with two asterisks (0.01). In cell B (repeated in cell C), we can see that the Pearson correlation coefficient for height and weight is .513, which is significant (p < .001 for a two-tailed test), based on 354 complete observations (i.e., cases with nonmissing values for both height and weight).

Decision and Conclusions

Based on the results, we can state the following:

  • Weight and height have a statistically significant linear relationship (r=.513, p < .001).
  • The direction of the relationship is positive (i.e., height and weight are positively correlated), meaning that these variables tend to increase together (i.e., greater height is associated with greater weight).
  • The magnitude, or strength, of the association is approximately moderate (.3 < | r | < .5).

Correlation is a parametric test that examines the strength of a linear relationship between two continuous variables. It does not assume a causal relationship, which means that the variables are not classified as explanatory or response. Correlation has the following sampling variability assumption that has to be met:

1. The two random variables are normally distributed;

Before we can interpret our statistical output, we need to make sure the assumption above is met.

###Background Theory The Pearson Product-Moment Correlation attempts to draw a line of best fit through the data of two continuous variables. The Pearson correlation coefficient (r) is a number between -1 and 1 that indicates the extent to which two variables are linearly related. The closer the value of r is to 1 or -1, the stronger the relationship. Basically, r tells you how far away each data point is to the line of best fit. It tests the null hypothesis that the slope of the best fit line is 0. If the slope is significantly different from 0, then the model is explaining more variance than error (random chance) and the model is significant.

Here are some statistical terms you should understand to interpret your correlation.

  • The correlation coefficient (r) is a measure of the strength and direction of a relationship between two continuous variables. It can vary between -1 and +1, and the closer the value is to -1 or +1, the stronger the relationship.
  • The p-value or alpha is the area under the curve at the critical test statistic. The p-value, which is interpreted as the chance that you are wrong if you accept that there is a significant effect (also known as a Type I Error rate) has been chosen by scientists to be 5%.
  • The df is equal to the number of observations minus the 2 (N - 2).

If the test statistic generated by your test is larger than the critical~ value, the area under the curve is smaller than 5% and, therefore, the p-value is < 0.05, and there is a significant relationship between the two variables. You have explained significantly more variance than random chance. And if you say that there is a significant effect and there really isn’t, you are wrong less than 5% of the time.

Figure 1. Scatterplots showing different relationships and their correlation coefficients. Relationships could be strong positive (a), strong negative (d), weak positive (b), weak negative (e), or there could be no relationship (c and f). Image: https://statistics.laerd.com/statistical-guides/pearson-correlation-coefficient-statistical-guide.php

A correlation test will provide you output that looks like the example below.

Pearson's product-moment correlation data: SLA and LT t = -3.6333, df = 23, p-value = 0.001392 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.8065511 -0.2741533 sample estimates: cor -0.6038692

This output provides the df, the p-value, and the pearson correlation coefficient (r) as the last line: cor. In this example, the two variables (SLA and LT) are significantly negatively associated with each other because the r = -0.60 and the p-value = 0.001.

The df is the degrees of freedom is the number of samples minus 2. In the above example, the df = 23 so there were in total 25 samples.

###Information about the Example Dataset

For this test, we will be using a data set of size and biomass of striped plateau lizards (Sceloporus virgatus) from sites in Cochise county in southeastern Arizona. Biomass is often of the variable of interest but it is difficult to get in many species so other measures are used to estimate biomass. In this data set, snout to vent length (mm) was measured as this is the standard measurement of body size in reptiles. Biomass was measured using a scale as the body mass (g). We are testing whether there is a positive correlation between snout to vent length (SVL) and biomass.

Figure 2. Striped plateau lizard (Sceloporus virgatus) in southeastern Arizona.

In order to make your RStudio script file organized, you will want to include some information at the top of the file. You can use the hashtag (#) to include things in your RStudio file that R can’t read.

For each test, you should include the following lines at the top of each RStudio file:

#question: #response variable (or variable 1): #explanatory variable (or variable 2): #test name:

Below is what it would look like for the lizard example:

#question: is there a positive relationship between snout to vent length and biomass in striped plateau lizards? #variable 1: snout to vent length (continuous) #variable 2: biomass (continuous) #test name: correlation

In order to run a correlation, you need to have each variable in a separate column. To bring data into RStudio, you can use the line of code below that will allow you to select any file directly from your computer.

#code to load a datafile into R from your computer DATA <- read.csv(file.choose())

An alternative way to open your file is to use the path directly to the folder on your computer where you set your working directory. This will allow you to choose a named file that you saved in your working directory (to learn more about what your working directory is, go the the main webpage and click on Introduction to RStudio). In the DATA code line, TO.BE.EDITED is the name of your file.

#code to set the working directory folder on your computer setwd <- ('C:/Users/YourUserName/Documents/Rfiles') #code to load a datafile into R from your working directory DATA <- read.csv(file="c:/TO.BE.EDITED.csv", header=TRUE, sep=",")

For this example, download the lizard_biomass.xlsx file from the home page, save the “lizard_biomass” worksheet in your working directory as a .csv file and then use the code below to upload the lizard_biomass.csv file into RStudio.

#code to load the datafile "lizard_biomass.csv" into R from your working directory DATA <- data.frame(read.table(file='lizard_biomass.csv', sep=',', header=TRUE, fill=TRUE))

To make sure it loaded correctly, you can copy and paste the following code to see the names of all of the columns in your file.

#code to see the names of the column titles names(DATA) ## [1] "Case" "Site" "Date" "TC" "Sex" "SVL.mm" "bmass"

If the file was successfully loaded into RStudio, you should have the names of the columns above. If you got an error, go back to the main stats webpage and click on Troubleshooting.

You can also use the following lines of code to make sure your data was read in correctly. The first line of code results in the top 6 rows of your data being displayed instead of just the column names, and the second line of code gives you the dimensions of your data file.

#code to see the top 6 rows of your data file head(DATA) ## Case Site Date TC Sex SVL.mm bmass ## 1 1 burned 22-May 1550 M 52.0 3.3 ## 2 2 burned 22-May 3330 M 59.0 4.9 ## 3 3 burned 22-May 1432 M 60.5 5.9 ## 4 4 burned 22-May 1551 M 50.0 3.2 ## 5 5 burned 22-May 1552 F 61.0 6.9 ## 6 6 burned 23-May 1553 F 58.0 5.4 #code to see the dimensions (number of rows and columns) of your data file dim(DATA) ## [1] 191 7

Once you are sure the data are loaded into R, you can proceed.

Before we run the test, it is a good idea to explore what the data look like. This gives you a chance to see if there are any outliers and to determine if your prediction based on your hypothesis seems supported by the data (presumably you would have some information about snout to vent length and biomass before running this analysis on what you predict to find). We will examine how snout to vent length (SVL.mm) varies with biomass (bmass). For this test, both variables are continuous.

Use the following code to create a scatterplot of your data. This is the most suitable graph form for a correlation.

#code to create a scatterplot with a line of best fit attach(DATA) #reg<-lm(bmass~SVL.mm, data=DATA) plot(DATA$SVL.mm, DATA$bmass, xlab="snout to vent length (mm)", ylab="biomass (g)", pch=19) abline(lm(bmass~SVL.mm, data=DATA))

Which of the following should be used to present the relationship between two continuous variables?

Looking at these data it seems that there is a strong positive relationship between the two variables and there don’t appear to be any outliers. If there are any outliers, explore why by looking at the data and determining if there was an error in recording to data entry. Note that you have to be certain of such confusions before you alter data; just a hunch is not necessarily enough. Any change you do make has to be justified either by an error that you know you made or an outlier test (see this website for information about outlier tests: http://r-statistics.co/Outlier-Treatment-With-R.html).

We will now test the assumption of the model.

We have to test the model assumptions before running the correlation test. You never look at the results of the test until AFTER you check that the assumptions of the model were met. So we will first examine if our data meet the assumption that:

both variables follow a normal distribution

To test the assumption of the model, you will use density plots to see if each variable follows a normal distribution. Use the code below to test whether SVL.mm and biomass are normally distributed.

#code to attach dataset in R attach(DATA) ## The following objects are masked from DATA (pos = 3): ## ## bmass, Case, Date, Sex, Site, SVL.mm, TC #code to create a density plot of snout to vent length (SVL.mm) plot(density(SVL.mm))

Which of the following should be used to present the relationship between two continuous variables?

#code to create a density plot of biomass (bmass) plot(density(bmass))

Which of the following should be used to present the relationship between two continuous variables?

Examine the density plots of each variable. It should follow more or less a bell-shaped curve. It can be hard to determine whether or not your density plot is “normal” as it takes practice. If there are any weird bumps or the curve is clearly skewed to one side or bimodal, chances are your variables are not normal. These plots looks pretty normal save for a small tail on the left of the top plot so we have successfully met the assumption of the model and will run the correlation test. See the ANOVA webpage for examples of non-normal and normal curves.

If your data satisfies the assumptions of the model, go directly to Running the Correlation below.

If your data do not satisfy the assumptions of the model, go to My Data Are Not Normal - What Do I Do? below. If your plots look more normal, then you can proceed to running the correlation test. Make sure you use your transformed variables when running the test.

Now you can run the correlation. Below is the code you will use to run the model. Remember that to actually run the test, you need to change the variable one and variable two to your own variable names. Use the code for names(DATA) above to make sure you write in the variables exactly as they are in the data file. If you write in biomass instead of bmass, for example, the test will not work because R won’t be able to find biomass in your datafile.

#code to run the Pearson product-moment correlation cor.test(varible one, variable two, method = "pearson") #code to run the Pearson product-moment correlation for the lizard data cor.test(SVL.mm, bmass, method = "pearson") ## ## Pearson's product-moment correlation ## ## data: SVL.mm and bmass ## t = 26.207, df = 189, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.8505065 0.9127686 ## sample estimates: ## cor ## 0.8855517

According to the correlation test output, there is a strong and significant positive correlation between snout to vent length and biomass because the p-value < 0.05 and the r is 0.89.

You should copy and paste the results into a Word doc or Excel file so you record it somewhere. In particular, you want to go through the Background Theory above to really understand all aspects of this output. For your results section, you should note the df, the r value (cor), and the p-value.

The results section of your paper should begin with a narrative of your results statements. These statements should be quantitative in nature and include 1) the statistical significance and 2) the biological significance.

  1. Statistical Significance: Your first sentence should list the results of the statistical test (in this case whether there was a significant relationship between biomass and snout to vent length). You include statistical data in parentheses at the end of the sentence only. You should never write “The p-value was…” or “The r was, which means…”. You don’t write about your statistics, you just write the biological results and include your statistics in parentheses. This satisfies whether your results were statistically significant.

  2. Biological Significance: For the quantitative statements, you can use the strength and direction of the relationship between the two variables (basically your r value and whether it was positive or negative).

The first results statement that includes the statistical results in parentheses should also reference the figure you are referring to. Remember to always -refer to figures in the order in which they appear, and -include your results narrative before the figure.

A scatterplot with a trend line is typically used to present the results of a correlation. A scatterplot is used when both variables are continuous. You already generated a scatterplot with a trendline but below is the code to do it again. Note that if you transformed the data to meet the assumptions of the model, untransformed data are presented.

To create a presentable figure for your paper, you need to install and load the package ggplot2. Use the code below to install (first line of code) and then load (second line of code) the package.

#code to install package "ggplot2" install.packages("ggplot2") #code to load package "ggplot2" into R library(ggplot2)

Now use the code below to create a scatterplot with a trendline and appropriate axis labels.

#code to create a scatterplot with a trendline and axis labels ggplot(DATA, aes(x = SVL.mm, y = bmass)) + geom_point(shape=1) + ylab("Biomass (g)") + xlab("Snout to vent length (mm)") + theme_classic() + geom_smooth(method=lm, se= FALSE) ## `geom_smooth()` using formula 'y ~ x'

Which of the following should be used to present the relationship between two continuous variables?

Alternatively, you can use Excel to create a scatterplot. Use the link below to watch a video on how to make a scatterplot in Excel: Video on how to make a scatterplot in Excel: https://drive.google.com/file/d/1BNL4hG8X6TTPCuXskyK0U8UyOHDM2Bn4/view?usp=sharing

A caption must be included below each figure.

The caption for a correlation must include:

1. A short descriptive title following the figure number 2. A description of what you plotted 3. Your sample size (e.g., # transects/site) 4. The p-value and the r value.

Lizard biomass and snout to vent length were significantly positively associated with each other (correlation, r= 0.89, df = 189, p < 0.001, Figure 3).

Figure 3. Biomass (g) and snout to vent length (mm) striped-plateau lizards (Sceloporus virgatus) (n = 191) from southeastern Arizona. (r = 0.89, p < 0.001).

When you can’t meet the assumptions of the model, you can either:
1. Transform the data 2. Run a non-parametric test

Transforming your data means modifying your variables so that the assumption of normality is met. This often means taking the square root or the log of each variable and running the test again. The reason transformations work is that the relative distance between each replicate stays the same (sample 1 is lower than sample 2, for example) but the absolute distance between them is reduced. If your density plot is bimodal (two large humps in your curve), often a transformation will make those humps smaller. You still present your data using the untransformed numbers, not the transformed numbers.

You can transform your data for the example above by taking the square root (first line of code below) OR the log (second line of code below) of each variable (pick one transformation or the other), you can use the following code:

#code to square-root transform your variables sqrtSVL = sqrt(DATA$SVL.mm) sqrtBmass = sqrt(DATA$bmass) #code to log transform your variables logSVL = log(DATA$SVL.mm) logBmass = log(DATA$bmass)

Once you transform your variables, you can replot your data and run the density plots using your new variables (e.g., sqrtSVL and sqrtBmass) in place of your original variables (SVL and bmass) for every line of code.

If you can’t meet the assumptions even after transforming your data, you can use a non-parametric test. A non-parametric test does not assume an underlying distribution (such as the t-distribution) and, therefore, does not need to meet the assumption of normality as the parametric test does.

The non-parametric correlation is called a Spearman rank correlation. Use the code below to run a Spearman rank correlation test:

#code to run the Spearman rank correlation for the lizard data cor.test(SVL.mm, bmass, method = "spearman") ## Warning in cor.test.default(SVL.mm, bmass, method = "spearman"): Cannot compute ## exact p-value with ties ## ## Spearman's rank correlation rho ## ## data: SVL.mm and bmass ## S = 143914, p-value < 2.2e-16 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## 0.8760727

Follow the directions in the Reporting Your Results section above except that the test is now a Spearman rank correlation test instead of just a correlation test and there are no degrees of freedom to be reported. For example: "Lizard biomas and snout to vent length were significantly positively associated with each other (correlation, rho = 0.88, p < 0.001, Figure 3).

##Quick Correlation

#bring data into RStudio DATA <- data.frame(read.table(file='lizard_biomass.csv', sep=',', header=TRUE, fill=TRUE)) #check that data was loaded properly names(DATA) dim(DATA) #code to create a scatterplot with a line of best fit (look for outliers) attach(DATA) plot(SVL.mm, bmass, xlab="biomass (g)", ylab="snout to vent length (mm)", pch=19) abline(lm(SVL.mm~bmass), col="red") #check model assumptions attach(DATA) #code to create a density plot of snout to vent length (SVL.mm) plot(density(SVL.mm)) #code to create a density plot of biomass (bmass) plot(density(bmass)) #code to run the Pearson product-moment correlation for the lizard data cor.test(SVL.mm, bmass, method = "pearson") #code to create a scatterplot with a trendline and axis labels install.packages(ggplot2) library(ggplot2) ggplot(DATA, aes(x = SVL.mm, y = bmass)) + geom_point(shape=1) + ylab("Biomass (g)") + xlab("Snout to vent length (mm)") + theme_classic() + geom_smooth(method=lm, se= FALSE)

*Written by Carrie L. Woods, August 2019. Modified from http://stats.pugetsound.edu/ecology/