A scatter diagram is used to visually describe the covariation between two variables

Try the new Google Books

Check out the new look and enjoy easier access to your favorite features

A scatter diagram is used to visually describe the covariation between two variables

A Scatter Diagram provides relationship between two variables, and provides a visual correlation coefficient.

Why You Would Use Scatter Analysis and Scatter Plots

A Scatter Analysis is used when you need to compare two data sets against each other to see if there is a relationship. Scatter plots are a way of visualizing the relationship; by plotting the data points you get a scattering of points on a graph. The analysis comes in when trying to discern what kind of pattern – if any – is present. And what that pattern means.

It is this kind of analysis we are talking about when we are trying to get at the root cause of an issue.

Scatter Diagrams are used to show the “cause-and-effect” relationship between two kinds of data, and to provide more useful information about a production process.

Specific instances of when to utilize scatter diagrams:

  • Pairs of numerical figures are present
  • Dependent variables have multiple values for each figure associated with the independent variable
  • Defining if there is a relationship between two variables

What Kind of Data Should You Use on Scatter Analysis?

Scatter analysis generally makes use of continuous data. (See notes on the different data types here.)

Discrete data is best at pass/ fail measurements. Continuous data lets you measure things deeply on an infinite set and is generally used in scatter analysis.

You could use discrete data on one axis of a scatter plot and continuous data on the other axis. For the discrete data, you’d have to put it into some kind of quantified band – like say 1-10 on a customer satisfaction score.

I suppose you also *could* put discrete data that comes out like pass/fail as one of two bands, but it would really depend on the data if you got any useful information out of it.

Best bet is continuous data.

If you are looking for a way to do graphical analysis on discrete data, you might try attribute charts.

Scatter Plot Videos

Scatter Plots and Correlation

Scatter plots only show correlation. They do not prove causation. The example often used is shark attacks and ice cream sales. There may be a correlation between the two, but ice cream does not cause shark attacks — the heat of the day does. In other words, more people are in the water on hot days equaling more shark attacks, and more people buy ice cream on hot days

How to Make a Scatter Diagram:

A scatter diagram is used to visually describe the covariation between two variables

  1. Collect sets of data where a relationship is present.
  2. Draw a graph in the shape of an “L,” and make the scale even multiples (i.e., 10, 20).
    • Place the independent variable on the horizontal (X) axis.
    • Place the dependent variable on the vertical (Y) axis.
    • Place a dot or a symbol where the x-axis value intersects the y-axis value.
    • If two dots fall together, place them side by side, so they are touching, and both are visible.
  3. Review the pattern of points to determine if a relationship is present:
    • Stop if the data forms a line or a curve, as the variables are considered correlated.
    • Use regression or correlation analysis, if necessary. If regression or correlation analysis are not needed, complete steps four through seven below.
  4. Divide points on the graph into four equal sections. If X points are present on the graph:
    • Count X/2 points from top to bottom and draw a horizontal line.
    • Count X/2 points from left to right and draw a vertical line.
    • If the number of points is odd, draw a line through the middle point.
  5. Count the points in each quadrant.

NOTE: Do not count points on a line.

  1. Locate the smaller sum and the total of points in all quadrants, and add the diagonally opposite quadrants:

A = points in upper left + points in lower right

B = points in upper right + points in lower left

Q = the smaller of A and B

N = A + B

  1. Look up the limit for N on the trend test table:
  • If Q is less than the limit, the two variables are related.
  • If Q is greater than or equal to the limit, the pattern may have originated from random chance.

 kassambara |   17/11/2017 |   103423  |  Comments (3)  |  R Graphics Essentials

Scatter plots are used to display the relationship between two continuous variables x and y. In this article, we’ll start by showing how to create beautiful scatter plots in R.

We’ll use helper functions in the ggpubr R package to display automatically the correlation coefficient and the significance level on the plot.

We’ll also describe how to color points by groups and to add concentration ellipses around each group. Additionally, we’ll show how to create bubble charts, as well as, how to add marginal plots (histogram, density or box plot) to a scatter plot.

We continue by showing show some alternatives to the standard scatter plots, including rectangular binning, hexagonal binning and 2d density estimation. These plot types are useful in a situation where you have a large data set containing thousands of records.

R codes for zooming, in a scatter plot, are also provided. Finally, you’ll learn how to add fitted regression trend lines and equations to a scatter graph.

Contents:


  1. Install cowplot package. Used to arrange multiple plots. Will be used here to create a scatter plot with marginal density plots. Install the latest developmental version as follow:
devtools::install_github("wilkelab/cowplot")
  1. Install ggpmisc for adding the equation of a fitted regression line on a scatter plot:
install.packages("ggpmisc")
  1. Load required packages and set ggplot themes:
  • Load ggplot2 and ggpubr R packages
  • Set the default theme to theme_minimal() [in ggplot2]
library(ggplot2) library(ggpubr) theme_set( theme_minimal() + theme(legend.position = "top") )

Dataset: mtcars. The variable cyl is used as grouping variable.

# Load data data("mtcars") df <-> ## wt mpg cyl qsec ## Mazda RX4 2.62 21.0 6 16.5 ## Mazda RX4 Wag 2.88 21.0 6 17.0 ## Datsun 710 2.32 22.8 4 18.6 ## Hornet 4 Drive 3.21 21.4 6 19.4

Key functions:

  • geom_point(): Create scatter plots. Key arguments: color, size and shape to change point color, size and shape.
  • geom_smooth(): Add smoothed conditional means / regression line. Key arguments:
    • color, size and linetype: Change the line color, size and type.
    • fill: Change the fill color of the confidence region.
b <->

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

To remove the confidence region around the regression line, specify the argument se = FALSE in the function geom_smooth().

Change the point shape, by specifying the argument shape, for example:

b + geom_point(shape = 18)

To see the different point shapes commonly used in R, type this:

ggpubr::show_point_shapes()

A scatter diagram is used to visually describe the covariation between two variables

Create easily a scatter plot using ggscatter() [in ggpubr]. Use stat_cor() [ggpubr] to add the correlation coefficient and the significance level.

# Add regression line and confidence interval # Add correlation coefficient: stat_cor() ggscatter(df, x = "wt", y = "mpg", add = "reg.line", conf.int = TRUE, add.params = list(fill = "lightgray"), ggtheme = theme_minimal() )+ stat_cor(method = "pearson", label.x = 3, label.y = 30)

A scatter diagram is used to visually describe the covariation between two variables

  • Change point colors and shapes by groups.
  • Add marginal rug: geom_rug().
# Change color and shape by groups (cyl) b + geom_point(aes(color = cyl, shape = cyl))+ geom_smooth(aes(color = cyl, fill = cyl), method = "lm") + geom_rug(aes(color =cyl)) + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+ scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) # Remove confidence region (se = FALSE) # Extend the regression lines: fullrange = TRUE b + geom_point(aes(color = cyl, shape = cyl)) + geom_rug(aes(color =cyl)) + geom_smooth(aes(color = cyl), method = lm, se = FALSE, fullrange = TRUE)+ scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+ ggpubr::stat_cor(aes(color = cyl), label.x = 3)

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

  • Split the plot into multiple panels. Use the function facet_wrap():
b + geom_point(aes(color = cyl, shape = cyl))+ geom_smooth(aes(color = cyl, fill = cyl), method = "lm", fullrange = TRUE) + facet_wrap(~cyl) + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+ scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) + theme_bw()

A scatter diagram is used to visually describe the covariation between two variables

  • Add concentration ellipse around groups. R function stat_ellipse(). Key arguments:
    • type: The type of ellipse. The default “t” assumes a multivariate t-distribution, and “norm” assumes a multivariate normal distribution. “euclid” draws a circle with the radius equal to level, representing the euclidean distance from the center.
    • level: The confidence level at which to draw an ellipse (default is 0.95), or, if type=“euclid”, the radius of the circle to be drawn.
b + geom_point(aes(color = cyl, shape = cyl))+ stat_ellipse(aes(color = cyl), type = "t")+ scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

A scatter diagram is used to visually describe the covariation between two variables

Instead of drawing the concentration ellipse, you can: i) plot a convex hull of a set of points; ii) add the mean points and the confidence ellipse of each group. Key R functions: stat_chull(), stat_conf_ellipse() and stat_mean() [in ggpubr]:

# Convex hull of groups b + geom_point(aes(color = cyl, shape = cyl)) + stat_chull(aes(color = cyl, fill = cyl), alpha = 0.1, geom = "polygon") + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) + scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) # Add mean points and confidence ellipses b + geom_point(aes(color = cyl, shape = cyl)) + stat_conf_ellipse(aes(color = cyl, fill = cyl), alpha = 0.1, geom = "polygon") + stat_mean(aes(color = cyl, shape = cyl), size = 2) + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) + scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

# Add group mean points and stars ggscatter(df, x = "wt", y = "mpg", color = "cyl", palette = "npg", shape = "cyl", ellipse = TRUE, mean.point = TRUE, star.plot = TRUE, ggtheme = theme_minimal()) # Change the ellipse type to 'convex' ggscatter(df, x = "wt", y = "mpg", color = "cyl", palette = "npg", shape = "cyl", ellipse = TRUE, ellipse.type = "convex", ggtheme = theme_minimal())

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

Key functions:

  • geom_text() and geom_label(): ggplot2 standard functions to add text to a plot.
  • geom_text_repel() and geom_label_repel() [in ggrepel package]. Repulsive textual annotations. Avoid text overlapping.

First install ggrepel (ìnstall.packages("ggrepel")), then type this:

library(ggrepel) # Add text to the plot .labs <->

A scatter diagram is used to visually describe the covariation between two variables

# Draw a rectangle underneath the text, making it easier to read. b + geom_point(aes(color = cyl)) + geom_label_repel(aes(label = .labs, color = cyl), size = 3)+ scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

A scatter diagram is used to visually describe the covariation between two variables

In a bubble chart, points size is controlled by a continuous variable, here qsec. In the R code below, the argument alpha is used to control color transparency. alpha should be between 0 and 1.

b + geom_point(aes(color = cyl, size = qsec), alpha = 0.5) + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07")) + scale_size(range = c(0.5, 12)) # Adjust the range of points size

A scatter diagram is used to visually describe the covariation between two variables

  • Color points according to the values of the continuous variable: “mpg”.
  • Change the default blue gradient color using the function scale_color_gradientn() [in ggplot2], by specifying two or more colors.
b + geom_point(aes(color = mpg), size = 3) + scale_color_gradientn(colors = c("#00AFBB", "#E7B800", "#FC4E07"))

A scatter diagram is used to visually describe the covariation between two variables

The function ggMarginal() [in ggExtra package] (Attali 2017), can be used to easily add a marginal histogram, density or box plot to a scatter plot.

First, install the ggExtra package as follow: install.packages("ggExtra"); then type the following R code:

# Create a scatter plot p <- type="density">

A scatter diagram is used to visually describe the covariation between two variables

One limitation of ggExtra is that it can’t cope with multiple groups in the scatter plot and the marginal plots.

A solution is provided in the function ggscatterhist() [ggpubr]:

library(ggpubr) # Grouped Scatter plot with marginal density plots ggscatterhist( iris, x = "Sepal.Length", y = "Sepal.Width", color = "Species", size = 3, alpha = 0.6, palette = c("#00AFBB", "#E7B800", "#FC4E07"), margin.params = list(fill = "Species", color = "black", size = 0.2) )

A scatter diagram is used to visually describe the covariation between two variables

# Use box plot as marginal plots ggscatterhist( iris, x = "Sepal.Length", y = "Sepal.Width", color = "Species", size = 3, alpha = 0.6, palette = c("#00AFBB", "#E7B800", "#FC4E07"), margin.plot = "boxplot", ggtheme = theme_bw() )

A scatter diagram is used to visually describe the covariation between two variables

In this section, we’ll present some alternatives to the standard scatter plots. These include:

  • Rectangular binning. Rectangular heatmap of 2d bin counts
  • Hexagonal binning: Hexagonal heatmap of 2d bin counts.
  • 2d density estimation

Rectangular binning is a very useful alternative to the standard scatter plot in a situation where you have a large data set containing thousands of records.

Rectangular binning helps to handle overplotting. Rather than plotting each point, which would appear highly dense, it divides the plane into rectangles, counts the number of cases in each rectangle, and then plots a heatmap of 2d bin counts. In this plot, many small hexagon are drawn with a color intensity corresponding to the number of cases in that bin.

Key function: geom_bin2d(): Creates a heatmap of 2d bin counts. Key arguments: bins, numeric vector giving number of bins in both vertical and horizontal directions. Set to 30 by default.

  1. Hexagonal binning: Similar to rectangular binning, but divides the plane into regular hexagons. Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of `geom_bin2d().

Key function: geom_hex()

  1. Contours of a 2d density estimate. Perform a 2D kernel density estimation and display results as contours overlaid on the scatter plot. This can be also useful for dealing with overplotting.

Key function: geom_density_2d()

  • Create a scatter plot with rectangular and hexagonal binning:
# Rectangular binning ggplot(diamonds, aes(carat, price)) + geom_bin2d(bins = 20, color ="white")+ scale_fill_gradient(low = "#00AFBB", high = "#FC4E07")+ theme_minimal() # Hexagonal binning ggplot(diamonds, aes(carat, price)) + geom_hex(bins = 20, color = "white")+ scale_fill_gradient(low = "#00AFBB", high = "#FC4E07")+ theme_minimal()

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

  • Create a scatter plot with 2d density estimation:
# Add 2d density estimation sp <->

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

  • Key function: facet_zomm() [in ggforce] (Pedersen 2016).
  • Demo data set: iris. The R code below zoom the points where Species == "versicolor".
library(ggforce) ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point() + ggpubr::color_palette("jco") + facet_zoom(x = Species == "versicolor")+ theme_bw()

A scatter diagram is used to visually describe the covariation between two variables

To zoom the points, where Petal.Length < 2.5, type this:

ggplot(iris, aes(Petal.Length, Petal.Width, colour = Species)) + geom_point() + ggpubr::color_palette("jco") + facet_zoom(x = Petal.Length < 2.5)+ theme_bw()

In this section, we’ll describe how to add trend lines to a scatter plot and labels (equation, R2, BIC, AIC) for a fitted lineal model.

  1. Load packages and create a basic scatter plot facetted by groups:
# Load packages and set theme library(ggpubr) library(ggpmisc) theme_set( theme_bw() + theme(legend.position = "top") ) # Scatter plot p <->
  1. Add regression line, correlation coefficient and equantions of the fitted line. Key functions:
    • stat_smooth() [ggplot2]
    • stat_cor() [ggpubr]
    • stat_poly_eq()[ggpmisc]
formula <->

A scatter diagram is used to visually describe the covariation between two variables

set.seed(4321) x <->
  • Fit polynomial regression line and add labels:
# Polynomial regression. Sow equation and adjusted R2 formula <->

A scatter diagram is used to visually describe the covariation between two variables

Note that, you can also display the AIC and the BIC values using ..AIC.label.. and ..BIC.label.. in the above equation.

Other arguments (label.x, label.y) are available in the function stat_poly_eq() to adjust label positions.

For more examples, type this R code: browseVignettes(“ggpmisc”).

  1. Create a basic scatter plot:
b <->

Possible layers, include:

  • geom_point() for scatter plot
  • geom_smooth() for adding smoothed line such as regression line
  • geom_rug() for adding a marginal rug
  • geom_text() for adding textual annotations

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables

  1. Continuous bivariate distribution:
c <->

Possible layers include:

  • geom_bin2d(): Rectangular binning.
  • geom_hex(): Hexagonal binning.
  • geom_density_2d(): Contours from a 2d density estimate

A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables
A scatter diagram is used to visually describe the covariation between two variables


Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!