R Worksheet 4: Plots I: Making plots

This R worksheet does include assessed questions at the bottom. The deadline for submitting your solutions to these questions is Monday 6 November at 1400. If you have difficulty with this worksheet, you can get help at the office hours drop-on sessions.

In this worksheet, we will look at how to make the three types of data visualisations we studied in Lecture 2:

boxplots
histograms
scatterplots

In this worksheet, we will just consider the basics of how to produce the plot. In the next worksheet, R Worksheet 5, we will see ways to improve and alter the plots, by adding axis titles, changing the colours, and so on.

We will use the same Met Office historical temperature data as we used on R Worksheet 3 for our example data.

temperature <- read.csv("https://mpaldridge.github.io/math1710/data/met-office.csv")

Exercise 4.1. Read the temperature data into R by copying the command above. Just to check you remember from R Worksheet 3: What was the median August temperature?

Boxplots

The first type of plot we will look at is the boxplot. This is very simple. We simply feed the vector of data we wish to plot into the function boxplot(). So, for example, a boxplot of the January temperatures can be produced with

boxplot(temperature$jan)

You should find that a picture like the one above appears in the “Plots” tab of RStudio – this is usually in the bottom-right quadrant of the screen. If the “Plots” tab is too small on your screen, the picture may fail to appear – your should drag the borders of the “Plots” tab to make it bigger then try again.

Above the picture, you should see the “Export” button. Clicking on this will give you the opportunity to either save your graph our copy it to the clipboard. Saving the plot is useful if you think you might want it again in the future – although if you are saving all your commands in an R script, as we discussed on R Worksheet 1, you can very quickly recreate any plots you need again without having to save each one. Copying the plot to the clipboard can be helpful if you want to include your plot in another document, like a Word document or a Powerpoint presentation.

Exercise 4.2. Draw a boxplot of the August temperature. Save your graph as an image.

Boxplots are mostly helpful to compare data, so we often want to draw multiple boxplots next to each other. In this case, we simple plug all the data vectors into boxplot(), separated by commas.

So to compare the temperatures in July, August, and September, we use

boxplot(temperature$jul, temperature$aug, temperature$sep)

You might like to think about what can we learn about the temperature data in those months from this boxplot.

Exercise 4.2. Draw a boxplot comparing the April temperature for the first 68 years in the data set (rows 1 to 68) and the second 68 years in the data (rows 69 to 136). (You may need to look back at R Worksheet 3 to remind yourself how to extract limited rows from a column of a data frame.) Comment briefly on what you learn from the comparison.

In the next worksheet, we will see how to properly label the box plots, label the y-axis, put a title on your boxplot, and other things.

Histograms

The basic method of how to draw a histogram is again pretty simple – just pass the data vector to the function hist().

hist(temperature$dec)

An important extra optional argument is the freq = ... argument. You’ll notice that the histogram above plots “Frequency” on the y-axis. Setting freq = TRUE ensures that that the frequency is plotted – although since this is the default behaviour (with one exception mentioned below), it’s not necessary to do so, and it can be more convenient to leave it out. Setting freq = FALSE instead plots the frequency density, as we discussed in Section 2.1 in Lecture 2. Compare the first picture above with this one (and look carefully at the y-axis):

hist(temperature$dec, freq = FALSE)

Another important extra optional arguments is breaks = .... This allows you to specify the bins you create. Your choices here are:

Don’t use the breaks argument at all: this lets R choose what it thinks is a sensible number of bins.
Set breaks equal to a number, like breaks = 5: this tells R how many bins you want. R won’t necessarily choose exactly the number of bins you request – rather it will pick a number close to that which it thinks splits up the x-axis nicely. (You may find this behaviour helpful or annoying, according to taste.)
Set breaks equal to a vector, like breaks = c(-5, 0, 2, 4, 6, 10, 20): these are the exact breakpoints between bins that R uses. Note that if the vectors are not equally spaced, then – very sensibly – plotting the density (freq = FALSE) becomes the default behaviour.

Here’s an example of specifying the bins we want to use:

hist(temperature$mar, breaks = c(0, 2, 4, 5, 6, 8, 10))

Exercise 4.4. Plot four histograms of the May temperature with density on the y-axis using (approximately) 3, 6, 10, and 20 bins. Which number of bins do you think is the most appropriate, and why?

In the next worksheet we will cover how to label our histogram to make the data display more clearly.

Scatterplots

The command for a scatterplot is simply plot(). You should provide two arguments: the first argument will be plotted on the x-axis and the second on the y-axis.

So to compare July and August temperatures each year, we can use

plot(temperature$jul, temperature$aug)

There appears to be a positive relationship here, although not an extremely strong one. We can confirm that by calculating the correlation, as on R Worksheet 3.

cor(temperature$jul, temperature$aug)

[1] 0.5397355

Exercise 4.5. Plot a scatterplot for January and August temperature. Do you see a relationship here? Copy your scatterplot to the clipboard and paste it into some other program.

In R Worksheet 5, we will look at some of the many extra optional arguments that plot() can take. But we will mention one here.

By default, R plots a point (or, rather, a small circle) for each pair \((x_i, y_i)\) of datapoints. But sometimes you might prefer to join up points with a line – this can be particularly appropriate if the x-axis represents time and there is only one y-data-point per time-point. Here, we want the argument type = .... There are many options here, but the three most important are:

type = "p" plots points, and is the default behaviour.
type = "l" plots lines.
type = "b" plots both points and lines.

We can see an example of this below. Because year is a time variable, and there is one October temperature per year, it makes sense to join up our data points with lines; and because we have so many years, the scatterplot is clearer without points.

plot(temperature$year, temperature$oct, type = "l")

You might want to think about what we can learn about the data from this scatterplot.

Exercise 4.6. Draw a scatterplot for the first 30 years of data, with year as the x-axis and February temperature as the y-axis, with points plotted that are also joined up with lines.

Assessed questions

The following five assessed questions should be submitted to the Microsoft Form that is linked to below. I recommend you do this in Week 5, but the official deadline is Monday 6 November at 1400.

You should submit your work by typing your answers into this Microsoft Form. You may need to be logged into your University account.

This work will be marked automatically by computer, so make sure your answers are accurate – the computer does not “know what you meant”; only what you actually enter into the form.

So that (most) students get different data to work with, you will work from a data file whose name is based on your student ID number. Your student ID number is (usually) a 9-digit number starting 201. Your dataset is the CSV file at

https://mpaldridge.github.io/math1710/data/R4-xy.csv

where xy is replaced by the last two digits of your student ID number. So if your student ID is 201623429, then you should use the file at https://mpaldridge.github.io/math1710/data/R4-29.csv; if your student ID number is 201491200 then you should use the file https://mpaldridge.github.io/math1710/data/R4-00.csv and so on. Take care to check you get this correct: if you get your student ID number wrong and/or use the wrong data file, the computer is likely to award you 0 marks.

First, read the data set at https://mpaldridge.github.io/math1710/data/R4-xy.csv into R. Remember that xy should be replaced by the last 2 digits of your student ID number. This data set has nine columns, called alpha, beta, gamma, delta, epsilon, zeta, eta, theta, and iota.

Assessed Question 1. Consider the alpha data and beta data. Which dataset has the larger mean?

Assessed Question 2. Draw a boxplot comparing the gamma data and delta data. Which data set is generally the more spread out?

Assessed Question 3. Draw a boxplot comparing the epsilon data and zeta data. One of the datasets has a very significant outlier. Which one?

Assessed Question 4. Draw a histogram of the eta data, and experiment with different numbers of breaks. How many “peaks” do you think it is accurate to say the histogram has? (Remember that too many breaks may cause too many “random” peaks, while too few breaks may smooth out actual peaks.)

Assessed Question 5. Draw a scatterplot with the theta data on the x-axis and the iota data on the y-axis. You should find that there is a non-linear relationship between the two data sets. As theta increases, does iota go up then down (like \(\cap\), or a sad face), or down then up (like \(\cup\) or a happy face)?

Remember to submit your answers into this Microsoft Form before 1400 on Monday 6 November.

R Worksheet 4: Plots I: Making plots

MATH1710 Probability and Statistcs I

University of Leeds, 2023–24

Boxplots

Histograms

Scatterplots

Assessed questions