This R worksheet does include assessed questions at the bottom. The deadline for submitting your solutions to these questions is Monday 6 November at 1400. If you have difficulty with this worksheet, you can get help at the office hours drop-on sessions.
In this worksheet, we will look at how to make the three types of data visualisations we studied in Lecture 2:
In this worksheet, we will just consider the basics of how to produce the plot. In the next worksheet, R Worksheet 5, we will see ways to improve and alter the plots, by adding axis titles, changing the colours, and so on.
We will use the same Met Office historical temperature data as we used on R Worksheet 3 for our example data.
temperature <- read.csv("https://mpaldridge.github.io/math1710/data/met-office.csv")
Exercise 4.1. Read the temperature data into R by copying the command above. Just to check you remember from R Worksheet 3: What was the median August temperature?
The first type of plot we will look at is the
boxplot. This is very simple. We simply feed the vector
of data we wish to plot into the function
boxplot()
. So, for example, a boxplot of
the January temperatures can be produced with
boxplot(temperature$jan)
You should find that a picture like the one above appears in the “Plots” tab of RStudio – this is usually in the bottom-right quadrant of the screen. If the “Plots” tab is too small on your screen, the picture may fail to appear – your should drag the borders of the “Plots” tab to make it bigger then try again.
Above the picture, you should see the “Export” button. Clicking on this will give you the opportunity to either save your graph our copy it to the clipboard. Saving the plot is useful if you think you might want it again in the future – although if you are saving all your commands in an R script, as we discussed on R Worksheet 1, you can very quickly recreate any plots you need again without having to save each one. Copying the plot to the clipboard can be helpful if you want to include your plot in another document, like a Word document or a Powerpoint presentation.
Exercise 4.2. Draw a boxplot of the August temperature. Save your graph as an image.
Boxplots are mostly helpful to compare data, so we often
want to draw multiple boxplots next to each other. In this case, we
simple plug all the data vectors into boxplot()
, separated
by commas.
So to compare the temperatures in July, August, and September, we use
boxplot(temperature$jul, temperature$aug, temperature$sep)
You might like to think about what can we learn about the temperature data in those months from this boxplot.
Exercise 4.2. Draw a boxplot comparing the April temperature for the first 68 years in the data set (rows 1 to 68) and the second 68 years in the data (rows 69 to 136). (You may need to look back at R Worksheet 3 to remind yourself how to extract limited rows from a column of a data frame.) Comment briefly on what you learn from the comparison.
In the next worksheet, we will see how to properly label the box plots, label the y-axis, put a title on your boxplot, and other things.
The basic method of how to draw a histogram is again
pretty simple – just pass the data vector to the function
hist()
.
hist(temperature$dec)
An important extra optional argument is the freq = ...
argument. You’ll notice that the histogram above plots “Frequency” on
the y-axis. Setting freq = TRUE
ensures that that the
frequency is plotted – although since this is the default behaviour
(with one exception mentioned below), it’s not necessary to do
so, and it can be more convenient to leave it out. Setting
freq = FALSE
instead plots the frequency density,
as we discussed in Section 2.1 in Lecture 2. Compare the first picture
above with this one (and look carefully at the y-axis):
hist(temperature$dec, freq = FALSE)
Another important extra optional arguments is
breaks = ...
. This allows you to specify the bins you
create. Your choices here are:
Don’t use the breaks
argument at
all: this lets R choose what it thinks is a sensible number of
bins.
Set breaks
equal to a number, like
breaks = 5
: this tells R how many bins you want. R won’t
necessarily choose exactly the number of bins you request –
rather it will pick a number close to that which it thinks splits up the
x-axis nicely. (You may find this behaviour helpful or annoying,
according to taste.)
Set breaks
equal to a vector, like
breaks = c(-5, 0, 2, 4, 6, 10, 20)
: these are the exact
breakpoints between bins that R uses. Note that if the vectors are not
equally spaced, then – very sensibly – plotting the density
(freq = FALSE
) becomes the default behaviour.
Here’s an example of specifying the bins we want to use:
hist(temperature$mar, breaks = c(0, 2, 4, 5, 6, 8, 10))
Exercise 4.4. Plot four histograms of the May temperature with density on the y-axis using (approximately) 3, 6, 10, and 20 bins. Which number of bins do you think is the most appropriate, and why?
In the next worksheet we will cover how to label our histogram to make the data display more clearly.
The command for a scatterplot is simply
plot()
. You should provide two arguments:
the first argument will be plotted on the x-axis and the second on the
y-axis.
So to compare July and August temperatures each year, we can use
plot(temperature$jul, temperature$aug)
There appears to be a positive relationship here, although not an extremely strong one. We can confirm that by calculating the correlation, as on R Worksheet 3.
cor(temperature$jul, temperature$aug)
[1] 0.5397355
Exercise 4.5. Plot a scatterplot for January and August temperature. Do you see a relationship here? Copy your scatterplot to the clipboard and paste it into some other program.
In R Worksheet 5, we will look at some of the many extra optional
arguments that plot()
can take. But we will mention one
here.
By default, R plots a point (or, rather, a small circle) for each
pair \((x_i, y_i)\) of datapoints. But
sometimes you might prefer to join up points with a line – this can be
particularly appropriate if the x-axis represents time and there is only
one y-data-point per time-point. Here, we want the argument
type = ...
. There are many options here, but the three most
important are:
type = "p"
plots points, and is the
default behaviour.type = "l"
plots lines.type = "b"
plots both points and
lines.We can see an example of this below. Because year
is a
time variable, and there is one October temperature per year, it makes
sense to join up our data points with lines; and because we have so many
years, the scatterplot is clearer without points.
plot(temperature$year, temperature$oct, type = "l")
You might want to think about what we can learn about the data from this scatterplot.
Exercise 4.6. Draw a scatterplot for the first 30 years of data, with year as the x-axis and February temperature as the y-axis, with points plotted that are also joined up with lines.
The following five assessed questions should be submitted to the Microsoft Form that is linked to below. I recommend you do this in Week 5, but the official deadline is Monday 6 November at 1400.
You should submit your work by typing your answers into this Microsoft Form. You may need to be logged into your University account.
This work will be marked automatically by computer, so make sure your answers are accurate – the computer does not “know what you meant”; only what you actually enter into the form.
So that (most) students get different data to work with, you will
work from a data file whose name is based on your student ID number.
Your student ID number is (usually) a 9-digit number starting
201
. Your dataset is the CSV file at
https://mpaldridge.github.io/math1710/data/R4-xy.csv
where xy
is replaced by the last two digits of your
student ID number. So if your student ID is 201623429
, then
you should use the file at
https://mpaldridge.github.io/math1710/data/R4-29.csv
; if
your student ID number is 201491200
then you should use the
file https://mpaldridge.github.io/math1710/data/R4-00.csv
and so on. Take care to check you get this correct: if you get your
student ID number wrong and/or use the wrong data file, the computer is
likely to award you 0 marks.
First, read the data set at
https://mpaldridge.github.io/math1710/data/R4-xy.csv
into
R. Remember that xy
should be replaced by the last 2 digits
of your student ID number. This data set has nine columns, called
alpha
, beta
, gamma
,
delta
, epsilon
, zeta
,
eta
, theta
, and iota
.
Assessed Question 1. Consider the
alpha
data andbeta
data. Which dataset has the larger mean?
Assessed Question 2. Draw a boxplot comparing the
gamma
data anddelta
data. Which data set is generally the more spread out?
Assessed Question 3. Draw a boxplot comparing the
epsilon
data andzeta
data. One of the datasets has a very significant outlier. Which one?
Assessed Question 4. Draw a histogram of the
eta
data, and experiment with different numbers of breaks. How many “peaks” do you think it is accurate to say the histogram has? (Remember that too many breaks may cause too many “random” peaks, while too few breaks may smooth out actual peaks.)
Assessed Question 5. Draw a scatterplot with the
theta
data on the x-axis and theiota
data on the y-axis. You should find that there is a non-linear relationship between the two data sets. As theta increases, does iota go up then down (like \(\cap\), or a sad face), or down then up (like \(\cup\) or a happy face)?
Remember to submit your answers into this Microsoft Form before 1400 on Monday 6 November.