This R worksheet does not include assessed questions.


CSV files

If we are dealing with a very small amount of data, then we can simply type it into one or more vectors in R, using the c() function, as on R Worksheet 2. But if we have a large amount of data, we will want to “read” it into R directly from a file – that is, we will want R to take the data from an external file and turn it automatically into an R object.

When large amounts of data are transferred, the standard format is a comma-separated variable or CSV file, which typically have the file suffix .csv. A comma-separated variable file looks like this:

year,jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec
1884,4.9,3.9,4.9,6.0,9.7,12.6,14.7,15.5,13.0,8.0,4.6,2.9
1885,1.8,4.1,3.2,6.3,7.5,12.4,14.5,12.2,10.7,6.1,4.6,2.7
1886,0.8,0.5,2.6,5.9,8.4,11.6,14.2,14.3,12.0,9.8,5.4,0.9
1887,1.3,3.1,3.0,5.3,8.1,14.1,15.9,13.8,10.4,6.1,3.7,1.8
1888,2.5,1.1,1.7,5.1,9.2,11.3,12.0,12.6,10.8,7.1,6.3,3.9

Here, the first line of the file represents a row of “headers”, telling us what the rest of the data represents. This is data from the UK Met Office about the average UK temperature in each month of different years. Then each line of the file after the first line corresponds to one data record – here, one year – and each piece of data is separated from the next with just a single comma , and no spaces. Represented in a more common table view, this data would look like this:

year jan feb mar apr may jun jul aug sep oct nov dec
1884 4.9 3.9 4.9 6.0 9.7 12.6 14.7 15.5 13.0 8.0 4.6 2.9
1885 1.8 4.1 3.2 6.3 7.5 12.4 14.5 12.2 10.7 6.1 4.6 2.7
1886 0.8 0.5 2.6 5.9 8.4 11.6 14.2 14.3 12.0 9.8 5.4 0.9
1887 1.3 3.1 3.0 5.3 8.1 14.1 15.9 13.8 10.4 6.1 3.7 1.8
1888 2.5 1.1 1.7 5.1 9.2 11.3 12.0 12.6 10.8 7.1 6.3 3.9

There are many advantages to using CSV files to store and send data:

We can read data into R from a CSV file saved “locally” – that is, saved on the hard drive of the computer you are working on now – or from a CSV file on the internet. In these R Worksheets, we will only read in data from the internet: for this module, you never need to save any CSV data files to your computer. The advantages of reading data in from the internet include:

A disadvantage is that the data owner might remove their data from the internet.

Reading CSV files into R

We read data from CSV files into R using the read.csv() function. The temperature data we mentioned is available at the web address https://mpaldridge.github.io/math1710/data/met-office.csv. To read in this data into R, we use the R command:

temperature <- read.csv("https://mpaldridge.github.io/math1710/data/met-office.csv")

(Note the quotation marks " " around the web address in the read.csv() command.) By using temperature <-, we have read the data into an R object we have called temperature. Nothing will happen until we interact with this R object temperature in some way.

Remember: To read data from a CSV file into R you must use the read.csv() command. Just typing a web address straight into R will not do anything (except return an error). The most common query I get from MATH1710 students is “I typed a web address into R and it returned an error. Has the data been deleted?” My response is almost always “No, the data has not been deleted. You forgot to use the read.csv() command.”

Once we’ve read some data into R, we will want to check it has worked, and find out some basic details about the data. There are various functions that will do this:

Exercise 3.1. Copy the command above to read in the Met Office temperature data to an object called temperature. How many rows does the data have?

Let us note for later that the names of our temperature data are these:

names(temperature)
 [1] "year" "jan"  "feb"  "mar"  "apr"  "may"  "jun"  "jul"  "aug"  "sep"  "oct"  "nov"  "dec" 

Data frames

When R reads in data from a CSV file, it saves it as a type of object known as a data frame. Let us look at the top of temperature:

head(temperature)
  year jan feb mar apr  may  jun  jul  aug  sep oct nov dec
1 1884 4.9 3.9 4.9 6.0  9.7 12.6 14.7 15.5 13.0 8.0 4.6 2.9
2 1885 1.8 4.1 3.2 6.3  7.5 12.4 14.5 12.2 10.7 6.1 4.6 2.7
3 1886 0.8 0.5 2.6 5.9  8.4 11.6 14.2 14.3 12.0 9.8 5.4 0.9
4 1887 1.3 3.1 3.0 5.3  8.1 14.1 15.9 13.8 10.4 6.1 3.7 1.8
5 1888 2.5 1.1 1.7 5.1  9.2 11.3 12.0 12.6 10.8 7.1 6.3 3.9
6 1889 2.7 1.9 3.4 5.7 11.3 14.0 13.5 13.5 11.3 7.5 5.5 2.4

A data frame is a two-dimensional table. The columns have names: the names "year", "jan", "feb", etc, that we saw above. The rows are simply numbered.

We can extract a single data point using the square brackets [ ], similar to how we extracted single entries of vectors on R Worksheet 2. So, the datapoint in row 21 and column “oct” is

temperature[21, "oct"]
[1] 8.7

The notation here in the square brackets is: first the number of the row, then a comma, then the name of the column in quotation marks. (Don’t forget the quotation marks!)

You’ll also remember from R Worksheet 2 that we can pull out a segment from a column with, for example,

temperature[21:26, "oct"]
[1]  8.7  6.1  9.6  8.7 11.1  9.1

If we want a whole row, we can just omit the column identifier – but remember that we still need the comma!

temperature[21, ]
   year jan feb mar apr may  jun  jul  aug  sep oct nov dec
21 1904 3.1   2 3.1 7.6 9.4 12.2 15.3 13.8 11.7 8.7 4.6 2.5

Similarly, if we want a whole column we can miss out the row identifier – but still keep the comma! – with

temperature[, "oct"]

(I won’t print the whole output here.) However, we want a whole column sufficiently often that there’s a shorter and more convenient notation where we just use a $ sign, and don’t need to worry about the square brackets, the comma, or the quotation marks.

temperature$oct

Exercise 3.2 We continue with the temperature data.
(a) What year does the 40th row correspond to?
(b) What was the temperature in August of that year?
(c) Output the whole list of January temperatures.
(d) Output the December temperature for rows 50 to 60.

Functions with data

Once we extract the rows or columns we need, we can then apply any of the functions we learned about on R Worksheet 2, like mean(), median(), var(), IQR(), cor(), min(), max() and so on.

So, for example, the mean February temperature is given by

mean(temperature$feb)
[1] 3.081618

the interquartile range of April temperatures in the first 100 rows is given by

IQR(temperature[1:100, "apr"])
[1] 1.625

the correlation between June and July temperatures is

cor(temperature$jun, temperature$jul)
[1] 0.2827105

and so on. Remember we can use round() and signif() with these to round the answers.

Exercise 3.3. We continue with the temperature data.
(a) What was the median temperature in September?
(b) What is the sample variance of the first fifty years of February data?
(c) What is the lowest December temperature?
(d) What is the correlation between October and November temperatures, restricted only to the first 100 years of data?