Lecture 1 Summary statistics

1.1 What is EDA?

Statistics is the study of data. Exploratory data analysis (or EDA, for short) is the part of statistics concerned with taking a “first look” at some data. Later, toward the end of this module, we will see more detailed and complex ways of building models for data, and in MATH1712 Probability and Statistics II (for those who take it) you will see many other statistical techniques – in particular, ways of testing formal hypotheses for data. But here we’re just interested in first impressions and brief summaries.

In this section, we will concentrate on two aspects of EDA:

  • Summary statistics: That is, calculating numbers that briefly summarise the data. A summary statistic might tell us what “central” or “typical” values of the data are, how spread out the data is, or about the relationship between two different variables.
  • Data visualisation: Drawing a picture based on the data is an another way to show the shape (centrality and spread) of data, or the relationship between different variables.

Even before calculating summary statistics or drawing a plot, however, there are other questions it is important to ask about the data:

  • What is the data? What variables have been measured? How were they measured? How many datapoints are there? What is the possible range of responses?
  • How was the data collected? Was data collected on the whole population or just a smaller sample? If a sample: How was that sample chosen? Is that sample representative of the population?
  • Are there any outliers? “Outliers” are datapoints that seem to be very different from the other datapoints – for example, are much larger or much smaller than the others. Each outlier should be investigated to seek the reason for it. Perhaps it is a genuine-but-unusual datapoint (which is useful for understanding the extremes of the data), or perhaps there is an extraordinary explanation (a measurement or recording error, for example) meaning the data is not relevant. Once the reason for an outlier is understood, it then might be appropriate to exclude it from analysis (for example, the incorrectly recorded measurement). It’s usually bad practice to exclude an outlier merely for being an outlier before understanding what caused it.
  • Ethical questions: Was the data collected ethically and, where necessary, with the informed consent of the subjects? Has it been stored properly? Are their privacy issues with the collection and storage of the data? What ethical issues should be considered before publishing (or not publishing) results of the analysis? Should the data be kept confidential, or should it be openly shared with other researchers for the betterment of science?

1.2 What is R?

R is a programming language that is particularly good at working with probability and statistics. A convenient way to use the language R is through the program RStudio. An important part of this module is learning to use R, by completing weekly worksheets – you can read more in the R section of these notes.

R can easily and quickly perform all the calculations and draw all the plots in this section of notes on exploratory data analysis. In this text, we’ll show the relevant R code. Code will appear like this:

data <- c(4, 7, 6, 7, 4, 5, 5)
mean(data)
[1] 5.428571

Here, the code in the first shaded box is the R commands that are typed into RStudio, which you can type in next to the > arrow in the RStudio “console”. The numerical answers that R returns are shown here in the second unshaded box. The [1] can be ignored (this is just R’s way of saying that this is the first part of the answer – but the answer here only has one part anyway). Plots produced by R are displayed in these notes as pictures.

Most importantly for now, you are not expected to understand the R code in this section yet. The code is included so that, in the future, as you work through the R worksheets week by week, you can look back at the code in the section, and it will start to make sense. By the time you have finished R Worksheet 5 in Week 6, you should be able understand most of the R code in this section.

1.3 Statistics of centrality

Suppose we have collected some data on a certain variable. We will assume here that we have \(n\) datapoints, each of which is a single real number. We can write this data as a vector \[ \mathbf x = (x_1, x_2, \dots, x_n) . \]

A statistic is a calculation from the data \(\mathbf x\), which is (usually) also a real number. In this section we will look at two types of “summary statistics”, which are statistics that we feel will give us useful information about the data.

We’ll look here at two types of summary statistic:

  • Statistics of centrality, which tell us where the “middle” of the data is.
  • Statistics of spread, which tell us how far the data typically spreads out from that middle.

Some measures of centrality are the following.

Definition 1.1 Consider some real-valued data \(\mathbf x = (x_1, x_2, \dots, x_n)\).

  • The mode is the most common value of \(x_i\). (If there are multiple joint-most common values, they are all modes.)
  • Suppose the data is ordered as \(x_1 \leq x_2 \leq \cdots \leq x_n\). Then the median is the central value in the ordered list. If \(n\) is odd, this is \(x_{(n+1)/2}\); if \(n\) is even, we normally take halfway between the two central points, \(\frac12(x_{n/2}+x_{n/2 + 1})\).
  • The mean \(\bar x\) is \[ \bar x = \frac{1}{n}(x_1 + x_2 + \cdots + x_n) = \frac1n \sum_{i=1}^n x_i . \]

In that last expression, we’ve made use of Sigma notation to write down the sum. (If Sigma notation is new to you, I recommend this PDF from MathCentre, or Section 2.4 of Clarke and Cooke, A Basic Course in Statistics.)

Example 1.1 Some packets of Skittles (a small fruit-flavoured sweet) were opened, and the number of Skittles in each packet counted. There were 13 packets, and the number of sweets (sorted from smallest to largest) were: \[ 59, \ 59, \ 59, \ 59, \ 60, \ 60, \ 60, \ 61, \ 62, \ 62, \ 62, \ 63, \ 63 .\]

The mode is 59, because there were 4 packets containing 59 sweets; more than any other number.

Since there are \(n = 13\) packets, the middle packet is number \(i = 7\), so the median is \(x_7 = 60\).

The mean is \[ \bar x = \frac{1}{13} (59 + 59 + \cdots + 63) = \frac{789}{13} = 60.7 .\]

The median is one example of a “quantile” of the data. Suppose our data is increasing order again. For \(0 \leq \alpha \leq 1\), the \(\alpha\)-quantile \(q(\alpha)\) of the data is the datapoint \(\alpha\) of the way along the list. Generally, \(q(\alpha)\) is equal to \(x_{1+\alpha(n-1)}\) when \(1+\alpha(n-1)\) is an integer. (If \(1+\alpha(n-1)\) isn’t an integer, there are various conventions of how to choose that we won’t go into here. R has nine different settings for choosing quantiles! – we will always just use R’s default choice.)

  • The median is the \(\frac12\)-quantile \(q(\frac12)\), which is \(q(\frac12) = x_7 = 60\) for this data.
  • The minimum is the 0-quantile \(q(0)\), which is \(q(0) = x_1 = 59\) for this data.
  • The maximum is the 1-quantile \(q(1)\), which is \(q(1) = x_{13} = 63\) for this data
  • The lower quartile (that’s “quartile”, as in “quarter” – not “quantile”) is the \(\frac14\)-quantile \(q(\frac14)\), which is \(q(\frac14) = x_4 = 59\) for this data.
  • The upper quartile is the \(\frac34\)-quantile \(q(\frac34)\), which is \(q(\frac34) = x_{10} = 62\) for this data.

The following R code reads in some data which has the daily average temperature in Leeds in 2020, divided into months. We can find, for example, the mean October temperature or the lower quartile of the July temperature.

temperature <- read.csv("https://mpaldridge.github.io/math1710/data/temperature.csv")
jul <- temperature[temperature$month == "jul", ]
oct <- temperature[temperature$month == "oct", ]

mean(oct$temp)
[1] 11.93548
quantile(jul$temp, probs = 1 / 4)
25% 
 15 

1.4 Statistics of spread

Some measures of spread are:

Definition 1.2 The number of distinct observations is precisely that: the number of different datapoints we have after removing any repeats.

The interquartile range is the difference between the upper and lower quartiles \(\text{IQR} = q(\frac34) - q(\frac14)\).

The sample variance is \[ s^2_x = \frac{1}{n-1} \left((x_1 - \bar x)^2 + \cdots + (x_n - \bar x)^2 \right) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 , \] where \(\bar x\) is the sample mean from before. The standard deviation \(s_x = \sqrt{s^2_x}\) is the square-root of the sample variance.

Example 1.2 We continue with the Skittles data.

The number of distinct observation is 5. (These are 59, 60, 61, 62, and 63.)

The interquartile range is \(x_{10} - x_4 = 62 - 59 = 3\).

You will calculate the sample variance on Problem Sheet 1.

The formula we’ve given for sample variance is sometimes called the “definitional formula”, as it’s the formula used to define the sample variance. We can rearrange that formula as follows: \[\begin{align*} s^2_x &= \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 \\ &= \frac{1}{n-1} \sum_{i=1}^n (x_i^2 - 2x_i\bar x + \bar x^2) \\ &= \frac{1}{n-1}\left(\sum_{i=1}^nx_i^2 - \sum_{i=1}^n 2x_i\bar x + \sum_{i=1}^n\bar x^2 \right) \\ &= \frac{1}{n-1} \left(\sum_{i=1}^n x_i^2 - 2\bar x \sum_{i=1}^n x_i + \bar x^2 \sum_{i=1}^n 1 \right) \\ &= \frac{1}{n-1} \left(\sum_{i=1}^n x_i^2 - 2n\bar x^2 + n\bar x^2 \right) \\ &= \frac{1}{n-1} \left(\sum_{i=1}^n x_i^2 - n\bar x^2 \right) . \end{align*}\] Here, the first line is the definitional formula; the second line is from expanding out the bracket; the third line is taking the sum term-by-term; the fourth line takes any constants (things not involving \(i\)) outside the sums; the fifth line uses \(\sum_{i=1}^n x_i = n\bar x\), from the definition of the mean, and \(\sum_{i=1}^n 1 = 1 + 1 + \cdots 1 = n\); and the sixth line simplifies the final two terms.

This has left us with \[ s^2_x = \frac{1}{n-1} \left(\sum_{i=1}^n x_i^2 - n\bar x^2 \right) . \] This is sometimes called the “computational formula”; this is because it usually takes fewer presses of calculator buttons to compute the sample variance with this formula rather than the definitional formula. (But make sure you keep enough decimal points in \(\bar x^2\).)

Going back to our weather data in R, we can find the sample variance of the October weather or the interquartile range of the July weather.

var(oct$temp)
[1] 2.862366
IQR(jul$temp)
[1] 3

Summary

  • Exploratory data analysis is about taking a first look at data.
  • Summary statistics are numbers calculated from data that give us useful information about the data.
  • Summary statistics that measure the centre of the data include the mode, median, and mean.
  • Summary statistics that measure the spread of the data include the number of distinct outcomes, the interquartile range, and the sample variance.

Recommended reading:

On Problem Sheet 1, you should now be able to complete Questions A1, B1, B2, B4, C1.