R programming for beginners (GV900)

Lesson 12: Confidence intervals

Sunday, January 14, 2024

Video of Lesson 12

1 Setup

In this lesson, we will learn Confidence intervals.

First, load the packages we will use in this lesson.

Code

library(tidyverse)
library(openintro)
library(carData)

2 Standard Error

We have learned that the standard error of a sample is calculated as follows:

\[\begin{aligned} se &= {sd \over \sqrt n} \\ \\ &= {\sqrt{{\sum_{i=1}^n(y_i-\bar{y})^2 \over n-1}} \over \sqrt n} \end{aligned}\]

With standard error, we can calculate the confidence interval.

3 Confidence Interval

Confidence Interval

Confidence Intrval is a range of values that is likely to contain the population parameter with a certain degree of confidence.

The 95% confidence interval is calculated as follows:

\[\begin{aligned} CI &= \bar{y} \pm 1.96 \times se \end{aligned}\]

Code

# why 1.96?
qnorm(0.975)

[1] 1.959964

Code

normTail(M = c(-1.96, 1.96), col = "purple")
abline(v = 0, col = "red")

The 99% confidence interval is calculated as follows:

\[\begin{aligned} CI &= \bar{y} \pm 2.58 \times se \end{aligned}\]

Code

# why 2.58?
qnorm(0.995)

[1] 2.575829

Code

normTail(M = c(-2.58, 2.58), col = "purple")
abline(v = 0, col = "red")

4 Example

We use BEPS data to calculate the confidence interval of the average age of the population age.
Here we assume that the whole data set is the population.

Code

# population mean and sd of age
mean(BEPS$age)

[1] 54.1823

Code

sd(BEPS$age)

[1] 15.71121

Code

# Take a sample with the size of 100
sample <- BEPS |> 
  sample_n(size = 100) |> 
  summarise(mean = mean(age), 
            sd = sd(age),
            se = sd(age)/sqrt(100))

The 95% confidence interval is calculated as follows:

Code

# 95% confidence interval
lower95 <- sample$mean - 1.96 * sample$se
upper95 <- sample$mean + 1.96 * sample$se

cat("95% confidence interval is: [",lower95, ",", upper95,"]")

95% confidence interval is: [ 49.05399 , 55.48601 ]

In this example, we say that we are 95% confident that the interval [49.0539906 and 55.4860094] covers the population mean age.
Which is true in this case, because the population mean age is 54.1822951.

Code

normTail(m = sample$mean, 
         s = sample$se,
         df = 99,
         M = c(lower95 , upper95), col = "purple")

The 99% confidence interval is calculated as follows:

Code

# 99% confidence interval
lower99 <- sample$mean - 2.58 * sample$se
upper99 <- sample$mean + 2.58 * sample$se

cat("99% confidence interval is: [",lower99, ",", upper99,"]")

99% confidence interval is: [ 48.03668 , 56.50332 ]

Code

normTail(m = sample$mean, 
         s = sample$se,
         df = 99,
         M = c(lower99 , upper99), col = "purple")

In this example, we say that we are 99% confident that the interval [48.0366815 and 56.5033185] covers the population mean age.
Which, of course, is also true in this case, because the 99% confidence interval is wider than the 95% confidence interval.
However, in this case we assume the population mean age is 54.1822951, which is just a sample mean of the whole population.
In reality, we do not know the population mean age. So let’s estimate the population mean age with the sample

Code

Sample <- BEPS |> 
  summarise(mean = mean(age), 
            sd = sd(age),
            se = sd(age)/sqrt(n()))

The 95% confidence interval is calculated as follows:

Code

# 95% confidence interval
Lower95 <- Sample$mean - 1.96 * Sample$se
Upper95 <- Sample$mean + 1.96 * Sample$se

cat("95% confidence interval is: [",Lower95, ",", Upper95,"]")

95% confidence interval is: [ 53.39374 , 54.97085 ]

In this example, we say that we are 95% confident that the interval [53.3937423 and 54.9708478] covers the population mean age.
In this case, we don’t know the population mean age, so we cannot say whether the interval actually covers the population mean age or not. But we are 95% confident that it does. Still, there are 5% chance that it does not.
Normally, we use 95% confidence interval, but we can also use 99%, or 90% confidence interval.
Which level to use depends on the situation. If we want to be more confident but less accurate, we use 99% confidence interval. If we want to be less confident but more accurate, we use 90% confidence interval.
Which means, confidence interval is a trade-off between confidence and accuracy.
It’s analogous to an archer aiming at a target. If the goal is for the archer to hit anywhere on the dartboard, their confidence may be high, around 99%. However, the arrow could land precisely in the bull’s eye or just at the edge of the dartboard. On the other hand, if the objective is to hit the bull’s eye specifically, the archer might feel less confident, perhaps only 90%.

5 Recap

Confidence interval is a range of values that is likely to contain the population parameter with a certain degree of confidence.
The 95% confidence interval is calculated as follows:

\[\begin{aligned} CI &= \bar{y} \pm 1.96 \times se \end{aligned}\]

In this lesson, we assume we know the population standard deviation. So we use the z-score to calculate the confidence interval.
In t-test lesson, we will learn how to calculate the confidence interval when we don’t know the population standard deviation.

Thank you!