Can.Do.So – R programming for beginners (GV900)

Video of Lesson 8

1 Setup

From previous lessons, we have learned how to use R to do some basic data analysis. From this lesson on, we will start to learn how to use R to analyse data.

First, load the packages we will use in this lesson.

Code

library(tidyverse)
library(openintro)
library(gapminder)

2 Why Data analysis?

Data analysis is the process of collecting, cleaning, and analyzing data to discover useful information, draw conclusions, and support decision-making.
Data make it easier and more accurate for us to understand many things.
Let’s start from an example: How do we describe the economic well being of a country?
We might have a lot of ways to describe it, like what they eat, how they live, how they dress, how they travel.
But the data might be the simplest and accurate way to describe it.
The best one would be the GDP per capita, which is an average data.

More General:

We can use R function to calculate mean easily.

Code

gapminder %>% # load the data
  filter(year == 2007) |> # filter year 2007
  summarise(avg_gdppc = mean(gdpPercap), # calculate mean
            .by = continent) |> # group by continent
  arrange(-avg_gdppc) # sort by mean

# A tibble: 5 × 2
  continent avg_gdppc
  <fct>         <dbl>
1 Oceania      29810.
2 Europe       25054.
3 Asia         12473.
4 Americas     11003.
5 Africa        3089.

3 Why distribution?

GDP per capita (Mean or average) is a good index to measure the economic well being of a country.
However, it is not enough. We may want to know how many people earn the average salary, how many earn less or more than the average level, and how far away from low/high to average level.
This is what the distribution of the data can tell us.
Again, we can use R function to tell us the distribution of the data.

Code

# summary statistics
gapminder_2007 <- subset(gapminder, year == 2007)
summary(gapminder_2007$gdpPercap)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  277.6  1624.8  6124.4 11680.1 18008.8 49357.2

Code

# histogram
gapminder |> 
  filter(year == 2007) |>
  ggplot(aes(x = gdpPercap)) +
  geom_histogram(bins = 30)

Code

# boxplot
gapminder |> 
  filter(year == 2007) |>
  ggplot(aes(x = continent, y = gdpPercap)) +
  geom_boxplot()

Why data distribution matters?

Example to understand why data distribution matters

Suppose you are a seller of some kind of adult shoes.
How many should you buy for your stock.
Suppose there are 10 different sizes.
Should you buy each size equally? If not, which size should you buy more and which should you buy less?
You would be in trouble if we buy each size equal number. You would find that some sizes are sold out quickly, while some sizes are sold only a few.
So, you need to know the distribution of the size of potential customers.
Then you can prepare the inventory according to the distribution.

4 Why normal distribution?

Why normal distribution matters?
In previous example, you will find out that in most cases, the bigger or the smaller the sizes, the fewer they are sold.
Because the shoe size of adults is normally distributed.
Normal distribution is the most important distribution in statistics.
It is called normal distribution because it clearly shows us that the normal level of the data is the most frequent level.
In previous example, we would stock more shoes of the most frequent size.
In normal distribution, the most frequent size is the average size.
It is called normal distribution also because it is the most common distribution in nature.
For example, the height of people, the salary of employees, the exam score of students, even the length of the tail of a dog.
Even though the data is not normally distributed, we can still use normal distribution because of the central limit theorem (Will be discussed in Part II).
That is why normal distribution is so important and useful.

5 Features of normal distribution

Normal distribution is a bell-shaped curve.

Code

normTail(m = 170, # mean
         s = 5, # standard deviation
         curveColor  = "skyblue",
         axes = 3)
# add a vertical line to show the mean
abline(v = 170, col = "red")

The mean, median and mode of normal distribution are equal.
The mean, median and mode are located at the center of the curve.
The curve is symmetrically distributed around the mean.
The total area under the curve is 1.

Code

normTail(m = 170, # mean
         s = 5, # standard deviation
         M = c(120, 210), # range
         col = "skyblue",
         axes = 3)
# add a vertical line to show the mean
abline(v = 170, col = "red")

The curve is defined by two parameters: mean and standard deviation.

The mean determines the location of the center of the curve.

Code

# plot three normal distribution density plots with different mean.
plot(0, 0, type = "n", xlim = c(-10, 10), ylim = c(0, 0.3), xlab = "", ylab = "")
curve(dnorm(x, mean = 0, sd = 1.5), from = -10, to = 10, col = "red", add = TRUE)
curve(dnorm(x, mean = 2, sd = 1.5), from = -10, to = 10, col = "blue", add = TRUE)
curve(dnorm(x, mean = -2, sd = 1.5), from = -10, to = 10, col = "green", add = TRUE)
legend("topright", legend = c("mean = 0", "mean = 2", "mean = -2"), col = c("red", "blue", "green"), lty = 1)
abline(v = 0, col = "red")
abline(v = -2, col = "green")
abline(v = 2, col = "blue")

The standard deviation determines the width of the curve.

Code

# plot three normal distribution density plots with different standard deviation.
plot(0, 0, type = "n", xlim = c(-10, 10), ylim = c(0, 0.85), xlab = "", ylab = "")
curve(dnorm(x, mean = 0, sd = 1), from = -10, to = 10, col = "red", add = TRUE)
curve(dnorm(x, mean = 0, sd = 2), from = -10, to = 10, col = "blue", add = TRUE)
curve(dnorm(x, mean = 0, sd = 0.5), from = -10, to = 10, col = "green", add = TRUE)
legend("topright", legend = c("sd = 1", "sd = 2", "sd = 0.5"), col = c("red", "blue", "green"), lty = 1)
abline(v = 0, col = "red")

The x axis is the value of the data; the y axis is the frequency of the data.
We can see that the closer the data is to the mean, the more frequent it is; the further the data is from the mean, the less frequent it is.

The PDF (Probability Density Function) formula of normal distribution is the curve line.

Code

normTail(m = 170, # mean
         s = 5, # standard deviation
         curveColor  = "purple",
         axes = 3)
# add a vertical line to show the mean
abline(v = 170, col = "red")

The CDF (Cumulative Distribution Function) formula of normal distribution is the area under the curve.

Code

normTail(m = 170, # mean
         s = 5, # standard deviation
         M = c(120, 210), # range
         col = "purple",
         axes = 3)
# add a vertical line to show the mean
abline(v = 170, col = "red")

To Be Continued

In this lesson, we have learned what is data analysis, why data distribution matters, why normal distribution matters, and the features of normal distribution.
In the next lesson, we will learn how to use R to calculate the probability of normal distribution, and more…

Thank you!