Lesson 9: Normal distribution ~ Part 2
Friday, January 12, 2024
In last lesson, we learned the basic concept of normal distribution, which is about what is normal distribution. In this lesson, we will learn how to use normal distribution to solve problems.
First, load the packages we will use in this lesson.
Unlike discrete data, the y axis in continuous data does not represent the probability of the data.
We can only use the area under the curve to describe the probability of the continuous data.
We can use CDF (Cumulative Distribution Function) to calculate the probability of the data.
\(F(x)\) represent the probability of the data less than x, i.e., the area under the curve to the left of x.
For example, we can calculate the probability of the data is less than 165: \(F(165) = P(x<165) = ?\).
- The first argument is the value of the data.
- The second argument is the mean of the data.
- The third argument is the standard deviation of the data.
- This example shows that the probability of the data is less than 165 is 0.1587, i.e. 15.87%.
[1] 0.8413447
[1] 0.8413447
No surprise, the result is the same as the above.
We can also calculate the probability of the data is between 165 and 175: \(P(165<x<175) = ?\).
We have plenty of ways to calculate it.
We can calculate the probability of the data less than 175 first, then subtract the probability of the data less than 165 from it: \(P(165<x<175) = P(x<175) - P(x<165) = F(175) - F(165)\).
i.e., we use the area of the following purple part:
[1] 0.6826895
This example shows that the probability of the data between 165 and 175 is 0.6827, i.e. 68.27%.
Remember that the total area under the curve is 1, so the probability of the data between 165 and 175 is 1 minus the probability of the data less than 165 and greater than 175: \(P(165<x<175) = 1 - P(x<165) - P(x>175) = 1 - F(165) - (1 - F(175))\).
So we can use 1 minus the following two parts to calculate it.
[1] 0.6826895
No surprise, the result is the same as the above.
Remember that the normal distribution is symmetrically distributed around the mean, so the probability of the data less than 165 is the same as the probability of the data greater than 175, because the distances from 165 and 175 to 170 is the same: \(P(x<165<175) = 1 - 2 \times P(x<165)\).
- 68% of the data is within 1 standard deviation of the mean.
- 95% of the data is within 2 standard deviations of the mean.
- 99.7% of the data is within 3 standard deviations of the mean.
qnorm()
functionqnorm()
to find out the critical value.[1] 1741.926
- The first argument is the probability of the data.
- The second argument is the mean of the data.
- The third argument is the standard deviation of the data.
- This example shows that if the probability to the left of the data is 0.79, then the critical value of the data is 1742.
- $lower.tail = FALSE$ means that we want to calculate the value of the data with probability to the right of the data.
- This example shows that if the probability to the right of the data is 0.37, then the value of the data is 1600.
We can also calculate the critical values of the data if we know the probability between the data.
For example, we can calculate the value of the data between probability 0.2 and 0.75.
[1] 1247.514 1702.347
- This example shows that if the probability to the left of the data is 0.2, then the left critical value of the data is 1248; if the probability to the left of the data is 0.75, then the right critical value of the data is 1702.
rnorm()
function to generate normal distribution data.[1] 0.95130004 0.12061804 -0.42174915 0.09952709 0.79155243 -1.26001000
The first argument is the number of data we want to generate.
The second argument is the mean of the data.
The third argument is the standard deviation of the data.
We can use hist()
function to plot the data.
During the period when there is no computer, people have to calculate the CDF by hand.
However, the PDF and CDF of standard normal distribution are quite complicated. It is hard to calculate, time-consuming, and easy to make mistakes.
To make it simple to calculate, we can standardize the normal distribution.
\[ f(x) = {1 \over \sqrt{2\pi}} e^{-{x^2 \over 2}} \]
\[ \Phi(x) = \int_{-\infty}^x f(x) dx = \int_{-\infty}^x {1 \over \sqrt{2\pi}} e^{-{x^2 \over 2}} dx \]
\[ f(x) = {1 \over \sqrt{2\pi\sigma^2}} e^{-{(x-\mu)^2 \over 2\sigma^2}} \]
\[ \mu = 0 \] \[ \sigma = 1 \]
We can use standard normal distribution to compare two data with different mean and standard deviation.
For example, we have two data of exam scores, a is 2000 and b is 89 with mean 1500 and 75, and standard deviation 300 and 5, respectively. We want to know which score is better.
We can standardize the two data first, then compare their z scores.
We can see that the z score of 2000 is 1.67, and the z score of 89 is 2.8. So the score of 89 is better than the score of 2000.
We can also use pnorm()
function to calculate the probability of the two scores
Suppose the heights of students in a school follow a normal distribution with a mean of 65 inches and a standard deviation of 4 inches.
Determine the probability that a randomly selected student has a height higher than 69 inches (\(P(height > 69) = ?\)) without resorting to the Z-score table. Consider using the 68-95-99.7 rule, which provides approximate probabilities based on standard deviations in a normal distribution.
What is the probability that a randomly selected student is shorter than 71 inches (\(P(height<71)=?\))
Find the probability that a student is between 60 inches and 70 inches tall (\(P(60<height<70)=?\)).
Calculate the Z-score for a student who is 68 inches tall (\(Z=?\)).
Determine the height that corresponds to the 80th percentile (\(P(height<?)=0.8\)).
Imagine a country that uses cm scale to measure height. In a school within this country, student heights conform to a normal distribution with an average of 170 cm and a standard deviation of 6 cm. Given that a student from this school measures 178 cm in this system, how does this height compare in percentile to a student from the previous example, who measured 72 inches? Which student occupies the higher percentile rank in their school?
Thank you!