Confidence Interval
- It is reported in 2 parts: Confidence level (e.g. 95%) and Interval (e.g. 0.2 0.01)
- It is used to infer info on the population parameter using a sample of the population.
- 95% CI means out of 100 samples, and confidence intervals constructed for each sample, 95 of them will contain the population parameter. So, we are 95% confident that the population parameter lies within the CI.
- To construct a CI for population proportion (for discrete variables)
- : sample proportion
- : z-value from standard normal distribution (depends on confidence level)
- : sample size
- To construct a CI for population mean
- : sample mean
- : t-value from t-distribution (depends on and confidence level)
- : sample standard deviation
- : sample size
- Properties
- When a smaller sample is taken, the CI will be wider
- At a lower confidence level, the CI will be narrower
Hypothesis Testing
Used to decide if the data from a random sample is sufficient to support a hypothesis about the population
- State (mutually exclusive):
- Null Hypothesis (always lower probability)
- Alternate Hypothesis
- Collect sample data
- State the level of significance (typically 0.05, 0.01 or 0.1). Lower value: harder to reject null hypothesis
- Calculate P value (probability that null hypothesis is true)
- Compare with level of significance
- p value < significance level: null hypothesis rejected, accept alternate hypothesis
- p value > significance level: not enough information to reject null hypothesis
Example
A coin manufacturer claims they produced a biased coin with When testing the coin 8 times, there were 7H and 1T
- Hypotheses
- Null Hypothesis : Coin is as claimed i.e. P(H) = 0.3
- Alternate Hypothesis : P(H) > 0.3 (one tailed test) (two tailed test is P(H) \neq 0.3)
- Let the level of significance be 0.1
- P-value = P(observation | null is true) + P(equally extreme outcomes | null is true) + P(outcomes that are even more extreme | null is true) Note: at least as extreme β at least as favourable to the alternate hypothesis
- P(HHHHHHHT | P(H) = 0.3) = (0.3)^7 x 0.7
-
- P(other combinations of 7H 1T | P(H) = 0.3) = 7(0.3)^7 x (0.7)
-
- P(HHHHHHHH | P(H) = 0.3) = (0.3)
- Since p-value < level of significance, we can reject the null hypothesis, and conclude the alternate hypothesis
Chi-sq test
Hypothesis testing for whether two variables are associated
Example
Null Hypothesis: Smoking is not associated with heart disease: rate(HD|S) = rate(HD|NS) = rate(HD) Alternate hypothesis: Smoking is associated with heart disease: rate(HD|S) rate(HD|NS)
Assume Null Hypothesis is true, calculate rate(HD).
Construct table for observation:
| Heart Diseas | No Heart Disease | Row Total | |
|---|---|---|---|
| Smoker | 38 | 14962 | 15000 |
| Non Smoker | 44 | 84956 | 85000 |
| Col Total | 82 | 99918 | 100000 |
| rate(HD) = 82/100000 | |||
| Then, draw the table for the expected outcome (null hypothesis) and compare: |
| Heart Disease | No Heart Disease | Row Total | |
|---|---|---|---|
| Smoker | 12.3 | 14897.7 | 15000 |
| Non Smoker | 69.7 | 844930.3 | 85000 |
| Col Total | 82 | 99918 | 100000 |
| p value is low if there is a big difference between expectation and observation |
if p-value < level of significance, we can reject the null hypothesis and conclude alternate hypothesis if p-value > level of significance, we cannot reject the null hypothesis, and therefore cannot conclude the alternate hypothesis (we cannot conclude the null hypothesis)
Note: Donβt need to know how to calculate p value, just use software