layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))
n = 56 # Sample size
df = n - 1 # Degrees of freedom
mu = 100
sigma = 15
IQ = seq(mu-45, mu+45, 1)
par(mar=c(4,2,2,0))
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red", main = "Population Distribution")
n.samples = 12
for(i in 1:n.samples) {
par(mar=c(2,2,2,0))
hist(rnorm(n, mu, sigma), main="Sample Distribution", cex.axis=.5, col="beige", cex.main = .75)
}
T-distribution &
Multiple regression
T-distribution
Gosset
In probability and statistics, Student’s t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown.
In the English-language literature it takes its name from William Sealy Gosset’s 1908 paper in Biometrika under the pseudonym “Student”. Gosset worked at the Guinness Brewery in Dublin, Ireland, and was interested in the problems of small samples, for example the chemical properties of barley where sample sizes might be as low as 3.
Source: Wikipedia
Population distribution
A sample
Let’s take one sample from our normal populatiion and calculate the t-value.
= rnorm(n, mu, sigma); x x
[1] 105.26728 107.64428 89.20503 122.52820 94.38646 93.14793 97.46753
[8] 102.32044 95.22961 85.16793 119.09532 109.16959 121.38975 78.67579
[15] 102.83598 116.32336 111.50476 80.53403 102.47326 70.75958 110.01741
[22] 116.99447 99.04687 99.23672 106.50329 114.56017 101.26652 99.49530
[29] 126.30077 119.15345 107.73616 95.45575 98.62924 90.74828 106.73112
[36] 122.59579 85.17341 116.93565 112.32223 88.31945 97.21278 99.20346
[43] 103.08937 106.80783 97.78413 109.91006 119.49368 103.43489 101.54717
[50] 109.94048 120.74371 73.70947 112.29974 82.29033 122.76944 85.64376
hist(x, main = "Sample distribution", col = "beige", breaks = 15)
text(80, 10, round(mean(x),2))
More samples
let’s take more samples.
= 1000
n.samples = vector()
mean.x.values = vector()
se.x.values
for(i in 1:n.samples) {
= rnorm(n, mu, sigma)
x = mean(x)
mean.x.values[i] = (sd(x) / sqrt(n))
se.x.values[i] }
Mean and SE for all samples
head(cbind(mean.x.values, se.x.values))
mean.x.values se.x.values
[1,] 101.44109 2.064637
[2,] 96.85671 1.926104
[3,] 100.40150 1.520093
[4,] 101.90844 1.954261
[5,] 97.94197 2.098088
[6,] 101.77852 1.946505
Sampling distribution
Of the mean
hist(mean.x.values,
col = "beige",
main = "Sampling distribution",
xlab = "all sample means")
T-statistic
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
So the t-statistic represents the deviation of the sample mean \(\bar{x}\) from the population mean \(\mu\), considering the sample size, expressed as the degrees of freedom \(df = n - 1\)
t-value
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
= (mean(x) - mu) / (sd(x) / sqrt(n))
t t
[1] -0.2754441
Calculate t-values
\[T_{n-1} = \frac{\bar{x}-\mu}{SE_x} = \frac{\bar{x}-\mu}{s_x / \sqrt{n}}\]
= (mean.x.values - mu) / se.x.values
t.values
tail(cbind(mean.x.values, mu, se.x.values, t.values))
mean.x.values mu se.x.values t.values
[995,] 96.75512 100 2.124674 -1.5272357
[996,] 100.74668 100 2.281768 0.3272384
[997,] 97.14992 100 2.107166 -1.3525681
[998,] 101.75692 100 2.146641 0.8184510
[999,] 96.56406 100 1.997002 -1.7205502
[1000,] 99.41413 100 2.126993 -0.2754441
Sampled t-values
What is the distribution of all these t-values?
hist(t.values,
freq = F,
main = "Sampled T-values",
xlab = "T-values",
col = "beige",
ylim = c(0, .4))
= seq(-4, 4, .01)
T lines(T, dt(T,df), col = "red")
legend("topright", lty = 1, col="red", legend = "T-distribution")
T-distribution
So if the population is normaly distributed (assumption of normality) the t-distribution represents the deviation of sample means from the population mean (\(\mu\)), given a certain sample size (\(df = n - 1\)).
The t-distibution therefore is different for different sample sizes and converges to a standard normal distribution if sample size is large enough.
The t-distribution is defined by:
\[\textstyle\frac{\Gamma \left(\frac{\nu+1}{2} \right)} {\sqrt{\nu\pi}\,\Gamma \left(\frac{\nu}{2} \right)} \left(1+\frac{x^2}{\nu} \right)^{-\frac{\nu+1}{2}}\!\]
where \(\nu\) is the number of degrees of freedom and \(\Gamma\) is the gamma function.
Source: wikipedia
One or two sided
Two sided
- \(H_A: \bar{x} \neq \mu\)
One sided
- \(H_A: \bar{x} > \mu\)
- \(H_A: \bar{x} < \mu\)
Effect-size
The effect-size is the standardised difference between the mean and the expected \(\mu\). In the t-test effect-size is expressed as \(r\).
\[r = \sqrt{\frac{t^2}{t^2 + \text{df}}}\]
= sqrt(t^2/(t^2 + df))
r
r
[1] 0.2603778
Effect-sizes
We can also calculate effect-sizes for all our calculated t-values. Under the assumption of \(H_0\) the effect-size distribution looks like this.
= sqrt(t.values^2/(t.values^2 + df))
r
tail(cbind(mean.x.values, mu, se.x.values, t.values, r))
mean.x.values mu se.x.values t.values r
[995,] 96.75512 100 2.124674 -1.5272357 0.20169997
[996,] 100.74668 100 2.281768 0.3272384 0.04408192
[997,] 97.14992 100 2.107166 -1.3525681 0.17942066
[998,] 101.75692 100 2.146641 0.8184510 0.10969394
[999,] 96.56406 100 1.997002 -1.7205502 0.22599668
[1000,] 99.41413 100 2.126993 -0.2754441 0.03711529
Effect-size distribution
Cohen (1988)
- Small: \(0 \leq .1\)
- Medium: \(.1 \leq .3\)
- Large: \(.3 \leq .5\)
Power
- Strive for 80%
- Based on know effect size
- Calculate number of subjects needed
- Use G*Power to calculate
Alpha Power
= seq(-3,6,.01)
T = 45
N = 2
E
# Set plot
plot(0,0,
type = "n",
ylab = "Density",
xlab = "T",
ylim = c(0,.5),
xlim = c(-3,6),
main = "T-Distributions under H0 and HA")
= qt(.05,N-1,lower.tail=FALSE)
critical_t
# Alpha
= seq(critical_t,6,.01)
range_x polygon(c(range_x,rev(range_x)),
c(range_x*0,rev(dt(range_x,N-1,ncp=0))),
col = "grey",
density = 10,
angle = 90,
lwd = 2)
# Power
= seq(critical_t,6,.01)
range_x polygon(c(range_x,rev(range_x)),
c(range_x*0,rev(dt(range_x,N-1,ncp=E))),
col = "grey",
density = 10,
angle = 45,
lwd = 2)
lines(T,dt(T,N-1,ncp=0),col="red", lwd=2) # H0 line
lines(T,dt(T,N-1,ncp=E),col="blue",lwd=2) # HA line
# Critical value
lines(rep(critical_t,2),c(0,dt(critical_t,N-1,ncp=E)),lwd=2,col="black")
text(critical_t,dt(critical_t,N-1,ncp=E),"critical T-value",pos=2, srt = 90)
# H0 and HA
text(0,dt(0,N-1,ncp=0),expression(H[0]),pos=3,col="red", cex=2)
text(E,dt(E,N-1,ncp=E),expression(H[A]),pos=3,col="blue",cex=2)
# Mu H0 line
lines(c(0,0),c(0,dt(0,N-1)), col="red", lwd=2,lty=2)
text(0,dt(0,N-1,ncp=0)/2,expression(mu),pos=4,cex=1.2)
# Mu HA line
lines(c(E,E),c(0,dt(E,N-1,ncp=E)),col="blue",lwd=2,lty=2)
text(E,dt(0,N-1,ncp=0)/2,expression(paste(mu)),pos=4,cex=1.2)
# t-value
lines( c(critical_t+.01,6),c(0,0),col="green",lwd=4)
# Legend
legend("topright", c(expression(alpha),'POWER'),density=c(10,10),angle=c(90,45))
Multiple regression
Multiple regression
\[\LARGE{\text{outcome} = \text{model} + \text{error}}\]
In statistics, linear regression is a linear approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X.
\[\LARGE{Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \dotso + \beta_n X_{ni} + \epsilon_i}\]
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters \(\beta\)’s are estimated from the data.
Source: wikipedia
Outcome vs Model
= c(2, 1, .5, .1)
error = 100
n
layout(matrix(1:4,1,4))
for(e in error) {
= rnorm(n)
x = x + rnorm(n, 0 , e)
y
= round(cor(x,y), 2)
r .2 = round(r^2, 2)
r
plot(x,y, las = 1, ylab = "outcome", xlab = "model", main = paste("r =", r," r2 = ", r.2), ylim=c(-2,2), xlim=c(-2,2))
<- lm(y ~ x)
fit abline(fit, col = "red")
}
Assumptions
A selection from Field (8.3.2.1. Assumptions of the linear model):
For simple regression
- Sensitivity
- Homoscedasticity
Plus multiple regressin
- Multicollinearity
- Linearity
Sensitivity
Outliers
- Extreme residuals
- Cook’s distance (< 1)
- Mahalonobis (< 11 at N = 30)
- Laverage (The average leverage value is defined as (k + 1)/n)
Homoscedasticity
- Variance of residual should be equal across all expected values
- Look at scatterplot of standardized: expected values \(\times\) residuals. Roughly round shape is needed.
Code
set.seed(27364)
fit <- lm(x[2:n] ~ y[2:n])
ZPRED = scale(fit$fitted.values)
ZREDID = scale(fit$residuals)
plot(ZPRED, ZREDID)
abline(h = 0, v = 0, lwd=2)
#install.packages("plotrix")
if(!"plotrix" %in% installed.packages()) { install.packages("plotrix") };
library("plotrix")
draw.circle(0,0,1.7,col=rgb(1,0,0,.5))
Multicollinearity
To adhere to the multicollinearity assumptien, there must not be a too high linear relation between the predictor variables.
This can be assessed through:
- Correlations
- Matrix scatterplot
- Tolerance > 0.2
Linearity
For the linearity assumption to hold, the predictors must have a linear relation to the outcome variable.
This can be checked through:
- Correlations
- Matrix scatterplot with predictors and outcome variable
Example
Perdict study outcome based on IQ and motivation.
Read data
# data <- read.csv('IQ.csv', header=T)
<- read.csv('http://shklinkenberg.github.io/statistics-lectures/topics/regression_multiple_predictors/IQ.csv', header=TRUE)
data
head(data)
Studieprestatie Motivatie IQ
1 2.710330 3.276778 129.9922
2 2.922617 2.598901 128.4936
3 1.997056 3.207279 130.2709
4 2.322539 2.104968 125.7457
5 2.162133 3.264948 128.6770
6 2.278899 2.217771 127.5349
= data$IQ
IQ = data$Studieprestatie
study.outcome = data$Motivatie motivation
Regression model in R
Perdict study outcome based on IQ and motivation.
<- lm(study.outcome ~ IQ + motivation) fit
What is the model
$coefficients fit
(Intercept) IQ motivation
-30.2822189 0.2690984 -0.6314253
.0 = round(fit$coefficients[1], 2) ## Intercept
b.1 = round(fit$coefficients[2], 2) ## Beta coefficient for IQ
b.2 = round(fit$coefficients[3], 2) ## Beta coefficient for motivation b
De beta coëfficients are:
- \(b_0\) (intercept) = -30.28
- \(b_1\) = 0.27
- \(b_2\) = -0.63.
Model summaries
summary(fit)
Call:
lm(formula = study.outcome ~ IQ + motivation)
Residuals:
Min 1Q Median 3Q Max
-0.75127 -0.17437 0.00805 0.17230 0.44435
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -30.28222 6.64432 -4.558 5.76e-05 ***
IQ 0.26910 0.05656 4.758 3.14e-05 ***
motivation -0.63143 0.23143 -2.728 0.00978 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2714 on 36 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.739
F-statistic: 54.8 on 2 and 36 DF, p-value: 1.192e-11
Visual
Attaching package: 'rgl'
The following object is masked from 'package:plotrix':
mtext3d
What are the expected values based on this model
\[\widehat{\text{studie prestatie}} = b_0 + b_1 \text{IQ} + b_2 \text{motivation}\]
= b.0 + b.1 * IQ + b.2 * motivation
exp.stu.prest
= exp.stu.prest model
\[\text{model} = \widehat{\text{studie prestatie}}\]
Apply regression model
\[\widehat{\text{studie prestatie}} = b_0 + b_1 \text{IQ} + b_2 \text{motivation}\] \[\widehat{\text{model}} = b_0 + b_1 \text{IQ} + b_2 \text{motivation}\]
cbind(model, b.0, b.1, IQ, b.2, motivation)[1:5,]
model b.0 b.1 IQ b.2 motivation
[1,] 2.753512 -30.28 0.27 129.9922 -0.63 3.276778
[2,] 2.775969 -30.28 0.27 128.4936 -0.63 2.598901
[3,] 2.872561 -30.28 0.27 130.2709 -0.63 3.207279
[4,] 2.345205 -30.28 0.27 125.7457 -0.63 2.104968
[5,] 2.405860 -30.28 0.27 128.6770 -0.63 3.264948
\[\widehat{\text{model}} = -30.28 + 0.27 \times \text{IQ} + -0.63 \times \text{motivation}\]
How far are we off?
= study.outcome - model
error
cbind(model, study.outcome, error)[1:5,]
model study.outcome error
[1,] 2.753512 2.710330 -0.04318159
[2,] 2.775969 2.922617 0.14664823
[3,] 2.872561 1.997056 -0.87550534
[4,] 2.345205 2.322539 -0.02266610
[5,] 2.405860 2.162133 -0.24372667
Outcome = Model + Error
Is that true?
== model + error study.outcome
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
- Yes!
Visual
plot(study.outcome, xlab='personen', ylab='study.outcome')
= length(study.outcome)
n = mean(study.outcome)
gemiddelde.study.outcome
## Voeg het gemiddelde toe
lines(c(1, n), rep(gemiddelde.study.outcome, 2))
## Wat is de totale variantie?
segments(1:n, study.outcome, 1:n, gemiddelde.study.outcome, col='blue')
## Wat zijn onze verwachte scores op basis van dit regressie model?
points(model, col='orange')
## Hoever zitten we ernaast, wat is de error?
segments(1:n, study.outcome, 1:n, model, col='purple', lwd=3)
Explained variance
The explained variance is the deviation of the estimated model outcome compared to the total mean.
To get a percentage of explained variance, it must be compared to the total variance. In terms of squares:
\[\frac{{SS}_{model}}{{SS}_{total}}\]
We also call this: \(r^2\) of \(R^2\).
= cor(study.outcome, model)
r ^2 r
[1] 0.7527463
Compare models
<- lm(study.outcome ~ motivation, data)
fit1 <- lm(study.outcome ~ motivation + IQ, data)
fit2
# summary(fit1)
# summary(fit2)
anova(fit1, fit2)
Analysis of Variance Table
Model 1: study.outcome ~ motivation
Model 2: study.outcome ~ motivation + IQ
Res.Df RSS Df Sum of Sq F Pr(>F)
1 37 4.3191
2 36 2.6518 1 1.6673 22.635 3.144e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1