layout(matrix(c(2:6,1,1,7:8,1,1,9:13), 4, 4))
= 56 # Sample size
n = n - 1 # Degrees of freedom
df
= 120
mu = 15
sigma
= seq(mu-45, mu+45, 1)
IQ
par(mar=c(4,2,0,0))
plot(IQ, dnorm(IQ, mean = mu, sd = sigma), type='l', col="red")
= 12
n.samples
for(i in 1:n.samples) {
par(mar=c(2,2,0,0))
hist(rnorm(n, mu, sigma), main="", cex.axis=.5, col="red")
}
F-distribution & Non-parametric tests
F-distribution
Ronald Fisher
The F-distribution, also known as Snedecor’s F distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is, in probability theory and statistics, a continuous probability distribution. The F-distribution arises frequently as the null distribution of a test statistic, most notably in the analysis of variance; see F-test.
Sir Ronald Aylmer Fisher FRS (17 February 1890 – 29 July 1962), known as R.A. Fisher, was an English statistician, evolutionary biologist, mathematician, geneticist, and eugenicist. Fisher is known as one of the three principal founders of population genetics, creating a mathematical and statistical basis for biology and uniting natural selection with Mendelian genetics.
Analysing variance
Decomposing variance example of height for males and females.
Population distribution
F-statistic
\[F = \frac{{MS}_{model}}{{MS}_{error}} = \frac{{SIGNAL}}{{NOISE}}\]
So the \(F\)-statistic represents a signal to noise ratio by deviding the model variance component by the error variance component.
A samples
Let’s take two sample from our normal populatiion and calculate the F-value.
.1 = rnorm(n, mu, sigma)
x.2 = rnorm(n, mu, sigma)
x
<- data.frame(group = rep(c("s1", "s2"), each=n), score = c(x.1,x.2))
data
= summary(aov(lm(score ~ group, data)))[[1]]$F[1]
F F
[1] 1.692487
More samples
let’s take more samples and calculate the F-value every time.
= 1000
n.samples
= vector()
f.values
for(i in 1:n.samples) {
.1 = rnorm(n, mu, sigma); x.1
x.2 = rnorm(n, mu, sigma); x.2
x
<- data.frame(group = rep(c("s1", "s2"), each=n), score = c(x.1,x.2))
data
= summary(aov(lm(score ~ group, data)))[[1]]$F[1]
f.values[i]
}
= 2
k = 2*n
N
= k - 1
df.model = N - k
df.error
hist(f.values, freq = FALSE, main="F-values", breaks=100)
= seq(0, 6, .01)
F lines(F, df(F,df.model, df.error), col = "red")
F-distribution
So if the population is normaly distributed (assumption of normality) the f-distribution represents the signal to noise ration given a certain number of samples (\({df}_{model} = k - 1\)) and sample size (\({df}_{error} = N - k\)).
The F-distibution therefore is different for different sample sizes and number of groups.
F-distribution
= c(5, 15, 30)
multiple.n = c(2, 4, 6)
multiple.k = multiple.k - 1
multiple.df.model = multiple.n - multiple.k
multiple.df.error = rainbow(length(multiple.df.model) * length(multiple.df.error))
col = seq(0, 10, .01)
F
plot(F, df(F, multiple.df.model[1], multiple.df.error[1]), type = "l",
xlim = c(0,10), ylim = c(0,.85),
xlab = "F", ylab="density",
col = col[1], main="F-distributions" )
= expand.grid(multiple.df.model, multiple.df.error)
dfs
for(i in 2:dim(dfs)[1]) {
lines(F, df(F, dfs[i,1], dfs[i,2]), col=col[i])
<- qf(.95, dfs[i,1], dfs[i,2])
critical.f
<- seq(critical.f, 1000, .01)
f.alpha
polygon(c(f.alpha, rev(f.alpha)), c(df(f.alpha, dfs[i,1], dfs[i,2]), f.alpha*0 ), col= rgb(1,.66,0, .5), border = col[i])
lines(c(critical.f+.1, 5), c(.02, .2), col=col[i])
}
text(5,.2, expression(paste(alpha, "= 5%")), pos =3)
legend("topright", legend = paste("df model =",dfs[,1], "df error =", dfs[,2]), lty=1, col = col, cex=.75)
F-distribution
Nonparametric tests
Parametric vs Nonparametric
Attribute | Parametric | Nonparametric |
---|---|---|
distribution | normaly distributed | any distribution |
sampling | random sample | random sample |
sensitivity to outliers | yes | no |
works with | large data sets | small and large data sets |
speed | fast | slow |
Ranking
= c(1, 4, 6, 7, 8, 9)
x = c(1, 4, 6, 7, 8, 39)
y
layout(matrix(1:2, 1, 2))
boxplot(x, horizontal=T, col='red')
boxplot(y, horizontal=T, col='red')
kable(rbind(rx = rank(x), ry = rank(y)))
rx | 1 | 2 | 3 | 4 | 5 | 6 |
ry | 1 | 2 | 3 | 4 | 5 | 6 |
Ties
# Scores
= c(11, 42, 62, 73, 84, 84, 42, 73, 90); x x
[1] 11 42 62 73 84 84 42 73 90
# sort
sort(x)
[1] 11 42 42 62 73 73 84 84 90
# assign ranks
1:length(x)
[1] 1 2 3 4 5 6 7 8 9
# solve for ties
rank(sort(x))
[1] 1.0 2.5 2.5 4.0 5.5 5.5 7.5 7.5 9.0
\[\frac{2 + 3}{2} = 2.5, \frac{5 + 6}{2} = 5.5, \frac{7 + 8}{2} = 7.5\]
Procedure
- Assumption: independent random samples.
- Hypothesis:
\(H_0\) : equal population distributions (implies equal mean ranking)
\(H_A\) : unequal mean ranking (two sided)
\(H_A\) : higher mean ranking for one group. - Test statistic is difference between mean or sum of ranking.
- Standardise test statistic
- Calculate P-value one or two sided.
- Conclude to reject \(H_0\) if \(p < \alpha\).
Kruskal–Wallis test
Independent >2 samples
Kruskal–Wallis test
Created by William Henry Kruskal (L) and Wilson Allen Wallis (R), the Kruskal-Wallis test is a nonparametric alternative to the independent one-way ANOVA.
The Kruskal-Wallis test essentially subtracts the expected mean ranking from the calculated oberved mean ranking, which is \(\chi^2\) distributed.
Simulate data
= 30
n = rep(c("ecstasy","alcohol","control"), each=n/3)
factor
.1 = ifelse(factor == "alcohol", 1, 0)
dummy.2 = ifelse(factor == "ecstasy", 1, 0)
dummy.0 = 23
b.1 = 0
b.2 = 0
b= rnorm(n, 0, 1.7)
error
# Model
= b.0 + b.1*dummy.1 + b.2*dummy.2 + error
depres = round(depres)
depres
<- data.frame(factor, depres) data
Assign ranks
# Assign ranks
$ranks = rank(data$depres) data
The data
Calculate H
\[H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1)\]
- \(N\) total sample size
- \(n_i\) sample size per group
- \(k\) number of groups
- \(R_i\) rank sums per group
Calculate H
# Now we need the sum of the ranks per group.
= aggregate(ranks ~ factor, data = data, sum)$ranks
R.i R.i
[1] 207.0 121.5 136.5
# De total sample size N is:
= nrow(data)
N
# And the sample size per group is n_i:
= aggregate(depres ~ factor, data=data, length)$depres
n.i n.i
[1] 10 10 10
Calculate H
\[H = \frac{12}{N(N+1)} \sum_{i=1}^k \frac{R_i^2}{n_i} - 3(N+1)\]
= ( 12/(N*(N+1)) ) * sum(R.i^2/n.i) - 3*(N+1)
H H
[1] 5.37871
And the degrees of freedom
= 3
k = k - 1 df
Test for significance
visualize.chisq(H, df, section="upper")