An acquaintance asks the following question: he is applying for a university course which requires a certain minimum score on a test for admittance, and wonders about his chances and a possible trend of increasing minimum scores over time.
(He hasn't received his test results yet.)
The university doesn't provide a distribution of admittee scores, but it does provide the minimum scores for 2005-2013, unless all applicants were admitted because they all scored above an unknown cutoff - in which case it provides no minimum score.
This leads to the dataset:
2005,NA
2006,410
2007,NA
2008,NA
2009,398
2010,407
2011,417
2012,NA
2013,NA
A quick eyeball tells us that we can't conclude much: only 4 actual datapoints, with 5 hidden from us.
We can't hope to conclude anything about time trends, other than there doesn't seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hidden.
We might be able to estimate a mean, though.
We can't simply average the 4 scores and conclude the mean minimum is 410 because of those NAs: a number of scores have been 'censored' because they were too low, and while we don't know what they were, we do know they were <398 (the smallest score) and so a bunch of <398s will pull down the uncensored mean of 410.
On approach is to treat it as a Tobit model (https://en.wikipedia.org/wiki/Tobit_model) and estimate using something like the censReg library ( http://cran.r-project.org/package=censReg / http://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf ).
But if we try a quick call to `censReg`, we are confounded: a Tobit model expects you to provide the cutoff below which the observations were censored, but that is something we don't know.
All we know is that it must be below 398, we weren't told it was exactly 395, 394, etc.
Fortunately, this is a solved problem. For example: "The Tobit model with a non-zero threshold" (http://econweb.ucsd.edu/~rcarson/papers/TobitEJ07.pdf), Carson & Sun 2007 tells us:
> In this paper, we consider estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. We show that the estimator $γ'$ of $γ$ is superconsistent and asymptotically exponentially distributed. Carson (1988, 1989) also suggests estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. In a recent paper, Zuehlke (2003) rediscovers these unpublished results and demonstrates via simulations that the asymptotic distribution of the maximum likelihood estimator does not seem to be affected by the estimation of the censoring threshold.
That seems to be almost too simple and easy, but it makes sense and reminds me a little of the German tank problem: the minimum might not be that accurate a guess (it's unlikely you just happened to draw a sample right on the censoring threshold) and it definitely can't be wrong in the sense of being too low. (A Bayesian method might be able to do better with a prior like a exponential.)
With that settled, the analysis is straightforward: load the data, figure out the minimum score, set the NAs to 0, regress, and extract the model estimates for each year:
scores <- data.frame(Year=2005:2013,
MinimumScore=c(NA,410,NA,NA,398,407,417,NA,NA));
censorThreshold <- min(scores$MinimumScore, na.rm=T)
scores[is.na(scores)] <- 0
library(censReg)
# 'censorThreshold-1' because censReg seems to treat threshold as < and not <=
summary(censReg(MinimumScore ~ Year, left=censorThreshold-1, data=scores))
# Warning message:
# In censReg(MinimumScore ~ Year, left = censorThreshold - 1, data = scores) :
# at least one value of the endogenous variable is smaller than the left limit
#
# Call:
# censReg(formula = MinimumScore ~ Year, left = censorThreshold -
# 1, data = scores)
#
# Observations:
# Total Left-censored Uncensored Right-censored
# 9 5 4 0
#
# Coefficients:
# Estimate Std. error t value Pr(> t)
# (Intercept) -139.9711 Inf 0 1
# Year 0.2666 Inf 0 1
# logSigma 2.6020 Inf 0 1
#
# Newton-Raphson maximisation, 37 iterations
# Return code 1: gradient close to zero
# Log-likelihood: -19.35 on 3 Df
-139.9711 + (0.2666 * scores$Year)
# [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7
With so little data the results aren't very reliable, but there is one observation we can make.
The fact that half the dataset is censored tells us that the uncensored mean may be a huge overestimate (since we're only looking at the 'top half' of the underlying data), and indeed it is.
The original mean of the uncensored scores was 410; however, the estimate including the censored data is much lower, 397 (13 less)!
This demonstrates the danger of ignoring systematic biases in your data.
So, trying to calculate a mean or time effect is not helpful.
What might be better is to instead exploit the censoring directly: if the censoring happened because everyone got in, then if you showed up in a censored year, you have 100% chance of getting in; while in a non-censored year you have an unknown but <100% chance of getting in; so the probability of a censored year sets a lower bound on one's chances, and this is easy to calculate as a simple binomial problem - 5 out of 9 years were censored years, so:
binom.test(c(5,4))
#
# Exact binomial test
#
# data: c(5, 4)
# number of successes = 5, number of trials = 9, p-value = 1
# alternative hypothesis: true probability of success is not equal to 0.5
# 95 percent confidence interval:
# 0.212 0.863
# sample estimates:
# probability of success
# 0.5556
So we can tell him that he may have a >55% chance of getting in.
#statistics #bias #R
(He hasn't received his test results yet.)
The university doesn't provide a distribution of admittee scores, but it does provide the minimum scores for 2005-2013, unless all applicants were admitted because they all scored above an unknown cutoff - in which case it provides no minimum score.
This leads to the dataset:
2005,NA
2006,410
2007,NA
2008,NA
2009,398
2010,407
2011,417
2012,NA
2013,NA
A quick eyeball tells us that we can't conclude much: only 4 actual datapoints, with 5 hidden from us.
We can't hope to conclude anything about time trends, other than there doesn't seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hidden.
We might be able to estimate a mean, though.
We can't simply average the 4 scores and conclude the mean minimum is 410 because of those NAs: a number of scores have been 'censored' because they were too low, and while we don't know what they were, we do know they were <398 (the smallest score) and so a bunch of <398s will pull down the uncensored mean of 410.
On approach is to treat it as a Tobit model (https://en.wikipedia.org/wiki/Tobit_model) and estimate using something like the censReg library ( http://cran.r-project.org/package=censReg / http://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf ).
But if we try a quick call to `censReg`, we are confounded: a Tobit model expects you to provide the cutoff below which the observations were censored, but that is something we don't know.
All we know is that it must be below 398, we weren't told it was exactly 395, 394, etc.
Fortunately, this is a solved problem. For example: "The Tobit model with a non-zero threshold" (http://econweb.ucsd.edu/~rcarson/papers/TobitEJ07.pdf), Carson & Sun 2007 tells us:
> In this paper, we consider estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. We show that the estimator $γ'$ of $γ$ is superconsistent and asymptotically exponentially distributed. Carson (1988, 1989) also suggests estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. In a recent paper, Zuehlke (2003) rediscovers these unpublished results and demonstrates via simulations that the asymptotic distribution of the maximum likelihood estimator does not seem to be affected by the estimation of the censoring threshold.
That seems to be almost too simple and easy, but it makes sense and reminds me a little of the German tank problem: the minimum might not be that accurate a guess (it's unlikely you just happened to draw a sample right on the censoring threshold) and it definitely can't be wrong in the sense of being too low. (A Bayesian method might be able to do better with a prior like a exponential.)
With that settled, the analysis is straightforward: load the data, figure out the minimum score, set the NAs to 0, regress, and extract the model estimates for each year:
scores <- data.frame(Year=2005:2013,
MinimumScore=c(NA,410,NA,NA,398,407,417,NA,NA));
censorThreshold <- min(scores$MinimumScore, na.rm=T)
scores[is.na(scores)] <- 0
library(censReg)
# 'censorThreshold-1' because censReg seems to treat threshold as < and not <=
summary(censReg(MinimumScore ~ Year, left=censorThreshold-1, data=scores))
# Warning message:
# In censReg(MinimumScore ~ Year, left = censorThreshold - 1, data = scores) :
# at least one value of the endogenous variable is smaller than the left limit
#
# Call:
# censReg(formula = MinimumScore ~ Year, left = censorThreshold -
# 1, data = scores)
#
# Observations:
# Total Left-censored Uncensored Right-censored
# 9 5 4 0
#
# Coefficients:
# Estimate Std. error t value Pr(> t)
# (Intercept) -139.9711 Inf 0 1
# Year 0.2666 Inf 0 1
# logSigma 2.6020 Inf 0 1
#
# Newton-Raphson maximisation, 37 iterations
# Return code 1: gradient close to zero
# Log-likelihood: -19.35 on 3 Df
-139.9711 + (0.2666 * scores$Year)
# [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7
With so little data the results aren't very reliable, but there is one observation we can make.
The fact that half the dataset is censored tells us that the uncensored mean may be a huge overestimate (since we're only looking at the 'top half' of the underlying data), and indeed it is.
The original mean of the uncensored scores was 410; however, the estimate including the censored data is much lower, 397 (13 less)!
This demonstrates the danger of ignoring systematic biases in your data.
So, trying to calculate a mean or time effect is not helpful.
What might be better is to instead exploit the censoring directly: if the censoring happened because everyone got in, then if you showed up in a censored year, you have 100% chance of getting in; while in a non-censored year you have an unknown but <100% chance of getting in; so the probability of a censored year sets a lower bound on one's chances, and this is easy to calculate as a simple binomial problem - 5 out of 9 years were censored years, so:
binom.test(c(5,4))
#
# Exact binomial test
#
# data: c(5, 4)
# number of successes = 5, number of trials = 9, p-value = 1
# alternative hypothesis: true probability of success is not equal to 0.5
# 95 percent confidence interval:
# 0.212 0.863
# sample estimates:
# probability of success
# 0.5556
So we can tell him that he may have a >55% chance of getting in.
#statistics #bias #R
Not sure I agree with the initial approach - I think it would be better to model the number of applicants each year vs the available places, I'm not convinced that the average threshold is relevant.
Be that as it may, given that there's a 56% chance of getting in regardless of score, shouldn't you add the 44% chance times the probability of beating the threshold score? My guess would be, since there is no threshold in more than half of the years, that they university isn't too discerning, and if your friend is an average student, chances are good that he'll get in, also in a year when there's a threshold.Jul 18, 2014
> Be that as it may, given that there's a 56% chance of getting in regardless of score, shouldn't you add the 44% chance times the probability of beating the threshold score?
I have no way of estimating what that is since it's not included in the info, so I just leave it as a lower bound of >56%.Jul 18, 2014