Press question mark to see available shortcut keys

An acquaintance asks the following question: he is applying for a university course which requires a certain minimum score on a test for admittance, and wonders about his chances and a possible trend of increasing minimum scores over time.
(He hasn't received his test results yet.)
The university doesn't provide a distribution of admittee scores, but it does provide the minimum scores for 2005-2013, unless all applicants were admitted because they all scored above an unknown cutoff - in which case it provides no minimum score.
This leads to the dataset:

    2005,NA
    2006,410
    2007,NA
    2008,NA
    2009,398
    2010,407
    2011,417
    2012,NA
    2013,NA

A quick eyeball tells us that we can't conclude much: only 4 actual datapoints, with 5 hidden from us.
We can't hope to conclude anything about time trends, other than there doesn't seem to be much of one: the last score, 417, is not much higher than 410, and the last two scores are low enough to be hidden.
We might be able to estimate a mean, though.

We can't simply average the 4 scores and conclude the mean minimum is 410 because of those NAs: a number of scores have been 'censored' because they were too low, and while we don't know what they were, we do know they were <398 (the smallest score) and so a bunch of <398s will pull down the uncensored mean of 410.

On approach is to treat it as a Tobit model (https://en.wikipedia.org/wiki/Tobit_model) and estimate using something like the censReg library ( http://cran.r-project.org/package=censReg / http://cran.r-project.org/web/packages/censReg/vignettes/censReg.pdf ).

But if we try a quick call to `censReg`, we are confounded: a Tobit model expects you to provide the cutoff below which the observations were censored, but that is something we don't know.
All we know is that it must be below 398, we weren't told it was exactly 395, 394, etc.
Fortunately, this is a solved problem. For example: "The Tobit model with a non-zero threshold" (http://econweb.ucsd.edu/~rcarson/papers/TobitEJ07.pdf), Carson & Sun 2007 tells us:

> In this paper, we consider estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. We show that the estimator $γ'$ of $γ$ is superconsistent and asymptotically exponentially distributed. Carson (1988, 1989) also suggests estimating the unknown censoring threshold by the minimum of the uncensored $y_i$'s. In a recent paper, Zuehlke (2003) rediscovers these unpublished results and demonstrates via simulations that the asymptotic distribution of the maximum likelihood estimator does not seem to be affected by the estimation of the censoring threshold.

That seems to be almost too simple and easy, but it makes sense and reminds me a little of the German tank problem: the minimum might not be that accurate a guess (it's unlikely you just happened to draw a sample right on the censoring threshold) and it definitely can't be wrong in the sense of being too low. (A Bayesian method might be able to do better with a prior like a exponential.)

With that settled, the analysis is straightforward: load the data, figure out the minimum score, set the NAs to 0, regress, and extract the model estimates for each year:


    scores <- data.frame(Year=2005:2013,
                         MinimumScore=c(NA,410,NA,NA,398,407,417,NA,NA));
    censorThreshold <- min(scores$MinimumScore, na.rm=T)
    scores[is.na(scores)] <- 0
    
    library(censReg)
    # 'censorThreshold-1' because censReg seems to treat threshold as < and not <=
    summary(censReg(MinimumScore ~ Year, left=censorThreshold-1, data=scores))
    # Warning message:
    # In censReg(MinimumScore ~ Year, left = censorThreshold - 1, data = scores) :
    #   at least one value of the endogenous variable is smaller than the left limit
    #
    # Call:
    # censReg(formula = MinimumScore ~ Year, left = censorThreshold -
    #     1, data = scores)
    #
    # Observations:
    #          Total  Left-censored     Uncensored Right-censored
    #              9              5              4              0
    #
    # Coefficients:
    #              Estimate Std. error t value Pr(> t)
    # (Intercept) -139.9711        Inf       0       1
    # Year           0.2666        Inf       0       1
    # logSigma       2.6020        Inf       0       1
    #
    # Newton-Raphson maximisation, 37 iterations
    # Return code 1: gradient close to zero
    # Log-likelihood: -19.35 on 3 Df
    -139.9711 + (0.2666 * scores$Year)
    # [1] 394.6 394.8 395.1 395.4 395.6 395.9 396.2 396.4 396.7


With so little data the results aren't very reliable, but there is one observation we can make.

The fact that half the dataset is censored tells us that the uncensored mean may be a huge overestimate (since we're only looking at the 'top half' of the underlying data), and indeed it is.
The original mean of the uncensored scores was 410; however, the estimate including the censored data is much lower, 397 (13 less)!

This demonstrates the danger of ignoring systematic biases in your data.

So, trying to calculate a mean or time effect is not helpful.
What might be better is to instead exploit the censoring directly: if the censoring happened because everyone got in, then if you showed up in a censored year, you have 100% chance of getting in; while in a non-censored year you have an unknown but <100% chance of getting in; so the probability of a censored year sets a lower bound on one's chances, and this is easy to calculate as a simple binomial problem - 5 out of 9 years were censored years, so:

    binom.test(c(5,4))
    #
    #   Exact binomial test
    #
    # data:  c(5, 4)
    # number of successes = 5, number of trials = 9, p-value = 1
    # alternative hypothesis: true probability of success is not equal to 0.5
    # 95 percent confidence interval:
    #  0.212 0.863
    # sample estimates:
    # probability of success
    #                 0.5556

So we can tell him that he may have a >55% chance of getting in.

#statistics #bias   #R  
Shared publiclyView activity