4

I am trying to understand how the predict function works for conditional logit. I tried to calculate it manually but it does not work. Any help would be greatly appreciated.

library(survival)

set.seed(1)
data <- data.frame(Used = rep(c(1,0,0,0),1250),
                   Open = round(runif(5000,0,50),0),
                   Strata = rep(1:1250,each=4))

mod <- clogit(Used ~ Open + strata(Strata),data=data)

newdata <- data.frame(Open = seq(0,10,1),Strata=1)

risk <- predict(mod,newdata=newdata,type = "risk")
risk


lp <- predict(mod,newdata=newdata,type = "lp")
lp


##manual calculation
coef<-data.frame(coef = summary(mod)$coefficients[,1],
                 se= summary(mod)$coefficients[,3])
coef$se <-summary(mod)$coefficients[,3]
coef$UpCI <- coef[,1] + (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity
coef$LowCI <-coef[,1] - (coef[,2]*2) ### this could be *1.96 but using 2 for simplicity

fitted<-data.frame(Open= seq(0,10,1))

fitted$Marginal <- exp(coef[1,1]*fitted$Open) /
  (1+exp(coef[1,1]*fitted$Open ))

fitted$UpCI <- exp(coef[1,3]*fitted$Open )/
  (1+exp(coef[1,3]*fitted$Open))

fitted$LowCI <- exp(coef[1,4]*fitted$Open )/
  (1+exp(coef[1,4]*fitted$Open ))
1
  • 2
    I don't have the full answer (and don't have time right now to work it out), but the issue is that the predict method (1) inherits from predict.coxph, so is providing predictions that are relevant/meaningful for survival analysis rather than for conditional logistic prediction: (2) every stratum is treated as a separate unit, so predictions are always (AFAICT) stratum-specific. There is a related question (with an unaccepted answer) here ... Commented Jun 1 at 17:08

1 Answer 1

6

This turns out to be a rather deep rabbit hole. There is an answer here, but I'm not quite sure it's correct.

The basic statistical issue is described below. You will want either risk/(1+risk) or risk/[average risk per stratum]; I think these might be equivalent if there are exactly two observations per stratum ...

It may be useful to know that the risk prediction is exp(lp), and risk/(risk+1) is plogis(lp), so if you have standard errors you should be able to get confidence intervals via

pred <- predict(model, newdata, type = "lp", se.fit = TRUE)
with(pred, plogis(fit ± 1.96*se.fit))  ## abbreviated, ofc ± doesn't really work ...

(For more information on how the predict() function works, see e.g. https://stats.stackexchange.com/questions/281781/predict-functions-after-clogit-in-r-using-survival-package)

Getting confidence intervals on the risk/average_risk form seems harder.

If you have more questions about this, you might need to follow up on Cross Validated (but please include a link to this Q&A so as not to waste time re-treading old ground) ...


Searching (after cloning) the r-help directory of Michael Chirico's R mailing lists archive: specifically, you can search online here within any of the files described below.

I ran egrep -l "(predict.*clogit|clogit.*predict)" * on the command line ("search all files for text containing 'predict [... stuff ...] clogit' or 'clogit [... stuff ...] predict'")

2006-February.txt

Arne Jol:

I wonder if there is a prediction function for a clogit model which can be used in the same way as the predict function for the multinom model.

Thomas Lumley:

I don't think this is going to be possible. The point of conditional logistic regression is that the probabilities depend on stratum parameters that cannot be estimated accurately. The conditional likelihood removes these parameters, but the resulting model does not contain enough information to estimate probabilities.

2010-April.txt

Discussion between Thomas Lumley and Chuck Berry suggesting

lp <- predict(model, type = "lp", newdata = ...)
exp(lp)/ave( exp(lp), stratvar, FUN=sum )

2014-June.txt

Peter Dalgaard:

I'm rusty on this, but I think what you want is something like

m <- model.matrix(~ x1 + x2 - 1, data=dat.test)
pp <- exp(m %*% coef(fit))
pps <- ave(pp, dat.test$set, FUN=sum)
pp/pps

[this is the same as the suggestion from April 2010]

i.e. the conditional probability that an observation is a case given covariates and that there is on[e] case in each set (in the data given, you have sets of three with one being a case, so all predicted probabilities are close to 0.33). For more general matched sets, I'm not really sure what one wants. Real experts are welcome to chime in.

Terry Therneau:

As Peter D said, the clogit function simply sets up a special data set and then calls coxph, and is based on an identity that the likelihood for the conditional logistic is identical to the likelihood of a Cox model for a specially structured data set. I vacillate on whether this identity is just a mathematical accident or the mark of some deep truth. However it did allow us to use existing code rather than write new routine from scratch.

In this special data set all of the follow-up times are set to 1 (any constant value would do). For normal Cox regression a predicted "expected number of events" depends on the length of follow-up for the subject, so the predict routine expects to find a follow-up time variable in your newdata. In my code, "rep(1, 150L)" is being mistaken for the name of the time variable, and failure of the confused routine ensues in due course.

For the interim: in a conditional logistic the "expected number of events" is just exp(eta)/(1 + exp(eta)) where eta is the linear predictor. There is no follow-up time to worry about and predict.coxph would have done way too much work, even if it hadn't gotton [sic] confused. Use

risk <- predict(fit, newdata=dat.test, type='risk')
risk / (1+risk)
Sign up to request clarification or add additional context in comments.

4 Comments

I really appreciate the detailed explanation, It is much clearer now. I also found this link that helped me understand how coxph.predict function is calculating the predictions. stats.stackexchange.com/questions/281781/…
I understand how the "risk" makes the predictions using the offset, based on the above mentioned link. How does R calculate the offset if there are 2 variables. Sample code attached here-set.seed(1) sim_data <- data.frame(Used = rep(c(1,0,0,0),1250), Open = round(runif(5000,0,50),0), Activity = rep(sample(rbinom(n=10000, size=1, prob=0.50),1250, replace=T), each=4), Strata = rep(1:1250,each=4)) mod_sim <- clogit(Used ~ Open + I(Activity*Open) +strata(Strata), data = sim_data)
It’s good to put “risk” in quotes because there are so many ways to quantify risk predictions. You really need to specify if you want odds ratios (the natural effect measure for logit models) or if you want predicted proportions of possible outcomes for a particular set of covariates.
The linked answer says "R uses the stratum mean of the covariate", which would presumably be "the stratum means of [all of] the covariates" in the case of multiple covariates ...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.