What is causing the error?
You are receiving this error because of the formulas you are passing to glm(). While you are likely expecting paste(Waves[i], collapse = " + ") to construct a string pasting together each of the values in the vector with the name returned from Waves[i], your code is pasting the name of the vector itself.
Illustration: what your code does when i = 2 in the loop
f <- as.formula(
paste(surv_wave[2],
paste(Waves[2], collapse = " + "),
sep = " ~ "))
glm(f, family = binomial(link = "logit"),
data = dta)
#> Error in model.frame.default(formula = f, data = dta, drop.unused.levels = TRUE): variable
#> lengths differ (found for 'Wave3')
# You're passing the following formula to glm():
f
#> surv2 ~ Wave3
How to fix the error?
You can fix this by passing Waves[i] to get() first, which will pass the object itself to paste(), rather than the name of the object.: paste(Waves[i], collapse = " + ").
Illustration: passing Waves[i] to get()
f <- as.formula(
paste(surv_wave[2],
paste(get(Waves[2]), collapse = " + "),
sep = " ~ "))
# Here's what you are now passing:
f
#> surv2 ~ age + sex + Q1 + Q2 + E1 + E2 + Sint
glm(f, family = binomial(link = "logit"),
data = dta)
#>
#> Call: glm(formula = f, family = binomial(link = "logit"), data = dta)
#>
#> Coefficients:
#> (Intercept) age sex Q1 Q2 E1
#> 146.414 -1.965 -3.931 NA 15.722 11.792
#> E2 Sint
#> NA -9.826
#>
#> Degrees of Freedom: 5 Total (i.e. Null); 0 Residual
#> Null Deviance: 5.407
#> Residual Deviance: 2.572e-10 AIC: 12
Bonus: an automated, likely-faster approach
Even fixing this issue, your code likely won't produce the desired outcome due to other errors. Furthermore, it requires a lot of manual entry of values to produce that won't be possible with many waves! We can entirely avoid the issue above and the need to create anything except for the vectors of model predictors you created by restructuring your data with tidyr. We can also make (arguably) more readable code and avoid some of the drawbacks of loops by using map functions from the purrr package.
Your (example) data, created concisely
dta <- data.frame(
ID = c("A", "B", "C", "D", "E", "F"),
age = c(54, 61, 65, 55, 60, 60),
sex = c(0, 0, 1, 1, 1, 0),
Q1 = c(0, 0, 0, 0, 0, 0),
Q2 = c(0, 1, 0, 0, 0, 1),
Q3 = c(0, 1, 1, 0, 0, 1),
Q4 = c(0, 1, 1, 1, 0, 1),
Q5 = c(0, 1, 1, 1, 0, 1),
E1 = c(2, 1, 0, 0, 0, 0),
E2 = c(0, 1, 2, 0, 1, 0),
E3 = c(0, 0, 1, 0, 1, 1),
E4 = c(1, 0, 0, 0, 0, 0),
E5 = c(0, 0, 0, 0, 2, 2),
Sint = c(4, 3, 4, 1, 0, 2),
surv1 = c(1, 1, 1, 1, 1, 1),
surv2 = c(1, 1, 0, 1, 1, 1),
surv3 = c(1, 1, 0, 1, 1, 1),
surv4 = c(1, 1, 0, 1, 1, 0),
surv5 = c(1, 1, 0, 1, 0, 0),
surv6 = c(1, 1, 0, 1, 0, 0)
)
# As @langtang noted, this is much easier with your formulas in a list:
predictors = list(
c("age", "sex", "Q1", "E1", "Sint"),
c("age", "sex", "Q1", "Q2", "E1", "E2", "Sint"),
c("age", "sex", "Q1", "Q2", "Q3", "E1", "E2", "E3", "Sint"),
c("age", "sex", "Q1", "Q2", "Q3", "Q4", "E1", "E2", "E3", "E4", "Sint"),
c("age", "sex", "Q1", "Q2", "Q3", "Q4", "Q5", "E1", "E2", "E3", "E4", "E5", "Sint")
)
Below is highly-commented code to accomplish this. May be faster due to vectorization of some functions.
Alternative implementation
# Three packages will make this much easier:
library(tidyr)
library(dplyr)
library(purrr)
# Reshape your data to a long structure where you have one row for each
# combination of year and ID observed in the data. Since you have balanced
# panel data, this means you have length(unique(ID))*length(unique(# of waves))
# rows now. The following code requires balanced panel data but allows for
# arbitrary numbers of waves, provided that each wave has its own vector of
# predictors in the predictors list above.
df <- dta %>%
pivot_longer(cols = starts_with("surv"),
names_to = "wave",
values_to = "survived_wave")
# Convert survey_wave into number vector by removing "surv" prefix
# then sort by ID and wave so earlier waves are in earlier rows, within each ID
df <- df %>%
mutate(wave = gsub("surv", "", wave)) %>%
arrange(ID, wave)
# Add indicator for whether survived previous wave
# - automatically creates NA values for first wave responses
df <- df %>%
group_by(ID) %>%
mutate(survived_prev_wave = lag(survived_wave, n = 1)) %>%
ungroup()
# Now that we have all the information we need (whether wave 2 survived wave 1)
# from the first wave of responses, we can drop them.
df_nowave1 <- df %>% filter(wave != 1)
# Create a list of data.frames, where each data.frame contains the rows for
# that corresponding wave
df_list <- df_nowave1 %>% split(.$wave)
# Define a function to return data with predicted probs. For each wave:
# - Subset to survivors of previous wave only
# - create a formula
# - fit a model
# - predict probs and add as "probs" column
analyze <- function(df, predictors) {
# Subset to survivors in previous period
survivors <- df %>% filter(survived_prev_wave == 1)
# Create formula
form <- reformulate(termlabels = predictors,
response = "survived_wave")
# Fit model
mod <- glm(formula = form,
data = survivors,
binomial(link = "logit"))
# Predict probabilities and add to data (note: predictions are for survivors!
# in the previous period!)
survivors %>% mutate(probs = predict(mod, type = "response"))
}
# We can now fit separate models for the survivors in each wave, using the
# wave-specific vectors of predictors and combine back into a single data.frame.
result <- map2(.x = df_list,
.y = predictors,
~ analyze(df = .x, predictors = .y)) %>%
bind_rows()
# OP wanted data in wide structure, so convert back to wide with probabilities
# for each wave in their own columns (probs_wave1, probs_wave2, etc.)
result_wide <- result %>% pivot_wider(values_from = "probs",
names_from = "wave",
names_prefix = "probs_wave")