0

I have the following data:

ID <- c("A", "B", "C", "D", "E", "F")
age <- c(54, 61, 65, 55, 60, 60)
sex <- c(0, 0, 1, 1, 1, 0)
Q1 <- c(0, 0, 0, 0, 0, 0)
Q2 <- c(0, 1, 0, 0, 0, 1) 
Q3 <- c(0, 1, 1, 0, 0, 1) 
Q4 <- c(0, 1, 1, 1, 0, 1) 
Q5 <- c(0, 1, 1, 1, 0, 1)
E1 <- c(2, 1, 0, 0, 0, 0)
E2 <- c(0, 1, 2, 0, 1, 0)
E3 <- c(0, 0, 1, 0, 1, 1)
E4 <- c(1, 0, 0, 0, 0, 0)
E5 <- c(0, 0, 0, 0, 2, 2)
Sint <- c(4, 3, 4, 1, 0, 2)
surv1 <- c(1, 1, 1, 1, 1, 1)
surv2 <- c(1, 1, 0, 1, 1, 1)
surv3 <- c(1, 1, 0, 1, 1, 1)
surv4 <- c(1, 1, 0, 1, 1, 0)
surv5 <- c(1, 1, 0, 1, 0, 0)
surv6 <- c(1, 1, 0, 1, 0, 0)


dta <- data.frame(ID, age, sex, Q1, Q2, Q3, Q4, Q5, E1, E2, E3, E4, E5, Sint,
                  surv1, surv2, surv3, surv4, surv5, surv6)

I created the following arrays:

surv_wave <- c("surv1", "surv2", "surv3", "surv4", "surv5", "surv6")
var_num <- c("age", "sex")
Wave2 <- c("age", "sex", "Q1", "E1", "Sint")
Wave3 <- c("age", "sex", "Q1", "Q2", "E1", "E2", "Sint")
Wave4 <- c("age", "sex", "Q1", "Q2", "Q3", "E1", "E2", "E3", "Sint")
Wave5 <- c("age", "sex", "Q1", "Q2", "Q3", "Q4", "E1", "E2", "E3", "E4", "Sint")
Wave6 <- c("age", "sex", "Q1", "Q2", "Q3", "Q4", "Q5", "E1", "E2", "E3", "E4", "E5", "Sint")
Waves <- c("Wave2", "Wave3", "Wave4", "Wave5", "Wave6")

And I want to iterate over the arrays to predict probabilities of survival given the variables in the arrays:

# Probability variables that will be predicted
dta$wsd2 <- NA
dta$wsd3 <- NA
dta$wsd4 <- NA
dta$wsd5 <- NA
dta$wsd6 <- NA

# vector of variables that will be predicted
wsurv_den <- c("wsd2", "wsd3", "wsd4", "wsd5", "wsd6")

# iterate all waves
for(i in 2:6) {
  
  # subset people who survived in the previous wave
  Subset <- subset(dta, dta[[surv_wave[i-1]]] == 1)
  
  # logistic regression
  f <- as.formula(
    paste(surv_wave[i], 
          paste(Waves[i], collapse = " + "),
          sep = " ~ "))
  Den_surv_s <- glm(f,  family = binomial(link = "logit"),  
                    data = Subset)
  
  # predict probabilities of survival based on logistic regression
  Den_surv_p_s <- predict(Den_surv_s, type = "response")
  
  # Add predicted values to original dataset
  dta[dta[[surv_wave[i-1]]] == 1,][[wsurv_den[i-1]]]<-Den_surv_p_s
  
}

I keep getting an error message: Error in model.frame.default(formula = f, data = Subset, drop.unused.levels = TRUE) : variable lengths differ (found for 'Wave3')

I looked at possible solutions, but I don't have NA values and the only "Wave3" variable I have in the environment is the array. What am I doing wrong?

3 Answers 3

2

What is causing the error?

You are receiving this error because of the formulas you are passing to glm(). While you are likely expecting paste(Waves[i], collapse = " + ") to construct a string pasting together each of the values in the vector with the name returned from Waves[i], your code is pasting the name of the vector itself.

Illustration: what your code does when i = 2 in the loop

f <- as.formula(
  paste(surv_wave[2], 
        paste(Waves[2], collapse = " + "),
        sep = " ~ "))

glm(f,  family = binomial(link = "logit"),  
                  data = dta)
#> Error in model.frame.default(formula = f, data = dta, drop.unused.levels = TRUE): variable 
#> lengths differ (found for 'Wave3')

# You're passing the following formula to glm():
f
#> surv2 ~ Wave3

How to fix the error?

You can fix this by passing Waves[i] to get() first, which will pass the object itself to paste(), rather than the name of the object.: paste(Waves[i], collapse = " + ").

Illustration: passing Waves[i] to get()

f <- as.formula(
  paste(surv_wave[2], 
        paste(get(Waves[2]), collapse = " + "),
        sep = " ~ "))

# Here's what you are now passing:
f
#> surv2 ~ age + sex + Q1 + Q2 + E1 + E2 + Sint


glm(f,  family = binomial(link = "logit"),  
                  data = dta)
#> 
#> Call:  glm(formula = f, family = binomial(link = "logit"), data = dta)
#> 
#> Coefficients:
#> (Intercept)          age          sex           Q1           Q2           E1  
#>     146.414       -1.965       -3.931           NA       15.722       11.792  
#>          E2         Sint  
#>          NA       -9.826  
#> 
#> Degrees of Freedom: 5 Total (i.e. Null);  0 Residual
#> Null Deviance:       5.407 
#> Residual Deviance: 2.572e-10     AIC: 12

Bonus: an automated, likely-faster approach

Even fixing this issue, your code likely won't produce the desired outcome due to other errors. Furthermore, it requires a lot of manual entry of values to produce that won't be possible with many waves! We can entirely avoid the issue above and the need to create anything except for the vectors of model predictors you created by restructuring your data with tidyr. We can also make (arguably) more readable code and avoid some of the drawbacks of loops by using map functions from the purrr package.

Your (example) data, created concisely

dta <- data.frame(
  ID = c("A", "B", "C", "D", "E", "F"),
  age = c(54, 61, 65, 55, 60, 60),
  sex = c(0, 0, 1, 1, 1, 0),
  Q1 = c(0, 0, 0, 0, 0, 0),
  Q2 = c(0, 1, 0, 0, 0, 1),
  Q3 = c(0, 1, 1, 0, 0, 1),
  Q4 = c(0, 1, 1, 1, 0, 1),
  Q5 = c(0, 1, 1, 1, 0, 1),
  E1 = c(2, 1, 0, 0, 0, 0),
  E2 = c(0, 1, 2, 0, 1, 0),
  E3 = c(0, 0, 1, 0, 1, 1),
  E4 = c(1, 0, 0, 0, 0, 0),
  E5 = c(0, 0, 0, 0, 2, 2),
  Sint = c(4, 3, 4, 1, 0, 2),
  surv1 = c(1, 1, 1, 1, 1, 1),
  surv2 = c(1, 1, 0, 1, 1, 1),
  surv3 = c(1, 1, 0, 1, 1, 1),
  surv4 = c(1, 1, 0, 1, 1, 0),
  surv5 = c(1, 1, 0, 1, 0, 0),
  surv6 = c(1, 1, 0, 1, 0, 0)
)

# As @langtang noted, this is much easier with your formulas in a list:
predictors = list(
  c("age", "sex", "Q1", "E1", "Sint"),
  c("age", "sex", "Q1", "Q2", "E1", "E2", "Sint"),
  c("age", "sex", "Q1", "Q2", "Q3", "E1", "E2", "E3", "Sint"),
  c("age", "sex", "Q1", "Q2", "Q3", "Q4", "E1", "E2", "E3", "E4", "Sint"),
  c("age", "sex", "Q1", "Q2", "Q3", "Q4", "Q5", "E1", "E2", "E3", "E4", "E5", "Sint")
)

Below is highly-commented code to accomplish this. May be faster due to vectorization of some functions.

Alternative implementation



# Three packages will make this much easier:
library(tidyr)
library(dplyr)
library(purrr)

# Reshape your data to a long structure where you have one row for each
# combination of year and ID observed in the data. Since you have balanced
# panel data, this means you have length(unique(ID))*length(unique(# of waves))
# rows now. The following code  requires balanced panel data but allows for
# arbitrary numbers of waves, provided that each wave has its own vector of
# predictors in the predictors list above.
df <- dta %>%
  pivot_longer(cols = starts_with("surv"),
               names_to = "wave",
               values_to = "survived_wave")

# Convert survey_wave into number vector by removing "surv" prefix
# then sort by ID and wave so earlier waves are in earlier rows, within each ID
df <- df %>%
  mutate(wave = gsub("surv", "", wave)) %>%
  arrange(ID, wave)

# Add indicator for whether survived previous wave
# - automatically creates NA values for first wave responses
df <- df %>%
  group_by(ID) %>%
  mutate(survived_prev_wave = lag(survived_wave, n = 1)) %>%
  ungroup()

# Now that we have all the information we need (whether wave 2 survived wave 1)
# from the first wave of responses, we can drop them.
df_nowave1 <- df %>% filter(wave != 1)

# Create a list of data.frames, where each data.frame contains the rows for
# that corresponding wave
df_list <- df_nowave1 %>% split(.$wave)

# Define a function to return data with predicted probs. For each wave:
# - Subset to survivors of previous wave only
# - create a formula
# - fit a model
# - predict probs and add as "probs" column
analyze <- function(df, predictors) {
  # Subset to survivors in previous period
  survivors <- df %>% filter(survived_prev_wave == 1)
  
  # Create formula
  form <- reformulate(termlabels = predictors, 
                      response = "survived_wave")
  
  # Fit model
  mod <- glm(formula = form,
             data = survivors,
             binomial(link = "logit"))
  
  # Predict probabilities and add to data (note: predictions are for survivors!
  # in the previous period!)
  survivors %>% mutate(probs = predict(mod, type = "response"))
}

# We can now fit separate models for the survivors in each wave, using the
# wave-specific vectors of predictors and combine back into a single data.frame.
result <- map2(.x = df_list,
               .y = predictors,
               ~ analyze(df = .x, predictors = .y)) %>%
  bind_rows()

# OP wanted data in wide structure, so convert back to wide with probabilities
# for each wave in their own columns (probs_wave1, probs_wave2, etc.)
result_wide <- result %>% pivot_wider(values_from = "probs",
                             names_from = "wave",
                             names_prefix = "probs_wave")
Sign up to request clarification or add additional context in comments.

Comments

2

First of all, please use reformulate instead of constipated, nested paste instructions to assemble a formula. reformulate does it in one simple step.

In the code below, Waves is the list in @socialscientist's answer and at end.

for(i in seq_along(surv_wave)[-1]) {
  surv_prev <- dta[[ surv_wave[i - 1L] ]]
  i_surv <- which(surv_prev == 1L)
  srv <- surv_wave[i]
  wv <- Waves[[i - 1L]]
  Subset <- dta[i_surv, ]
  f <- reformulate(wv, srv)
  fit <- glm(f, data = Subset, family = binomial(link = "logit"))
  ypred <- predict(fit, type = "response")
  dta[i_surv, wsurv_den[i - 1L]] <- ypred
}

Data

Waves list.

Waves = list(
  Wave2 =c("age", "sex", "Q1", "E1", "Sint"),
  Wave3 =c("age", "sex", "Q1", "Q2", "E1", "E2", "Sint"),
  Wave4 =c("age", "sex", "Q1", "Q2", "Q3", "E1", "E2", "E3", "Sint"),
  Wave5 =c("age", "sex", "Q1", "Q2", "Q3", "Q4", "E1", "E2", "E3", "E4", "Sint"),
  Wave6 =c("age", "sex", "Q1", "Q2", "Q3", "Q4", "Q5", "E1", "E2", "E3", "E4", "E5", "Sint")
)

Comments

0

I think a couple of changes will help you.

  1. First, create a named list Waves, where each element is a vector your variables for that wave, like this:
Waves = list(
  Wave2 =c("age", "sex", "Q1", "E1", "Sint"),
  Wave3 =c("age", "sex", "Q1", "Q2", "E1", "E2", "Sint"),
  Wave4 =c("age", "sex", "Q1", "Q2", "Q3", "E1", "E2", "E3", "Sint"),
  Wave5 =c("age", "sex", "Q1", "Q2", "Q3", "Q4", "E1", "E2", "E3", "E4", "Sint"),
  Wave6 =c("age", "sex", "Q1", "Q2", "Q3", "Q4", "Q5", "E1", "E2", "E3", "E4", "E5", "Sint")
)
  1. Then, update the creation of the formula to reference this list
  f <- as.formula(
    paste(surv_wave[i], "~", paste(Waves[[paste0("Wave",i)]], collapse=" + "))
  )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.