3

I have data from a survey, where several questions are in the format

"Do you think that [xxxxxxx]"

The possible answers to the questions are in the format

"I am certain that [xxxxxxx]" "I think it is possible that [xxxxxx]" "I don't know if [xxxxxx]"

and so on.

I would now like to recode these factors so that "I am certain" = 1, "I think it is possible" = 2 and so on. I have been playing with dplyr::recode but it does not seem to work with regular expressions.

For example:

set.seed(12345)

possible_answers <- c(
    "I am certain that", "I think it is possible that",
    "I don't know if is possible that", "I think it is not possible that",
    "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
    Q1 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 1"
    ),
    Q2 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 2"
    ),
    Q3 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 3"
    ),
    Q4 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 4"
    ),
    Q5 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 5"
    )
)

I can do something like

survey %>% 
    mutate_at("Q1", recode,
                "I am certain that topic 1" = 1,
                "I think it is possible that topic 1" = 2,
                "I don't know if is possible that topic 1" = 3,
                "I think it is not possible that topic 1" = 4,
                "I am certain that it is not possible that topic 1" = 5,
                "It is impossible for me to know if topic 1" = 6)

but doing it for all questions would be cumbersome.

I would like to do

survey %>% 
    mutate_at(vars(starts_with("Q")), recode,
                "I am certain that (.*)" = 1,
                "I think it is possible that (.*)" = 2,
                "I don't know if is possible that (.*)" = 3,
                "I think it is not possible that (.*)" = 4,
                "I am certain that it is not possible that (.*)" = 5,
                "It is impossible for me to know if (.*)" = 6)

But this changes everything to NA, because it does not see the strings as regular expressions.

6
  • Why not use base::startsWith()? Commented Jun 23, 2023 at 12:17
  • @SamR I am not sure how that would solve the problem, could you provide an example? Commented Jun 23, 2023 at 12:29
  • Exactly the same way as the answer by Dave Armstrong, just with a fixed string rather than regex, which should be slightly faster. The important caveat about order still applies. Commented Jun 23, 2023 at 12:36
  • The problem with the fixed string is that I have to create a different case for each question and with >20 questions it becomes too messy. Commented Jun 23, 2023 at 12:51
  • It's only the start of the string that is fixed. See the docs linked above: startsWith() is equivalent to but much faster than substring(x, 1, nchar(prefix)) == prefix Commented Jun 23, 2023 at 12:52

2 Answers 2

3

Without the data I can't test, but you should be able to use mutate(across(...)) with case_when() to do this. Note that since "I am certain that" will also match "I am certain that it is not possible", you need to do the latter first so that the search for "I am certain" only catches the positive cases.

survey %>% 
  mutate(across(starts_with("Q"), 
                ~case_when(
                  grepl("I am certain that it is not possible that", .x) ~ 5,
                  grepl("I am certain that", .x) ~ 1, 
                  grepl("I think it is possible that", .x) ~ 2, 
                  grepl("I don't know if is possible that", .x) ~ 3, 
                  grepl("I think it is not possible that", .x) ~ 4,
                  grepl("It is impossible for me to know if", .x) ~ 6)))
Sign up to request clarification or add additional context in comments.

Comments

1

Another options is the first cut the "topic X" at the end of each string and then recode all variables in one go with recode():

library(dplyr)
library(stringr)


recode_vec <- setNames(as.character(1:6), possible_answers)

survey |> 
  mutate(across(starts_with("Q"),
                \(x) {
                  str_replace_all(x,
                                  "(.*)\\stopic\\s\\d$",
                                  "\\1") |> 
                  recode(!!! recode_vec)
                }
                )
         )
#>    Q1 Q2 Q3 Q4 Q5
#> 1   6  6  4  3  1
#> 2   3  6  3  4  1
#> 3   2  2  1  1  5
#> 4   4  1  6  1  2
#> 5   2  6  5  4  2
#> 6   5  6  4  3  3
#> 7   3  1  2  6  3
#> 8   2  4  6  1  5
#> 9   6  4  2  5  3
#> 10  3  2  4  3  1

Data from OP

set.seed(12345)

possible_answers <- c(
  "I am certain that", "I think it is possible that",
  "I don't know if is possible that", "I think it is not possible that",
  "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
  Q1 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 1"
  ),
  Q2 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 2"
  ),
  Q3 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 3"
  ),
  Q4 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 4"
  ),
  Q5 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 5"
  )
)

Created on 2023-06-23 by the reprex package (v2.0.1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.