Recoding multiple factors using regexp

Question

I have data from a survey, where several questions are in the format

"Do you think that [xxxxxxx]"

The possible answers to the questions are in the format

"I am certain that [xxxxxxx]" "I think it is possible that [xxxxxx]" "I don't know if [xxxxxx]"

and so on.

I would now like to recode these factors so that "I am certain" = 1, "I think it is possible" = 2 and so on. I have been playing with dplyr::recode but it does not seem to work with regular expressions.

For example:

set.seed(12345)

possible_answers <- c(
    "I am certain that", "I think it is possible that",
    "I don't know if is possible that", "I think it is not possible that",
    "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
    Q1 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 1"
    ),
    Q2 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 2"
    ),
    Q3 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 3"
    ),
    Q4 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 4"
    ),
    Q5 = paste(
        sample(possible_answers, num_answers, replace = TRUE),
        "topic 5"
    )
)

I can do something like

survey %>% 
    mutate_at("Q1", recode,
                "I am certain that topic 1" = 1,
                "I think it is possible that topic 1" = 2,
                "I don't know if is possible that topic 1" = 3,
                "I think it is not possible that topic 1" = 4,
                "I am certain that it is not possible that topic 1" = 5,
                "It is impossible for me to know if topic 1" = 6)

but doing it for all questions would be cumbersome.

I would like to do

survey %>% 
    mutate_at(vars(starts_with("Q")), recode,
                "I am certain that (.*)" = 1,
                "I think it is possible that (.*)" = 2,
                "I don't know if is possible that (.*)" = 3,
                "I think it is not possible that (.*)" = 4,
                "I am certain that it is not possible that (.*)" = 5,
                "It is impossible for me to know if (.*)" = 6)

But this changes everything to NA, because it does not see the strings as regular expressions.

@SamR I am not sure how that would solve the problem, could you provide an example? — nico
– nico, Commented Jun 23, 2023 at 12:29
Exactly the same way as the answer by Dave Armstrong, just with a fixed string rather than regex, which should be slightly faster. The important caveat about order still applies. — SamR
– SamR, Commented Jun 23, 2023 at 12:36
The problem with the fixed string is that I have to create a different case for each question and with >20 questions it becomes too messy. — nico
– nico, Commented Jun 23, 2023 at 12:51
It's only the start of the string that is fixed. See the docs linked above: startsWith() is equivalent to but much faster than substring(x, 1, nchar(prefix)) == prefix — SamR
– SamR, Commented Jun 23, 2023 at 12:52

nico · Accepted Answer · 2023-06-23 12:32:12Z

3

Without the data I can't test, but you should be able to use mutate(across(...)) with case_when() to do this. Note that since "I am certain that" will also match "I am certain that it is not possible", you need to do the latter first so that the search for "I am certain" only catches the positive cases.

survey %>% 
  mutate(across(starts_with("Q"), 
                ~case_when(
                  grepl("I am certain that it is not possible that", .x) ~ 5,
                  grepl("I am certain that", .x) ~ 1, 
                  grepl("I think it is possible that", .x) ~ 2, 
                  grepl("I don't know if is possible that", .x) ~ 3, 
                  grepl("I think it is not possible that", .x) ~ 4,
                  grepl("It is impossible for me to know if", .x) ~ 6)))

edited Jun 23, 2023 at 12:32

nico

51.9k17 gold badges92 silver badges118 bronze badges

answered Jun 23, 2023 at 12:21

DaveArmstrong

22.5k2 gold badges16 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

TimTeaFan · Accepted Answer · 2023-06-23 13:13:29Z

Another options is the first cut the "topic X" at the end of each string and then recode all variables in one go with recode():

library(dplyr)
library(stringr)


recode_vec <- setNames(as.character(1:6), possible_answers)

survey |> 
  mutate(across(starts_with("Q"),
                \(x) {
                  str_replace_all(x,
                                  "(.*)\\stopic\\s\\d$",
                                  "\\1") |> 
                  recode(!!! recode_vec)
                }
                )
         )
#>    Q1 Q2 Q3 Q4 Q5
#> 1   6  6  4  3  1
#> 2   3  6  3  4  1
#> 3   2  2  1  1  5
#> 4   4  1  6  1  2
#> 5   2  6  5  4  2
#> 6   5  6  4  3  3
#> 7   3  1  2  6  3
#> 8   2  4  6  1  5
#> 9   6  4  2  5  3
#> 10  3  2  4  3  1

Data from OP

set.seed(12345)

possible_answers <- c(
  "I am certain that", "I think it is possible that",
  "I don't know if is possible that", "I think it is not possible that",
  "I am certain that it is not possible that", "It is impossible for me to know if"
)

num_answers <- 10
survey <- data.frame(
  Q1 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 1"
  ),
  Q2 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 2"
  ),
  Q3 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 3"
  ),
  Q4 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 4"
  ),
  Q5 = paste(
    sample(possible_answers, num_answers, replace = TRUE),
    "topic 5"
  )
)

^{Created on 2023-06-23 by the reprex package (v2.0.1)}

Collectives™ on Stack Overflow

Recoding multiple factors using regexp

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related