3

I would like to do a check on my dataset to make sure a certain set of values are in every group and output a dataset showing all the values I am checking for and whether they exist in each group. How do I do this? For example, using the iris R dataset, say I want to check whether all of the species contain the petal lengths of 1, 3, and 4. I have tried the dplyr summarize function below, but I would like to know whether each value is there or not instead of summarizing the results to true or false.

# load example data
data(iris)

# preview data
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa


# what I want
     Species Petal.Length value_is_present
     setosa            1                Y
 versicolor            1                N
  virginica            1                N
     setosa            3                N
 versicolor            3                Y
  virginica            3                Y
     setosa            4                N
 versicolor            4                N
  virginica            4                N

# what I tried:
expected_values <- c(1, 3, 4)

# Check if all expected values exist within each group
result <- iris %>%
  group_by(Species) %>%
  summarise(
    all_values_present = all(expected_values %in% Petal.Length)
  ) %>%
  ungroup()

> print(result)
# A tibble: 3 × 2
  Species    all_values_present
  <fct>      <lgl>             
1 setosa     FALSE             
2 versicolor FALSE             
3 virginica  FALSE  





Edit: I made some typos in my want dataset but seems like everyone got the picture. Thanks!

2
  • I don't understand your desired output. Why is setosa and length 1 repeated 3 times, once with Y and twice with N? Commented Oct 23 at 20:34
  • @MrFlick oops, it was just example output I made up, setosa should have length 1,3, and 4, not 1 repeated three times. Commented Oct 24 at 18:02

5 Answers 5

4

Base R idea:

expand.grid(Species=levels(iris$Species), Petal.Length=c(1, 3, 4)) |>
  sort_by(~Species) |> # cosmetics
  ( \(.) transform(., present = do.call('paste0', .) %in% do.call(
    'paste0', subset(iris, select=c(Species, Petal.Length)))) )()

-output

     Species Petal.Length present
1     setosa            1    TRUE
4     setosa            3   FALSE
7     setosa            4   FALSE
2 versicolor            1   FALSE
5 versicolor            3    TRUE
8 versicolor            4    TRUE
3  virginica            1   FALSE
6  virginica            3   FALSE
9  virginica            4   FALSE

NOTE. We are almost always better off by using a Boolean variable TRUE/FALSE or 1/0 which can be achieved by doing +(...). It simplifies further analysis a lot. A quick demonstration:

# <...> |>
  aggregate(present~Species, data=_, all)

-output

     Species present
1     setosa   FALSE
2 versicolor   FALSE
3  virginica   FALSE
Sign up to request clarification or add additional context in comments.

Comments

3

I think you can get what you want by counting the number of values per group, and then converting those to Y/N values. How about

iris %>% 
  filter(Petal.Length %in% expected_values) %>% 
  mutate(Petal.Length=factor(Petal.Length)) %>% 
  count(Species, Petal.Length, .drop=FALSE) %>% 
  mutate(value_is_present = if_else(n>0, "Y", "N"), n=NULL)

which returns

     Species Petal.Length value_is_present
1     setosa            1                Y
2     setosa            3                N
3     setosa            4                N
4 versicolor            1                N
5 versicolor            3                Y
6 versicolor            4                Y
7  virginica            1                N
8  virginica            3                N
9  virginica            4                N

1 Comment

I find this pattern easiest to write on the fly and find dplyr::count() very useful. If you specify factor levels like this mutate(Petal.Length = factor(Petal.Length, levels = expected_values)), I believe this will protect against the case where a value in expected_values does not appear in the dataset at all.
2

Looping over the expected_values per group using purrr::map, then separating the list with tidyr::unnest

library(dplyr)
library(tidyr)

expected_values <- c(1, 3, 4)

iris %>% 
  reframe(value = purrr::map(expected_values, ~ 
            list(Petal.length = .x, 
                 value_is_present = c("N","Y")[any(.x == Petal.Length) + 1])),
          .by = Species) %>% 
  unnest_wider(value)

output

# A tibble: 9 × 3
  Species    Petal.length value_is_present
  <fct>             <dbl> <chr>        
1 setosa                1 Y            
2 setosa                3 N            
3 setosa                4 N            
4 versicolor            1 N            
5 versicolor            3 Y            
6 versicolor            4 Y            
7 virginica             1 N            
8 virginica             3 N            
9 virginica             4 N 

Comments

2

Another variation. Get the distinct Species/Petal.Length values and mark present, add rows for the missing expected_values, and remove the non-expected_values.

iris |>
  distinct(Species, Petal.Length, present = TRUE) |>
  complete(Species, Petal.Length = expected_values, fill = list(present = FALSE)) |>
  filter(Petal.Length %in% expected_values)

  Species    Petal.Length present
  <fct>             <dbl> <lgl>  
1 setosa                1 TRUE   
2 setosa                3 FALSE  
3 setosa                4 FALSE  
4 versicolor            1 FALSE  
5 versicolor            3 TRUE   
6 versicolor            4 TRUE   
7 virginica             1 FALSE  
8 virginica             3 FALSE  
9 virginica             4 FALSE 

Or we could set up a table of the Species and expected values and set present FALSE, then update that table with a version of iris where present is TRUE:

iris |>
  reframe(Petal.Length = expected_values, present = FALSE, .by = Species) |>
  rows_update(iris |> distinct(Species, Petal.Length) |> mutate(present = TRUE), 
              by = c("Species", "Petal.Length"), unmatched = "ignore")

Comments

1

Yet another alternative: build a data.frame with your expected values and join your actual values to it.

In the code below, starting from the expected values:

  • we join to the actual data. If the group-value pair doesn't exist in the data to check, then it will be NA in "value_is_present"
  • if the group-value pairs existed several times, then we end up with the same row duplicated, so we call distinct() to get rid of them.
  • replace NA generated in the first step by "N".
library(dplyr, warn.conflicts = FALSE)

data_to_check <- iris |> 
  select(Species, Petal.Length) |> 
  mutate(value_is_present = "Y")

expected_values <- expand.grid(unique(iris$Species), c(1, 3, 4)) |> 
  arrange(Var1, Var2)
names(expected_values) <- c("Species", "Petal.Length")

expected_values |> 
  left_join(data_to_check, join_by(Species, Petal.Length)) |> 
  distinct() |> 
  mutate(value_is_present = if_else(is.na(value_is_present), "N", value_is_present))
#>      Species Petal.Length value_is_present
#> 1     setosa            1                Y
#> 2     setosa            3                N
#> 3     setosa            4                N
#> 4 versicolor            1                N
#> 5 versicolor            3                Y
#> 6 versicolor            4                Y
#> 7  virginica            1                N
#> 8  virginica            3                N
#> 9  virginica            4                N

1 Comment

You can name your arguments as in expand.grid(Species=.., Petal.Length=c(1,3,4)), no need to rename them later.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.