R functions on workbook skipping blank cells

Question

I'm trying to write a function that edits workbooks imported via loadWorkbook and have a problem where blank cells get ignored rather than treated as NAs. The following code demonstrates this issue.

df <- data.frame(x = c(1, 2, 3, 4, NA, 6), y = c(7, 8, 9, 10, NA, 12))
xlsx::write.xlsx(df, file = 'testdf.xlsx', showNA = FALSE, row.names = FALSE)
testwb <- loadWorkbook(file = 'testdf.xlsx')

sheetnames <- c('Sheet1')
sheets <- lapply(sheetnames, function(name) getSheets(testwb)[[name]])
rows <- getRows(sheets[[1]], rowIndex=1:7)
cells <- getCells(rows, colIndex=1:2)
values <- sapply(unlist(cells), getCellValue)
values
length(values)

When I run it as is I get what I expected, with NAs included:

1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2

"x" "y" "1" "7" "2" "8" "3" "9" "4" "10" NA NA "6" "12"

However, if I open the excel file and manually delete the first row of data and then run it again, I get this:

1.1 1.2 3.1 3.2 4.1 4.2 5.1 5.2 7.1 7.2

"x" "y" "2" "8" "3" "9" "4" "10" "6" "12"

This causes problems for me because when doing comparisons between two sheets they end up with different numbers of values if one has more blank cells than the other. How do I force it to include NA values?

I tried manually rebuilding a copy of the workbook so I could use the keepNA argument in writeData like this:

testwb2 <- openxlsx::createWorkbook()
addWorksheet(testwb2, 'Sheet1')
writeData(testwb2, 'Sheet1', read_xlsx('testdf.xlsx', sheet = 1), keepNA = TRUE)

sheetnames <- c('Sheet1')
sheets <- lapply(sheetnames, function(name) getSheets(testwb2)[[name]])
rows <- getRows(sheets[[1]], rowIndex=1:7)
cells <- getCells(rows, colIndex=1:2)
values <- sapply(unlist(cells), getCellValue)
values
length(values)

but then I get the error

Error in envRefInferField(x, what, getClass(class(x)), selfEnv) : ‘getNumberOfSheets’ is not a valid field or method name for reference class “Workbook”

which I also don't understand. Is there a simple way to handle this?

Could you provide some example data/make this question reproducible? Thanks and good luck! — jpsmith
– jpsmith, Commented Sep 22, 2023 at 14:49
I reorganized my code at the beginning to make it more easily reproducible. However, I can't include code for manually deleting the first row of data because the issue only occurs when I do that in Excel; if, for example, I create a new dataframe df2 <- data.frame(x = c(NA, 2, 3, 4, NA, 6), y = c(NA, 8, 9, 10, NA, 12)) to mimic the deletion then the code works as I want it to. So I'm wondering if Excel changes NAs somehow when you delete cells there, and if there's some way to force R to recognize the blank cells afterward. Is there anything else I should provide? — js4032
– js4032, Commented Sep 22, 2023 at 15:21

Jan Marvin · Accepted Answer · 2023-09-23 13:24:04Z

The question is a little confusing, but the problem has probably been answered a few times already. Here's what I think is happening: When you remove the first row, your spreadsheet software does some kind of housecleaning and removes the empty row from the data (it is no longer part of the xml structure in the xlsx file you wrote), and the xlsx package returns only the cells it could read in the workbook.

I have tried to fix similar problems with openxlsx, so I assume that xlsx behaves similarly.

Below I have created a file that behaves as you described the problem, showing that openxlsx also behaves naughty, but that there are other packages that behave nice.

df <- data.frame(
  x = c(1, 2, 3, 4, NA, 6),
  y = c(7, 8, 9, 10, NA, 12)
)

tmp <- openxlsx2::temp_xlsx()
openxlsx2::wb_workbook()$
  add_worksheet("Sheet1")$
  add_data(x = df[2:4,], dims = "A1:B4")$
  add_data(x = df[6,], dims = "A6:B6", col_names = FALSE)$
  save(tmp)

testwb <- xlsx::loadWorkbook(file = tmp)

sheetnames <- c('Sheet1')
sheets <- lapply(sheetnames, function(name) xlsx::getSheets(testwb)[[name]])
rows   <- xlsx::getRows(sheets[[1]], rowIndex=1:7)
cells  <- xlsx::getCells(rows, colIndex=1:2)
values <- sapply(unlist(cells), xlsx::getCellValue)
values
#>  1.1  1.2  2.1  2.2  3.1  3.2  4.1  4.2  6.1  6.2 
#>  "x"  "y"  "2"  "8"  "3"  "9"  "4" "10"  "6" "12"
length(values)
#> [1] 10

# naughty
openxlsx::read.xlsx(tmp)
#>   x  y
#> 1 2  8
#> 2 3  9
#> 3 4 10
#> 4 6 12

# nice
readxl::read_xlsx(tmp)
#> # A tibble: 5 × 2
#>       x     y
#>   <dbl> <dbl>
#> 1     2     8
#> 2     3     9
#> 3     4    10
#> 4    NA    NA
#> 5     6    12

# nice
openxlsx2::read_xlsx(tmp)
#>    x  y
#> 2  2  8
#> 3  3  9
#> 4  4 10
#> 5 NA NA
#> 6  6 12

Collectives™ on Stack Overflow

R functions on workbook skipping blank cells

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related