1

I am processing data using Random Forest, and I am trying to create random artificial gaps in my dataset so that I can test how accurate the random forest predictions are.

TIMESTAMP <- c(2001:2020)
ch4_flux <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32)
ch4_flux_gaps <- c(67.36, 66.39, 65.39, 64.41, 63.52, 62.76, 62.16,61.76, 61.54,61.53,61.7,62.05,62.52, 63.09, 63.71, 64.33, 64.92, 65.46, 65.93, 66.32)
distance <- c(1000,1000,1000,125.35,1000,1000,1000,5.50,1000,1000,1000,1000, 1000,1000,179.65,1000,1000,1000,1000,1000)
CowNum <- c(0, 0, 0, 30, 0, 0, 0, 81, 0, 0, 0, 0, 0, 0, 127, 0, 0, 0, 0, 0)
dd <- data.frame(TIMESTAMP, ch4_flux, ch4_flux_gaps, distance,CowNum)

In the above example data, ch4_flux and ch4_flux_gaps are identical columns because I will be making gaps in only the ch4_flux_gaps column and then comparing them. I'm hoping to add gaps to 5-10% of the rows. I have seen information about how to add an entire row that is a gap, but not how to target only one column and have the gaps be random.

I am hoping that the ch4_flux_gaps column will look something like this afterwards:

ch4_flux_gaps <- c(67.36, 66.39, 65.39, NA, 63.52, NA, 62.16,61.76, 61.54,61.53,61.7,62.05,NA, 63.09, 63.71, 64.33, 64.92, 65.46, NA, NA)
1
  • 1
    dd$ch4_flux_gaps <- ifelse(runif(nrows(dd)) < 0.05, NA, dd$ch4_flux) or similar? Adjust the 0.05 to give the proportion of missings you want. data.frames are rectangular arrays, so you can't "add" rows to just one column. Here it's probably easier to "remove" values from one column (replacing with NAs) because otherwise you have to define how to duplicate values in the other columns... This is almost certainly a duplicate. Commented Mar 17 at 15:41

1 Answer 1

1

The package {messy}, by Nicola Rennie, features a make_missing() function that allows you to randomly add missing values to a column, specifying a percentage of the rows to modify:

dd2 <- dd |> 
  messy::make_missing(cols = "ch4_flux_gaps", messiness = 0.3)

# > dd2$ch4_flux_gaps
# [1] 67.36 66.39 65.39 NA 63.52 62.76 NA 61.76 61.54 NA NA NA NA 63.09 63.71 NA 64.92 NA 65.93 66.32
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.