1

here is a test code and I don't understand why is not working as expected. Is a ggplot2 question, not an R one.

library(ggplot2)

K = 10

x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)

set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)

folds <- cut(seq(1, nrow(xy)), breaks = K, labels = FALSE)

p1 <- ggplot(xy, aes(x = xy$x, y = xy$yrand))+geom_point() +ggtitle ("Simple 
x vs y plot with added random noise") + xlab("X") + ylab("Y")

for(i in 1:K){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- xy[testIndexes, ]
  trainData <- xy[-testIndexes, ]

  lmTemp <- lm(yrand ~ x, data = trainData)

  p1 <- p1 + geom_line(data = trainData, aes(x = trainData$x, y = predict(lmTemp, newdata = trainData)))

 }

p1

Now what I would like to see is a plot with 10 lines (the regression lines). But I only see one. Can someone help me out? Is the ggplot2 syntax that is wrong?

enter image description here

Thanks, Umberto

EDITED:

I marked the answer I got since it is a nice way of doing it. I just wanted to add a simple way of doing it preparing the datasets for the graph I wanted to create. I think this method is slightly easier to understand if you don't have so much R experience.

for(i in 1:K){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- xy[testIndexes, ]
  trainData <- xy[-testIndexes, ]

  lmTemp <- lm(yrand ~ x, data = trainData)

  # Let's build a data set for the lines
 fitLines <- rbind(fitLines, data.frame(rep(paste("set",i),nrow(trainData)),trainData[,1], predict(lmTemp, newdata = trainData)))


}

names(fitLines) <- c("set", "x","y")
p1 + geom_line(data = fitLines, aes(x = x, y = y, col = set))

And this is what you get

enter image description here

4
  • folds is not defined, so testIndexes is probably empty, conclusion : in the loop you always use the same data set. Commented Apr 11, 2017 at 11:53
  • I just corrected it. Now folds is defined. It was a paste error... Sorry. Now it should work. Commented Apr 11, 2017 at 11:54
  • you also should use stat_smooth, so you can remove le lm line. Commented Apr 11, 2017 at 11:56
  • plus, the last line seems weir, try replacing trainData with testData Commented Apr 11, 2017 at 11:58

1 Answer 1

3

You could use the crossv_kfold()function from the modelr-package, and put your complete code into a "pipe-workflow":

library(modelr)
library(tidyverse)

x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)

xy %>% 
  crossv_kfold() %>% 
  mutate(
    models = map(train, ~ lm(yrand ~ x, data = .x)),
    predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
  ) %>% 
  select(-train, -test, -models) %>% 
  unnest() %>% 
  bind_cols(xy) %>% 
  ggplot(aes(x = x, y = predictions)) +
  stat_smooth(aes(colour = .id), method = "lm", se = FALSE) +
  geom_point(aes(y = yrand))

enter image description here

Putting the colour-aes inside the ggplot-call would also map the points to the groups:

xy %>% 
  crossv_kfold() %>% 
  mutate(
    models = map(train, ~ lm(yrand ~ x, data = .x)),
    predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
  ) %>% 
  select(-train, -test, -models) %>% 
  unnest() %>% 
  bind_cols(xy) %>% 
  ggplot(aes(x = x, y = predictions, colour = .id)) +
  stat_smooth(, method = "lm", se = FALSE) +
  geom_point(aes(y = yrand))

enter image description here

Sign up to request clarification or add additional context in comments.

4 Comments

That is very nice. Will check that immediately. But could you explain it why my version is not working? Thanks.
I guess you are always replacing the old layer of geom_line, because you don't map your line-geom to a grouping factor. Try adding a plot(p1) at the end of your inner code of the for-loop. Furthermore, you don't use the testData anywhere in your loop.
Hi thanks. Will try that. Regarding the testData I know. What I Put here is just a part of the code I am using. In my real application I use also testData. I tried to add a plot(p1) in my code. I get all the different plots. You are probably right when you say that I am replacing the layer without adding new ones. Probably I need to organise the data in such a way that I can use groups. Somehow how you have done. It is simply difficult since I want to use crossvalidation and I want to use always different parts of the dataset...
crossv_kfold() does what you want, I guess. It splits your data into k test-training partitions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.