Multiple Regression lines in ggplot2

Question

here is a test code and I don't understand why is not working as expected. Is a ggplot2 question, not an R one.

library(ggplot2)

K = 10

x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)

set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)

folds <- cut(seq(1, nrow(xy)), breaks = K, labels = FALSE)

p1 <- ggplot(xy, aes(x = xy$x, y = xy$yrand))+geom_point() +ggtitle ("Simple 
x vs y plot with added random noise") + xlab("X") + ylab("Y")

for(i in 1:K){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- xy[testIndexes, ]
  trainData <- xy[-testIndexes, ]

  lmTemp <- lm(yrand ~ x, data = trainData)

  p1 <- p1 + geom_line(data = trainData, aes(x = trainData$x, y = predict(lmTemp, newdata = trainData)))

 }

p1

Now what I would like to see is a plot with 10 lines (the regression lines). But I only see one. Can someone help me out? Is the ggplot2 syntax that is wrong?

Thanks, Umberto

EDITED:

I marked the answer I got since it is a nice way of doing it. I just wanted to add a simple way of doing it preparing the datasets for the graph I wanted to create. I think this method is slightly easier to understand if you don't have so much R experience.

for(i in 1:K){
  #Segement your data by fold using the which() function 
  testIndexes <- which(folds==i,arr.ind=TRUE)
  testData <- xy[testIndexes, ]
  trainData <- xy[-testIndexes, ]

  lmTemp <- lm(yrand ~ x, data = trainData)

  # Let's build a data set for the lines
 fitLines <- rbind(fitLines, data.frame(rep(paste("set",i),nrow(trainData)),trainData[,1], predict(lmTemp, newdata = trainData)))


}

names(fitLines) <- c("set", "x","y")
p1 + geom_line(data = fitLines, aes(x = x, y = y, col = set))

And this is what you get

folds is not defined, so testIndexes is probably empty, conclusion : in the loop you always use the same data set. — Mamoun Benghezal
– Mamoun Benghezal, Commented Apr 11, 2017 at 11:53
I just corrected it. Now folds is defined. It was a paste error... Sorry. Now it should work. — Umberto
– Umberto, Commented Apr 11, 2017 at 11:54
you also should use stat_smooth, so you can remove le lm line. — Mamoun Benghezal
– Mamoun Benghezal, Commented Apr 11, 2017 at 11:56
plus, the last line seems weir, try replacing trainData with testData — Mamoun Benghezal
– Mamoun Benghezal, Commented Apr 11, 2017 at 11:58

Daniel · Accepted Answer · 2017-04-11 12:29:40Z

3

You could use the crossv_kfold()function from the modelr-package, and put your complete code into a "pipe-workflow":

library(modelr)
library(tidyverse)

x <- 1:100/100
y <- sapply (x, FUN= function(x) 1+x)
xy <- data.frame(x,y)
set.seed(1234)
xy$yrand <- xy$y + runif(100,min = -0.35, max = 0.5)

xy %>% 
  crossv_kfold() %>% 
  mutate(
    models = map(train, ~ lm(yrand ~ x, data = .x)),
    predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
  ) %>% 
  select(-train, -test, -models) %>% 
  unnest() %>% 
  bind_cols(xy) %>% 
  ggplot(aes(x = x, y = predictions)) +
  stat_smooth(aes(colour = .id), method = "lm", se = FALSE) +
  geom_point(aes(y = yrand))

Putting the colour-aes inside the ggplot-call would also map the points to the groups:

xy %>% 
  crossv_kfold() %>% 
  mutate(
    models = map(train, ~ lm(yrand ~ x, data = .x)),
    predictions = map2(models, test, ~predict(.x, newdata = .y, type = "response"))
  ) %>% 
  select(-train, -test, -models) %>% 
  unnest() %>% 
  bind_cols(xy) %>% 
  ggplot(aes(x = x, y = predictions, colour = .id)) +
  stat_smooth(, method = "lm", se = FALSE) +
  geom_point(aes(y = yrand))

answered Apr 11, 2017 at 12:29

Daniel

7,9906 gold badges29 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Umberto Over a year ago

That is very nice. Will check that immediately. But could you explain it why my version is not working? Thanks.

Daniel Over a year ago

I guess you are always replacing the old layer of geom_line, because you don't map your line-geom to a grouping factor. Try adding a plot(p1) at the end of your inner code of the for-loop. Furthermore, you don't use the testData anywhere in your loop.

Umberto Over a year ago

Hi thanks. Will try that. Regarding the testData I know. What I Put here is just a part of the code I am using. In my real application I use also testData. I tried to add a plot(p1) in my code. I get all the different plots. You are probably right when you say that I am replacing the layer without adding new ones. Probably I need to organise the data in such a way that I can use groups. Somehow how you have done. It is simply difficult since I want to use crossvalidation and I want to use always different parts of the dataset...

Daniel Over a year ago

crossv_kfold() does what you want, I guess. It splits your data into k test-training partitions.

Collectives™ on Stack Overflow

Multiple Regression lines in ggplot2

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related