separate rows based on delim based on two columns in r

Question

I have the following df:

df_1=data.frame(col_1=c("a;b;c","c;d","e","f","g","h;j"),col_2=c("1;2;3","4","5;6","7","8;9","10;11;12"))

so I want to separate col_1 into separate rows with corresponding values of col_2 if it exists.

For example if number of elements in col_1= number of elements in col_2 then they should be separated with their corresponding values in col_1 and col_2 ( row 1)
if they have differing number of elements, if one column has only has one element, then that can be separated to different rows as well (row 2)
if they have disproportionate number of elements ( more than 1 each and not equal) then it should be left as-is

here is the final_dataset:

df_2=data.frame(col_1=c("a","b","c","c","d","e","e","f","g","g","h;j"),col_2=c("1","2","3","4","4","5","6","7","8","9","10;11;12"))

akrun · Accepted Answer · 2020-09-14 20:05:19Z

2

We can use cSplit

library(splitstackshape)
library(zoo)

cnt1 <- nchar(gsub(";", "", df_1$col_1))
cnt2 <- nchar(gsub(";", "", df_1$col_2))
i1 <- cnt1 != cnt2 & cnt1 > 1 & cnt2 > 1
rbind(cSplit(df_1[!i1,], c('col_1', 'col_2'), sep=";", "long")[
          !is.na(col_1)|!is.na(col_2), lapply(.SD, na.locf0)], df_1[i1,])
#     col_1    col_2
# 1:     a        1
# 2:     b        2
# 3:     c        3
# 4:     c        4
# 5:     d        4
# 6:     e        5
# 7:     e        6
# 8:     f        7
# 9:     g        8
#10:     g        9
#11:   h;j 10;11;12

Or using base R with all the constraints

cnt1 <- nchar(gsub(";", "", df_1$col_1))
cnt2 <- nchar(gsub(";", "", df_1$col_2))
i1 <- cnt1 != cnt2 & cnt1 > 1 & cnt2 > 1
   
lst1 <- lapply(df_1[!i1, ], function(x) strsplit(x, ";"))
out <- rbind(do.call(rbind, Map(function(x, y) {
       l1 <- length(x)
       l2 <- length(y)
       mx <- max(l1, l2)
       x <- if(l1 != l2 &  l1 == 1) rep(x, mx) else x
       y <- if(l1 != l2 & l2 == 1) rep(y, mx) else y
       data.frame(col_1 = x, col_2 = y) } ,
       lst1[[1]], lst1[[2]])), df_1[i1,])
   
row.names(out) <- NULL
out
#   col_1    col_2
#1      a        1
#2      b        2
#3      c        3
#4      c        4
#5      d        4
#6      e        5
#7      e        6
#8      f        7
#9      g        8
#10     g        9
#11   h;j 10;11;12

edited Sep 14, 2020 at 20:05

answered Sep 14, 2020 at 18:57

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mel Over a year ago

thank you! except the last row, I want to keep as-is

akrun Over a year ago

@Mel that was confusing. h;j and 10;11:12 I thought it is h one rep and j two rep

Mel Over a year ago

The things is I wouldn't know about the split for this one, so it would be better to keep that as-is

ThomasIsCoding · Accepted Answer · 2020-09-14 20:28:36Z

1

Here is another base R option via defining a custom function f

f <- function(v) {
  X <- unlist(strsplit(v[[1]],";"))
  Y <- unlist(strsplit(v[[2]],";"))
  if (length(X) == length(Y) || min(length(X),length(Y))==1) {
    res <- data.frame(col_1 = X, col_2 = Y)
  } else {
    res <- data.frame(col_1 = v[[1]], col_2 = v[[2]])
  }
  res
}

df_2 <- do.call(rbind,apply(df_1,1,f))

and we will get

   col_1    col_2
1      a        1
2      b        2
3      c        3
4      c        4
5      d        4
6      e        5
7      e        6
8      f        7
9      g        8
10     g        9
11   h;j 10;11;12

answered Sep 14, 2020 at 20:28

ThomasIsCoding

106k9 gold badges38 silver badges110 bronze badges

Collectives™ on Stack Overflow

separate rows based on delim based on two columns in r

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related