split 3 data frames evenly using the unique id

Question

I have three tables: upload, pap1, pap2) each table has 50 column and 150 thousands of rows, I want to split the three dataframes into matched multiple dataframes( where each subset has max 1000 rows) using the unique primary key, for example, subset_upload1 must have the same ID'S in subset_pap1 and subset_pap2 and so on...

 employee_id<-c(1,2,3)
 employee <- c('John','Peter ','Jolie')
 salary <- c(21000, 23400, 26800)
 startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))

 upload<- data.frame(employee_id,employee, salary, startdate)

 employee_id<-c(1,2,3)
 line_1<-c('address1','address2','address3')
 line_2<-c('address1','address2','address3')
 postcode<-c('postcode1','postcode2','postcode')

 pap1<-data.frame(employee_id,line_1,line_2,postcode)


 age<-c(57,43,23)
 Height<-c(150,170,190)
 gender<-c('M','M','F')
 enddate<-as.Date(c('2020-11-1','2020-3-25','2020-3-14'))

 pap2<-data.frame(employee_id,age,Height,gender,enddate)

the outcome I am hoping to is:

    subupload1<-data.frame(employee_id =1,employee = "John",salary=21000,startdate=as.Date('2010-11-1'))
   subpap1_1<-data.frame(employee_id=1,line_1='address1',line_2='address1',postcode='postcode1')
         subpap2_1<-data.frame(age=57,Height=150,gender='M',enddate=as.Date('202011-1'))

it would help if you had an example of your dataframe dput(head(mydataframe))). When you say you have 3 tables do you mean data.frames, or these are tables that have to be converted to dataframe type first? It is a bit unclear to me but this could be only me. — Dimitrios Zacharatos
– Dimitrios Zacharatos, Commented Feb 29, 2020 at 20:47
It seems like a task suited for the split function. If you can't provide an example, have a look at: rdocumentation.org/packages/base/versions/3.6.2/topics/split — novica
– novica, Commented Feb 29, 2020 at 20:53
i added an example @DimitriosZacharatos, Sorry i am new to R — ryan
– ryan, Commented Feb 29, 2020 at 21:50
I am trying to understand the problem in order to help and get some points. No need to be sorry — Dimitrios Zacharatos
– Dimitrios Zacharatos, Commented Feb 29, 2020 at 22:15

Ronak Shah · Accepted Answer · 2020-03-02 13:02:54Z

1

We can use a while loop to randomly sample n unique id's and subset them from 3 dataframes respectively to create new dataframe.

n <- 1 #Number of unique primary key in one dataframe
remaining_ids <- unique(upload$employee_id)
counter <- 1

while(length(remaining_ids) > n) {
   ids <- sample(remaining_ids, n)
   assign(paste0("subupload_", counter), subset(upload, employee_id %in% ids))
   assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% ids))
   assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% ids))
   counter <- counter + 1
   remaining_ids <- setdiff(remaining_ids, ids)
}

assign(paste0("subupload_", counter),subset(upload, employee_id %in% remaining_ids))
assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% remaining_ids))
assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% remaining_ids))

However, try to use lists to better handle/manage data instead of polluting your global environment with lots of objects.

If we want to write all these as csv we can use write.csv instead of assign like :

while(length(remaining_ids) > n) {
  ids <- sample(remaining_ids, n)
  write.csv(subset(upload, employee_id %in% ids), paste0("subupload_", counter, ".csv"))
  write.csv(subset(pap1, employee_id %in% ids), paste0("subpap1_", counter, ".csv"))
  write.csv(subset(pap2, employee_id %in% ids), paste0("subpap2_", counter, ".csv"))
  counter <- counter + 1
  remaining_ids <- setdiff(remaining_ids, ids)
}
write.csv(subset(upload, employee_id %in% remaining_ids), paste0("subupload_", counter, ".csv"))
write.csv(subset(pap1, employee_id %in% remaining_ids), paste0("subpap1_", counter, ".csv"))
write.csv(subset(pap2, employee_id %in% remaining_ids), paste0("subpap2_", counter, ".csv"))

edited Mar 2, 2020 at 13:02

answered Mar 1, 2020 at 10:02

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

ryan Over a year ago

thank you @Ronak if I run your code then I will end up with 150000 subtables for each data frame (150000 unique employees), I want each sub-table to have max 1000 employees

Ronak Shah Over a year ago

@ryan Change the first line from n <- 1 to n <- 1000.

ryan Over a year ago

it worked perfectly thank you so much for your help @Ronak, last question is there a way to save them all asCSV files

Ronak Shah Over a year ago

@ryan you could replace assign with write.csv, see updated answer.

Dimitrios Zacharatos · Accepted Answer · 2020-02-29 23:04:00Z

1

upload<-data.frame(employee_id=c(1,2,3),
                   employee=c('John','Peter','Jolie'),
                   salary=c(21000,23400,26800),
                   startdate=as.Date(c('2010-11-1','2008-3-25','2007-3-14')))

pap1<-data.frame(employee_id=c(1,2,3),
                 line_1=c('address1','address2','address3'),
                 line_2=c('address1','address2','address3'),
                 postcode=c('postcode1','postcode2','postcode'))

pap2<-data.frame(employee_id=c(1,2,3),
                 age=c(57,43,23),
                 Height=c(150,170,190),
                 gender=c('M','M','F'),
                 enddate=as.Date(c('2020-11-1','2020-3-25','2020-3-14')))

subupload1<-data.frame(employee_id=1,employee = "John",salary=21000,startdate=as.Date('2010-11-1'))
subpap1<-data.frame(employee_id=1,line_1='address1',line_2='address1',postcode='postcode1')
subpap2<-data.frame(employee_id=1,age=57,Height=150,gender='M',enddate=as.Date('2020-11-1'))

upload[upload$employee_id%in%1,]
upload[upload$employee_id%in%1:2,]
upload[upload$employee_id%in%1:3,]



upload<-upload[order(upload$employee_id),]
pap1<-pap1[order(pap1$employee_id),]
pap2<-pap2[order(pap2$employee_id),]


upload<-data.frame(employee_id=1:150000,
                   employee=sample(c('John','Peter','Jolie'),150000,replace=TRUE),
                   salary=sample(c(21000,23400,26800),150000,replace=TRUE),
                   startdate=sample(as.Date(c('2010-11-1','2008-3-25','2007-3-14')),150000,replace=TRUE))

split_setting<-c()
for(i in 1:(150000/1000))
  split_setting<-c(split_setting,rep(i,1000))

result<-split(upload,split_setting)

result$`1`
nrow(result$`1`)

edited Feb 29, 2020 at 23:04

answered Feb 29, 2020 at 22:26

Dimitrios Zacharatos

8501 gold badge6 silver badges18 bronze badges

7 Comments

ryan Over a year ago

Thank you very much for doing this but as I mentioned in my original question I have three tables each table has 150,000 rows which are unique and employee and I need each table to be split it into multiple table with at least with maximum 1000 employing each one following the same logic in my example, so subuplaod is will have 1000employeeswhom are the same in subpap1-1 and subpap2-1 and subupload2 will have the second time 1000 employees whom ar on subpap1-1 and subpap2-1 and so on

Dimitrios Zacharatos Over a year ago

no worries, do you actually need a way to do a lot of splits in an automatic way?

Dimitrios Zacharatos Over a year ago

I am missing something

ryan Over a year ago

exactly, because have to run it every day

ryan Over a year ago

each table will be split into 150 tables with 1000 row in each, and subtable from the upload must have the same employeeids in the other tow subtables

|

Collectives™ on Stack Overflow

split 3 data frames evenly using the unique id

2 Answers 2

4 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related