0

I have three tables: upload, pap1, pap2) each table has 50 column and 150 thousands of rows, I want to split the three s into matched multiple dataframes( where each subset has max 1000 rows) using the unique primary key, for example, subset_upload1 must have the same ID'S in subset_pap1 and subset_pap2 and so on...

 employee_id<-c(1,2,3)
 employee <- c('John','Peter ','Jolie')
 salary <- c(21000, 23400, 26800)
 startdate <- as.Date(c('2010-11-1','2008-3-25','2007-3-14'))

 upload<- data.frame(employee_id,employee, salary, startdate)

 employee_id<-c(1,2,3)
 line_1<-c('address1','address2','address3')
 line_2<-c('address1','address2','address3')
 postcode<-c('postcode1','postcode2','postcode')

 pap1<-data.frame(employee_id,line_1,line_2,postcode)


 age<-c(57,43,23)
 Height<-c(150,170,190)
 gender<-c('M','M','F')
 enddate<-as.Date(c('2020-11-1','2020-3-25','2020-3-14'))

 pap2<-data.frame(employee_id,age,Height,gender,enddate)

the outcome I am hoping to is:

    subupload1<-data.frame(employee_id =1,employee = "John",salary=21000,startdate=as.Date('2010-11-1'))
   subpap1_1<-data.frame(employee_id=1,line_1='address1',line_2='address1',postcode='postcode1')
         subpap2_1<-data.frame(age=57,Height=150,gender='M',enddate=as.Date('202011-1'))
6
  • 1
    it would help if you had an example of your dataframe dput(head(mydataframe))). When you say you have 3 tables do you mean data.frames, or these are tables that have to be converted to dataframe type first? It is a bit unclear to me but this could be only me. Commented Feb 29, 2020 at 20:47
  • It seems like a task suited for the split function. If you can't provide an example, have a look at: rdocumentation.org/packages/base/versions/3.6.2/topics/split Commented Feb 29, 2020 at 20:53
  • i added an example @DimitriosZacharatos, Sorry i am new to R Commented Feb 29, 2020 at 21:50
  • I am trying to understand the problem in order to help and get some points. No need to be sorry Commented Feb 29, 2020 at 22:15
  • it looks to me that this is a subsetting problem Commented Feb 29, 2020 at 22:25

2 Answers 2

1

We can use a while loop to randomly sample n unique id's and subset them from 3 dataframes respectively to create new dataframe.

n <- 1 #Number of unique primary key in one dataframe
remaining_ids <- unique(upload$employee_id)
counter <- 1

while(length(remaining_ids) > n) {
   ids <- sample(remaining_ids, n)
   assign(paste0("subupload_", counter), subset(upload, employee_id %in% ids))
   assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% ids))
   assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% ids))
   counter <- counter + 1
   remaining_ids <- setdiff(remaining_ids, ids)
}

assign(paste0("subupload_", counter),subset(upload, employee_id %in% remaining_ids))
assign(paste0("subpap1_", counter), subset(pap1, employee_id %in% remaining_ids))
assign(paste0("subpap2_", counter), subset(pap2, employee_id %in% remaining_ids))

However, try to use lists to better handle/manage data instead of polluting your global environment with lots of objects.


If we want to write all these as csv we can use write.csv instead of assign like :

while(length(remaining_ids) > n) {
  ids <- sample(remaining_ids, n)
  write.csv(subset(upload, employee_id %in% ids), paste0("subupload_", counter, ".csv"))
  write.csv(subset(pap1, employee_id %in% ids), paste0("subpap1_", counter, ".csv"))
  write.csv(subset(pap2, employee_id %in% ids), paste0("subpap2_", counter, ".csv"))
  counter <- counter + 1
  remaining_ids <- setdiff(remaining_ids, ids)
}
write.csv(subset(upload, employee_id %in% remaining_ids), paste0("subupload_", counter, ".csv"))
write.csv(subset(pap1, employee_id %in% remaining_ids), paste0("subpap1_", counter, ".csv"))
write.csv(subset(pap2, employee_id %in% remaining_ids), paste0("subpap2_", counter, ".csv"))
Sign up to request clarification or add additional context in comments.

4 Comments

thank you @Ronak if I run your code then I will end up with 150000 subtables for each data frame (150000 unique employees), I want each sub-table to have max 1000 employees
@ryan Change the first line from n <- 1 to n <- 1000.
it worked perfectly thank you so much for your help @Ronak, last question is there a way to save them all asCSV files
@ryan you could replace assign with write.csv, see updated answer.
1
upload<-data.frame(employee_id=c(1,2,3),
                   employee=c('John','Peter','Jolie'),
                   salary=c(21000,23400,26800),
                   startdate=as.Date(c('2010-11-1','2008-3-25','2007-3-14')))

pap1<-data.frame(employee_id=c(1,2,3),
                 line_1=c('address1','address2','address3'),
                 line_2=c('address1','address2','address3'),
                 postcode=c('postcode1','postcode2','postcode'))

pap2<-data.frame(employee_id=c(1,2,3),
                 age=c(57,43,23),
                 Height=c(150,170,190),
                 gender=c('M','M','F'),
                 enddate=as.Date(c('2020-11-1','2020-3-25','2020-3-14')))

subupload1<-data.frame(employee_id=1,employee = "John",salary=21000,startdate=as.Date('2010-11-1'))
subpap1<-data.frame(employee_id=1,line_1='address1',line_2='address1',postcode='postcode1')
subpap2<-data.frame(employee_id=1,age=57,Height=150,gender='M',enddate=as.Date('2020-11-1'))

upload[upload$employee_id%in%1,]
upload[upload$employee_id%in%1:2,]
upload[upload$employee_id%in%1:3,]



upload<-upload[order(upload$employee_id),]
pap1<-pap1[order(pap1$employee_id),]
pap2<-pap2[order(pap2$employee_id),]


upload<-data.frame(employee_id=1:150000,
                   employee=sample(c('John','Peter','Jolie'),150000,replace=TRUE),
                   salary=sample(c(21000,23400,26800),150000,replace=TRUE),
                   startdate=sample(as.Date(c('2010-11-1','2008-3-25','2007-3-14')),150000,replace=TRUE))

split_setting<-c()
for(i in 1:(150000/1000))
  split_setting<-c(split_setting,rep(i,1000))

result<-split(upload,split_setting)

result$`1`
nrow(result$`1`)

7 Comments

Thank you very much for doing this but as I mentioned in my original question I have three tables each table has 150,000 rows which are unique and employee and I need each table to be split it into multiple table with at least with maximum 1000 employing each one following the same logic in my example, so subuplaod is will have 1000employeeswhom are the same in subpap1-1 and subpap2-1 and subupload2 will have the second time 1000 employees whom ar on subpap1-1 and subpap2-1 and so on
no worries, do you actually need a way to do a lot of splits in an automatic way?
I am missing something
exactly, because have to run it every day
each table will be split into 150 tables with 1000 row in each, and subtable from the upload must have the same employeeids in the other tow subtables
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.