0

I have a dataframe one of the cols is id and some of the values have been messed up during the recording of the data.

here's an example of the type of data

dput(df)
structure(list(Id = c("'110171786'", "'1103fbfd5'", "'0700edf6dc'", 
"'1103fad09'", "'01103fc9bb'", "''", "''", "0000fba2b'", "'01103fb169'", 
"'01103fd723'", "'01103f9c34'", "''", "''", "''", "'01103fc088'", 
"'01103fa6d8'", "'01103fb374'", "'01103fce8c'", "'01103f955d'", 
"'011016e633'", "'01103fa0da'", "''", "''", "''", "'01103fa4bd'", 
"'01103fb5c4'", "'01103fd0d7'", "'01103f9e2e'", "'01103fc657'", 
"'01103fd4d1'", "'011016e78e'", "'01103fbda2'", "'01103fbae7'", 
"'011016ee23'", "'01103fc847'", "'01103fbfbb'", "''", "'01103fb8bb'", 
"'01103fc853'", "''", "'01103fbcd5'", "'011016e690'", "'01103fb253'", 
"'01103fcb19'", "'01103fb446'", "'01103fa4fa'", "'011016cfbd'", 
"'01103fd250'", "'01103fac7d'", "'011016a86e'"), Weight = c(11.5, 
11.3, 11.3, 10.6, 10.6, 8.9, 18.7, 10.9, 11.3, 18.9, 18.9, 8.6, 
8.8, 8.4, 11, 10.4, 10.4, 10.8, 11.2, 11, 10.3, 9.5, 8.1, 9.3, 
10.2, 10.5, 11.2, 21.9, 18, 17.8, 11.3, 11.5, 10.8, 10.5, 12.8, 
10.9, 8.9, 10.3, 10.8, 8.9, 10.9, 9.9, 19, 11.6, 11.3, 11.7, 
10.9, 12.1, 11.3, 10.6)), class = "data.frame", row.names = c(NA, 
-50L))
> 

What I would like to do is search through the id column and replace the following mistakes

  • some of the values have a zero missing off the front, all of these would start with a 1 now instead which makes finding them easily. So basically anything that has a character length of 9 and starts with a 1 needs a 0 as the first character.
  • some of the values are less than 10 characters long, these need to be removed.
  • some have more than one leading 0 and these need to be removed.
2
  • Why are your Id's double quoted eg "'110171786'" instead of "110171786"? Just curious Commented Dec 16, 2021 at 15:36
  • I think that just initially to stop excel from treating them like numbers and removing the zeros (which didn't work) and some of the IDs have an "E" in the middle and excel turns that into scientific notation. Relic of an old database system Commented Dec 17, 2021 at 7:53

1 Answer 1

2
df$Id <- gsub("^('?)(1.{8}')$", "\\10\\2", df$Id)
df[ !grepl("^'?(00|'$)", df$Id),]
#              Id Weight
# 1  '0110171786'   11.5
# 2  '01103fbfd5'   11.3
# 3  '0700edf6dc'   11.3
# 4  '01103fad09'   10.6
# 5  '01103fc9bb'   10.6
# 9  '01103fb169'   11.3
# 10 '01103fd723'   18.9
# 11 '01103f9c34'   18.9
# 15 '01103fc088'   11.0
# 16 '01103fa6d8'   10.4
# 17 '01103fb374'   10.4
# 18 '01103fce8c'   10.8
# 19 '01103f955d'   11.2
# 20 '011016e633'   11.0
# 21 '01103fa0da'   10.3
# 25 '01103fa4bd'   10.2
# 26 '01103fb5c4'   10.5
# 27 '01103fd0d7'   11.2
# 28 '01103f9e2e'   21.9
# 29 '01103fc657'   18.0
# 30 '01103fd4d1'   17.8
# 31 '011016e78e'   11.3
# 32 '01103fbda2'   11.5
# 33 '01103fbae7'   10.8
# 34 '011016ee23'   10.5
# 35 '01103fc847'   12.8
# 36 '01103fbfbb'   10.9
# 38 '01103fb8bb'   10.3
# 39 '01103fc853'   10.8
# 41 '01103fbcd5'   10.9
# 42 '011016e690'    9.9
# 43 '01103fb253'   19.0
# 44 '01103fcb19'   11.6
# 45 '01103fb446'   11.3
# 46 '01103fa4fa'   11.7
# 47 '011016cfbd'   10.9
# 48 '01103fd250'   12.1
# 49 '01103fac7d'   11.3
# 50 '011016a86e'   10.6
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.