1

I am looking for a function (let's call it scramblematch) that can do the following.

query='one five six'
target1='one six two six three four five '
target2=' two five six'

scramblematch(query, target1) returns TRUE and

scramblematch(query, targ2) returns FALSE

The stringdist package might be what I need, but I don't know how to use it.

Update1

Use case for the function I am looking for: I have a dataset with data entered gradually over the years. Values for one text field (textfield) of the dataset is not standardized so people entered differently. Now I want to clean up this data by using a standardized set of values for textfield. All those values that describes the same things by different wordings are to be replaced by standardized values. For example (I am making this up):

In my standardized choices of values (let's call this lookupfactors), I have lookupfactors=c('liver disease', 'and more'). In the textfield I have following rows:

liver cancer disease
some other thing
male, liver fibrosis disease
yet another thing
failure of liver, disease

I want in the final result, to have row 1, 3, and 5 (because they have 'liver' and 'disease' in the content) to be replaced by liver disease. Here I assume that people who entered the data do not know the precise term, but they know the keywords to put it. Therefore words in the values of lookupfactors are substring/subset of those in textfield.

7
  • 3
    Do you want to check whether each word in query appears in the target string? If so, you can try Reduce("&",lapply(strsplit(query," ")[[1]],grepl,c(target1,target2))). Commented Dec 26, 2015 at 21:27
  • Very good demonstration of Reduce. Using Reduce like so is what I need: Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target1)) and Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target2)). Thank you. But I wonder if there is any other faster method. Commented Dec 26, 2015 at 21:33
  • 1
    You don't need separate calls, if you need speed. Just create a vector of target strings and use my line. It should be fast enough, I guess. Commented Dec 26, 2015 at 21:37
  • 1
    You found this method to be slow? Commented Dec 26, 2015 at 21:41
  • I haven't done benchmarking. Reduce is not slow. It is just my intuition when looking at the usage of lapply and grepl. The set query with %in% in the answer by docendo below may be faster. Commented Dec 26, 2015 at 21:45

2 Answers 2

3

You can try (the fixed=TRUE improvement is from @David's comment):

scramblematch<-function(query,target) {
   Reduce("&",lapply(strsplit(query," ")[[1]],grepl,target,fixed=TRUE))
}

Some benchmark:

query='one five six'
target1='one six two six three four five '
target2=' two five six'
target<-rep(c(target1,target2),10000)
system.time(scramblematch(query,target))   
# user  system elapsed 
#0.008   0.000   0.008
scramblematchDD <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
system.time(vapply(target,scramblematchDD,query=query,TRUE))   
# user  system elapsed 
#0.657   0.000   0.658

The vapply in the @docendodiscimus solution is needed, since it is not vectorized.

Sign up to request clarification or add additional context in comments.

4 Comments

Maybe add fixed = TRUE for some additional speed gain.
My answer is not vectorized because they didn't ask for such implementation in the question. Of course it would not make sense to strsplit the query each time if speed is a concern.
@DavidArenburg Tx, that speeds up things by a factor of 3.
Wow big difference. I was wrong about the speed of this approach.
3

One option to implement it is with %in% and strsplit:

scramblematch <- function(query, target, sep = " ") {
  all(unlist(strsplit(query, sep)) %in% unlist(strsplit(target, sep)))
}
scramblematch(query, target1)
#[1] TRUE
scramblematch(query, target2)
#[1] FALSE

A vectorized approach using stringi could be

library(stringi)
scramblematch <- function(query, target, sep = " ") {
  q <- stri_split_fixed(query, sep)[[1L]]
  sapply(stri_split_fixed(target, sep), function(x) {
    all(q %in% x)
  })
}

scramblematch(query, c(target1, target2))
#[1]  TRUE FALSE

2 Comments

Maybe also (similarly) library(stringi) ; all(stri_detect_fixed(target1, stri_split_fixed(query, " ")[[1]]))
@DavidArenburg, yep, I thought so too.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.