1

I have a large pandas DataFrames like below.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ("1", "Dixon Street", "Auckland"),
        ("2", "Deep Creek Road", "Wellington"),
        ("3", "Lyon St", "Melbourne"),
        ("4", "Hongmian Street", "Quinxin"),
        ("5", "Kadawatha Road", "Ganemulla"),
    ],
    
    columns=("ad_no", "street", "city"),
)

And I have a second large pandas DataFrame as below.

dfa = pd.DataFrame(
    [
        ("1 Dixon Street", "Auckland"),
        ("2 Deep Creek Road", "Wellington"),
        ("3 Lyon St", "Melbourne"),
        ("4 Hongmian Street", "Quinxin"),
        ("5 Federal Street", "Porac City"),
    ],
    
    columns=("address", "city"),
)

I want to check street string in df is available in dfa using str.contains function. I am particularly interested in not matching ones (e.g, Kadawatha Road)Can someone please let me know how to do that? Thanks

I tried the following code. But, it doesn't provide any results.

for a in df['street']:
  dfa[dfa['address'].str.contains(a, case=False)] 
2
  • 1
    Why using contains, is it partial string match? str.contains is slow BWT. Commented Nov 14, 2024 at 1:44
  • Yes partial string matching. Commented Nov 14, 2024 at 2:37

3 Answers 3

1

As @LMC mentioned, you can use a string contains method, though this might be slow.

I might add a helper column

df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())

And then use a filter

not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])

There are some other options/libraries. For example you could try a fuzzy match to do something similar:

%pip install thefuzz
from thefuzz import process
threshold = 80  # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])
Sign up to request clarification or add additional context in comments.

Comments

1

Another solution would be to concat the values and use regular expression.

dfa['address'].str.contains(df['street'].str.cat(sep='|'), regex=True)

But this is not very performant for large data sets.

3 Comments

Thank you. Yes, my data frames are very large.
You can use timeit (docs.python.org/3/library/timeit.html) to measure execution time. If you're using a jupyther notebook just add a %%timeit in the first line. Based on your sample data, the most performant way would be first splitting the address into street and number and then compare the street values with .isin(). But I guess your real data is much complexer than your sample data and splitting is not that easy.
Yes, too complex. Thank you.
0

you could also merge the columns to get the full adress inlcuding the city and drop the duplicates.
something like:

df['full_add'] = df['ad_no'] + " " + df['street'] + " "+df['city']
dfa['full_add'] = dfa['address'] + " " +dfa['city']
pd.concat([df,dfa]).drop_duplicates('full_add', keep=False)

that will produce:

ad_no   street  city    full_add    address
4   5   Kadawatha Road  Ganemulla   5 Kadawatha Road Ganemulla  NaN
4   NaN NaN Porac City  5 Federal Street Porac City 5 Federal Street

1 Comment

Thanks @AlexVI, This is not the expected output.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.