Find str.contains in two large Pandas DataFrames

Question

I have a large pandas DataFrames like below.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        ("1", "Dixon Street", "Auckland"),
        ("2", "Deep Creek Road", "Wellington"),
        ("3", "Lyon St", "Melbourne"),
        ("4", "Hongmian Street", "Quinxin"),
        ("5", "Kadawatha Road", "Ganemulla"),
    ],
    
    columns=("ad_no", "street", "city"),
)

And I have a second large pandas DataFrame as below.

dfa = pd.DataFrame(
    [
        ("1 Dixon Street", "Auckland"),
        ("2 Deep Creek Road", "Wellington"),
        ("3 Lyon St", "Melbourne"),
        ("4 Hongmian Street", "Quinxin"),
        ("5 Federal Street", "Porac City"),
    ],
    
    columns=("address", "city"),
)

I want to check street string in df is available in dfa using str.contains function. I am particularly interested in not matching ones (e.g, Kadawatha Road)Can someone please let me know how to do that? Thanks

I tried the following code. But, it doesn't provide any results.

for a in df['street']:
  dfa[dfa['address'].str.contains(a, case=False)]

Why using contains, is it partial string match? str.contains is slow BWT. — LMC
– LMC, Commented Nov 14, 2024 at 1:44

Docuemada · Accepted Answer · 2024-11-14 02:08:49Z

1

As @LMC mentioned, you can use a string contains method, though this might be slow.

I might add a helper column

df['is_matched'] = df['street'].apply(lambda x: dfa['address'].str.contains(x).any())

And then use a filter

not_matched_df = df[~df['is_matched']].drop(columns=['is_matched'])

There are some other options/libraries. For example you could try a fuzzy match to do something similar:

%pip install thefuzz
from thefuzz import process
threshold = 80  # Set a similarity threshold
df['match'] = df['street'].apply(lambda x: process.extractOne(x, dfa['address'], score_cutoff=threshold))
not_matched_df = df[df['match'].isnull()].drop(columns=['match'])

answered Nov 14, 2024 at 2:08

Docuemada

1,7892 gold badges27 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jens · Accepted Answer · 2024-11-14 20:20:59Z

1

Another solution would be to concat the values and use regular expression.

dfa['address'].str.contains(df['street'].str.cat(sep='|'), regex=True)

But this is not very performant for large data sets.

answered Nov 14, 2024 at 20:20

Jens

2711 silver badge7 bronze badges

3 Comments

Totura Over a year ago

Thank you. Yes, my data frames are very large.

Jens Over a year ago

You can use timeit (docs.python.org/3/library/timeit.html) to measure execution time. If you're using a jupyther notebook just add a %%timeit in the first line. Based on your sample data, the most performant way would be first splitting the address into street and number and then compare the street values with .isin(). But I guess your real data is much complexer than your sample data and splitting is not that easy.

Totura Over a year ago

Yes, too complex. Thank you.

AlexVI · Accepted Answer · 2024-11-15 10:56:08Z

0

you could also merge the columns to get the full adress inlcuding the city and drop the duplicates.
something like:

df['full_add'] = df['ad_no'] + " " + df['street'] + " "+df['city']
dfa['full_add'] = dfa['address'] + " " +dfa['city']
pd.concat([df,dfa]).drop_duplicates('full_add', keep=False)

that will produce:

ad_no   street  city    full_add    address
4   5   Kadawatha Road  Ganemulla   5 Kadawatha Road Ganemulla  NaN
4   NaN NaN Porac City  5 Federal Street Porac City 5 Federal Street

answered Nov 15, 2024 at 10:56

AlexVI

1841 silver badge5 bronze badges

1 Comment

Totura Over a year ago

Thanks @AlexVI, This is not the expected output.

Collectives™ on Stack Overflow

Find str.contains in two large Pandas DataFrames

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related