Remove outliers from pandas dataframe python

Question

I have a code that creates a dataframe using pandas

import pandas as pd
import numpy as np

x = (g[0].time[:111673])
y = (g[0].data.f[:111673])
df = pd.DataFrame({'Time': x, 'Data': y})
#df

This prints out:

          Data          Time
0        -0.704239      7.304021
1        -0.704239      7.352021
2        -0.704239      7.400021
3        -0.704239      7.448021
4        -0.825279      7.496021

Which is great but I know there are outliers in this data that I want removed so I created this dataframe below to point them out:

newdf = df.copy()
Data = newdf.groupby('Data')
newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())]
newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
#newdf

This prints out:

             Data          Time  Outlier
0        -0.704239      7.304021    False
1        -0.704239      7.352021    False
2        -0.704239      7.400021    False
3        -0.704239      7.448021    False
4        -0.825279      7.496021    False

In the example of my data you cant see it but there are maybe 300 outliers and I want to remove them without messing with the original dataframe and then plot them together as a compression. My question is this: So instead of printing out false/true how can I just eliminate the outliers that are true? so I can eventually plot them in the same graph for a comparison.

Codes I have already tried:

newdf[np.abs(newdf.Data-newdf.Data.mean())<=(1.96*newdf.Data.std())]

newdf = df.copy()
def replace_outliers_with_nan(df, stdvs):
    newdf=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        df = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(df-df.mean())<=(stdvs*df.std())] 
        df[idx==False]=np.nan  
        newdf[col] = df
    return newdf

Both of these doesn't work, they returns the same amount of data points as my original dataframe however I know that if it removed the outliers the amount of points would be less than the original.

jezrael · Accepted Answer · 2017-08-02 13:22:20Z

3

It seems you need boolean indexing with ~ for invert condition, because need filter only not outliers rows (and drop outliers):

df1 = df[~df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std()).values]
print (df1)
       Data      Time
0 -0.704239  7.304021
1 -0.704239  7.352021
2 -0.704239  7.400021
3 -0.704239  7.448021
4 -0.825279  7.496021

edited Aug 2, 2017 at 13:22

answered Aug 2, 2017 at 13:08

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

eliza.b Over a year ago

When I tried your answer I get a value error: raise ValueError(msg.format(dtype=dt)) ValueError: Boolean array expected for the condition, not float64

jezrael Over a year ago

What return print (df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std())) ? not True and False Series?

jezrael Over a year ago

I find problem, you need .values for convert Series to numpy array.

eliza.b Over a year ago

I'm a little confused by your wording. Yes, print (df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std())) returns a True or False series for my 'Time' column and nothing for the 'Data' column. But I already had that with

Data = newdf.groupby('Data') newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())] newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )

i'm looking to remove the outliers from the 'Data' column.

eliza.b Over a year ago

Yes, thank you! sorry I didn't see your edit before my last comment.

Collectives™ on Stack Overflow

Remove outliers from pandas dataframe python

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related