4

I have a code that creates a dataframe using pandas

import pandas as pd
import numpy as np

x = (g[0].time[:111673])
y = (g[0].data.f[:111673])
df = pd.DataFrame({'Time': x, 'Data': y})
#df

This prints out:

          Data          Time
0        -0.704239      7.304021
1        -0.704239      7.352021
2        -0.704239      7.400021
3        -0.704239      7.448021
4        -0.825279      7.496021

Which is great but I know there are outliers in this data that I want removed so I created this dataframe below to point them out:

newdf = df.copy()
Data = newdf.groupby('Data')
newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())]
newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
#newdf

This prints out:

             Data          Time  Outlier
0        -0.704239      7.304021    False
1        -0.704239      7.352021    False
2        -0.704239      7.400021    False
3        -0.704239      7.448021    False
4        -0.825279      7.496021    False

In the example of my data you cant see it but there are maybe 300 outliers and I want to remove them without messing with the original dataframe and then plot them together as a compression. My question is this: So instead of printing out false/true how can I just eliminate the outliers that are true? so I can eventually plot them in the same graph for a comparison.

Codes I have already tried:

newdf[np.abs(newdf.Data-newdf.Data.mean())<=(1.96*newdf.Data.std())]

newdf = df.copy()
def replace_outliers_with_nan(df, stdvs):
    newdf=pd.DataFrame()
    for i, col in enumerate(df.sites.unique()):
        df = pd.DataFrame(df[df.sites==col])
        idx = [np.abs(df-df.mean())<=(stdvs*df.std())] 
        df[idx==False]=np.nan  
        newdf[col] = df
    return newdf

Both of these doesn't work, they returns the same amount of data points as my original dataframe however I know that if it removed the outliers the amount of points would be less than the original.

1 Answer 1

3

It seems you need boolean indexing with ~ for invert condition, because need filter only not outliers rows (and drop outliers):

df1 = df[~df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std()).values]
print (df1)
       Data      Time
0 -0.704239  7.304021
1 -0.704239  7.352021
2 -0.704239  7.400021
3 -0.704239  7.448021
4 -0.825279  7.496021
Sign up to request clarification or add additional context in comments.

5 Comments

When I tried your answer I get a value error: raise ValueError(msg.format(dtype=dt)) ValueError: Boolean array expected for the condition, not float64
What return print (df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std())) ? not True and False Series?
I find problem, you need .values for convert Series to numpy array.
I'm a little confused by your wording. Yes, print (df.groupby('Data').transform( lambda x: abs(x-x.mean()) > 1.96*x.std())) returns a True or False series for my 'Time' column and nothing for the 'Data' column. But I already had that with Data = newdf.groupby('Data') newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())] newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() ) i'm looking to remove the outliers from the 'Data' column.
Yes, thank you! sorry I didn't see your edit before my last comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.