I have a code that creates a dataframe using pandas
import pandas as pd
import numpy as np
x = (g[0].time[:111673])
y = (g[0].data.f[:111673])
df = pd.DataFrame({'Time': x, 'Data': y})
#df
This prints out:
Data Time
0 -0.704239 7.304021
1 -0.704239 7.352021
2 -0.704239 7.400021
3 -0.704239 7.448021
4 -0.825279 7.496021
Which is great but I know there are outliers in this data that I want removed so I created this dataframe below to point them out:
newdf = df.copy()
Data = newdf.groupby('Data')
newdf[np.abs(newdf.Data-newdf.Data.mean())<=(3*newdf.Data.std())]
newdf['Outlier'] = Data.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() )
#newdf
This prints out:
Data Time Outlier
0 -0.704239 7.304021 False
1 -0.704239 7.352021 False
2 -0.704239 7.400021 False
3 -0.704239 7.448021 False
4 -0.825279 7.496021 False
In the example of my data you cant see it but there are maybe 300 outliers and I want to remove them without messing with the original dataframe and then plot them together as a compression. My question is this: So instead of printing out false/true how can I just eliminate the outliers that are true? so I can eventually plot them in the same graph for a comparison.
Codes I have already tried:
newdf[np.abs(newdf.Data-newdf.Data.mean())<=(1.96*newdf.Data.std())]
newdf = df.copy()
def replace_outliers_with_nan(df, stdvs):
newdf=pd.DataFrame()
for i, col in enumerate(df.sites.unique()):
df = pd.DataFrame(df[df.sites==col])
idx = [np.abs(df-df.mean())<=(stdvs*df.std())]
df[idx==False]=np.nan
newdf[col] = df
return newdf
Both of these doesn't work, they returns the same amount of data points as my original dataframe however I know that if it removed the outliers the amount of points would be less than the original.