Read first two and last rows based on 2 columns

Question

I have a dataframe :

dfs = pd.read_csv(StringIO("""
      datetime        ID  C_1 C_2  C_3   C_4 C_5 C_6
"18/06/2023 3:51:50"  136 101 2024  89    4   3   13
"18/06/2023 3:51:52"  136 101 2028  61    4   3   18
"18/06/2023 3:51:53"  24  101 2029  65    0   0   0
"18/06/2023 3:51:53"  24  102 2022  89    0   0   0
"18/06/2023 3:51:54"  136 102 2045  66    2   3   4
"18/06/2023 3:51:55"  0   101 2022  89    0   0   0
"18/06/2023 3:51:56"  136 101 2222  77    0   0   0
"18/06/2023 3:51:56"  24  102 2022  89    0   0   0
"18/06/2023 3:51:57"  136 101 2024  90    0   0   0
"18/06/2023 3:51:57"  24  101 2026  87    0   1   8
"18/06/2023 3:51:58"  0   102 2045  44    43  42  41
"18/06/2023 3:51:59"  24  102 2043  33    0   1   8
"18/06/2023 3:52:01"  24  101 2022  89    1   4   76
"18/06/2023 3:52:03"  24  102 2046  31    0   1   6
"18/06/2023 3:52:18"  136 101 3333  99    0   1   87
"18/06/2023 3:52:54"  136 102 2045  66    2   3   4
"""), sep="\s+")

Is there a way to read first two and last two columns(one for ID=136 and one for ID=24) for every different C_1.

I am trying below code it's working as expected, looking for simple and fast solution-

filter_1 = dfs['ID'].isin(['136'])
filter_2 = dfs['ID'].isin(['24'])
test_df1 = dfs.loc[filter_1, :]
test_df2 = dfs.loc[filter_2, :]
g1 = test_df1.groupby('C_1')
g2 = test_df2.groupby('C_1')
final_df1 = pd.concat([g1.head(1), g1.tail(1)]).drop_duplicates().sort_values('C_1').reset_index(drop=True)
final_df2 = pd.concat([g2.head(1), g2.tail(1)]).drop_duplicates().sort_values('C_1').reset_index(drop=True)
#merge final_df1 & final_df2

Output -

      datetime        ID  C_1 C_2  C_3   C_4 C_5 C_6
"18/06/2023 3:51:50"  136 101 2024  89    4   3   13
"18/06/2023 3:51:53"  24  101 2029  65    0   0   0
"18/06/2023 3:52:01"  24  101 2022  89    1   4   76
"18/06/2023 3:52:18"  136 101 3333  99    0   1   87
"18/06/2023 3:51:53"  24  102 2022  89    0   0   0
"18/06/2023 3:51:54"  136 102 2045  66    2   3   4
"18/06/2023 3:52:03"  24  102 2046  31    0   1   6
"18/06/2023 3:52:54"  136 102 2045  66    2   3   4

mozway · Accepted Answer · 2024-01-15 09:20:07Z

You could use cumcount and boolean indexing:

N = 1 # number of rows to keep per ID/C_1
g = dfs.groupby(['ID', 'C_1'])
out = dfs[g.cumcount().lt(N) | g.cumcount(ascending=False).lt(N)]

If you also want to filter on the ID:

N = 1
g = dfs.groupby(['ID', 'C_1'])
m = dfs['ID'].isin(['24', '136'])
out = dfs[m & (g.cumcount().lt(N) | g.cumcount(ascending=False).lt(N))]

Output:

              datetime   ID  C_1   C_2  C_3  C_4  C_5  C_6
0   18/06/2023 3:51:50  136  101  2024   89    4    3   13
2   18/06/2023 3:51:53   24  101  2029   65    0    0    0
3   18/06/2023 3:51:53   24  102  2022   89    0    0    0
4   18/06/2023 3:51:54  136  102  2045   66    2    3    4
12  18/06/2023 3:52:01   24  101  2022   89    1    4   76
13  18/06/2023 3:52:03   24  102  2046   31    0    1    6
14  18/06/2023 3:52:18  136  101  3333   99    0    1   87
15  18/06/2023 3:52:54  136  102  2045   66    2    3    4

Intermediates:

              datetime   ID  C_1   C_2  C_3  C_4  C_5  C_6  g  g_desc   isin  selection
0   18/06/2023 3:51:50  136  101  2024   89    4    3   13  0       4   True       True
1   18/06/2023 3:51:52  136  101  2028   61    4    3   18  1       3   True      False
2   18/06/2023 3:51:53   24  101  2029   65    0    0    0  0       2   True       True
3   18/06/2023 3:51:53   24  102  2022   89    0    0    0  0       3   True       True
4   18/06/2023 3:51:54  136  102  2045   66    2    3    4  0       1   True       True
5   18/06/2023 3:51:55    0  101  2022   89    0    0    0  0       0  False      False
6   18/06/2023 3:51:56  136  101  2222   77    0    0    0  2       2   True      False
7   18/06/2023 3:51:56   24  102  2022   89    0    0    0  1       2   True      False
8   18/06/2023 3:51:57  136  101  2024   90    0    0    0  3       1   True      False
9   18/06/2023 3:51:57   24  101  2026   87    0    1    8  1       1   True      False
10  18/06/2023 3:51:58    0  102  2045   44   43   42   41  0       0  False      False
11  18/06/2023 3:51:59   24  102  2043   33    0    1    8  2       1   True      False
12  18/06/2023 3:52:01   24  101  2022   89    1    4   76  2       0   True       True
13  18/06/2023 3:52:03   24  102  2046   31    0    1    6  3       0   True       True
14  18/06/2023 3:52:18  136  101  3333   99    0    1   87  4       0   True       True
15  18/06/2023 3:52:54  136  102  2045   66    2    3    4  1       0   True       True

Collectives™ on Stack Overflow

Read first two and last rows based on 2 columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related