Multiple random selection from MultiIndex

Question

Consider the following DataFrame:

import pandas as pd
arrays = [['A','A','B','B','C','C'],[1,1,3,3,5,5,],[2,2,4,4,6,6],[0.1,0.2,0.3,0.4,0.5,0.6]]
index = pd.MultiIndex.from_arrays(arrays,names=('Sample','P1','P2','T'))
data = np.random.rand(10,6)
df = pd.DataFrame(columns=index,data=data)

I want to select: for sample A, the column with T=0.2, and for sample C, the column with T=0.5.

I can easily select each of the single columns, e.g.:

df.loc[:,('A',slice(None),slice(None),0.2)]  # or
df.loc(axis=1)[('C',slice(None),slice(None),0.5)]

But how can I combine them? I tried supplying a list of tuples:

df.loc[:,[('A',slice(None),slice(None),0.2),('C',slice(None),slice(None),0.5)]]

But that generates an error.

How can I select my columns without resorting to pd.concat?

unrelated: having floats as column names might not be a good idea. — Quang Hoang
– Quang Hoang, Commented Sep 14, 2024 at 2:05
FWIW, query is very handy: df.T.query('(Sample == "A" & T==0.2) or (Sample == "C" & T==0.5)').T — Quang Hoang
– Quang Hoang, Commented Sep 14, 2024 at 2:08
Thank you Quang for your answer. query is not a command I have used so far, but this answer is intuitive and clean, except for the need to transpose. Floats as index values are a fact of life. — germ
– germ, Commented Sep 14, 2024 at 16:51

Panda Kim · Accepted Answer · 2024-09-14 01:15:05Z

2

use boolean indexing

out = df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]

out:

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.836079  0.368242
1       0.870087  0.520477
2       0.582020  0.105908
3       0.736918  0.324141
4       0.386489  0.613063
5       0.969809  0.358152
6       0.325047  0.958949
7       0.995300  0.474698
8       0.674752  0.949571
9       0.622846  0.878193

answered Sep 14, 2024 at 1:15

Panda Kim

13.7k2 gold badges8 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

germ Over a year ago

I like this answer because it's concise and clear. It also reflects the spirit of the questions, a list-like selection. I am going to make this the preferred answer.

sammywemmy · Accepted Answer · 2024-09-14 04:13:34Z

@QuangHoang has suggested an easy way with query in the comments. Another option, with a third party library- pyjanitor - allows for multiple selections on a MultiIndex with a dictionary, using the select method. I am a contributor to the library:

# pip install pyjanitor
import pandas as pd
import janitor

df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

As shown in the example code above, the key of the dictionary is the level in the MultiIndex, while the value is the actual label you wish to select.

Another option is to extract the indices with a boolean array, before selecting with loc. The advantage of this is clarity - you know exactly what you are selecting, and you have more control:

cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
indexer=df.columns[cond1|cond2]
indexer
MultiIndex([('A', 1, 2, 0.2),
            ('C', 5, 6, 0.5)],
           names=['Sample', 'P1', 'P2', 'T'])
df.loc[:, indexer]
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

Another option is a combination pandas' xs method and concat function:

pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
Out[9]:
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.427995  0.257755
1       0.520026  0.370271
2       0.280932  0.402916
3       0.901517  0.337662
4       0.179267  0.399471
...          ...       ...
199995  0.620268  0.215945
199996  0.420360  0.544633
199997  0.793744  0.153501
199998  0.253700  0.588854
199999  0.201630  0.177926

[200000 rows x 2 columns]

a look at performance of the various solutions:

# let's increase size of the dataframe
In [54]: df = pd.concat([df]*20_000, ignore_index=True)

In [55]: df.shape
Out[55]: (200000, 6)
# manual form
In [56]: %%timeit
    ...: cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
    ...: cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
    ...: indexer=df.columns[cond1|cond2]
    ...: df.loc[:, indexer]
    ...:
    ...:
680 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# third party library - pyjanitor
In [57]: %timeit df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])
603 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# panda kim's solution
In [58]: %timeit df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]
753 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# xs + concat
%timeit pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
1.03 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# query seems expensive - not sure why
In [59]: %timeit df.T.query('(Sample == "A" & T==0.2) or (Sample == "C" & T==0.5)').T
109 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

For a generic way using a list of indices: df.select(columns=[dict(zip(['Sample', 'T'], x)) for x in [('A', 0.2), ('C', 0.5)]])
Thank you for your exhaustive answer. The boolean query is the way to go for a more systematic kind of query, for example selecting all columns with a range of T values. I am not a fan of installing yet another extra package.

Collectives™ on Stack Overflow

Multiple random selection from MultiIndex

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related