1

Consider the following DataFrame:

import pandas as pd
arrays = [['A','A','B','B','C','C'],[1,1,3,3,5,5,],[2,2,4,4,6,6],[0.1,0.2,0.3,0.4,0.5,0.6]]
index = pd.MultiIndex.from_arrays(arrays,names=('Sample','P1','P2','T'))
data = np.random.rand(10,6)
df = pd.DataFrame(columns=index,data=data)

I want to select: for sample A, the column with T=0.2, and for sample C, the column with T=0.5.

I can easily select each of the single columns, e.g.:

df.loc[:,('A',slice(None),slice(None),0.2)]  # or
df.loc(axis=1)[('C',slice(None),slice(None),0.5)] 

But how can I combine them? I tried supplying a list of tuples:

df.loc[:,[('A',slice(None),slice(None),0.2),('C',slice(None),slice(None),0.5)]]

But that generates an error.

How can I select my columns without resorting to pd.concat?

3
  • 2
    unrelated: having floats as column names might not be a good idea. Commented Sep 14, 2024 at 2:05
  • 2
    FWIW, query is very handy: df.T.query('(Sample == "A" & T==0.2) or (Sample == "C" & T==0.5)').T Commented Sep 14, 2024 at 2:08
  • Thank you Quang for your answer. query is not a command I have used so far, but this answer is intuitive and clean, except for the need to transpose. Floats as index values are a fact of life. Commented Sep 14, 2024 at 16:51

2 Answers 2

2

use boolean indexing

out = df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]

out:

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.836079  0.368242
1       0.870087  0.520477
2       0.582020  0.105908
3       0.736918  0.324141
4       0.386489  0.613063
5       0.969809  0.358152
6       0.325047  0.958949
7       0.995300  0.474698
8       0.674752  0.949571
9       0.622846  0.878193
Sign up to request clarification or add additional context in comments.

1 Comment

I like this answer because it's concise and clear. It also reflects the spirit of the questions, a list-like selection. I am going to make this the preferred answer.
2

@QuangHoang has suggested an easy way with query in the comments. Another option, with a third party library- pyjanitor - allows for multiple selections on a MultiIndex with a dictionary, using the select method. I am a contributor to the library:

# pip install pyjanitor
import pandas as pd
import janitor

df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

As shown in the example code above, the key of the dictionary is the level in the MultiIndex, while the value is the actual label you wish to select.

Another option is to extract the indices with a boolean array, before selecting with loc. The advantage of this is clarity - you know exactly what you are selecting, and you have more control:

cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
indexer=df.columns[cond1|cond2]
indexer
MultiIndex([('A', 1, 2, 0.2),
            ('C', 5, 6, 0.5)],
           names=['Sample', 'P1', 'P2', 'T'])
df.loc[:, indexer]
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

Another option is a combination pandas' xs method and concat function:

pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
Out[9]:
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.427995  0.257755
1       0.520026  0.370271
2       0.280932  0.402916
3       0.901517  0.337662
4       0.179267  0.399471
...          ...       ...
199995  0.620268  0.215945
199996  0.420360  0.544633
199997  0.793744  0.153501
199998  0.253700  0.588854
199999  0.201630  0.177926

[200000 rows x 2 columns]

a look at performance of the various solutions:

# let's increase size of the dataframe
In [54]: df = pd.concat([df]*20_000, ignore_index=True)

In [55]: df.shape
Out[55]: (200000, 6)
# manual form
In [56]: %%timeit
    ...: cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
    ...: cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
    ...: indexer=df.columns[cond1|cond2]
    ...: df.loc[:, indexer]
    ...:
    ...:
680 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# third party library - pyjanitor
In [57]: %timeit df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])
603 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# panda kim's solution
In [58]: %timeit df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]
753 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# xs + concat
%timeit pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
1.03 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# query seems expensive - not sure why
In [59]: %timeit df.T.query('(Sample == "A" & T==0.2) or (Sample == "C" & T==0.5)').T
109 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

2 Comments

For a generic way using a list of indices: df.select(columns=[dict(zip(['Sample', 'T'], x)) for x in [('A', 0.2), ('C', 0.5)]])
Thank you for your exhaustive answer. The boolean query is the way to go for a more systematic kind of query, for example selecting all columns with a range of T values. I am not a fan of installing yet another extra package.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.