Suppose we have a DataFrame:
data = {'person_id': ['person_a', 'person_a', 'person_b','person_b', 'person_c','person_c'],
'categorical_data': ['new', 'new', 'ok', 'bad', 'new', 'bad']}
df = pd.DataFrame(data)
person_id categorical_data
0 person_a new
1 person_a new
2 person_b ok
3 person_b bad
4 person_c new
5 person_c bad
I want to expand the categorical data into multiple columns with counts of each category.
We can group by the person id to get counts:
count_categories = df.groupby('person_id')['categorical_data'].value_counts().reset_index(name='count')
person_id categorical_data count
0 person_a new 2
1 person_b bad 1
2 person_b ok 1
3 person_c bad 1
4 person_c new 1
Then I tried this to create the new columns:
pivoted = count_categories.set_index(['person_id','categorical_data']).unstack('categorical_data')
count
categorical_data bad new ok
person_id
person_a NaN 2.0 NaN
person_b 1.0 NaN 1.0
person_c 1.0 1.0 NaN
This is the form I want, but I'm confused by the MultiIndexing
How can I get rid of the index, or is there a better way to do this? trying reset index yields:
pivoted.reset_index()
person_id count
categorical_data bad new ok
0 person_a NaN 2.0 NaN
1 person_b 1.0 NaN 1.0
2 person_c 1.0 1.0 NaN