How to use Groupby and Aggregate with pandas in python ?

August 11, 2021    /    Viewed: 881    /    Comments: 0    /    Edit


Testing Groupby and Aggregate operations for exploratory data analysis with pandas. Please feel free to download and edit it as you like.

Note: you can also download my jupyter notebook that I created to test Groupby and Aggregate with pandas in python

Introduction

Import pandas

import pandas as pd

GroupBy

How to create a dataframe with pandas

Lets first create a simple dataframe

data = {'Age':[21,26,82,15,28],
                'weight':[120,148,139,156,129],
             'Gender':['male','male','female','male','female'],
             'Country':['France','USA','USA','Germany','USA']}

df = pd.DataFrame(data=data)

gives

     Age  weight  Gender  Country
0   21     120    male   France
1   26     148    male      USA
2   82     139  female      USA
3   15     156    male  Germany
4   28     129  female      USA
How to use group by

To group by "Gender" for example, a solution is to use pandas.DataFrame.groupby

df.groupby(by="Gender").mean()

returns

                            Age      weight
Gender                       
female  55.000000  134.000000
male    20.666667  141.333333
How To use group by with 2 columns

To group by Gender and Country:

df.groupby(["Gender",'Country']).mean()

returns

                Age  weight
Gender Country             
female USA       55     134
male   France    21     120
       Germany   15     156
       USA       26     148

Note that inverting column order as no impact here:

df.groupby(["Country",'Gender']).mean()

print(df.groupby(["Country",'Gender']).mean())

returns

                Age  weight
Country Gender             
France  male     21     120
Germany male     15     156
USA     female   55     134
        male     26     148
How to only apply group for a specific column

How to ONLY group by Gender and Country for a specific column "Age" for example (selecting columns to apply group by should be faster if the dataframe has a lot of columns)

df.groupby(["Gender",'Country']).Age.mean()

returns

Gender  Country
female  USA        55
male    France     21
        Germany    15
        USA        26
Name: Age, dtype: int64

Same as doing

df.groupby(["Gender",'Country'])['Age'].mean()

also returns

Gender  Country
female  USA        55
male    France     21
        Germany    15
        USA        26
Name: Age, dtype: int64
How to sort values for a given column after applying group by :
df.groupby(["Country",'Gender']).mean().sort_values(by='Age')

print(df.groupby(["Country",'Gender']).mean().sort_values(by='Age'))

returns

                Age  weight
Country Gender             
Germany male     15     156
France  male     21     120
USA     male     26     148
        female   55     134
How to use describe() with group by
df.groupby(by="Gender").describe()

returns

                Age                                                                                        weight  
             count       mean        std   min   25%   50%   75%   max  count    mean        std    min    25%    50%    75%    max  
Gender                                                                    
female   2.0  55.000000  38.183766  28.0  41.5  55.0  68.5  82.0    2.0   134.000000   7.071068  129.0  131.5  134.0  136.5  139.0  
male     3.0  20.666667   5.507571  15.0  18.0  21.0  23.5  26.0    3.0   141.333333  18.903263  120.0  134.0  148.0  152.0  156.0

How to combine group by and aggregate

df.groupby(by="Gender").agg('mean')

print(df.groupby(by="Gender").agg('mean'))

              Age      weight
Gender                       
female  55.000000  134.000000
male    20.666667  141.333333

Advantage: possible to define multiple types of aggreation (mean, count, etc)

df.groupby(by="Gender").agg(['mean','count','sum','min','max'])

print(df.groupby(by="Gender").agg(['mean','count','sum','min','max']))

              Age                         weight                     
             mean count  sum min max        mean count  sum  min  max
Gender                                                               
female  55.000000     2  110  28  82  134.000000     2  268  129  139
male    20.666667     3   62  15  26  141.333333     3  424  120  156

Testing on the titanic database

Now lets test what we learned previously on the well-known titanic dataset:

df = pd.read_csv('https://raw.githubusercontent.com/benjamin-hg-marchant/teaching/main/datasets/titanic.csv')

df.head()

returns

     PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3

                                                                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0

     Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S
Dataset quick E.D.A

Get column names:

df.columns

gives

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
             'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
            dtype='object')

Plot Age distribution

df['Age'].plot(kind="hist")

How to use Groupby and Aggregate with pandas in python ?

df['Age'].mean()

returns

29.69911764705882

Get value_counts()

df['Sex'].value_counts(dropna=False)

returns

male      577
female    314
Name: Sex, dtype: int64

Get age mean for male

df['Age'][ df['Sex'] == "male" ].mean()

df['Age'][ df['Sex'] == "male" ].hist()

returns

30.72664459161148

and

How to use Groupby and Aggregate with pandas in python ?

Get age mean for female

df['Age'][ df['Sex'] == "female" ].mean()

df['Age'][ df['Sex'] == "female" ].hist()

returns

27.915708812260537

and

How to use Groupby and Aggregate with pandas in python ?

Group by on 'Survived' and 'Sex' columns and then get 'Age' and 'Fare' mean:
df.groupby(['Survived','Sex'])[['Age','Fare']].mean()

returns

                                                 Age       Fare
    Survived Sex                         
    0        female  25.046875  23.024385
                     male    31.618056  21.960993
    1        female  28.847716  51.938573
                     male    27.276022  40.821484
Group by on 'Survived' and 'Sex' columns and then get 'Age' mean:
df.groupby(['Survived','Sex']).Age.mean()

returns

Survived  Sex   
0         female    25.046875
          male      31.618056
1         female    28.847716
          male      27.276022
Name: Age, dtype: float64
Group by on 'Pclass' columns and then get 'Survived' mean (faster approach):
df[['Pclass','Survived']].groupby(['Pclass']).mean()

returns

        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363
Group by on 'Pclass' columns and then get 'Survived' mean (slower that previously approach):
df.groupby(['Pclass'])[['Survived']].mean()

returns

        Survived
Pclass          
1       0.629630
2       0.472826
3       0.242363
Group by on 'Survived' and 'Sex' and then apply describe() to age
df.groupby(['Survived','Sex']).Age.describe()

returns

                 count       mean        std   min    25%   50%    75%   max
Survived Sex                                                                
0        female   64.0  25.046875  13.618591  2.00  16.75  24.5  33.25  57.0
         male    360.0  31.618056  14.056019  1.00  21.75  29.0  39.25  74.0
1        female  197.0  28.847716  14.175073  0.75  19.00  28.0  38.00  63.0
         male     93.0  27.276022  16.504803  0.42  18.00  28.0  36.00  80.0
Group by on 'Survived' and 'Sex' and then aggregate (mean, max, min) age and fate
df.groupby(['Survived','Sex'])[['Age','Fare']].agg(['mean','max','min'])

returns

                       Age                   Fare                 
                      mean   max   min       mean       max    min
Survived Sex                                                      
0        female  25.046875  57.0  2.00  23.024385  151.5500  6.750
         male    31.618056  74.0  1.00  21.960993  263.0000  0.000
1        female  28.847716  63.0  0.75  51.938573  512.3292  7.225
         male    27.276022  80.0  0.42  40.821484  512.3292  0.000
Group by on Survived and get age mean
df.groupby(['Survived']).Age.mean()

returns

Survived
0    30.626179
1    28.343690
Group by on Survived and get fare mean
df.groupby(['Survived']).Fare.mean()

returns

Survived
0    22.117887
1    48.395408
Name: Fare, dtype: float64

References