How to apply a one hot encoding to a pandas dataframe categorical data column using get_dummies() ?

Active February 08, 2022    /    Viewed 455    /    Comments 0    /    Edit


Examples of how to apply a one hot encoding to a pandas dataframe categorical data column in python ?

Create two dataframes with pandas

So, let's first create a fake dataset stored in a pandas dataframe:

import pandas as pd
import random
import numpy as np

categorical_data_1 = ['Paris', 'London', 'New-York', 'Berlin', 'Moscow']
categorical_data_2 = ['M', 'M', 'F', 'F', 'F']

data = np.arange(100)
data = data.reshape((25,4))

location_list = [random.choice(categorical_data_1) for i in range( data.shape[0]  )] 
meta_list = [random.choice(categorical_data_2) for i in range( data.shape[0]  )]

label_list = [random.choice([0,1]) for i in range( data.shape[0]  )]

df = pd.DataFrame(data,columns=['A','B','C','D'])

df['Location'] = location_list
df['Gender'] = meta_list
df['Label'] = label_list

print(df)

gives for example:

     A   B   C   D  Location Gender  Label
0    0   1   2   3    Moscow      M      1
1    4   5   6   7    Berlin      M      1
2    8   9  10  11    London      F      0
3   12  13  14  15  New-York      F      1
4   16  17  18  19    London      M      0
5   20  21  22  23     Paris      F      0
6   24  25  26  27  New-York      F      1
7   28  29  30  31  New-York      F      0
8   32  33  34  35     Paris      M      1
9   36  37  38  39    London      M      0
10  40  41  42  43    Moscow      F      1
11  44  45  46  47    Moscow      F      1
12  48  49  50  51    Moscow      M      1
13  52  53  54  55  New-York      M      1
14  56  57  58  59    Berlin      M      1
15  60  61  62  63  New-York      M      1
16  64  65  66  67    Moscow      F      0
17  68  69  70  71    Berlin      F      0
18  72  73  74  75    Berlin      M      0
19  76  77  78  79    London      F      0
20  80  81  82  83  New-York      M      1
21  84  85  86  87    Moscow      M      0
22  88  89  90  91    Moscow      M      1
23  92  93  94  95  New-York      F      0
24  96  97  98  99  New-York      M      0

Split the dataframe

Let's split the dataframe into two dataframes:

df_train = df.sample(frac=0.8, random_state=42)

df_test = df.drop(df_train.index)

print(df_train)
print(df_test)

gives

     A   B   C   D  Location Gender  Label
8   32  33  34  35     Paris      M      1
16  64  65  66  67    Moscow      F      0
0    0   1   2   3    Moscow      M      1
23  92  93  94  95  New-York      F      0
11  44  45  46  47    Moscow      F      1
9   36  37  38  39    London      M      0
13  52  53  54  55  New-York      M      1
1    4   5   6   7    Berlin      M      1
22  88  89  90  91    Moscow      M      1
5   20  21  22  23     Paris      F      0
2    8   9  10  11    London      F      0
12  48  49  50  51    Moscow      M      1
15  60  61  62  63  New-York      M      1
3   12  13  14  15  New-York      F      1
4   16  17  18  19    London      M      0
20  80  81  82  83  New-York      M      1
17  68  69  70  71    Berlin      F      0
21  84  85  86  87    Moscow      M      0
18  72  73  74  75    Berlin      M      0
24  96  97  98  99  New-York      M      0

and

     A   B   C   D  Location Gender  Label
6   24  25  26  27  New-York      F      1
7   28  29  30  31  New-York      F      0
10  40  41  42  43    Moscow      F      1
14  56  57  58  59    Berlin      M      1
19  76  77  78  79    London      F      0

respectively

One-Hot Encoding using pandas get_dummies()

To encode categorical data one solution is to use pandas.get_dummies:

df_train_e = pd.get_dummies(df_train, columns=['Location','Gender'], prefix=['Location','Gender'])

print(df_train_e)

gives

     A   B   C   D  Label  Location_Berlin  Location_London  Location_Moscow  \
8   32  33  34  35      1                0                0                0   
16  64  65  66  67      0                0                0                1   
0    0   1   2   3      1                0                0                1   
23  92  93  94  95      0                0                0                0   
11  44  45  46  47      1                0                0                1   
9   36  37  38  39      0                0                1                0   
13  52  53  54  55      1                0                0                0   
1    4   5   6   7      1                1                0                0   
22  88  89  90  91      1                0                0                1   
5   20  21  22  23      0                0                0                0   
2    8   9  10  11      0                0                1                0   
12  48  49  50  51      1                0                0                1   
15  60  61  62  63      1                0                0                0   
3   12  13  14  15      1                0                0                0   
4   16  17  18  19      0                0                1                0   
20  80  81  82  83      1                0                0                0   
17  68  69  70  71      0                1                0                0   
21  84  85  86  87      0                0                0                1   
18  72  73  74  75      0                1                0                0   
24  96  97  98  99      0                0                0                0

    Location_New-York  Location_Paris  Gender_F  Gender_M  
8                   0               1         0         1  
16                  0               0         1         0  
0                   0               0         0         1  
23                  1               0         1         0  
11                  0               0         1         0  
9                   0               0         0         1  
13                  1               0         0         1  
1                   0               0         0         1  
22                  0               0         0         1  
5                   0               1         1         0  
2                   0               0         1         0  
12                  0               0         0         1  
15                  1               0         0         1  
3                   1               0         1         0  
4                   0               0         0         1  
20                  1               0         0         1  
17                  0               0         1         0  
21                  0               0         0         1  
18                  0               0         0         1  
24                  1               0         0         1

Apply the encoding to another dataframe

NOw let's see how to apply the encoding to another dataframe (called df_test here):

df_test_e = pd.get_dummies(df_test, columns=['Location','Gender'], prefix=['Location','Gender'])  
df_test_e = df_test_e.reindex(columns = df_train_e.columns, fill_value=0)

print(df_test_e)

gives then

     A   B   C   D  Label  Location_Berlin  Location_London  Location_Moscow  \
6   24  25  26  27      1                0                0                0   
7   28  29  30  31      0                0                0                0   
10  40  41  42  43      1                0                0                1   
14  56  57  58  59      1                1                0                0   
19  76  77  78  79      0                0                1                0

    Location_New-York  Location_Paris  Gender_F  Gender_M  
6                   1               0         1         0  
7                   1               0         1         0  
10                  0               0         1         0  
14                  0               0         0         1  
19                  0               0         1         0

References


Card image cap
profile-image
Daidalos

Hi, I am Ben.

I have developed this web site from scratch with Django to share with everyone my notes. If you have any ideas or suggestions to improve the site, let me know ! (you can contact me using the form in the welcome page). Thanks!



Did you find this content useful ?, If so, please consider donating a tip to the author(s). MoonBooks.org is visited by millions of people each year and it will help us to maintain our servers and create new contents.

Amount