Thoram Mastero February 2016

random forest with characters in scikit-learn/python

I have a character column and numbers but I want to categorize the character column and apply a random forest classifier. I realize that there is OneHotEncoder but there is no example anywhere. So how can I categorize the characters e.g. a gender column which has 'f' and 'm' into integers like (0,1)?

Answers


Robin Spiess February 2016

Use LabelEncoder which takes an array of strings and transforms it into an array of integers.

Example:

from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = pd.DataFrame()

data['age'] = [17,33,47]
data['gender'] = ['m','f','m']

enc = LabelEncoder()

print(data)
enc.fit(data['gender'])
data['gender'] = enc.transform(data['gender'])
print(data)

Output:

   age gender
0    17      m
1    33      f
2    47      m
   age  gender
0    17       1
1    33       0
2    47       1


Charlie Haley February 2016

Alternatively, you can use pandas's get_dummies function, which performs label encoding and one hot encoding.

In:

import pandas as pd
s = pd.DataFrame(list('abca'))
s = pd.get_dummies(s)
print s

Out:

    a   b   c
0   1   0   0
1   0   1   0
2   0   0   1
3   1   0   0

Post Status

Asked in February 2016
Viewed 3,876 times
Voted 9
Answered 2 times

Search




Leave an answer