Home Ask Login Register

Developers Planet

Your answer is one click away!

Ophilia February 2016

Text classification with Scikit-learn

I am doing text classification for two labels with scikit learn .. I am loading my text files with the method load_files

text_data = load_files(path,categories=categories)

from the following structure:

├── Label0
│   ├── 0001.txt
│   └── 0002.txt
└── Label1
    ├── 0001.txt
    └── 0002.txt

my problem is that when I try to look at the shape of text_data.data it returns:

print (type(text_data.data))
<type 'list'>

print text_data.data.shape
AttributeError: 'list' object has no attribute 'shape'

X = np.array(text_data.data)
print x.shape

it returns 1D array .. I thought it should be 2D numpy array or a dictionary where the first will be for the text and the other one will be for the class (label0 or 1 ) .. have I missed something ?


David Maust February 2016

The problem is after calling load_files, it is not yet a numpy array. It is just a list of text. You should vectorize this text using CountVectorizer or TfidfVectorizer.


cv = CountVectorizer()
X = cv.fit_transform(text_data.data)
y = text_data.target
print cv.vocabulary_  # Show words in vocabulary with column index

clf = LogisticRegression() # or other classifier
clf.fit(X, y)

Post Status

Asked in February 2016
Viewed 3,802 times
Voted 12
Answered 1 times


Leave an answer

Quote of the day: live life