Naresh MG February 2016

python CountVectorizer() vocabulary_ get method returns None

I have this piece of code as per documentation at http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

my_bunch = load_files("c:\\temp\\billing_test\\")

my_data = my_bunch['data']
print (my_bunch.keys())
print('target_names',my_bunch['target_names'])
print('length of data' , len(my_bunch['data']))


X_train_counts = count_vect.fit_transform(my_data)
print(X_train_counts.shape)

print ( count_vect.vocabulary_.get(u'algorithm'))

the output is as follows

dict_keys(['target', 'filenames', 'target_names', 'data', 'DESCR'])
target_names ['false', 'true']
length of data 920
(920, 8773)
None

Wonder why the "None" towards the bottom after (920, 8773)

I have around 460 text documents in each of the folder "true" and "false"

thanks,

Answers


Farseer February 2016

Because word 'algoritham' never appeared in your documents.

Perhaps you should try 'algorithm'.

Post Status

Asked in February 2016
Viewed 2,530 times
Voted 8
Answered 1 times

Search




Leave an answer