tec_abhi February 2016

tokenize in python using pandas

I am trying to tokenize a dataframe with one coulmn and using the followng code:

def main(args):
    df = pd.DataFrame(pd.read_csv(args[1]), index= None)
    doc_set = pd.DataFrame(df.Country)
    tokenizer = RegexpTokenizer(r'\w+')
    en_stop = get_stop_words('en')
    p_stemmer = PorterStemmer()
    texts = []
    print doc_set
    for i in doc_set:
        raw = i.lower()
        tokens = tokenizer.tokenize(raw)
            stopped_tokens = [i for i in tokens if not i in en_stop]    
                stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
                texts.append(stemmed_tokens)

This code outputs me only the header of the dataframe which i have created from a csv file: Please help me in finding whats wrong in my approach.

Answers


Joel Kreager March 2016

When python starts spitting out things that make no sense to me, I have gotten in the habit of downloading the latest source, compiling to /usr/local and reinstalling everything with pip. Strangely, this usually fixes things.

Post Status

Asked in February 2016
Viewed 1,262 times
Voted 5
Answered 1 times

Search




Leave an answer