MichM February 2016

spark -- filter down element in a list of list

new to pyspark -- I have a bunch of text documents and I would like to get all the individual words from these text documents.

I can see this step turns each document into a list of words alright: words = documents.map(lambda v: re.split('[\W_\']+',v.lower())) display(words.toDF())

But I also want to remove stopwords (a list of words defined somewhere before this code) from this. The issue is that the RDD "words" doesn't appear to be simply a list of words. It is a list of lists of words. For example, words.first() would return a list of words, not just one word. So how do I remove any word that belongs to stopwords from "words"?

I tried words2 = words.map(lambda x:x if x not in stoplist) and got errors "org.apache.spark.SparkException: Job aborted due to stage failure:"

Answers


zero323 February 2016

For example as follows:

pattern = re.compile('[\W_\']+')

(documents
    .flatMap(lambda v: pattern.split(v.lower()))
    .filter(lambda w: w and w not in stoplist))

or if you want to keep records without flattening:

(documents
    .map(lambda v: pattern.split(v.lower()))
    .map(lambda ws: [w for w in ws if w and w not in stoplist]))

Post Status

Asked in February 2016
Viewed 3,609 times
Voted 7
Answered 1 times

Search




Leave an answer