new to pyspark --
I have a bunch of text documents and I would like to get all the individual words from these text documents.
I can see this step turns each document into a list of words alright:
words = documents.map(lambda v: re.split('[\W_\']+',v.lower()))
But I also want to remove stopwords (a list of words defined somewhere before this code) from this. The issue is that the RDD "words" doesn't appear to be simply a list of words. It is a list of lists of words.
For example, words.first() would return a list of words, not just one word. So how do I remove any word that belongs to stopwords from "words"?
I tried words2 = words.map(lambda x:x if x not in stoplist) and got errors "org.apache.spark.SparkException: Job aborted due to stage failure:"