Ravina Singh February 2016

Getting AttributeError on nltk Textual entailment classifier

Im referring to the link in the section http://www.nltk.org/book/ch06.html#recognizing-textual-entailment

def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])

extractor = nltk.RTEFeatureExtractor(rtepair)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-a7f96e33ba9e> in <module>()
----> 1 extractor = nltk.RTEFeatureExtractor(rtepair)

C:\Users\RAVINA\Anaconda2\lib\site-packages\nltk\classify\rte_classify.pyc in __init__(self, rtepair, stop, lemmatize)
     65 
     66         #Get the set of word types for text and hypothesis
---> 67         self.text_tokens = tokenizer.tokenize(rtepair.text)
     68         self.hyp_tokens = tokenizer.tokenize(rtepair.hyp)
     69         self.text_words = set(self.text_tokens)

AttributeError: 'list' object has no attribute 'text'

Its the exact code as mentioned in the book, can anyone help me whats going wrong here. Thanks Ravina

Answers


helios35 February 2016

Take a look at the type signatures. Type this into the python shell:

import nltk
x = nltk.corpus.rte.pairs(['rte3_dev.xml'])
type(x)

tells you x is of type list.

Now, type:

help(nltk.RTEFeatureExtractor)

which tells you:

:param rtepair: a RTEPair from which features should be extracted

Clearly, x does not have the correct type for calling nltk.RTEFeatureExtractor. Instead:

type(x[33])
<class 'nltk.corpus.reader.rte.RTEPair'>

A single item of the list does have the correct type.


Update: As mentioned in the comment section, extractor.text_words shows only empty strings. This seems to be due to changes made in NLTK since the documentation was written. Long story short: You won't be able to fix this without downgrading to an older version of NLTK or fixing the problem in NLTK yourself. Inside the file nltk/classify/rte_classify.py, you will find the following piece of code:

class RTEFeatureExtractor(object):
    …
    import nltk
    from nltk.tokenize import RegexpTokenizer
    tokenizer = RegexpTokenizer('([A-Z]\.)+|\w+|\$[\d\.]+')
    self.text_tokens = tokenizer.tokenize(rtepair.text)
    self.text_words = set(self.text_tokens)

If you run the same RegexpTokenizer with the exact text from the extractor, it will produce only empty strings:

import nltk
rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer('([A-Z]\.)+|\w+|\$[\d\.]+')
tokenizer.tokenize(rtepair.text)

Returns ['', '', …, ''] (i.e., a list of empty strings).

Post Status

Asked in February 2016
Viewed 3,014 times
Voted 13
Answered 1 times

Search




Leave an answer