Developers Planet

GHerman February 2016

R :: tm - Create a table/matrix of term association frequencies and add values to dendrogram

I got a corpus that basically a vector of short sentences (n > 50), e.g.:

``````corpus <- c("looking for help in R","check whether my milk is sour or not",
"random sentence with dubious meaning")
``````

I am able to print a dendrogram

``````fit <- hclust(d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=nc)   # "k=" defines the number of clusters you are using
rect.hclust(fit, k=nc, border="red") # draw dendrogram with red borders around the 5 clusters
``````

and a correlation matrix

``````cor_1 <- cor(as.matrix(dtms))
corrplot(cor_1, method = "number")
``````

As far as I have understood it - please correct me here if I am wrong - `findAssocs()` i.e. correlation checks whether two terms appear in the same document?

Goal: Now I don't want to see the correlation, but the frequency of two terms appear in the same document which are NOT necessarily adjacent to each other (BigramTokenizer won't work). For example: term A and term B appear together in 5 different documents across my corpus regardless of distance.

Ideally I want to create a frequency matrix similar to the one above and add the frequencies to the dendrogram if possible (akin to where `pvclust()` prints their numbers)

Any ideas on how to achieve this?

Ken Benoit February 2016

I think you are asking how to get a co-occurrence matrix for terms, where a the cells are the number of documents in which a term occurs with another document. We can accomplish this magic using a matrix cross-product of the transpose of the matrix with itself, after converting the matrix of document-term frequencies to Boolean values indicating whether a term occurred in a document.

(I've used the quanteda package here instead of tm but a similar approach will work with a `DocumentTermMatrix` object from tm.)

``````# create some demonstration documents
(txts <- c(paste(letters[c(1, 1:3)], collapse = " "),
paste(letters[c(1, 3, 5)], collapse = " "),
paste(letters[c(5, 6, 7)], collapse = " ")))
## [1] "a a b c" "a c e" "e f g"

# convert to a document-term matrix
require(quanteda)
dtm <- dfm(txts, verbose = FALSE)
dtm
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c e f g
##   text1 2 1 1 0 0 0
##   text2 1 0 1 1 0 0
##   text3 0 0 0 1 1 1

# convert to a matrix of co-occcurences rather than counts
(dtm <- tf(dtm, "boolean"))
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
##        features
## docs    a b c e f g
##   text1 1 1 1 0 0 0
##   text2 1 0 1 1 0 0
##   text3 0 0 0 1 1 1

# now get the "feature in document" co-occurrence matrix
t(dtm) %*% dtm
## 6 x 6 sparse Matrix of class "dgCMatrix"
##   a b c e f g
## a 2 1 2 1 . .
## b 1 1 1 . . .
## c 2 1 2 1 . .
## e 1 . 1 2 1 1
## f . . . 1 1 1
## g . . . 1 1 1
``````

Note: This setup counts a term as "co-occurring" once in a document where it appears only with itself (e.g. `b`). If you want to change that, simply replace the diagonal with the diagonal minus one.