GHerman February 2016
### R :: tm - Create a table/matrix of term association frequencies and add values to dendrogram

I got a corpus that basically a vector of short sentences (n > 50), e.g.:

```
corpus <- c("looking for help in R","check whether my milk is sour or not",
"random sentence with dubious meaning")
```

I am able to print a dendrogram

```
fit <- hclust(d, method="ward")
plot(fit, hang=-1)
groups <- cutree(fit, k=nc) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=nc, border="red") # draw dendrogram with red borders around the 5 clusters
```

and a correlation matrix

```
cor_1 <- cor(as.matrix(dtms))
corrplot(cor_1, method = "number")
```

As far as I have understood it - please correct me here if I am wrong - `findAssocs()`

i.e. correlation checks whether two terms appear in the same document?

**Goal:**
Now I don't want to see the correlation, but the frequency of two terms appear in the same document which are NOT necessarily adjacent to each other (BigramTokenizer won't work). For example: term A and term B appear together in 5 different documents across my corpus regardless of distance.

Ideally I want to create a frequency matrix similar to the one above and add the frequencies to the dendrogram if possible (akin to where `pvclust()`

prints their numbers)

Any ideas on how to achieve this?

I think you are asking how to get a co-occurrence matrix for terms, where a the cells are the number of documents in which a term occurs with another document. We can accomplish this magic using a matrix cross-product of the transpose of the matrix with itself, after converting the matrix of document-term frequencies to Boolean values indicating whether a term occurred in a document.

(I've used the **quanteda** package here instead of **tm** but a similar approach will work with a `DocumentTermMatrix`

object from **tm**.)

```
# create some demonstration documents
(txts <- c(paste(letters[c(1, 1:3)], collapse = " "),
paste(letters[c(1, 3, 5)], collapse = " "),
paste(letters[c(5, 6, 7)], collapse = " ")))
## [1] "a a b c" "a c e" "e f g"
# convert to a document-term matrix
require(quanteda)
dtm <- dfm(txts, verbose = FALSE)
dtm
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c e f g
## text1 2 1 1 0 0 0
## text2 1 0 1 1 0 0
## text3 0 0 0 1 1 1
# convert to a matrix of co-occcurences rather than counts
(dtm <- tf(dtm, "boolean"))
## Document-feature matrix of: 3 documents, 6 features.
## 3 x 6 sparse Matrix of class "dfmSparse"
## features
## docs a b c e f g
## text1 1 1 1 0 0 0
## text2 1 0 1 1 0 0
## text3 0 0 0 1 1 1
# now get the "feature in document" co-occurrence matrix
t(dtm) %*% dtm
## 6 x 6 sparse Matrix of class "dgCMatrix"
## a b c e f g
## a 2 1 2 1 . .
## b 1 1 1 . . .
## c 2 1 2 1 . .
## e 1 . 1 2 1 1
## f . . . 1 1 1
## g . . . 1 1 1
```

Note: This setup counts a term as "co-occurring" once in a document where it appears only with itself (e.g. `b`

). If you want to change that, simply replace the diagonal with the diagonal minus one.

Asked in February 2016

Viewed 2,696 times

Voted 14

Answered 1 times

Viewed 2,696 times

Voted 14

Answered 1 times