I got a corpus that basically a vector of short sentences (n > 50), e.g.:
corpus <- c("looking for help in R","check whether my milk is sour or not",
"random sentence with dubious meaning")
I am able to print a dendrogram
fit <- hclust(d, method="ward")
groups <- cutree(fit, k=nc) # "k=" defines the number of clusters you are using
rect.hclust(fit, k=nc, border="red") # draw dendrogram with red borders around the 5 clusters
and a correlation matrix
cor_1 <- cor(as.matrix(dtms))
corrplot(cor_1, method = "number")
As far as I have understood it - please correct me here if I am wrong -
findAssocs() i.e. correlation checks whether two terms appear in the same document?
Now I don't want to see the correlation, but the frequency of two terms appear in the same document which are NOT necessarily adjacent to each other (BigramTokenizer won't work). For example: term A and term B appear together in 5 different documents across my corpus regardless of distance.
Ideally I want to create a frequency matrix similar to the one above and add the frequencies to the dendrogram if possible (akin to where
pvclust() prints their numbers)
Any ideas on how to achieve this?