Semantic Analysis

LSA: Latent Semantic Analysis

Often we want to think about documents from the perspective of semantic content. One standard approach to doing this is to perform Latent Semantic Analysis or LSA on the corpus. You can do this using the lsa function:

lsa(crps)

LDA: Latent Dirichlet Allocation

Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:

First we need to produce the DocumentTermMatrix

julia> crps = Corpus([StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)

Latent Dirchlet Allocation has two hyper parameters -

julia> k = 2            # number of topics
julia> iterations = 1000 # number of gibbs sampling iterations

julia> α = 0.1      # hyper parameter
julia> β  = 0.1       # hyper parameter

julia> ϕ, θ  = lda(m, k, iterations, α, β)
(
  [2 ,  1]  =  0.333333
  [2 ,  2]  =  0.333333
  [1 ,  3]  =  0.222222
  [1 ,  4]  =  0.222222
  [1 ,  5]  =  0.111111
  [1 ,  6]  =  0.111111
  [1 ,  7]  =  0.111111
  [2 ,  8]  =  0.333333
  [1 ,  9]  =  0.111111
  [1 , 10]  =  0.111111, [0.5 1.0; 0.5 0.0])

See ?lda for more help.