LSA: Latent Semantic Analysis

Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.

TextAnalysis.lsaFunction
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source

lsa uses tf_idf for statistics.

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])A Corpus with 2 documents: * 1 StringDocument's * 0 FileDocument's * 1 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> lsa(crps)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

lsa can also be performed on a DocumentTermMatrix.

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> lsa(m)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

LDA: Latent Dirichlet Allocation

Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:

First we need to produce the DocumentTermMatrix

TextAnalysis.ldaFunction
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrix
julia> k = 2 # number of topics2
julia> iterations = 1000 # number of gibbs sampling iterations1000
julia> α = 0.1 # hyper parameter0.1
julia> β = 0.1 # hyper parameter0.1
julia> ϕ, θ = lda(m, k, iterations, α, β);
julia> ϕ2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: ⋅ ⋅ ⋅ ⋅ 0.5 0.5 ⋅ ⋅ ⋅ ⋅ 0.1 0.1 0.2 0.2 ⋅ ⋅ 0.1 0.1 0.1 0.1
julia> θ2×2 Matrix{Float64}: 0.0 0.333333 1.0 0.666667

See ?lda for more help.