LSA: Latent Semantic Analysis
Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.
TextAnalysis.lsa
— Functionlsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)
Performs Latent Semantic Analysis or LSA on a corpus.
lsa uses tf_idf
for statistics.
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])
A Corpus with 2 documents: * 1 StringDocument's * 0 FileDocument's * 1 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> lsa(crps)
LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
lsa can also be performed on a DocumentTermMatrix
.
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix
julia> lsa(m)
LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
LDA: Latent Dirichlet Allocation
Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:
First we need to produce the DocumentTermMatrix
TextAnalysis.lda
— Functionϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)
Perform Latent Dirichlet allocation.
Required Positional Arguments
α
Dirichlet dist. hyperparameter for topic distribution per document.α<1
yields a sparse topic mixture for each document.α>1
yields a more uniform topic mixture for each document.β
Dirichlet dist. hyperparameter for word distribution per topic.β<1
yields a sparse word mixture for each topic.β>1
yields a more uniform word mixture for each topic.
Optional Keyword Arguments
showprogress::Bool
. Show a progress bar during the Gibbs sampling. Default value:true
.
Return Values
ϕ
:ntopics × nwords
Sparse matrix of probabilities s.t.sum(ϕ, 1) == 1
θ
:ntopics × ndocs
Dense matrix of probabilities s.t.sum(θ, 1) == 1
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
A 2 X 10 DocumentTermMatrix
julia> k = 2 # number of topics
2
julia> iterations = 1000 # number of gibbs sampling iterations
1000
julia> α = 0.1 # hyper parameter
0.1
julia> β = 0.1 # hyper parameter
0.1
julia> ϕ, θ = lda(m, k, iterations, α, β);
julia> ϕ
2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.125 0.125 0.25 0.25 ⋅ ⋅ 0.125 ⋅ ⋅ 0.125 ⋅ ⋅ ⋅ ⋅ 0.25 0.25 ⋅ 0.25 0.25 ⋅
julia> θ
2×2 Matrix{Float64}: 0.833333 0.5 0.166667 0.5
See ?lda
for more help.