LSA: Latent Semantic Analysis
Often we want to think about documents from the perspective of semantic content. One standard approach to doing this, is to perform Latent Semantic Analysis or LSA on the corpus.
TextAnalysis.lsa — Functionlsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)Performs Latent Semantic Analysis or LSA on a corpus.
lsa uses tf_idf for statistics.
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])A Corpus with 2 documents: * 1 StringDocument's * 0 FileDocument's * 1 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> lsa(crps)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
lsa can also be performed on a DocumentTermMatrix.
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ]);julia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrixjulia> lsa(m)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
LDA: Latent Dirichlet Allocation
Another way to get a handle on the semantic content of a corpus is to use Latent Dirichlet Allocation:
First we need to produce the DocumentTermMatrix
TextAnalysis.lda — Functionϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)Perform Latent Dirichlet allocation.
Required Positional Arguments
αDirichlet dist. hyperparameter for topic distribution per document.α<1yields a sparse topic mixture for each document.α>1yields a more uniform topic mixture for each document.βDirichlet dist. hyperparameter for word distribution per topic.β<1yields a sparse word mixture for each topic.β>1yields a more uniform word mixture for each topic.
Optional Keyword Arguments
showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value:true.
Return Values
ϕ:ntopics × nwordsSparse matrix of probabilities s.t.sum(ϕ, 1) == 1θ:ntopics × ndocsDense matrix of probabilities s.t.sum(θ, 1) == 1
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);julia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrixjulia> k = 2 # number of topics2julia> iterations = 1000 # number of gibbs sampling iterations1000julia> α = 0.1 # hyper parameter0.1julia> β = 0.1 # hyper parameter0.1julia> ϕ, θ = lda(m, k, iterations, α, β);julia> ϕ2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.111111 0.111111 0.222222 0.222222 … 0.111111 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 0.333333 0.333333julia> θ2×2 Matrix{Float64}: 1.0 0.5 0.0 0.5
See ?lda for more help.