LSA: Latent Semantic Analysis
Often we want to analyze documents from the perspective of their semantic content. One standard approach to doing this is to perform Latent Semantic Analysis (LSA) on the corpus.
TextAnalysis.lsa — Function
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix.
sourceLSA uses tf_idf for computing term statistics.
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])A Corpus with 2 documents: * 1 StringDocument's * 0 FileDocument's * 1 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> lsa(crps)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
LSA can also be performed directly on a DocumentTermMatrix:
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ]);julia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrixjulia> lsa(m)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
LDA: Latent Dirichlet Allocation
Another way to analyze the semantic content of a corpus is to use Latent Dirichlet Allocation.
First, we need to create a DocumentTermMatrix:
TextAnalysis.lda — Function
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)Perform Latent Dirichlet allocation.
Arguments
dtm::DocumentTermMatrix: Document-term matrix containing the corpusntopics::Int: Number of topics to extractiterations::Int: Number of Gibbs sampling iterationsα::Float64: Dirichlet distribution hyperparameter for topic distribution per document.α < 1yields a sparse topic mixture,α > 1yields a more uniform topic mixtureβ::Float64: Dirichlet distribution hyperparameter for word distribution per topic.β < 1yields a sparse word mixture,β > 1yields a more uniform word mixture
Keyword Arguments
showprogress::Bool: Show a progress bar during Gibbs sampling (default:true)
Returns
ϕ:ntopics × nwordssparse matrix of word probabilities per topicθ:ntopics × ndocsdense matrix of topic probabilities per document
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);julia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrixjulia> k = 2 # Number of topics2julia> iterations = 1000 # Number of Gibbs sampling iterations1000julia> α = 0.1 # Hyperparameter for document-topic distribution0.1julia> β = 0.1 # Hyperparameter for topic-word distribution0.1julia> ϕ, θ = lda(m, k, iterations, α, β);julia> ϕ # Topic-word distribution matrix2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.25 0.25 ⋅ ⋅ ⋅ ⋅ 0.25 0.25 ⋅ ⋅ ⋅ ⋅ 0.25 0.25 0.125 0.125 ⋅ ⋅ 0.125 0.125julia> θ # Document-topic distribution matrix2×2 Matrix{Float64}: 0.666667 0.0 0.333333 1.0
The lda function returns two matrices:
ϕ(phi): The topic-word distribution matrix showing the probability of each word in each topicθ(theta): The document-topic distribution matrix showing the probability of each topic in each document
See ?lda for more help.