LSA: Latent Semantic Analysis

Often we want to analyze documents from the perspective of their semantic content. One standard approach to doing this is to perform Latent Semantic Analysis (LSA) on the corpus.

TextAnalysis.lsaFunction
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix.

source

LSA uses tf_idf for computing term statistics.

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ])A Corpus with 2 documents: * 1 StringDocument's * 0 FileDocument's * 1 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> lsa(crps)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

LSA can also be performed directly on a DocumentTermMatrix:

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("this is a string document"), TokenDocument("this is a token document") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> lsa(m)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}} U factor: 2×2 Matrix{Float64}: 1.0 0.0 0.0 1.0 singular values: 2-element Vector{Float64}: 0.13862943611198905 0.13862943611198905 Vt factor: 2×6 Matrix{Float64}: 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

LDA: Latent Dirichlet Allocation

Another way to analyze the semantic content of a corpus is to use Latent Dirichlet Allocation.

First, we need to create a DocumentTermMatrix:

TextAnalysis.ldaFunction
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Arguments

  • dtm::DocumentTermMatrix: Document-term matrix containing the corpus
  • ntopics::Int: Number of topics to extract
  • iterations::Int: Number of Gibbs sampling iterations
  • α::Float64: Dirichlet distribution hyperparameter for topic distribution per document. α < 1 yields a sparse topic mixture, α > 1 yields a more uniform topic mixture
  • β::Float64: Dirichlet distribution hyperparameter for word distribution per topic. β < 1 yields a sparse word mixture, β > 1 yields a more uniform word mixture

Keyword Arguments

  • showprogress::Bool: Show a progress bar during Gibbs sampling (default: true)

Returns

  • ϕ: ntopics × nwords sparse matrix of word probabilities per topic
  • θ: ntopics × ndocs dense matrix of topic probabilities per document
source
julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("This is the Foo Bar Document"), StringDocument("This document has too Foo words") ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrix
julia> k = 2 # Number of topics2
julia> iterations = 1000 # Number of Gibbs sampling iterations1000
julia> α = 0.1 # Hyperparameter for document-topic distribution0.1
julia> β = 0.1 # Hyperparameter for topic-word distribution0.1
julia> ϕ, θ = lda(m, k, iterations, α, β);
julia> ϕ # Topic-word distribution matrix2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.25 0.25 ⋅ ⋅ ⋅ ⋅ 0.25 0.25 ⋅ ⋅ ⋅ ⋅ 0.25 0.25 0.125 0.125 ⋅ ⋅ 0.125 0.125
julia> θ # Document-topic distribution matrix2×2 Matrix{Float64}: 0.666667 0.0 0.333333 1.0

The lda function returns two matrices:

  • ϕ (phi): The topic-word distribution matrix showing the probability of each word in each topic
  • θ (theta): The document-topic distribution matrix showing the probability of each topic in each document

See ?lda for more help.