Semantic Analysis · TextAnalysis

LSA: Latent Semantic Analysis

Often we want to analyze documents from the perspective of their semantic content. One standard approach to doing this is to perform Latent Semantic Analysis (LSA) on the corpus.

TextAnalysis.lsa — Function

lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix.

source

LSA uses tf_idf for computing term statistics.

julia> using TextAnalysis
julia> crps = Corpus([
         StringDocument("this is a string document"),
         TokenDocument("this is a token document")
       ])A Corpus with 2 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 1 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> lsa(crps)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0
singular values:
2-element Vector{Float64}:
 0.13862943611198905
 0.13862943611198905
Vt factor:
2×6 Matrix{Float64}:
 0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0

LSA can also be performed directly on a DocumentTermMatrix:

julia> using TextAnalysis
julia> crps = Corpus([
         StringDocument("this is a string document"),
         TokenDocument("this is a token document")
       ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> lsa(m)LinearAlgebra.SVD{Float64, Float64, Matrix{Float64}, Vector{Float64}}
U factor:
2×2 Matrix{Float64}:
 1.0  0.0
 0.0  1.0
singular values:
2-element Vector{Float64}:
 0.13862943611198905
 0.13862943611198905
Vt factor:
2×6 Matrix{Float64}:
 0.0  0.0  0.0  1.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  1.0

LDA: Latent Dirichlet Allocation

Another way to analyze the semantic content of a corpus is to use Latent Dirichlet Allocation.

First, we need to create a DocumentTermMatrix:

TextAnalysis.lda — Function

ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Arguments

dtm::DocumentTermMatrix: Document-term matrix containing the corpus
ntopics::Int: Number of topics to extract
iterations::Int: Number of Gibbs sampling iterations
α::Float64: Dirichlet distribution hyperparameter for topic distribution per document. α < 1 yields a sparse topic mixture, α > 1 yields a more uniform topic mixture
β::Float64: Dirichlet distribution hyperparameter for word distribution per topic. β < 1 yields a sparse word mixture, β > 1 yields a more uniform word mixture

Keyword Arguments

showprogress::Bool: Show a progress bar during Gibbs sampling (default: true)

Returns

ϕ: ntopics × nwords sparse matrix of word probabilities per topic
θ: ntopics × ndocs dense matrix of topic probabilities per document

source

julia> using TextAnalysis
julia> crps = Corpus([
         StringDocument("This is the Foo Bar Document"),
         StringDocument("This document has too Foo words")
       ]);
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 10 DocumentTermMatrix
julia> k = 2             # Number of topics2
julia> iterations = 1000 # Number of Gibbs sampling iterations1000
julia> α = 0.1           # Hyperparameter for document-topic distribution0.1
julia> β = 0.1           # Hyperparameter for topic-word distribution0.1
julia> ϕ, θ = lda(m, k, iterations, α, β);
julia> ϕ  # Topic-word distribution matrix2×10 SparseArrays.SparseMatrixCSC{Float64, Int64} with 11 stored entries:
 0.142857  0.142857  0.142857  0.285714  …   ⋅   0.142857  0.142857   ⋅    ⋅
  ⋅         ⋅        0.2        ⋅           0.2   ⋅         ⋅        0.2  0.2
julia> θ  # Document-topic distribution matrix2×2 Matrix{Float64}:
 1.0  0.166667
 0.0  0.833333

The lda function returns two matrices:

ϕ (phi): The topic-word distribution matrix showing the probability of each word in each topic
θ (theta): The document-topic distribution matrix showing the probability of each topic in each document

See ?lda for more help.