Statistical Language Models
TextAnalysis provides the following different language models:
- MLE - Base n-gram model using Maximum Likelihood Estimation.
- Lidstone - Base n-gram model with Lidstone smoothing.
- Laplace - Base n-gram language model with Laplace smoothing.
- WittenBellInterpolated - Interpolated version of the Witten-Bell algorithm.
- KneserNeyInterpolated - Interpolated version of Kneser-Ney smoothing.
APIs
To use the API, first instantiate the desired model and then train it with a training set:
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
(lm::<Languagemodel>)(text, min::Integer, max::Integer)Arguments:
word: Array of strings to store the vocabulary.unk_cutoff: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.unk_label: Token for unknown labels.gamma: Smoothing parameter gamma.discount: Discounting factor forKneserNeyInterpolated.
For more information, see the docstrings of the vocabulary functions.
julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"]
julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"]
# voc and train are used to train the vocabulary and model respectively
julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "<unk>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))
julia> print(voc)
11-element Vector{String}:
"my"
"name"
"is"
"salman"
"khan"
"and"
"he"
"is"
"shahrukh"
"Khan"
"<unk>"
# You can see the "<unk>" token is added to voc
julia> fit = model(train,2,2) # considering only bigrams
julia> unmaskedscore = score(model, fit, "is" ,"<unk>") # score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333
julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
# As expected, maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
Evaluation Methods
score
Used to evaluate the probability of a word given its context (P(word | context)):
TextAnalysis.score — Function
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context using add-one smoothing.
Applies add-one smoothing to Lidstone or Laplace (gammamodel) models.
sourcescore(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context using MLE (Maximum Likelihood Estimation).
sourcescore(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context in an interpolated language model.
Applies Kneser-Ney and Witten-Bell smoothing depending on the sub-type.
sourceArguments:
m: Instance ofLangmodelstruct.temp_lm: Output of function call of instance ofLangmodel.word: String of the word.context: Context of the given word.
- For
LidstoneandLaplacemodels, smoothing is applied. - For interpolated language models,
KneserNeyandWittenBellsmoothing are provided.
maskedscore
TextAnalysis.maskedscore — Function
logscore
TextAnalysis.logscore — Function
logscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Evaluate the log score of a word in a given context.
The arguments are the same as for score and maskedscore.
entropy
TextAnalysis.entropy — Function
entropy(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculate the cross-entropy of the model for a given evaluation text.
Input text must be a Vector of n-grams of the same length.
perplexity
TextAnalysis.perplexity — Function
Preprocessing
The following functions are available for preprocessing:
TextAnalysis.everygram — Function
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1) where {T <: AbstractString}Return all possible n-grams generated from a sequence of items, as a Vector{String}.
Example
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq, min_len=1, max_len=-1)
10-element Vector{Any}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"sourceTextAnalysis.padding_ngram — Function
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol="</s>") where {T <: AbstractString}Pad both left and right sides of a sentence and output n-grams of order n.
This function also pads the original input vector of strings.
Example
julia> example = ["1","2","3","4","5"]
julia> padding_ngram(example,2,pad_left=true,pad_right=true)
6-element Vector{Any}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"sourceVocabulary
A struct to store language model vocabulary.
It checks membership and filters items by comparing their counts to a cutoff value.
It also adds a special "unknown" token which unseen words are mapped to:
julia> using TextAnalysisjulia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]11-element Vector{String}: "a" "c" "-" "d" "c" "a" "b" "r" "a" "c" "d"julia> vocabulary = Vocabulary(words, 2)Vocabulary(Dict("<unk>" => 1, "c" => 3, "a" => 3, "d" => 2), 2, "<unk>", ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d", "<unk>"])julia> # Look up a sequence of words in the vocabulary word = ["a", "-", "d", "c", "a"]5-element Vector{String}: "a" "-" "d" "c" "a"julia> lookup(vocabulary ,word)5-element Vector{String}: "a" "<unk>" "d" "c" "a"