Statistical Language Model
TextAnalysis provide following different Language Models
- MLE - Base Ngram model.
- Lidstone - Base Ngram model with Lidstone smoothing.
- Laplace - Base Ngram language model with Laplace smoothing.
- WittenBellInterpolated - Interpolated Version of witten-Bell algorithm.
- KneserNeyInterpolated - Interpolated version of Kneser -Ney smoothing.
APIs
To use the API, we first Instantiate desired model and then load it with train set
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
(lm::<Languagemodel>)(text, min::Integer, max::Integer)
Arguments:
word
: Array of strings to store vocabulary.unk_cutoff
: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.unk_label
: token for unknown labelsgamma
: smoothing argument gammadiscount
: discounting factor forKneserNeyInterpolated
for more information see docstrings of vocabulary
julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"]
julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"]
# voc and train are used to train vocabulary and model respectively
julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "<unk>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))
julia> print(voc)
11-element Array{String,1}:
"my"
"name"
"is"
"salman"
"khan"
"and"
"he"
"is"
"shahrukh"
"Khan"
"<unk>"
# you can see "<unk>" token is added to voc
julia> fit = model(train,2,2) #considering only bigrams
julia> unmaskedscore = score(model, fit, "is" ,"<unk>") #score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333
julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
#as expected maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
When you call MLE(voc)
for the first time, It will update your vocabulary set as well.
Evaluation Method
score
used to evaluate the probability of word given context (P(word | context))
TextAnalysis.score
— Functionscore(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context
Add-one smoothing to Lidstone or Laplace(gammamodel) models
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context in MLE
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context in InterpolatedLanguageModel
Apply Kneserney and WittenBell smoothing depending upon the sub-Type
Arguments:
m
: Instance ofLangmodel
struct.temp_lm
: output of function call of instance ofLangmodel
.word
: string of wordcontext
: context of given word
In case of
Lidstone
andLaplace
it apply smoothing and,In Interpolated language model, provide
Kneserney
andWittenBell
smoothing
maskedscore
TextAnalysis.maskedscore
— Functionmaskedscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
It is used to evaluate score with masks out of vocabulary words
The arguments are the same as for score
logscore
TextAnalysis.logscore
— Functionlogscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Evaluate the log score of this word in this context.
The arguments are the same as for score
and maskedscore
entropy
TextAnalysis.entropy
— Functionentropy(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculate cross-entropy of model for given evaluation text.
Input text must be Vector
of ngram of same lengths
perplexity
TextAnalysis.perplexity
— Functionperplexity(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculates the perplexity of the given text.
This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy
Preprocessing
For Preprocessing following functions:
TextAnalysis.everygram
— Functioneverygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}
Return all possible ngrams generated from sequence of items, as an Array{String,1}
Example
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
TextAnalysis.padding_ngram
— Functionpadding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}
padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n
It also pad the original input Array of string
Example
julia> example = ["1","2","3","4","5"]
julia> padding_ngram(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
Vocabulary
Struct to store Language models vocabulary
checking membership and filters items by comparing their counts to a cutoff value
It also Adds a special "unknown" tokens which unseen words are mapped to
julia> using TextAnalysis
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
11-element Vector{String}: "a" "c" "-" "d" "c" "a" "b" "r" "a" "c" "d"
julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>" => 1, "c" => 3, "a" => 3, "d" => 2), 2, "<unk>", ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d", "<unk>"])
julia> # lookup a sequence or words in the vocabulary word = ["a", "-", "d", "c", "a"]
5-element Vector{String}: "a" "-" "d" "c" "a"
julia> lookup(vocabulary ,word)
5-element Vector{String}: "a" "<unk>" "d" "c" "a"