Statistical Language Models

TextAnalysis provides the following different language models:

  • MLE - Base n-gram model using Maximum Likelihood Estimation.
  • Lidstone - Base n-gram model with Lidstone smoothing.
  • Laplace - Base n-gram language model with Laplace smoothing.
  • WittenBellInterpolated - Interpolated version of the Witten-Bell algorithm.
  • KneserNeyInterpolated - Interpolated version of Kneser-Ney smoothing.

APIs

To use the API, first instantiate the desired model and then train it with a training set:

MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
KneserNeyInterpolated(word::Vector{T}, discount:: Float64=0.1, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
        
(lm::<Languagemodel>)(text, min::Integer, max::Integer)

Arguments:

  • word: Array of strings to store the vocabulary.

  • unk_cutoff: Tokens with counts greater than or equal to the cutoff value will be considered part of the vocabulary.

  • unk_label: Token for unknown labels.

  • gamma: Smoothing parameter gamma.

  • discount: Discounting factor for KneserNeyInterpolated.

For more information, see the docstrings of the vocabulary functions.

julia> voc = ["my","name","is","salman","khan","and","he","is","shahrukh","Khan"]

julia> train = ["khan","is","my","good", "friend","and","He","is","my","brother"]
# voc and train are used to train the vocabulary and model respectively

julia> model = MLE(voc)
MLE(Vocabulary(Dict("khan"=>1,"name"=>1,"<unk>"=>1,"salman"=>1,"is"=>2,"Khan"=>1,"my"=>1,"he"=>1,"shahrukh"=>1,"and"=>1…), 1, "<unk>", ["my", "name", "is", "salman", "khan", "and", "he", "is", "shahrukh", "Khan", "<unk>"]))

julia> print(voc)
11-element Vector{String}:
 "my"
 "name"
 "is"
 "salman"
 "khan" 
 "and" 
 "he" 
 "is"
 "shahrukh"
 "Khan"
 "<unk>"

# You can see the "<unk>" token is added to voc 
julia> fit = model(train,2,2) # considering only bigrams

julia> unmaskedscore = score(model, fit, "is" ,"<unk>") # score output P(word | context) without replacing context word with "<unk>"
0.3333333333333333

julia> masked_score = maskedscore(model,fit,"is","alien")
0.3333333333333333
# As expected, maskedscore is equivalent to unmaskedscore with context replaced with "<unk>"
Note

When you call MLE(voc) for the first time, it will update your vocabulary set as well.

Evaluation Methods

score

Used to evaluate the probability of a word given its context (P(word | context)):

TextAnalysis.scoreFunction
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context using add-one smoothing.

Applies add-one smoothing to Lidstone or Laplace (gammamodel) models.

source
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context using MLE (Maximum Likelihood Estimation).

source
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context in an interpolated language model.

Applies Kneser-Ney and Witten-Bell smoothing depending on the sub-type.

source

Arguments:

  1. m: Instance of Langmodel struct.
  2. temp_lm: Output of function call of instance of Langmodel.
  3. word: String of the word.
  4. context: Context of the given word.
  • For Lidstone and Laplace models, smoothing is applied.
  • For interpolated language models, KneserNey and WittenBell smoothing are provided.

maskedscore

TextAnalysis.maskedscoreFunction
maskedscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the score with masked out-of-vocabulary words.

The arguments are the same as for score.

source

logscore

TextAnalysis.logscoreFunction
logscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the log score of a word in a given context.

The arguments are the same as for score and maskedscore.

source

entropy

TextAnalysis.entropyFunction
entropy(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate the cross-entropy of the model for a given evaluation text.

Input text must be a Vector of n-grams of the same length.

source

perplexity

TextAnalysis.perplexityFunction
perplexity(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate the perplexity of the given text.

This is simply 2^entropy for the text, so the arguments are the same as entropy.

source

Preprocessing

The following functions are available for preprocessing:

TextAnalysis.everygramFunction
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1) where {T <: AbstractString}

Return all possible n-grams generated from a sequence of items, as a Vector{String}.

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq, min_len=1, max_len=-1)
 10-element Vector{Any}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"
source
TextAnalysis.padding_ngramFunction
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol="</s>") where {T <: AbstractString}

Pad both left and right sides of a sentence and output n-grams of order n.

This function also pads the original input vector of strings.

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngram(example,2,pad_left=true,pad_right=true)
 6-element Vector{Any}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"
source

Vocabulary

A struct to store language model vocabulary.

It checks membership and filters items by comparing their counts to a cutoff value.

It also adds a special "unknown" token which unseen words are mapped to:

julia> using TextAnalysis
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]11-element Vector{String}: "a" "c" "-" "d" "c" "a" "b" "r" "a" "c" "d"
julia> vocabulary = Vocabulary(words, 2)Vocabulary(Dict("<unk>" => 1, "c" => 3, "a" => 3, "d" => 2), 2, "<unk>", ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d", "<unk>"])
julia> # Look up a sequence of words in the vocabulary word = ["a", "-", "d", "c", "a"]5-element Vector{String}: "a" "-" "d" "c" "a"
julia> lookup(vocabulary ,word)5-element Vector{String}: "a" "<unk>" "d" "c" "a"