API References

Base.argmaxMethod
argmax(scores::Vector{Score})::Score
  • scores - Vector of Score objects

Return the maximum by f-measure field of each Score.

source
Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end and terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

source
TextAnalysis.averageMethod
average(scores::Vector{Score})::Score
  • scores - Vector of Score objects

Return average values of scores as a Score with precision/recall/fmeasure.

source
TextAnalysis.bleu_scoreMethod
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Compute the BLEU score of translated segments against one or more references.

Return the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translation_length, and reference_length.

Arguments

  • reference_corpus: List of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: List of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: Maximum n-gram order to use when computing BLEU score.
  • smooth=false: Whether or not to apply Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source
TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)

Basic low-level function that calculates the co-occurrence matrix of a document. Return a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).

Example

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  1.0
  [1, 2]  =  1.0
  [3, 2]  =  0.1999
  [2, 3]  =  0.1999
source
TextAnalysis.coomMethod
coom(entity, eltype=Float [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first be created in order for the actual matrix to be accessed.

source
TextAnalysis.cos_similarityMethod
function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
    "to be or not to be",
    "to sing or not to sing",
    "to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
    # 3×3 Matrix{Float64}:
    #  1.0        0.0329318  0.0
    #  0.0329318  1.0        0.0
    #  0.0        0.0        1.0
source
TextAnalysis.counter2Method
counter2(
    data,
    min::Integer,
    max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}

Create a conditional distribution counter, which is used by score functions to calculate conditional frequency distributions.

source
TextAnalysis.dtmMethod
dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)

Create a sparse matrix from a DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Matrix{Int64}:
 1  2  0  1  1  1
 1  0  2  1  1  1
source
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, so a lexicon must be passed as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
1×6 Matrix{Int64}:
 1  2  0  1  1  1
source
TextAnalysis.entropyMethod
entropy(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate the cross-entropy of the model for a given evaluation text.

Input text must be a Vector of n-grams of the same length.

source
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1) where {T <: AbstractString}

Return all possible n-grams generated from a sequence of items, as a Vector{String}.

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq, min_len=1, max_len=-1)
 10-element Vector{Any}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"
source
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to the dictionary of the classifier model.

source
TextAnalysis.featuresMethod
features(
    fs::AbstractDict,
    dict::AbstractVector
) -> Vector{Int64}

Compute an array, mapping the values corresponding to elements of dict from the input AbstractDict.

source
TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

source
TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β=1.0)

Compute the F-measure based on WLCS.

Arguments

  • RLCS: Recall factor for LCS computation
  • PLCS: Precision factor for LCS computation
  • β: Beta parameter controlling precision vs recall balance (default: 1.0)

Returns

  • Real: F-measure score balancing precision and recall
source
TextAnalysis.frequenciesMethod
frequencies(
    xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A

Create a dictionary that maps elements in input array to their frequencies.

source
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Return the frequent terms from crps, occurring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Vector{String}:
 "is"
 "This"
 "Document"

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.get_ngramsMethod
get_ngrams(segment, max_order)

Extract all n-grams up to a given maximum order from an input segment.

Return a counter containing all n-grams up to max_order in the segment with a count of how many times each n-gram occurred.

Arguments

  • segment: Text segment from which n-grams will be extracted.
  • max_order: Maximum length in tokens of the n-grams returned by this method.
source
TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)

Represent a Corpus as a Matrix with N entries.

source
TextAnalysis.hash_dtvMethod
hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represent a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> hash_dtv(crps[1], h)
1×10 Matrix{Int64}:
 0  2  0  0  1  3  0  0  0  0

julia> hash_dtv(crps[1])
1×100 Matrix{Int64}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
source
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Show mapping of string to integer using the hash trick.

Arguments

  • str: String to be hashed
  • TextHashFunc: TextHashFunction object containing hash configuration

Examples

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7
source
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Return the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index provides this information and enables a simplistic search algorithm.

source
TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")

julia> language!(d, Languages.Spanish())

julia> d.metadata.language
Languages.Spanish()

See also: language, languages, languages!

source
TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then the language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of the vector.

See also: languages, language!, language

source
TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Arguments

  • dtm::DocumentTermMatrix: Document-term matrix containing the corpus
  • ntopics::Int: Number of topics to extract
  • iterations::Int: Number of Gibbs sampling iterations
  • α::Float64: Dirichlet distribution hyperparameter for topic distribution per document. α < 1 yields a sparse topic mixture, α > 1 yields a more uniform topic mixture
  • β::Float64: Dirichlet distribution hyperparameter for word distribution per topic. β < 1 yields a sparse word mixture, β > 1 yields a more uniform word mixture

Keyword Arguments

  • showprogress::Bool: Show a progress bar during Gibbs sampling (default: true)

Returns

  • ϕ: ntopics × nwords sparse matrix of word probabilities per topic
  • θ: ntopics × ndocs dense matrix of topic probabilities per document
source
TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Return the lexicon of the corpus.

The lexicon of a corpus consists of all terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                          StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

julia> lexicon(crps)
Dict{String,Int64} with 0 entries
source
TextAnalysis.logscoreMethod
logscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the log score of a word in a given context.

The arguments are the same as for score and maskedscore.

source
TextAnalysis.lookupMethod
lookup(
    voc::Vocabulary,
    word::AbstractArray{T<:AbstractString, 1}
) -> Vector

Look up a sequence of words in the vocabulary.

Return a vector of strings.

See Vocabulary

source
TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix.

source
TextAnalysis.maskedscoreMethod
maskedscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the score with masked out-of-vocabulary words.

The arguments are the same as for score.

source
TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the n-grams of tokens of order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
  "be or not" => 1
  "or not to" => 1
  "To be or"  => 1
source
TextAnalysis.ngramizenewMethod
ngramizenew(words::Vector{T}, nlist::Integer...) where {T <: AbstractString}

Generate n-grams from a sequence of words.

Example

julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq, 2)
 7-element Vector{Any}:
  "To be" 
  "be or" 
  "or not"
  "not To"
  "To not"
  "not To"
  "To not"
source
TextAnalysis.ngramsMethod
ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> ngrams(sd)
 Dict{String,Int64} with 7 entries:
  "or"   => 1
  "not"  => 1
  "to"   => 1
  "To"   => 1
  "be"   => 1
  "be.." => 1
  "."    => 1
source
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dictionary for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2
source
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol="</s>") where {T <: AbstractString}

Pad both left and right sides of a sentence and output n-grams of order n.

This function also pads the original input vector of strings.

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngram(example,2,pad_left=true,pad_right=true)
 6-element Vector{Any}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"
source
TextAnalysis.pagerankMethod
pagerank(A; n_iter=20, damping=0.15)

Compute PageRank scores for nodes in a graph using the power iteration method.

Arguments

  • A: Adjacency matrix representing the graph
  • n_iter: Number of iterations for convergence (default: 20)
  • damping: Damping factor for PageRank algorithm (default: 0.15)

Returns

  • Matrix{Float64}: PageRank scores for each node
source
TextAnalysis.perplexityMethod
perplexity(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate the perplexity of the given text.

This is simply 2^entropy for the text, so the arguments are the same as entropy.

source
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

source
TextAnalysis.prepare!Method
prepare!(doc, flags)
prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

  • strip_patterns
  • stripcorruptutf8
  • strip_case
  • stem_words
  • tagpartof_speech
  • strip_whitespace
  • strip_punctuation
  • strip_numbers
  • stripnonletters
  • stripindefinitearticles
  • stripdefinitearticles
  • strip_articles
  • strip_prepositions
  • strip_pronouns
  • strip_stopwords
  • stripsparseterms
  • stripfrequentterms
  • striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is   document of "
source
TextAnalysis.probFunction
prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word
) -> Float64
prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Get the probability of a word given its context.

In other words, for a given context, calculate the frequency distribution of words.

source
TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

source
TextAnalysis.remove_case!Method
remove_case!(doc)
remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"

See also: remove_case

source
TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove frequent terms from crps, occurring in more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
"     1"
julia> text(crps[2])
"     2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.remove_html_tags!Method
remove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)

Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.

Example

julia> html_doc = StringDocument(
             "
               <html>
                   <head><script language="javascript">x = 20;</script></head>
                   <body>
                       <h1>Hello</h1><a href="world">world</a>
                   </body>
               </html>
             "
            )
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet:  <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"

See also: remove_html_tags

source
TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

source
TextAnalysis.remove_sparse_terms!Function
remove_sparse_terms!(crps, alpha=0.05)

Remove sparse terms from crps, occurring in less than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
remove_whitespace!(crps)

Remove multiple whitespaces and replace with a single space, removing all leading and trailing whitespaces in document or corpus. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

source
TextAnalysis.remove_words!Method
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})

Remove the occurrences of words from doc or crps.

Example

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"
source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
    references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
    weighted=false, weight_func=sqrt,
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Return a vector of Score objects.

See Rouge: A package for automatic evaluation of summaries

Note

The weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
TextAnalysis.rouge_nMethod
rouge_n(
    references::Vector{<:AbstractString}, 
    candidate::AbstractString, 
    n::Int; 
    lang::Language
)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

Arguments

  • references::Vector{T} where T<: AbstractString - List of reference summaries
  • candidate::AbstractString - Input candidate summary to be scored against reference summaries
  • n::Integer - Order of n-grams
  • lang::Language - Language of the text, useful while generating n-grams (default: Languages.English())

Return a vector of Score objects.

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context using MLE (Maximum Likelihood Estimation).

source
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context in an interpolated language model.

Applies Kneser-Ney and Witten-Bell smoothing depending on the sub-type.

source
TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

Compute the probability of a word given its context using add-one smoothing.

Applies add-one smoothing to Lidstone or Laplace (gammamodel) models.

source
TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(lang, s)

Split string into individual sentences.

Arguments

  • lang: Language for sentence boundary detection rules
  • s: String to split into sentences

Returns

  • Vector{SubString{String}}: Array of sentences extracted from the string

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Vector{SubString{String}}:
 "Here are few words!"
 "I am Foo Bar."

See also: tokenize

source
TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05)

Return the sparse terms from crps, occurring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Vector{String}:
 "1"
 "2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.standardize!Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              TokenDocument("Document 2"),
		              NGramDocument("Document 3")])
A Corpus with 3 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 1 TokenDocument's
 * 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens


julia> standardize!(crps, NGramDocument)

# After this step, you can check that the corpus only contains NGramDocument's:

julia> crps
A Corpus with 3 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
source
TextAnalysis.stem!Method
stem!(doc)
stem!(crps)

Apply stemming to the document or documents in crps using an appropriate stemmer.

Does not support FileDocument or Corpus containing FileDocument.

Arguments

  • doc: Document to apply stemming to
  • crps: Corpus containing documents to apply stemming to
source
TextAnalysis.stem!Method
stem!(crps::Corpus)

Apply stemming to an entire corpus. Assumes all documents in the corpus have the same language (determined from the first document).

Arguments

  • crps: Corpus containing documents to apply stemming to
source
TextAnalysis.summarizeMethod
summarize(doc; ns=5)

Generate a summary of the document and return the top ns sentences.

Arguments

  • doc: Document of type StringDocument, FileDocument, or TokenDocument
  • ns: Number of sentences in the summary (default: 5)

Returns

  • Vector{SubString{String}}: Array of the most relevant sentences

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Vector{SubString{String}}:
 "Assume this Short Document as an example."
 "This has too foo sentences."
source
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from one tagging scheme to another in-place.

Arguments

  • tags: Vector of tags to convert
  • current_scheme: Name of the current tagging scheme
  • new_scheme: Name of the target tagging scheme

Supported Schemes

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]

julia> tag_scheme!(tags, "BIO1", "BIOES")

julia> tags
8-element Vector{String}:
 "S-LOC"
 "O"
 "S-PER"
 "B-MISC"
 "E-MISC"
 "B-PER"
 "I-PER"
 "E-PER"
source
TextAnalysis.textMethod
text(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)

Access the text of Document as a string.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> text(sd)
"To be or not to be..."
source
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Compute term frequency for sparse matrices and store result in tf.

Arguments

  • dtm: Sparse document-term matrix containing term counts
  • tf: Output sparse matrix for term frequency values (modified in-place)

Notes

The tf matrix should have the same nonzero pattern as dtm.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Compute term frequency and store result in tf matrix.

Arguments

  • dtm: Document-term matrix containing term counts
  • tf: Output matrix for term frequency values (modified in-place)

Notes

Works correctly when dtm and tf are the same matrix.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute term frequency for the document-term matrix.

Arguments

  • dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)

Returns

  • Matrix{Float64} or SparseMatrixCSC{Float64}: Term frequency matrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm)

Compute TF-IDF values for document-term matrix in-place.

Arguments

  • dtm: Document-term matrix to transform (modified in-place)
source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Compute TF-IDF (Term Frequency-Inverse Document Frequency) and store result in tf_idf matrix.

Arguments

  • dtm: Document-term matrix containing term counts
  • tf_idf: Output matrix for TF-IDF values (modified in-place)

Notes

The matrices dtm and tf_idf must have the same dimensions.

See also: tf, tf!, tf_idf

source
TextAnalysis.tf_idfMethod
tf_idf(dtm::DocumentTermMatrix)
tf_idf(dtm::SparseMatrixCSC{Real})
tf_idf(dtm::Matrix{Real})

Compute TF-IDF (Term Frequency-Inverse Document Frequency) values for the document-term matrix.

Arguments

  • dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)

Returns

  • Matrix{Float64} or SparseMatrixCSC{Float64}: TF-IDF weighted matrix

Notes

TF-IDF addresses issues with raw word counts:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set the title of the ith document to the corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of the vector.

See also: titles, title!, title

source
TextAnalysis.tokenizeMethod
tokenize(lang, s)

Split string into words and other tokens such as punctuation.

Arguments

  • lang: Language for tokenization rules
  • s: String to tokenize

Returns

  • Vector{String}: Array of tokens extracted from the string

Example

julia> tokenize(Languages.English(), "Too foo words!")
4-element Vector{String}:
 "Too"
 "foo"
 "words"
 "!"

See also: sentence_tokenize

source
TextAnalysis.tokensMethod
tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> tokens(sd)
7-element Vector{String}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be.."
    "."
source
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weighted=true, f=sqrt)

Compute the Weighted Longest Common Subsequence of X and Y.

Arguments

  • X: First sequence
  • Y: Second sequence
  • weighted: Whether to use weighted computation (default: true)
  • f: Weighting function (default: sqrt)

Returns

  • Float32: Length of the weighted longest common subsequence
source
TextAnalysis.weighted_lcs_tokensFunction
weighted_lcs_tokens(X, Y, weighted=true, f=sqrt)

Compute the tokens of the Weighted Longest Common Subsequence of X and Y.

Arguments

  • X: First sequence
  • Y: Second sequence
  • weighted: Whether to use weighted computation (default: true)
  • f: Weighting function (default: sqrt)

Returns

  • Vector{String}: Array of tokens in the longest common subsequence
source
TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatrixCSC{T,Int}: The actual COOM; elements represent co-occurrences of two terms within a given window.
  • terms::Vector{String}: A list of terms that represent the lexicon of the document or corpus.
  • column_indices::OrderedDict{String, Int}: A map between the terms and the columns of the co-occurrence matrix.
source
TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructors of the CooMatrix type. The type T must be a subtype of AbstractFloat.

The constructors require a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source
TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              StringDocument("Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
source
TextAnalysis.DocumentMetadataType
DocumentMetadata(
    language::Language,
    title::String,
    author::String,
    timestamp::String,
    custom::Any
)

Store basic metadata about a document.

Arguments

  • language: Language of the document (default: Languages.English())
  • title: Title of the document (default: "Untitled Document")
  • author: Author of the document (default: "Unknown Author")
  • timestamp: Timestamp when the document was written (default: "Unknown Time")
  • custom: User-specific data field (default: nothing)
source
TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int}, terms::Vector{String})

Represent documents as a matrix of word counts.

This representation allows linear algebra operations and statistical techniques to be applied. The lexicon must be updated before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1
source
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represent a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
source
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount::Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initialize a type for providing a Kneser-Ney interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

source
TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initialize a Laplace type for providing Laplace-smoothed scores.

In addition to initialization arguments from the base n-gram model, this uses a smoothing parameter gamma = 1.

source
TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initialize a Lidstone type for providing Lidstone-smoothed scores.

In addition to initialization arguments from the base n-gram model, this also requires a number by which to increase the counts (gamma).

source
TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initialize a type for providing MLE n-gram model scores.

Implementation of the base n-gram model using Maximum Likelihood Estimation.

source
TextAnalysis.NGramDocumentMethod
NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represent a document as a bag of n-grams, which are UTF8 n-grams that map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

Arguments

  • classes: Array of possible classes that the data could belong to
  • dict: (Optional) Array of possible tokens (words). This is automatically updated if a new token is detected during training or prediction

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict

julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))

julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])

julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])

julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
  :spam     => 0.59883
  :non_spam => 0.40117
source
TextAnalysis.ScoreMethod
Score(
    precision::AbstractFloat,
    recall::AbstractFloat,
    fmeasure::AbstractFloat
) -> Score

Stores a result of evaluation

source
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represent a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
source
TextAnalysis.TextHashFunctionMethod
TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)

The need to create a lexicon before constructing a document term matrix is often prohibitive. This implementation employs the "Hash Trick" technique, which replaces terms with their hashed values using a hash function that outputs integers from 1 to N.

Arguments

  • cardinality: Maximum index used for hashing (default: 100)
  • hash_function: Function used for hashing process (default: built-in hash function)

Examples

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
source
TextAnalysis.TokenDocumentMethod
TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represent a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Vector{String}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.VocabularyType
Vocabulary(word, unk_cutoff=1, unk_label="<unk>")

Store language model vocabulary.

Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items by comparing their counts to a cutoff value.
  • Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2) 
  Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 

julia> vocabulary.vocab
  Dict{String,Int64} with 4 entries:
   "<unk>" => 1
   "c"     => 3
   "a"     => 3
   "d"     => 2

Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
 3

julia> "c" in keys(vocabulary.vocab)
 true

julia> vocabulary.vocab["d"]
 2

julia> "d" in keys(vocabulary.vocab)
 true

Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
 false

julia> "<unk>" in keys(vocabulary.vocab)
 true

We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
 'a'

julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary, word)
 5-element Vector{Any}:
  "a"    
  "<unk>"
  "d"    
  "c"    
  "a"

If given a sequence, it will return a `Vector{Any}` of the looked up words as shown above.
   
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
 1

julia> vocabulary.vocab["b"]
 1
source
TextAnalysis.VocabularyMethod
Vocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff
) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff,
    unk_label
) -> Vocabulary
source
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initialize a type for providing an interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.

source