API References

Base.argmax — Method

argmax(scores::Vector{Score})::Score

scores - vector of Score

Returns maximum by precision fiels of each Score

source

Base.merge! — Method

merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

source

TextAnalysis.DirectoryCorpus — Method

DirectoryCorpus(dirname::AbstractString)

Construct a Corpus from a directory of text files.

source

TextAnalysis.author! — Method

author!(doc, author)

Set the author metadata of doc to author.

See also: author, authors, authors!

source

TextAnalysis.author — Method

author(doc)

Return the author metadata for doc.

See also: author!, authors, authors!

source

TextAnalysis.authors! — Method

authors!(crps, athrs)
authors!(crps, athr)

Set the authors of the documents in crps to the athrs, respectively.

See also: authors, author!, author

source

TextAnalysis.authors — Method

authors(crps)

Return the authors for each document in crps.

See also: authors!, author, author!

source

TextAnalysis.average — Method

average(scores::Vector{Score})::Score

scores - vector of Score

Returns average values of scores as a Score with precision/recall/fmeasure

source

TextAnalysis.bleu_score — Method

bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
max_order: maximum n-gram order to use when computing BLEU score.
smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)

source

TextAnalysis.columnindices — Method

columnindices(terms::Vector{String})

Creates a column index lookup dictionary from a vector of terms.

source

TextAnalysis.coo_matrix — Method

coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)

Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).

Example

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  1.0
  [1, 2]  =  1.0
  [3, 2]  =  0.1999
  [2, 3]  =  0.1999

source

TextAnalysis.coom — Method

coom(c::CooMatrix)

Access the co-occurrence matrix field coom of a CooMatrix c.

source

TextAnalysis.coom — Method

coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source

TextAnalysis.cos_similarity — Method

function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
    "to be or not to be",
    "to sing or not to sing",
    "to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
    # 3×3 Array{Float64,2}:
    #  1.0        0.0329318  0.0
    #  0.0329318  1.0        0.0
    #  0.0        0.0        1.0

source

TextAnalysis.counter2 — Method

counter2(
    data,
    min::Integer,
    max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}

counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution

source

TextAnalysis.dtm — Method

dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)

Creates a simple sparse matrix of DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
 1  2  0  1  1  1
 1  0  2  1  1  1

source

TextAnalysis.dtv — Method

dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
 1  2  0  1  1  1

source

TextAnalysis.entropy — Method

entropy(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source

TextAnalysis.everygram — Method

everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
 10-element Array{Any,1}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"

source

TextAnalysis.extend! — Method

extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

source

TextAnalysis.features — Method

features(
    fs::AbstractDict,
    dict::AbstractVector
) -> Vector{Int64}

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

source

TextAnalysis.fit! — Method

fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

source

TextAnalysis.fmeasure_lcs — Function

fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

RLCS - Recall Factor
PLCS - Precision Factor
β - Parameter

source

TextAnalysis.frequencies — Method

frequencies(
    xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A

Create a dict that maps elements in input array to their frequencies.

source

TextAnalysis.frequent_terms — Function

frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occurring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
 "is"
 "This"
 "Document"

source

TextAnalysis.get_ngrams — Method

get_ngrams(segment, max_order)

Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.

Arguments

segment: text segment from which n-grams will be extracted.
max_order: maximum length in tokens of the n-grams returned by this methods.

source

TextAnalysis.hash_dtm — Method

hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

source

TextAnalysis.hash_dtv — Method

hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represents a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0

julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

source

TextAnalysis.index_hash — Method

index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7

source

TextAnalysis.inverse_index — Method

inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

source

TextAnalysis.language! — Method

language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")

julia> language!(d, Languages.Spanish())

julia> d.metadata.language
Languages.Spanish()

source

TextAnalysis.language — Method

language(doc)

Return the language metadata for doc.

source

TextAnalysis.languages! — Method

languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

source

TextAnalysis.languages — Method

languages(crps)

Return the languages for each document in crps.

source

TextAnalysis.lda — Method

ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1

source

TextAnalysis.lexical_frequency — Method

lexical_frequency(crps::Corpus, term::AbstractString)

Tells us how often a term occurs across all of the documents.

source

TextAnalysis.lexicon — Method

lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                          StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

julia> lexicon(crps)
Dict{String,Int64} with 0 entries

source

TextAnalysis.lexicon_size — Method

lexicon_size(crps::Corpus)

Tells the total number of terms in a lexicon.

source

TextAnalysis.logscore — Method

logscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore

source

TextAnalysis.lookup — Method

lookup(
    voc::Vocabulary,
    word::AbstractArray{T<:AbstractString, 1}
) -> Vector

lookup a sequence or words in the vocabulary

Return an Array of String

See Vocabulary

source

TextAnalysis.lsa — Method

lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source

TextAnalysis.maskedscore — Method

maskedscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

source

TextAnalysis.ngramize — Method

ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
  "be or not" => 1
  "or not to" => 1
  "To be or"  => 1

source

TextAnalysis.ngramizenew — Method

ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
 7-element Array{Any,1}:
  "To be" 
  "be or" 
  "or not"
  "not To"
  "To not"
  "not To"
  "To not"

source

TextAnalysis.ngrams — Method

ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> ngrams(sd)
 Dict{String,Int64} with 7 entries:
  "or"   => 1
  "not"  => 1
  "to"   => 1
  "To"   => 1
  "be"   => 1
  "be.." => 1
  "."    => 1

source

TextAnalysis.onegramize — Method

onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2

source

TextAnalysis.padding_ngram — Method

padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngram(example,2,pad_left=true,pad_right=true)
 6-element Array{Any,1}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"

source

TextAnalysis.perplexity — Method

perplexity(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source

TextAnalysis.predict — Method

predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

source

TextAnalysis.prepare! — Method

prepare!(doc, flags)
prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

strip_patterns
stripcorruptutf8
strip_case
stem_words
tagpartof_speech
strip_whitespace
strip_punctuation
strip_numbers
stripnonletters
stripindefinitearticles
stripdefinitearticles
strip_articles
strip_prepositions
strip_pronouns
strip_stopwords
stripsparseterms
stripfrequentterms
striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is   document of "

source

TextAnalysis.prob — Function

prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word
) -> Float64
prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

To get probability of word given that context

In other words, for given context calculate frequency distribution of word

source

TextAnalysis.prune! — Method

prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

source

TextAnalysis.remove_case! — Method

remove_case!(doc)
remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"

See also: sentence_tokenize

source

TextAnalysis.tokens — Method

tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> tokens(sd)
7-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be.."
    "."

source

TextAnalysis.update — Method

update(vocab::Vocabulary, words) -> Dict{String, Int64}

See Vocabulary

source

TextAnalysis.weighted_lcs — Function

weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

source

TextAnalysis.CooMatrix — Type

Basic Co-occurrence Matrix (COOM) type.

Fields

coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

source

TextAnalysis.CooMatrix — Method

CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source

TextAnalysis.Corpus — Method

Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              StringDocument("Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

source

TextAnalysis.DocumentMetadata — Type

DocumentMetadata(
    language::Language,
    title::String,
    author::String,
    timestamp::String,
    custom::Any
)

Stores basic metadata about Document.

...

Arguments

language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
title::String : What is the title of the document? Defaults to "Untitled Document".
author::String : Who wrote the document? Defaults to "Unknown Author".
timestamp::String : When was the document written? Defaults to "Unknown Time".
custom : user specific data field. Defaults to nothing.

...

source

TextAnalysis.DocumentTermMatrix — Method

DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})

Represent documents as a matrix of word counts.

Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

source

TextAnalysis.FileDocument — Method

FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah

source

TextAnalysis.KneserNeyInterpolated — Method

KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

source

TextAnalysis.Laplace — Type

Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

source

TextAnalysis.Lidstone — Method

Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

source

TextAnalysis.MLE — Method

MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

source

TextAnalysis.NGramDocument — Method

NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

source

TextAnalysis.NaiveBayesClassifier — Method

NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

It takes two arguments:

classes: An array of possible classes that the concerned data could belong to.
dict:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict

julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))

julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])

julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])

julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
  :spam     => 0.59883
  :non_spam => 0.40117

source

TextAnalysis.Score — Type

struct Score

precision::Float32
recall::Float32
fmeasure::Float32

source

TextAnalysis.Score — Method

Score(
    precision::AbstractFloat,
    recall::AbstractFloat,
    fmeasure::AbstractFloat
) -> Score

Stores a result of evaluation

source

TextAnalysis.Score — Method

Score(; precision, recall, fmeasure) -> Score

source

TextAnalysis.StringDocument — Method

StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

source

TextAnalysis.TextHashFunction — Method

TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.

Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

source

TextAnalysis.TokenDocument — Method

TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

source

TextAnalysis.Vocabulary — Type

Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2) 
  Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 

julia> vocabulary.vocab
  Dict{String,Int64} with 4 entries:
   "<unk>" => 1
   "c"     => 3
   "a"     => 3
   "d"     => 2

Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
 3

julia> "c" in keys(vocabulary.vocab)
 true

julia> vocabulary.vocab["d"]
 2

julia> "d" in keys(vocabulary.vocab)
 true

Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
 false

julia> "<unk>" in keys(vocabulary.vocab)
 true

We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
 'a'

julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
 5-element Array{Any,1}:
  "a"    
  "<unk>"
  "d"    
  "c"    
  "a"

If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
   
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
 1

julia> vocabulary.vocab["b"]
 1

source

TextAnalysis.Vocabulary — Method

Vocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff
) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff,
    unk_label
) -> Vocabulary

source

TextAnalysis.WittenBellInterpolated — Method

WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.

source