API References

Base.argmaxMethod
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
Base.merge!Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}

Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.

source
TextAnalysis.bleu_scoreMethod
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source
TextAnalysis.coo_matrixMethod
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)

Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).

Example

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  2.0
  [1, 2]  =  2.0
  [3, 2]  =  0.3999
  [2, 3]  =  0.3999

julia> using TextAnalysis, DataStructures
       doc = StringDocument("This is a text about an apple. There are many texts about apples.")
       docv = TextAnalysis.tokenize(language(doc), text(doc))
       vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
       TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)

3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
  [2, 1]  =  1.0
  [1, 2]  =  1.0
  [3, 2]  =  0.1999
  [2, 3]  =  0.1999
source
TextAnalysis.coomMethod
coom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])

Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.

source
TextAnalysis.cos_similarityMethod
function cos_similarity(tfm::AbstractMatrix)

cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).

Example

crps = Corpus( StringDocument.([
    "to be or not to be",
    "to sing or not to sing",
    "to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
    # 3×3 Array{Float64,2}:
    #  1.0        0.0329318  0.0
    #  0.0329318  1.0        0.0
    #  0.0        0.0        1.0
source
TextAnalysis.counter2Method
counter2(
    data,
    min::Integer,
    max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}

counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution

source
TextAnalysis.dtmMethod
dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)

Creates a simple sparse matrix of DocumentTermMatrix object.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
 1  2  0  1  1  1
 1  0  2  1  1  1
source
TextAnalysis.dtvMethod
dtv(d::AbstractDocument, lex::Dict{String, Int})

Produce a single row of a DocumentTermMatrix.

Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.

Examples

julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
 1  2  0  1  1  1
source
TextAnalysis.entropyMethod
entropy(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculate cross-entropy of model for given evaluation text.

Input text must be Vector of ngram of same lengths

source
TextAnalysis.everygramMethod
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}

Return all possible ngrams generated from sequence of items, as an Array{String,1}

Example

julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
 10-element Array{Any,1}:
  "or"          
  "not"         
  "To"          
  "be"                  
  "or not" 
  "be or"       
  "be or not"   
  "To be or"    
  "To be or not"
source
TextAnalysis.extend!Method
extend!(model::NaiveBayesClassifier, dictElement)

Add the dictElement to dictionary of the Classifier model.

source
TextAnalysis.featuresMethod
features(
    fs::AbstractDict,
    dict::AbstractVector
) -> Vector{Int64}

Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.

source
TextAnalysis.fit!Method
fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)

Fit the weights for the model on the input data.

source
TextAnalysis.fmeasure_lcsFunction
fmeasure_lcs(RLCS, PLCS, β)

Compute the F-measure based on WLCS.

Arguments

  • RLCS - Recall Factor
  • PLCS - Precision Factor
  • β - Parameter
source
TextAnalysis.frequenciesMethod
frequencies(
    xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A

Create a dict that maps elements in input array to their frequencies.

source
TextAnalysis.frequent_termsFunction
frequent_terms(crps, alpha=0.95)

Find the frequent terms from Corpus, occurring more than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
 "is"
 "This"
 "Document"

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.get_ngramsMethod
get_ngrams(segment, max_order)

Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.

Arguments

  • segment: text segment from which n-grams will be extracted.
  • max_order: maximum length in tokens of the n-grams returned by this methods.
source
TextAnalysis.hash_dtmMethod
hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)

Represents a Corpus as a Matrix with N entries.

source
TextAnalysis.hash_dtvMethod
hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)

Represents a document as a vector with N entries.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0

julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
source
TextAnalysis.index_hashMethod
index_hash(str, TextHashFunc)

Shows mapping of string to integer.

Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7
source
TextAnalysis.inverse_indexMethod
inverse_index(crps::Corpus)

Shows the inverse index of a corpus.

If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.

source
TextAnalysis.language!Method
language!(doc, lang::Language)

Set the language of doc to lang.

Example

julia> d = StringDocument("String Document 1")

julia> language!(d, Languages.Spanish())

julia> d.metadata.language
Languages.Spanish()

See also: language, languages, languages!

source
TextAnalysis.languages!Method
languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)

Update languages of documents in a Corpus.

If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.

See also: languages, language!, language

source
TextAnalysis.ldaMethod
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)

Perform Latent Dirichlet allocation.

Required Positional Arguments

  • α Dirichlet dist. hyperparameter for topic distribution per document. α<1 yields a sparse topic mixture for each document. α>1 yields a more uniform topic mixture for each document.
  • β Dirichlet dist. hyperparameter for word distribution per topic. β<1 yields a sparse word mixture for each topic. β>1 yields a more uniform word mixture for each topic.

Optional Keyword Arguments

  • showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value: true.

Return Values

  • ϕ: ntopics × nwords Sparse matrix of probabilities s.t. sum(ϕ, 1) == 1
  • θ: ntopics × ndocs Dense matrix of probabilities s.t. sum(θ, 1) == 1
source
TextAnalysis.lexiconMethod
lexicon(crps::Corpus)

Shows the lexicon of the corpus.

Lexicon of a corpus consists of all the terms that occur in any document in the corpus.

Example

julia> crps = Corpus([StringDocument("Name Foo"),
                          StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens

julia> lexicon(crps)
Dict{String,Int64} with 0 entries
source
TextAnalysis.logscoreMethod
logscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

Evaluate the log score of this word in this context.

The arguments are the same as for score and maskedscore

source
TextAnalysis.lookupMethod
lookup(
    voc::Vocabulary,
    word::AbstractArray{T<:AbstractString, 1}
) -> Vector

lookup a sequence or words in the vocabulary

Return an Array of String

See Vocabulary

source
TextAnalysis.lsaMethod
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)

Performs Latent Semantic Analysis or LSA on a corpus.

source
TextAnalysis.maskedscoreMethod
maskedscore(
    m::TextAnalysis.Langmodel,
    temp_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

It is used to evaluate score with masks out of vocabulary words

The arguments are the same as for score

source
TextAnalysis.ngramizeMethod
ngramize(lang, tokens, n)

Compute the ngrams of tokens of the order n.

Example

julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
  "be or not" => 1
  "or not to" => 1
  "To be or"  => 1
source
TextAnalysis.ngramizenewMethod
ngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}

ngramizenew is used to out putting ngrmas in set

Example

julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
 7-element Array{Any,1}:
  "To be" 
  "be or" 
  "or not"
  "not To"
  "To not"
  "not To"
  "To not"
source
TextAnalysis.ngramsMethod
ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)

Access the document text as n-gram counts.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> ngrams(sd)
 Dict{String,Int64} with 7 entries:
  "or"   => 1
  "not"  => 1
  "to"   => 1
  "To"   => 1
  "be"   => 1
  "be.." => 1
  "."    => 1
source
TextAnalysis.onegramizeMethod
onegramize(lang, tokens)

Create the unigrams dict for input tokens.

Example

julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2
source
TextAnalysis.padding_ngramMethod
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}

padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n

It also pad the original input Array of string

Example

julia> example = ["1","2","3","4","5"]

julia> padding_ngram(example,2,pad_left=true,pad_right=true)
 6-element Array{Any,1}:
  "<s> 1" 
  "1 2"   
  "2 3"   
  "3 4"   
  "4 5"   
  "5 </s>"
source
TextAnalysis.perplexityMethod
perplexity(
    m::TextAnalysis.Langmodel,
    lm::DataStructures.DefaultDict,
    text_ngram::AbstractVector
) -> Float64

Calculates the perplexity of the given text.

This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy

source
TextAnalysis.predictMethod
predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)

Predict probabilities for each class on the input Features or String.

source
TextAnalysis.prepare!Method
prepare!(doc, flags)
prepare!(crps, flags)

Preprocess document or corpus based on the input flags.

List of Flags

  • strip_patterns
  • stripcorruptutf8
  • strip_case
  • stem_words
  • tagpartof_speech
  • strip_whitespace
  • strip_punctuation
  • strip_numbers
  • stripnonletters
  • stripindefinitearticles
  • stripdefinitearticles
  • strip_articles
  • strip_prepositions
  • strip_pronouns
  • strip_stopwords
  • stripsparseterms
  • stripfrequentterms
  • striphtmltags

Example

julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is   document of "
source
TextAnalysis.probFunction
prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word
) -> Float64
prob(
    m::TextAnalysis.Langmodel,
    templ_lm::DataStructures.DefaultDict,
    word,
    context
) -> Float64

To get probability of word given that context

In other words, for given context calculate frequency distribution of word

source
TextAnalysis.prune!Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}

Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.

source
TextAnalysis.remove_case!Method
remove_case!(doc)
remove_case!(crps)

Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.

Example

julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"

See also: remove_case

source
TextAnalysis.remove_frequent_terms!Function
remove_frequent_terms!(crps, alpha=0.95)

Remove terms in crps, occurring more than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
"     1"
julia> text(crps[2])
"     2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.remove_html_tags!Method
remove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)

Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.

Example

julia> html_doc = StringDocument(
             "
               <html>
                   <head><script language="javascript">x = 20;</script></head>
                   <body>
                       <h1>Hello</h1><a href="world">world</a>
                   </body>
               </html>
             "
            )
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet:  <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"

See also: remove_html_tags

source
TextAnalysis.remove_patterns!Method
remove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)

Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns

source
TextAnalysis.remove_sparse_terms!Function
remove_sparse_terms!(crps, alpha=0.05)

Remove sparse terms in crps, occurring less than alpha percent of documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "

See also: remove_frequent_terms!, sparse_terms

source
TextAnalysis.remove_whitespace!Method
remove_whitespace!(doc)
remove_whitespace!(crps)

Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace

source
TextAnalysis.remove_words!Method
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})

Remove the occurrences of words from doc or crps.

Example

julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown   jumps   the lazy dog"
source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
    references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
    weighted=false, weight_func=sqrt,
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
TextAnalysis.rouge_nMethod
rouge_n(
    references::Vector{<:AbstractString}, 
    candidate::AbstractString, 
    n::Int; 
    lang::Language
)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.scoreFunction
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in MLE

source
TextAnalysis.scoreFunction
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context in InterpolatedLanguageModel

Apply Kneserney and WittenBell smoothing depending upon the sub-Type

source
TextAnalysis.scoreMethod
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)

score is used to output probability of word given that context

Add-one smoothing to Lidstone or Laplace(gammamodel) models

source
TextAnalysis.sentence_tokenizeMethod
sentence_tokenize(language, str)

Split str into sentences.

Example

julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Array{SubString{String},1}:
 "Here are few words!"
 "I am Foo Bar."

See also: tokenize

source
TextAnalysis.sparse_termsFunction
sparse_terms(crps, alpha=0.05])

Find the sparse terms from Corpus, occurring in less than alpha percentage of the documents.

Example

julia> crps = Corpus([StringDocument("This is Document 1"),
                      StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Array{String,1}:
 "1"
 "2"

See also: remove_sparse_terms!, frequent_terms

source
TextAnalysis.standardize!Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument

Standardize the documents in a Corpus to a common type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              TokenDocument("Document 2"),
		              NGramDocument("Document 3")])
A Corpus with 3 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 1 TokenDocument's
 * 1 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens


julia> standardize!(crps, NGramDocument)

# After this step, you can check that the corpus only contains NGramDocument's:

julia> crps
A Corpus with 3 documents:
 * 0 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 3 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
source
TextAnalysis.stem!Method
stem!(doc)
stem!(crps)

Stems the document or documents in crps with a suitable stemmer.

Stemming cannot be done for FileDocument and Corpus made of these type of documents.

source
TextAnalysis.stem!Method
stem!(crps::Corpus)

Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)

source
TextAnalysis.summarizeMethod
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
 "Assume this Short Document as an example."
 "This has too foo sentences."
source
TextAnalysis.tag_scheme!Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)

Convert tags from current_scheme to new_scheme.

List of tagging schemes currently supported-

  • BIO1 (BIO)
  • BIO2
  • BIOES

Example

julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]

julia> tag_scheme!(tags, "BIO1", "BIOES")

julia> tags
8-element Array{String,1}:
 "S-LOC"
 "O"
 "S-PER"
 "B-MISC"
 "E-MISC"
 "B-PER"
 "I-PER"
 "E-PER"
source
TextAnalysis.textMethod
text(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)

Access the text of Document as a string.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> text(sd)
"To be or not to be..."
source
TextAnalysis.tf!Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

tf should have the has same nonzeros as dtm.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf!Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})

Overwrite tf with the term frequency of the dtm.

Works correctly if dtm and tf are same matrix.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})

Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

The arguments must have same number of nonzeros.

See also: tf, tf_idf, tf_idf!

source
TextAnalysis.tf_idf!Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})

Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.

dtm and tf-idf must be matrices of same dimensions.

See also: tf, tf! , tf_idf

source
TextAnalysis.tf_idfMethod
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source
TextAnalysis.titles!Method
titles!(crps, vec::Vector{String})
titles!(crps, str)

Update titles of the documents in a Corpus.

If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.

See also: titles, title!, title

source
TextAnalysis.tokenizeMethod
tokenize(language, str)

Split str into words and other tokens such as punctuation.

Example

julia> tokenize(Languages.English(), "Too foo words!")
4-element Array{String,1}:
 "Too"
 "foo"
 "words"
 "!"

See also: sentence_tokenize

source
TextAnalysis.tokensMethod
tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))

Access the document text as a token array.

Example

julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

julia> tokens(sd)
7-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be.."
    "."
source
TextAnalysis.weighted_lcsFunction
weighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)

Compute the Weighted Longest Common Subsequence of X and Y.

source
TextAnalysis.CooMatrixType

Basic Co-occurrence Matrix (COOM) type.

Fields

  • coom::SparseMatriCSC{T,Int} the actual COOM; elements represent

co-occurrences of two terms within a given window

  • terms::Vector{String} a list of terms that represent the lexicon of

the document or corpus

  • column_indices::OrderedDict{String, Int} a map between the terms and the

columns of the co-occurrence matrix

source
TextAnalysis.CooMatrixMethod
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])

Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.

source
TextAnalysis.CorpusMethod
Corpus(docs::Vector{T}) where {T <: AbstractDocument}

Collections of documents are represented using the Corpus type.

Example

julia> crps = Corpus([StringDocument("Document 1"),
		              StringDocument("Document 2")])
A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
source
TextAnalysis.DocumentMetadataType
DocumentMetadata(
    language::Language,
    title::String,
    author::String,
    timestamp::String,
    custom::Any
)

Stores basic metadata about Document.

...

Arguments

  • language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
  • title::String : What is the title of the document? Defaults to "Untitled Document".
  • author::String : Who wrote the document? Defaults to "Unknown Author".
  • timestamp::String : When was the document written? Defaults to "Unknown Time".
  • custom : user specific data field. Defaults to nothing.

...

source
TextAnalysis.DocumentTermMatrixMethod
DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})

Represent documents as a matrix of word counts.

Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.

Examples

julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix

julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1
source
TextAnalysis.FileDocumentMethod
FileDocument(pathname::AbstractString)

Represents a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
source
TextAnalysis.KneserNeyInterpolatedMethod
KneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing KneserNey Interpolated language model.

The idea to abstract this comes from Chen & Goodman 1995.

source
TextAnalysis.LaplaceType
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Laplace) for providing Laplace-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.

source
TextAnalysis.LidstoneMethod
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.

In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.

source
TextAnalysis.MLEMethod
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}

Initiate Type for providing MLE ngram model scores.

Implementation of Base Ngram Model.

source
TextAnalysis.NGramDocumentMethod
NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.NaiveBayesClassifierMethod
NaiveBayesClassifier([dict, ]classes)

A Naive Bayes Classifier for classifying documents.

It takes two arguments:

  • classes: An array of possible classes that the concerned data could belong to.
  • dict:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)

Example

julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict

julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))

julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])

julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])

julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
  :spam     => 0.59883
  :non_spam => 0.40117
source
TextAnalysis.ScoreMethod
Score(
    precision::AbstractFloat,
    recall::AbstractFloat,
    fmeasure::AbstractFloat
) -> Score

Stores a result of evaluation

source
TextAnalysis.StringDocumentMethod
StringDocument(txt::AbstractString)

Represents a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
source
TextAnalysis.TextHashFunctionMethod
TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.

Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
source
TextAnalysis.TokenDocumentMethod
TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represents a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
source
TextAnalysis.VocabularyType
Vocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")

Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:

  • When checking membership and calculating its size, filters items

by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.

Example

julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2) 
  Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>") 

julia> vocabulary.vocab
  Dict{String,Int64} with 4 entries:
   "<unk>" => 1
   "c"     => 3
   "a"     => 3
   "d"     => 2

Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
 3

julia> "c" in keys(vocabulary.vocab)
 true

julia> vocabulary.vocab["d"]
 2

julia> "d" in keys(vocabulary.vocab)
 true

Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
 false

julia> "<unk>" in keys(vocabulary.vocab)
 true

We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
 'a'

julia> word = ["a", "-", "d", "c", "a"]

julia> lookup(vocabulary ,word)
 5-element Array{Any,1}:
  "a"    
  "<unk>"
  "d"    
  "c"    
  "a"

If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
   
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
 1

julia> vocabulary.vocab["b"]
 1
source
TextAnalysis.VocabularyMethod
Vocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff
) -> Vocabulary
Vocabulary(
    word::Array{T<:AbstractString, 1},
    unk_cutoff,
    unk_label
) -> Vocabulary
source
TextAnalysis.WittenBellInterpolatedMethod
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}

Initiate Type for providing Interpolated version of Witten-Bell smoothing.

The idea to abstract this comes from Chen & Goodman 1995.

source