API References
Base.argmax — Method
Base.merge! — Method
merge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}Merge one DocumentTermMatrix instance into another. Documents are appended to the end and terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.
sourceTextAnalysis.DirectoryCorpus — Method
TextAnalysis.author! — Method
TextAnalysis.author — Method
TextAnalysis.authors! — Method
TextAnalysis.authors — Method
TextAnalysis.average — Method
TextAnalysis.bleu_score — Method
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)Compute the BLEU score of translated segments against one or more references.
Return the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translation_length, and reference_length.
Arguments
reference_corpus: List of lists of references for each translation. Each reference should be tokenized into a list of tokens.translation_corpus: List of translations to score. Each translation should be tokenized into a list of tokens.max_order: Maximum n-gram order to use when computing BLEU score.smooth=false: Whether or not to apply Lin et al. 2004 smoothing.
Example:
one_doc_references = [
["apple", "is", "apple"],
["apple", "is", "a", "fruit"]
]
one_doc_translation = [
"apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)sourceTextAnalysis.columnindices — Method
columnindices(terms::Vector{String})Create a column index lookup dictionary from a vector of terms.
sourceTextAnalysis.coo_matrix — Method
coo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)Basic low-level function that calculates the co-occurrence matrix of a document. Return a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).
Example
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 2.0
[1, 2] = 2.0
[3, 2] = 0.3999
[2, 3] = 0.3999
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 1.0
[1, 2] = 1.0
[3, 2] = 0.1999
[2, 3] = 0.1999sourceTextAnalysis.coom — Method
TextAnalysis.coom — Method
coom(entity, eltype=Float [;window=5, normalize=true])Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first be created in order for the actual matrix to be accessed.
TextAnalysis.cos_similarity — Method
function cos_similarity(tfm::AbstractMatrix)cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).
Example
crps = Corpus( StringDocument.([
"to be or not to be",
"to sing or not to sing",
"to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
# 3×3 Matrix{Float64}:
# 1.0 0.0329318 0.0
# 0.0329318 1.0 0.0
# 0.0 0.0 1.0sourceTextAnalysis.counter2 — Method
counter2(
data,
min::Integer,
max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}
Create a conditional distribution counter, which is used by score functions to calculate conditional frequency distributions.
sourceTextAnalysis.dtm — Method
dtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)Create a sparse matrix from a DocumentTermMatrix object.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1
julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Matrix{Int64}:
1 2 0 1 1 1
1 0 2 1 1 1sourceTextAnalysis.dtv — Method
dtv(d::AbstractDocument, lex::Dict{String, Int})Produce a single row of a DocumentTermMatrix.
Individual documents do not have a lexicon associated with them, so a lexicon must be passed as an additional argument.
Examples
julia> dtv(crps[1], lexicon(crps))
1×6 Matrix{Int64}:
1 2 0 1 1 1sourceTextAnalysis.entropy — Method
entropy(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculate the cross-entropy of the model for a given evaluation text.
Input text must be a Vector of n-grams of the same length.
TextAnalysis.everygram — Method
everygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1) where {T <: AbstractString}Return all possible n-grams generated from a sequence of items, as a Vector{String}.
Example
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq, min_len=1, max_len=-1)
10-element Vector{Any}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"sourceTextAnalysis.extend! — Method
extend!(model::NaiveBayesClassifier, dictElement)Add the dictElement to the dictionary of the classifier model.
TextAnalysis.features — Method
features(
fs::AbstractDict,
dict::AbstractVector
) -> Vector{Int64}
Compute an array, mapping the values corresponding to elements of dict from the input AbstractDict.
TextAnalysis.fit! — Method
fit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)Fit the weights for the model on the input data.
sourceTextAnalysis.fmeasure_lcs — Function
fmeasure_lcs(RLCS, PLCS, β=1.0)Compute the F-measure based on WLCS.
Arguments
RLCS: Recall factor for LCS computationPLCS: Precision factor for LCS computationβ: Beta parameter controlling precision vs recall balance (default: 1.0)
Returns
Real: F-measure score balancing precision and recall
TextAnalysis.frequencies — Method
frequencies(
xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A
Create a dictionary that maps elements in input array to their frequencies.
sourceTextAnalysis.frequent_terms — Function
frequent_terms(crps, alpha=0.95)Return the frequent terms from crps, occurring more than alpha percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Vector{String}:
"is"
"This"
"Document"See also: remove_frequent_terms!, sparse_terms
TextAnalysis.get_ngrams — Method
get_ngrams(segment, max_order)Extract all n-grams up to a given maximum order from an input segment.
Return a counter containing all n-grams up to max_order in the segment with a count of how many times each n-gram occurred.
Arguments
segment: Text segment from which n-grams will be extracted.max_order: Maximum length in tokens of the n-grams returned by this method.
TextAnalysis.hash_dtm — Method
hash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)Represent a Corpus as a Matrix with N entries.
sourceTextAnalysis.hash_dtv — Method
hash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)Represent a document as a vector with N entries.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> hash_dtv(crps[1], h)
1×10 Matrix{Int64}:
0 2 0 0 1 3 0 0 0 0
julia> hash_dtv(crps[1])
1×100 Matrix{Int64}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0sourceTextAnalysis.index_hash — Method
index_hash(str, TextHashFunc)Show mapping of string to integer using the hash trick.
Arguments
str: String to be hashedTextHashFunc: TextHashFunction object containing hash configuration
Examples
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> index_hash("a", h)
8
julia> index_hash("b", h)
7sourceTextAnalysis.inverse_index — Method
inverse_index(crps::Corpus)Return the inverse index of a corpus.
If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index provides this information and enables a simplistic search algorithm.
sourceTextAnalysis.language! — Method
language!(doc, lang::Language)Set the language of doc to lang.
Example
julia> d = StringDocument("String Document 1")
julia> language!(d, Languages.Spanish())
julia> d.metadata.language
Languages.Spanish()See also: language, languages, languages!
TextAnalysis.language — Method
TextAnalysis.languages! — Method
languages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)Update languages of documents in a Corpus.
If the input is a Vector, then the language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of the vector.
See also: languages, language!, language
TextAnalysis.languages — Method
languages(crps)Return the languages for each document in crps.
See also: languages!, language, language!
TextAnalysis.lda — Method
ϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)Perform Latent Dirichlet allocation.
Arguments
dtm::DocumentTermMatrix: Document-term matrix containing the corpusntopics::Int: Number of topics to extractiterations::Int: Number of Gibbs sampling iterationsα::Float64: Dirichlet distribution hyperparameter for topic distribution per document.α < 1yields a sparse topic mixture,α > 1yields a more uniform topic mixtureβ::Float64: Dirichlet distribution hyperparameter for word distribution per topic.β < 1yields a sparse word mixture,β > 1yields a more uniform word mixture
Keyword Arguments
showprogress::Bool: Show a progress bar during Gibbs sampling (default:true)
Returns
ϕ:ntopics × nwordssparse matrix of word probabilities per topicθ:ntopics × ndocsdense matrix of topic probabilities per document
TextAnalysis.lexical_frequency — Method
lexical_frequency(crps::Corpus, term::AbstractString)Tells us how often a term occurs across all of the documents.
sourceTextAnalysis.lexicon — Method
lexicon(crps::Corpus)Return the lexicon of the corpus.
The lexicon of a corpus consists of all terms that occur in any document in the corpus.
Example
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> lexicon(crps)
Dict{String,Int64} with 0 entriessourceTextAnalysis.lexicon_size — Method
TextAnalysis.logscore — Method
logscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Evaluate the log score of a word in a given context.
The arguments are the same as for score and maskedscore.
TextAnalysis.lookup — Method
lookup(
voc::Vocabulary,
word::AbstractArray{T<:AbstractString, 1}
) -> Vector
Look up a sequence of words in the vocabulary.
Return a vector of strings.
See Vocabulary
TextAnalysis.lsa — Method
lsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)Perform Latent Semantic Analysis (LSA) on a corpus or document-term matrix.
sourceTextAnalysis.maskedscore — Method
TextAnalysis.ngramize — Method
ngramize(lang, tokens, n)Compute the n-grams of tokens of order n.
Example
julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
"be or not" => 1
"or not to" => 1
"To be or" => 1sourceTextAnalysis.ngramizenew — Method
ngramizenew(words::Vector{T}, nlist::Integer...) where {T <: AbstractString}Generate n-grams from a sequence of words.
Example
julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq, 2)
7-element Vector{Any}:
"To be"
"be or"
"or not"
"not To"
"To not"
"not To"
"To not"sourceTextAnalysis.ngrams — Method
ngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)Access the document text as n-gram counts.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> ngrams(sd)
Dict{String,Int64} with 7 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 1
"be.." => 1
"." => 1sourceTextAnalysis.onegramize — Method
onegramize(lang, tokens)Create the unigrams dictionary for input tokens.
Example
julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2sourceTextAnalysis.padding_ngram — Method
padding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol="</s>") where {T <: AbstractString}Pad both left and right sides of a sentence and output n-grams of order n.
This function also pads the original input vector of strings.
Example
julia> example = ["1","2","3","4","5"]
julia> padding_ngram(example,2,pad_left=true,pad_right=true)
6-element Vector{Any}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"sourceTextAnalysis.pagerank — Method
pagerank(A; n_iter=20, damping=0.15)Compute PageRank scores for nodes in a graph using the power iteration method.
Arguments
A: Adjacency matrix representing the graphn_iter: Number of iterations for convergence (default: 20)damping: Damping factor for PageRank algorithm (default: 0.15)
Returns
Matrix{Float64}: PageRank scores for each node
TextAnalysis.perplexity — Method
TextAnalysis.predict — Method
predict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)Predict probabilities for each class on the input Features or String.
sourceTextAnalysis.prepare! — Method
prepare!(doc, flags)
prepare!(crps, flags)Preprocess document or corpus based on the input flags.
List of Flags
- strip_patterns
- stripcorruptutf8
- strip_case
- stem_words
- tagpartof_speech
- strip_whitespace
- strip_punctuation
- strip_numbers
- stripnonletters
- stripindefinitearticles
- stripdefinitearticles
- strip_articles
- strip_prepositions
- strip_pronouns
- strip_stopwords
- stripsparseterms
- stripfrequentterms
- striphtmltags
Example
julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is document of "sourceTextAnalysis.prob — Function
prob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word
) -> Float64
prob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Get the probability of a word given its context.
In other words, for a given context, calculate the frequency distribution of words.
sourceTextAnalysis.prune! — Method
prune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.
TextAnalysis.remove_case! — Method
remove_case!(doc)
remove_case!(crps)Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.
Example
julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"See also: remove_case
TextAnalysis.remove_case — Method
TextAnalysis.remove_corrupt_utf8! — Method
remove_corrupt_utf8!(doc)
remove_corrupt_utf8!(crps)Remove corrupt UTF8 characters for doc or documents in crps. Does not support FileDocument or Corpus containing FileDocument. See also: remove_corrupt_utf8
TextAnalysis.remove_corrupt_utf8 — Method
TextAnalysis.remove_frequent_terms! — Function
remove_frequent_terms!(crps, alpha=0.95)Remove frequent terms from crps, occurring in more than alpha percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
" 1"
julia> text(crps[2])
" 2"See also: remove_sparse_terms!, frequent_terms
TextAnalysis.remove_html_tags! — Method
remove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.
Example
julia> html_doc = StringDocument(
"
<html>
<head><script language="javascript">x = 20;</script></head>
<body>
<h1>Hello</h1><a href="world">world</a>
</body>
</html>
"
)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"See also: remove_html_tags
TextAnalysis.remove_html_tags — Method
remove_html_tags(str)Remove html tags from str, including the style and script tags. See also: remove_html_tags!
TextAnalysis.remove_patterns! — Method
remove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns
TextAnalysis.remove_patterns — Method
remove_patterns(str, rex::Regex)Remove the part of str matched by rex. See also: remove_patterns!
TextAnalysis.remove_sparse_terms! — Function
remove_sparse_terms!(crps, alpha=0.05)Remove sparse terms from crps, occurring in less than alpha percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "See also: remove_frequent_terms!, sparse_terms
TextAnalysis.remove_whitespace! — Method
remove_whitespace!(doc)
remove_whitespace!(crps)Remove multiple whitespaces and replace with a single space, removing all leading and trailing whitespaces in document or corpus. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace
TextAnalysis.remove_whitespace — Method
remove_whitespace(str)Remove multiple whitespaces and replace with a single space. Remove all leading and trailing whitespaces. See also: remove_whitespace!
TextAnalysis.remove_words! — Method
remove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})Remove the occurrences of words from doc or crps.
Example
julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown jumps the lazy dog"sourceTextAnalysis.rouge_l_sentence — Function
rouge_l_sentence(
references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
weighted=false, weight_func=sqrt,
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at sentence level.
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
The weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.
See also: rouge_n, rouge_l_summary
TextAnalysis.rouge_l_summary — Method
rouge_l_summary(
references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at summary level.
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence, rouge_n
TextAnalysis.rouge_n — Method
rouge_n(
references::Vector{<:AbstractString},
candidate::AbstractString,
n::Int;
lang::Language
)::Vector{Score}Compute n-gram recall between candidate and the references summaries.
Arguments
references::Vector{T} where T<: AbstractString- List of reference summariescandidate::AbstractString- Input candidate summary to be scored against reference summariesn::Integer- Order of n-gramslang::Language- Language of the text, useful while generating n-grams (default:Languages.English())
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence, rouge_l_summary
TextAnalysis.score — Function
score(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context using MLE (Maximum Likelihood Estimation).
sourceTextAnalysis.score — Function
score(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context in an interpolated language model.
Applies Kneser-Ney and Witten-Bell smoothing depending on the sub-type.
sourceTextAnalysis.score — Method
score(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)Compute the probability of a word given its context using add-one smoothing.
Applies add-one smoothing to Lidstone or Laplace (gammamodel) models.
sourceTextAnalysis.sentence_tokenize — Method
sentence_tokenize(lang, s)Split string into individual sentences.
Arguments
lang: Language for sentence boundary detection ruless: String to split into sentences
Returns
Vector{SubString{String}}: Array of sentences extracted from the string
Example
julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Vector{SubString{String}}:
"Here are few words!"
"I am Foo Bar."See also: tokenize
TextAnalysis.sparse_terms — Function
sparse_terms(crps, alpha=0.05)Return the sparse terms from crps, occurring in less than alpha percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Vector{String}:
"1"
"2"See also: remove_sparse_terms!, frequent_terms
TextAnalysis.standardize! — Method
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocumentStandardize the documents in a Corpus to a common type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> standardize!(crps, NGramDocument)
# After this step, you can check that the corpus only contains NGramDocument's:
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokenssourceTextAnalysis.stem! — Method
stem!(doc)
stem!(crps)Apply stemming to the document or documents in crps using an appropriate stemmer.
Does not support FileDocument or Corpus containing FileDocument.
Arguments
doc: Document to apply stemming tocrps: Corpus containing documents to apply stemming to
TextAnalysis.stem! — Method
stem!(crps::Corpus)Apply stemming to an entire corpus. Assumes all documents in the corpus have the same language (determined from the first document).
Arguments
crps: Corpus containing documents to apply stemming to
TextAnalysis.stemmer_for_document — Method
stemmer_for_document(d)Return an appropriate stemmer based on the language of the document.
Arguments
d: Document for which to select stemmer
TextAnalysis.summarize — Method
summarize(doc; ns=5)Generate a summary of the document and return the top ns sentences.
Arguments
doc: Document of typeStringDocument,FileDocument, orTokenDocumentns: Number of sentences in the summary (default: 5)
Returns
Vector{SubString{String}}: Array of the most relevant sentences
Example
julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
julia> summarize(s, ns=2)
2-element Vector{SubString{String}}:
"Assume this Short Document as an example."
"This has too foo sentences."sourceTextAnalysis.tag_scheme! — Method
tag_scheme!(tags, current_scheme::String, new_scheme::String)Convert tags from one tagging scheme to another in-place.
Arguments
tags: Vector of tags to convertcurrent_scheme: Name of the current tagging schemenew_scheme: Name of the target tagging scheme
Supported Schemes
- BIO1 (BIO)
- BIO2
- BIOES
Example
julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]
julia> tag_scheme!(tags, "BIO1", "BIOES")
julia> tags
8-element Vector{String}:
"S-LOC"
"O"
"S-PER"
"B-MISC"
"E-MISC"
"B-PER"
"I-PER"
"E-PER"sourceTextAnalysis.text — Method
text(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)Access the text of Document as a string.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> text(sd)
"To be or not to be..."sourceTextAnalysis.tf! — Method
tf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})Compute term frequency for sparse matrices and store result in tf.
Arguments
dtm: Sparse document-term matrix containing term countstf: Output sparse matrix for term frequency values (modified in-place)
Notes
The tf matrix should have the same nonzero pattern as dtm.
TextAnalysis.tf! — Method
tf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})Compute term frequency and store result in tf matrix.
Arguments
dtm: Document-term matrix containing term countstf: Output matrix for term frequency values (modified in-place)
Notes
Works correctly when dtm and tf are the same matrix.
TextAnalysis.tf — Method
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})Compute term frequency for the document-term matrix.
Arguments
dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)
Returns
Matrix{Float64}orSparseMatrixCSC{Float64}: Term frequency matrix
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.166667
[2, 1] = 0.166667
[1, 2] = 0.333333
[2, 3] = 0.333333
[1, 4] = 0.166667
[2, 4] = 0.166667
[1, 5] = 0.166667
[2, 5] = 0.166667
[1, 6] = 0.166667
[2, 6] = 0.166667See also: tf!, tf_idf, tf_idf!
TextAnalysis.tf_idf! — Method
tf_idf!(dtm)Compute TF-IDF values for document-term matrix in-place.
Arguments
dtm: Document-term matrix to transform (modified in-place)
TextAnalysis.tf_idf! — Method
TextAnalysis.tf_idf! — Method
tf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})Compute TF-IDF (Term Frequency-Inverse Document Frequency) and store result in tf_idf matrix.
Arguments
dtm: Document-term matrix containing term countstf_idf: Output matrix for TF-IDF values (modified in-place)
Notes
The matrices dtm and tf_idf must have the same dimensions.
TextAnalysis.tf_idf — Method
tf_idf(dtm::DocumentTermMatrix)
tf_idf(dtm::SparseMatrixCSC{Real})
tf_idf(dtm::Matrix{Real})Compute TF-IDF (Term Frequency-Inverse Document Frequency) values for the document-term matrix.
Arguments
dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)
Returns
Matrix{Float64}orSparseMatrixCSC{Float64}: TF-IDF weighted matrix
Notes
TF-IDF addresses issues with raw word counts:
- Some documents are longer than other documents
- Some words are more frequent than other words
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.0
[2, 1] = 0.0
[1, 2] = 0.231049
[2, 3] = 0.231049
[1, 4] = 0.0
[2, 4] = 0.0
[1, 5] = 0.0
[2, 5] = 0.0
[1, 6] = 0.0
[2, 6] = 0.0See also: tf!, tf_idf, tf_idf!
TextAnalysis.timestamp! — Method
timestamp!(doc, timestamp::AbstractString)Set the timestamp metadata of doc to timestamp.
See also: timestamp, timestamps, timestamps!
TextAnalysis.timestamp — Method
timestamp(doc)Return the timestamp metadata for doc.
See also: timestamp!, timestamps, timestamps!
TextAnalysis.timestamps! — Method
timestamps!(crps, times::Vector{String})
timestamps!(crps, time::AbstractString)Set the timestamps of the documents in crps to the timestamps in times, respectively.
See also: timestamps, timestamp!, timestamp
TextAnalysis.timestamps — Method
timestamps(crps)Return the timestamps for each document in crps.
See also: timestamps!, timestamp, timestamp!
TextAnalysis.title! — Method
TextAnalysis.title — Method
TextAnalysis.titles! — Method
titles!(crps, vec::Vector{String})
titles!(crps, str)Update titles of the documents in a Corpus.
If the input is a String, set the same title for all documents. If the input is a vector, set the title of the ith document to the corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of the vector.
See also: titles, title!, title
TextAnalysis.titles — Method
TextAnalysis.tokenize — Method
tokenize(lang, s)Split string into words and other tokens such as punctuation.
Arguments
lang: Language for tokenization ruless: String to tokenize
Returns
Vector{String}: Array of tokens extracted from the string
Example
julia> tokenize(Languages.English(), "Too foo words!")
4-element Vector{String}:
"Too"
"foo"
"words"
"!"See also: sentence_tokenize
TextAnalysis.tokens — Method
tokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))Access the document text as a token array.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> tokens(sd)
7-element Vector{String}:
"To"
"be"
"or"
"not"
"to"
"be.."
"."sourceTextAnalysis.update — Method
TextAnalysis.weighted_lcs — Function
weighted_lcs(X, Y, weighted=true, f=sqrt)Compute the Weighted Longest Common Subsequence of X and Y.
Arguments
X: First sequenceY: Second sequenceweighted: Whether to use weighted computation (default: true)f: Weighting function (default: sqrt)
Returns
Float32: Length of the weighted longest common subsequence
TextAnalysis.weighted_lcs_tokens — Function
weighted_lcs_tokens(X, Y, weighted=true, f=sqrt)Compute the tokens of the Weighted Longest Common Subsequence of X and Y.
Arguments
X: First sequenceY: Second sequenceweighted: Whether to use weighted computation (default: true)f: Weighting function (default: sqrt)
Returns
Vector{String}: Array of tokens in the longest common subsequence
TextAnalysis.CooMatrix — Type
Basic Co-occurrence Matrix (COOM) type.
Fields
coom::SparseMatrixCSC{T,Int}: The actual COOM; elements represent co-occurrences of two terms within a given window.terms::Vector{String}: A list of terms that represent the lexicon of the document or corpus.column_indices::OrderedDict{String, Int}: A map between thetermsand the columns of the co-occurrence matrix.
TextAnalysis.CooMatrix — Method
CooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])Auxiliary constructors of the CooMatrix type. The type T must be a subtype of AbstractFloat.
The constructors require a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.
TextAnalysis.Corpus — Method
Corpus(docs::Vector{T}) where {T <: AbstractDocument}Collections of documents are represented using the Corpus type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokenssourceTextAnalysis.DocumentMetadata — Type
DocumentMetadata(
language::Language,
title::String,
author::String,
timestamp::String,
custom::Any
)Store basic metadata about a document.
Arguments
language: Language of the document (default:Languages.English())title: Title of the document (default: "Untitled Document")author: Author of the document (default: "Unknown Author")timestamp: Timestamp when the document was written (default: "Unknown Time")custom: User-specific data field (default:nothing)
TextAnalysis.DocumentTermMatrix — Method
DocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int}, terms::Vector{String})Represent documents as a matrix of word counts.
This representation allows linear algebra operations and statistical techniques to be applied. The lexicon must be updated before use.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix
julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1sourceTextAnalysis.FileDocument — Method
FileDocument(pathname::AbstractString)Represent a document using a plain text file on disk.
Example
julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> fd = FileDocument(pathname)
A FileDocument
* Language: Languages.English()
* Title: /usr/share/dict/words
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's AaliyahsourceTextAnalysis.KneserNeyInterpolated — Method
KneserNeyInterpolated(word::Vector{T}, discount::Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initialize a type for providing a Kneser-Ney interpolated language model.
The idea to abstract this comes from Chen & Goodman 1995.
sourceTextAnalysis.Laplace — Type
Laplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initialize a Laplace type for providing Laplace-smoothed scores.
In addition to initialization arguments from the base n-gram model, this uses a smoothing parameter gamma = 1.
sourceTextAnalysis.Lidstone — Method
Lidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initialize a Lidstone type for providing Lidstone-smoothed scores.
In addition to initialization arguments from the base n-gram model, this also requires a number by which to increase the counts (gamma).
sourceTextAnalysis.MLE — Method
MLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initialize a type for providing MLE n-gram model scores.
Implementation of the base n-gram model using Maximum Likelihood Estimation.
sourceTextAnalysis.NGramDocument — Method
NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractStringRepresent a document as a bag of n-grams, which are UTF8 n-grams that map to counts.
Example
julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
"or" => 1, "not" => 1,
"to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
"or" => 1
"be..." => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2
julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***sourceTextAnalysis.NaiveBayesClassifier — Method
NaiveBayesClassifier([dict, ]classes)A Naive Bayes Classifier for classifying documents.
Arguments
classes: Array of possible classes that the data could belong todict: (Optional) Array of possible tokens (words). This is automatically updated if a new token is detected during training or prediction
Example
julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))
julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])
julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])
julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
:spam => 0.59883
:non_spam => 0.40117sourceTextAnalysis.Score — Type
TextAnalysis.Score — Method
Score(
precision::AbstractFloat,
recall::AbstractFloat,
fmeasure::AbstractFloat
) -> Score
Stores a result of evaluation
sourceTextAnalysis.Score — Method
Score(; precision, recall, fmeasure) -> Score
sourceTextAnalysis.StringDocument — Method
StringDocument(txt::AbstractString)Represent a document using a UTF8 String stored in RAM.
Example
julia> str = "To be or not to be..."
"To be or not to be..."
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...sourceTextAnalysis.TextHashFunction — Method
TextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)The need to create a lexicon before constructing a document term matrix is often prohibitive. This implementation employs the "Hash Trick" technique, which replaces terms with their hashed values using a hash function that outputs integers from 1 to N.
Arguments
cardinality: Maximum index used for hashing (default: 100)hash_function: Function used for hashing process (default: built-inhashfunction)
Examples
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)sourceTextAnalysis.TokenDocument — Method
TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractStringRepresent a document as a sequence of UTF8 tokens.
Example
julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Vector{String}:
"To"
"be"
"or"
"not"
"to"
"be..."
julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***sourceTextAnalysis.Vocabulary — Type
Vocabulary(word, unk_cutoff=1, unk_label="<unk>")Store language model vocabulary.
Satisfies two common language modeling requirements for a vocabulary:
- When checking membership and calculating its size, filters items by comparing their counts to a cutoff value.
- Adds a special "unknown" token which unseen words are mapped to.
Example
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
julia> vocabulary.vocab
Dict{String,Int64} with 4 entries:
"<unk>" => 1
"c" => 3
"a" => 3
"d" => 2
Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
3
julia> "c" in keys(vocabulary.vocab)
true
julia> vocabulary.vocab["d"]
2
julia> "d" in keys(vocabulary.vocab)
true
Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
false
julia> "<unk>" in keys(vocabulary.vocab)
true
We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
'a'
julia> word = ["a", "-", "d", "c", "a"]
julia> lookup(vocabulary, word)
5-element Vector{Any}:
"a"
"<unk>"
"d"
"c"
"a"
If given a sequence, it will return a `Vector{Any}` of the looked up words as shown above.
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
1
julia> vocabulary.vocab["b"]
1sourceTextAnalysis.Vocabulary — Method
Vocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff
) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff,
unk_label
) -> Vocabulary
sourceTextAnalysis.WittenBellInterpolated — Method
WittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initialize a type for providing an interpolated version of Witten-Bell smoothing.
The idea to abstract this comes from Chen & Goodman 1995.
source