API References
Base.argmax — Methodargmax(scores::Vector{Score})::Score- scores - vector of
Score
Returns maximum by precision fiels of each Score
Base.merge! — Methodmerge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.
TextAnalysis.DirectoryCorpus — MethodDirectoryCorpus(dirname::AbstractString)Construct a Corpus from a directory of text files.
TextAnalysis.author! — MethodTextAnalysis.author — MethodTextAnalysis.authors! — Methodauthors!(crps, athrs)
authors!(crps, athr)Set the authors of the documents in crps to the athrs, respectively.
TextAnalysis.authors — MethodTextAnalysis.average — Methodaverage(scores::Vector{Score})::Score- scores - vector of
Score
Returns average values of scores as a Score with precision/recall/fmeasure
TextAnalysis.bleu_score — Methodbleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength
Arguments
reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.max_order: maximum n-gram order to use when computing BLEU score.smooth=false: whether or not to apply. Lin et al. 2004 smoothing.
Example:
one_doc_references = [
["apple", "is", "apple"],
["apple", "is", "a", "fruit"]
]
one_doc_translation = [
"apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)TextAnalysis.columnindices — Methodcolumnindices(terms::Vector{String})Creates a column index lookup dictionary from a vector of terms.
TextAnalysis.coo_matrix — Methodcoo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n where n = length(vocab) with elements of type T. The document doc is represented by a vector of its terms (in order). The keywordswindowandnormalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. Themodekeyword can be either:defaultor:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that ifmodeis:directionalthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc. Ifmodeis:defaultthen the co-occurrence matrix will be an × nmatrix wheren = length(vocab)andcoom[i,j]will be twice the number of timesvocab[i]co-occurs withvocab[j]in the documentdoc` (once for each direction, from i to j + from j to i).
Example
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 2.0
[1, 2] = 2.0
[3, 2] = 0.3999
[2, 3] = 0.3999
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 1.0
[1, 2] = 1.0
[3, 2] = 0.1999
[2, 3] = 0.1999TextAnalysis.coom — Methodcoom(c::CooMatrix)Access the co-occurrence matrix field coom of a CooMatrix c.
TextAnalysis.coom — Methodcoom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])Access the co-occurrence matrix of the CooMatrix associated with the entity. The CooMatrix{T} will first have to be created in order for the actual matrix to be accessed.
TextAnalysis.cos_similarity — Methodfunction cos_similarity(tfm::AbstractMatrix)cos_similarity calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).
Example
crps = Corpus( StringDocument.([
"to be or not to be",
"to sing or not to sing",
"to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
# 3×3 Array{Float64,2}:
# 1.0 0.0329318 0.0
# 0.0329318 1.0 0.0
# 0.0 0.0 1.0TextAnalysis.counter2 — Methodcounter2(
data,
min::Integer,
max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}
counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution
TextAnalysis.dtm — Methoddtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)Creates a simple sparse matrix of DocumentTermMatrix object.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1
julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
1 2 0 1 1 1
1 0 2 1 1 1TextAnalysis.dtv — Methoddtv(d::AbstractDocument, lex::Dict{String, Int})Produce a single row of a DocumentTermMatrix.
Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.
Examples
julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
1 2 0 1 1 1TextAnalysis.entropy — Methodentropy(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculate cross-entropy of model for given evaluation text.
Input text must be Vector of ngram of same lengths
TextAnalysis.everygram — Methodeverygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}Return all possible ngrams generated from sequence of items, as an Array{String,1}
Example
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"TextAnalysis.extend! — Methodextend!(model::NaiveBayesClassifier, dictElement)Add the dictElement to dictionary of the Classifier model.
TextAnalysis.features — Methodfeatures(
fs::AbstractDict,
dict::AbstractVector
) -> Vector{Int64}
Compute an Array, mapping the value corresponding to elements of dict to the input AbstractDict.
TextAnalysis.fit! — Methodfit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)Fit the weights for the model on the input data.
TextAnalysis.fmeasure_lcs — Functionfmeasure_lcs(RLCS, PLCS, β)Compute the F-measure based on WLCS.
Arguments
RLCS- Recall FactorPLCS- Precision Factorβ- Parameter
TextAnalysis.frequencies — Methodfrequencies(
xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A
Create a dict that maps elements in input array to their frequencies.
TextAnalysis.frequent_terms — Functionfrequent_terms(crps, alpha=0.95)Find the frequent terms from Corpus, occurring more than alpha percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
"is"
"This"
"Document"See also: remove_frequent_terms!, sparse_terms
TextAnalysis.get_ngrams — Methodget_ngrams(segment, max_order)Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.
Arguments
segment: text segment from which n-grams will be extracted.max_order: maximum length in tokens of the n-grams returned by this methods.
TextAnalysis.hash_dtm — Methodhash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)Represents a Corpus as a Matrix with N entries.
TextAnalysis.hash_dtv — Methodhash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)Represents a document as a vector with N entries.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0
julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0TextAnalysis.index_hash — Methodindex_hash(str, TextHashFunc)Shows mapping of string to integer.
Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> index_hash("a", h)
8
julia> index_hash("b", h)
7TextAnalysis.inverse_index — Methodinverse_index(crps::Corpus)Shows the inverse index of a corpus.
If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.
TextAnalysis.language! — Methodlanguage!(doc, lang::Language)Set the language of doc to lang.
Example
julia> d = StringDocument("String Document 1")
julia> language!(d, Languages.Spanish())
julia> d.metadata.language
Languages.Spanish()See also: language, languages, languages!
TextAnalysis.language — MethodTextAnalysis.languages! — Methodlanguages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)Update languages of documents in a Corpus.
If the input is a Vector, then language of the ith document is set to the ith element in the vector, respectively. However, the number of documents must equal the length of vector.
TextAnalysis.languages — Methodlanguages(crps)Return the languages for each document in crps.
See also: languages!, language, language!
TextAnalysis.lda — Methodϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)Perform Latent Dirichlet allocation.
Required Positional Arguments
αDirichlet dist. hyperparameter for topic distribution per document.α<1yields a sparse topic mixture for each document.α>1yields a more uniform topic mixture for each document.βDirichlet dist. hyperparameter for word distribution per topic.β<1yields a sparse word mixture for each topic.β>1yields a more uniform word mixture for each topic.
Optional Keyword Arguments
showprogress::Bool. Show a progress bar during the Gibbs sampling. Default value:true.
Return Values
ϕ:ntopics × nwordsSparse matrix of probabilities s.t.sum(ϕ, 1) == 1θ:ntopics × ndocsDense matrix of probabilities s.t.sum(θ, 1) == 1
TextAnalysis.lexical_frequency — Methodlexical_frequency(crps::Corpus, term::AbstractString)Tells us how often a term occurs across all of the documents.
TextAnalysis.lexicon — Methodlexicon(crps::Corpus)Shows the lexicon of the corpus.
Lexicon of a corpus consists of all the terms that occur in any document in the corpus.
Example
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> lexicon(crps)
Dict{String,Int64} with 0 entriesTextAnalysis.lexicon_size — Methodlexicon_size(crps::Corpus)Tells the total number of terms in a lexicon.
TextAnalysis.logscore — Methodlogscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Evaluate the log score of this word in this context.
The arguments are the same as for score and maskedscore
TextAnalysis.lookup — Methodlookup(
voc::Vocabulary,
word::AbstractArray{T<:AbstractString, 1}
) -> Vector
lookup a sequence or words in the vocabulary
Return an Array of String
See Vocabulary
TextAnalysis.lsa — Methodlsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)Performs Latent Semantic Analysis or LSA on a corpus.
TextAnalysis.maskedscore — Methodmaskedscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
It is used to evaluate score with masks out of vocabulary words
The arguments are the same as for score
TextAnalysis.ngramize — Methodngramize(lang, tokens, n)Compute the ngrams of tokens of the order n.
Example
julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
"be or not" => 1
"or not to" => 1
"To be or" => 1TextAnalysis.ngramizenew — Methodngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}ngramizenew is used to out putting ngrmas in set
Example
julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
7-element Array{Any,1}:
"To be"
"be or"
"or not"
"not To"
"To not"
"not To"
"To not"TextAnalysis.ngrams — Methodngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)Access the document text as n-gram counts.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> ngrams(sd)
Dict{String,Int64} with 7 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 1
"be.." => 1
"." => 1TextAnalysis.onegramize — Methodonegramize(lang, tokens)Create the unigrams dict for input tokens.
Example
julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2TextAnalysis.padding_ngram — Methodpadding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n
It also pad the original input Array of string
Example
julia> example = ["1","2","3","4","5"]
julia> padding_ngram(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"TextAnalysis.perplexity — Methodperplexity(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculates the perplexity of the given text.
This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy
TextAnalysis.predict — Methodpredict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)Predict probabilities for each class on the input Features or String.
TextAnalysis.prepare! — Methodprepare!(doc, flags)
prepare!(crps, flags)Preprocess document or corpus based on the input flags.
List of Flags
- strip_patterns
- stripcorruptutf8
- strip_case
- stem_words
- tagpartof_speech
- strip_whitespace
- strip_punctuation
- strip_numbers
- stripnonletters
- stripindefinitearticles
- stripdefinitearticles
- strip_articles
- strip_prepositions
- strip_pronouns
- strip_stopwords
- stripsparseterms
- stripfrequentterms
- striphtmltags
Example
julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is document of "TextAnalysis.prob — Functionprob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word
) -> Float64
prob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
To get probability of word given that context
In other words, for given context calculate frequency distribution of word
TextAnalysis.prune! — Methodprune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}Delete documents specified by document_positions from a document term matrix. Optionally compact the matrix by removing unreferenced terms.
TextAnalysis.remove_case! — Methodremove_case!(doc)
remove_case!(crps)Convert the text of doc or crps to lowercase. Does not support FileDocument or crps containing FileDocument.
Example
julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"See also: remove_case
TextAnalysis.remove_case — Methodremove_case(str)Convert str to lowercase. See also: remove_case!
TextAnalysis.remove_corrupt_utf8! — Methodremove_corrupt_utf8!(doc)
remove_corrupt_utf8!(crps)Remove corrupt UTF8 characters for doc or documents in crps. Does not support FileDocument or Corpus containing FileDocument. See also: remove_corrupt_utf8
TextAnalysis.remove_corrupt_utf8 — Methodremove_corrupt_utf8(str)Remove corrupt UTF8 characters in str. See also: remove_corrupt_utf8!
TextAnalysis.remove_frequent_terms! — Functionremove_frequent_terms!(crps, alpha=0.95)Remove terms in crps, occurring more than alpha percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
" 1"
julia> text(crps[2])
" 2"See also: remove_sparse_terms!, frequent_terms
TextAnalysis.remove_html_tags! — Methodremove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)Remove html tags from the StringDocument or documents crps. Does not work for documents other than StringDocument.
Example
julia> html_doc = StringDocument(
"
<html>
<head><script language="javascript">x = 20;</script></head>
<body>
<h1>Hello</h1><a href="world">world</a>
</body>
</html>
"
)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"See also: remove_html_tags
TextAnalysis.remove_html_tags — Methodremove_html_tags(str)Remove html tags from str, including the style and script tags. See also: remove_html_tags!
TextAnalysis.remove_patterns! — Methodremove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)Remove patterns matched by rex in document or Corpus. Does not modify FileDocument or Corpus containing FileDocument. See also: remove_patterns
TextAnalysis.remove_patterns — Methodremove_patterns(str, rex::Regex)Remove the part of str matched by rex. See also: remove_patterns!
TextAnalysis.remove_sparse_terms! — Functionremove_sparse_terms!(crps, alpha=0.05)Remove sparse terms in crps, occurring less than alpha percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "See also: remove_frequent_terms!, sparse_terms
TextAnalysis.remove_whitespace! — Methodremove_whitespace!(doc)
remove_whitespace!(crps)Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument, TokenDocument or NGramDocument. See also: remove_whitespace
TextAnalysis.remove_whitespace — Methodremove_whitespace(str)Squash multiple whitespaces to a single one. And remove all leading and trailing whitespaces. See also: remove_whitespace!
TextAnalysis.remove_words! — Methodremove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})Remove the occurrences of words from doc or crps.
Example
julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown jumps the lazy dog"TextAnalysis.rouge_l_sentence — Functionrouge_l_sentence(
references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
weighted=false, weight_func=sqrt,
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at sentence level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.
See also: rouge_n, rouge_l_summary
TextAnalysis.rouge_l_summary — Methodrouge_l_summary(
references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at summary level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence(), rouge_n
TextAnalysis.rouge_n — Methodrouge_n(
references::Vector{<:AbstractString},
candidate::AbstractString,
n::Int;
lang::Language
)::Vector{Score}Compute n-gram recall between candidate and the references summaries.
The function takes the following arguments -
references::Vector{T} where T<: AbstractString= The list of reference summaries.candidate::AbstractString= Input candidate summary, to be scored against reference summaries.n::Integer= Order of NGramslang::Language= Language of the text, useful while generating N-grams. Defaults value is Languages.English()
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence, rouge_l_summary
TextAnalysis.score — Functionscore(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)score is used to output probability of word given that context in InterpolatedLanguageModel
Apply Kneserney and WittenBell smoothing depending upon the sub-Type
TextAnalysis.score — Functionscore(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)score is used to output probability of word given that context in MLE
TextAnalysis.score — Methodscore(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)score is used to output probability of word given that context
Add-one smoothing to Lidstone or Laplace(gammamodel) models
TextAnalysis.sentence_tokenize — Methodsentence_tokenize(language, str)Split str into sentences.
Example
julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Array{SubString{String},1}:
"Here are few words!"
"I am Foo Bar."See also: tokenize
TextAnalysis.sparse_terms — Functionsparse_terms(crps, alpha=0.05])Find the sparse terms from Corpus, occurring in less than alpha percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Array{String,1}:
"1"
"2"See also: remove_sparse_terms!, frequent_terms
TextAnalysis.standardize! — Methodstandardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocumentStandardize the documents in a Corpus to a common type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> standardize!(crps, NGramDocument)
# After this step, you can check that the corpus only contains NGramDocument's:
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokensTextAnalysis.stem! — Methodstem!(doc)
stem!(crps)Stems the document or documents in crps with a suitable stemmer.
Stemming cannot be done for FileDocument and Corpus made of these type of documents.
TextAnalysis.stem! — Methodstem!(crps::Corpus)Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)
TextAnalysis.stemmer_for_document — Methodstemmer_for_document(doc)Search for an appropriate stemmer based on the language of the document.
TextAnalysis.summarize — Methodsummarize(doc [, ns])Summarizes the document and returns ns number of sentences. It takes 2 arguments:
d: A document of typeStringDocument,FileDocumentorTokenDocumentns: (Optional) Mention the number of sentences in the Summary, defaults to5sentences.
By default ns is set to the value 5.
Example
julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
"Assume this Short Document as an example."
"This has too foo sentences."TextAnalysis.tag_scheme! — Methodtag_scheme!(tags, current_scheme::String, new_scheme::String)Convert tags from current_scheme to new_scheme.
List of tagging schemes currently supported-
- BIO1 (BIO)
- BIO2
- BIOES
Example
julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]
julia> tag_scheme!(tags, "BIO1", "BIOES")
julia> tags
8-element Array{String,1}:
"S-LOC"
"O"
"S-PER"
"B-MISC"
"E-MISC"
"B-PER"
"I-PER"
"E-PER"TextAnalysis.text — Methodtext(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)Access the text of Document as a string.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> text(sd)
"To be or not to be..."TextAnalysis.tf! — Methodtf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})Overwrite tf with the term frequency of the dtm.
tf should have the has same nonzeros as dtm.
TextAnalysis.tf! — Methodtf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})Overwrite tf with the term frequency of the dtm.
Works correctly if dtm and tf are same matrix.
TextAnalysis.tf — Methodtf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})Compute the term-frequency of the input.
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.166667
[2, 1] = 0.166667
[1, 2] = 0.333333
[2, 3] = 0.333333
[1, 4] = 0.166667
[2, 4] = 0.166667
[1, 5] = 0.166667
[2, 5] = 0.166667
[1, 6] = 0.166667
[2, 6] = 0.166667TextAnalysis.tf_idf! — Methodtf_idf!(dtm)Compute tf-idf for dtm
TextAnalysis.tf_idf! — Methodtf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})Overwrite tfidf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.
The arguments must have same number of nonzeros.
TextAnalysis.tf_idf! — Methodtf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})Overwrite tf_idf with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm.
dtm and tf-idf must be matrices of same dimensions.
TextAnalysis.tf_idf — Methodtf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.
In many cases, raw word counts are not appropriate for use because:
- Some documents are longer than other documents
- Some words are more frequent than other words
A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.0
[2, 1] = 0.0
[1, 2] = 0.231049
[2, 3] = 0.231049
[1, 4] = 0.0
[2, 4] = 0.0
[1, 5] = 0.0
[2, 5] = 0.0
[1, 6] = 0.0
[2, 6] = 0.0TextAnalysis.timestamp! — Methodtimestamp!(doc, timestamp::AbstractString)Set the timestamp metadata of doc to timestamp.
See also: timestamp, timestamps, timestamps!
TextAnalysis.timestamp — MethodTextAnalysis.timestamps! — Methodtimestamps!(crps, times::Vector{String})
timestamps!(crps, time::AbstractString)Set the timestamps of the documents in crps to the timestamps in times, respectively.
See also: timestamps, timestamp!, timestamp
TextAnalysis.timestamps — Methodtimestamps(crps)Return the timestamps for each document in crps.
See also: timestamps!, timestamp, timestamp!
TextAnalysis.title! — MethodTextAnalysis.title — MethodTextAnalysis.titles! — Methodtitles!(crps, vec::Vector{String})
titles!(crps, str)Update titles of the documents in a Corpus.
If the input is a String, set the same title for all documents. If the input is a vector, set title of ith document to corresponding ith element in the vector vec. In the latter case, the number of documents must equal the length of vector.
TextAnalysis.titles — MethodTextAnalysis.tokenize — Methodtokenize(language, str)Split str into words and other tokens such as punctuation.
Example
julia> tokenize(Languages.English(), "Too foo words!")
4-element Array{String,1}:
"Too"
"foo"
"words"
"!"See also: sentence_tokenize
TextAnalysis.tokens — Methodtokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))Access the document text as a token array.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> tokens(sd)
7-element Array{String,1}:
"To"
"be"
"or"
"not"
"to"
"be.."
"."TextAnalysis.update — Methodupdate(vocab::Vocabulary, words) -> Dict{String, Int64}
See Vocabulary
TextAnalysis.weighted_lcs — Functionweighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)Compute the Weighted Longest Common Subsequence of X and Y.
TextAnalysis.CooMatrix — TypeBasic Co-occurrence Matrix (COOM) type.
Fields
coom::SparseMatriCSC{T,Int}the actual COOM; elements represent
co-occurrences of two terms within a given window
terms::Vector{String}a list of terms that represent the lexicon of
the document or corpus
column_indices::OrderedDict{String, Int}a map between thetermsand the
columns of the co-occurrence matrix
TextAnalysis.CooMatrix — MethodCooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])Auxiliary constructor(s) of the CooMatrix type. The type T has to be a subtype of AbstractFloat. The constructor(s) requires a corpus crps and a terms structure representing the lexicon of the corpus. The latter can be a Vector{String}, an AbstractDict where the keys are the lexicon, or can be omitted, in which case the lexicon field of the corpus is used.
TextAnalysis.Corpus — MethodCorpus(docs::Vector{T}) where {T <: AbstractDocument}Collections of documents are represented using the Corpus type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokensTextAnalysis.DocumentMetadata — TypeDocumentMetadata(
language::Language,
title::String,
author::String,
timestamp::String,
custom::Any
)Stores basic metadata about Document.
...
Arguments
language: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.title::String: What is the title of the document? Defaults to "Untitled Document".author::String: Who wrote the document? Defaults to "Unknown Author".timestamp::String: When was the document written? Defaults to "Unknown Time".custom: user specific data field. Defaults to nothing.
...
TextAnalysis.DocumentTermMatrix — MethodDocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})Represent documents as a matrix of word counts.
Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix
julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1TextAnalysis.FileDocument — MethodFileDocument(pathname::AbstractString)Represents a document using a plain text file on disk.
Example
julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> fd = FileDocument(pathname)
A FileDocument
* Language: Languages.English()
* Title: /usr/share/dict/words
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's AaliyahTextAnalysis.KneserNeyInterpolated — MethodKneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initiate Type for providing KneserNey Interpolated language model.
The idea to abstract this comes from Chen & Goodman 1995.
TextAnalysis.Laplace — TypeLaplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Function to initiate Type(Laplace) for providing Laplace-smoothed scores.
In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.
TextAnalysis.Lidstone — MethodLidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.
In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.
TextAnalysis.MLE — MethodMLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}Initiate Type for providing MLE ngram model scores.
Implementation of Base Ngram Model.
TextAnalysis.NGramDocument — MethodNGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractStringRepresents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.
Example
julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
"or" => 1, "not" => 1,
"to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
"or" => 1
"be..." => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2
julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***TextAnalysis.NaiveBayesClassifier — MethodNaiveBayesClassifier([dict, ]classes)A Naive Bayes Classifier for classifying documents.
It takes two arguments:
classes: An array of possible classes that the concerned data could belong to.dict:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)
Example
julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))
julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])
julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])
julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
:spam => 0.59883
:non_spam => 0.40117TextAnalysis.Score — Typestruct Scoreprecision::Float32recall::Float32fmeasure::Float32
TextAnalysis.Score — MethodScore(
precision::AbstractFloat,
recall::AbstractFloat,
fmeasure::AbstractFloat
) -> Score
Stores a result of evaluation
TextAnalysis.Score — MethodScore(; precision, recall, fmeasure) -> Score
TextAnalysis.StringDocument — MethodStringDocument(txt::AbstractString)Represents a document using a UTF8 String stored in RAM.
Example
julia> str = "To be or not to be..."
"To be or not to be..."
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...TextAnalysis.TextHashFunction — MethodTextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.
Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)TextAnalysis.TokenDocument — MethodTokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractStringRepresents a document as a sequence of UTF8 tokens.
Example
julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
"To"
"be"
"or"
"not"
"to"
"be..."
julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***TextAnalysis.Vocabulary — TypeVocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:
- When checking membership and calculating its size, filters items
by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.
Example
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
julia> vocabulary.vocab
Dict{String,Int64} with 4 entries:
"<unk>" => 1
"c" => 3
"a" => 3
"d" => 2
Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
3
julia> "c" in keys(vocabulary.vocab)
true
julia> vocabulary.vocab["d"]
2
julia> "d" in keys(vocabulary.vocab)
true
Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
false
julia> "<unk>" in keys(vocabulary.vocab)
true
We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
'a'
julia> word = ["a", "-", "d", "c", "a"]
julia> lookup(vocabulary ,word)
5-element Array{Any,1}:
"a"
"<unk>"
"d"
"c"
"a"
If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
1
julia> vocabulary.vocab["b"]
1TextAnalysis.Vocabulary — MethodVocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff
) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff,
unk_label
) -> Vocabulary
TextAnalysis.WittenBellInterpolated — MethodWittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}Initiate Type for providing Interpolated version of Witten-Bell smoothing.
The idea to abstract this comes from Chen & Goodman 1995.