API References
Base.argmax
— Methodargmax(scores::Vector{Score})::Score
- scores - vector of
Score
Returns maximum by precision fiels of each Score
Base.merge!
— Methodmerge!(dtm1::DocumentTermMatrix{T}, dtm2::DocumentTermMatrix{T}) where {T}
Merge one DocumentTermMatrix instance into another. Documents are appended to the end. Terms are re-sorted. For efficiency, this may result in modifications to dtm2 as well.
TextAnalysis.DirectoryCorpus
— MethodDirectoryCorpus(dirname::AbstractString)
Construct a Corpus from a directory of text files.
TextAnalysis.author!
— MethodTextAnalysis.author
— MethodTextAnalysis.authors!
— Methodauthors!(crps, athrs)
authors!(crps, athr)
Set the authors of the documents in crps
to the athrs
, respectively.
TextAnalysis.authors
— MethodTextAnalysis.average
— Methodaverage(scores::Vector{Score})::Score
- scores - vector of
Score
Returns average values of scores as a Score
with precision/recall/fmeasure
TextAnalysis.bleu_score
— Methodbleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)
Computes BLEU score of translated segments against one or more references. Returns the BLEU score
, n-gram precisions
, brevity penalty
, geometric mean of n-gram precisions, translationlength and referencelength
Arguments
reference_corpus
: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.translation_corpus
: list of translations to score. Each translation should be tokenized into a list of tokens.max_order
: maximum n-gram order to use when computing BLEU score.smooth=false
: whether or not to apply. Lin et al. 2004 smoothing.
Example:
one_doc_references = [
["apple", "is", "apple"],
["apple", "is", "a", "fruit"]
]
one_doc_translation = [
"apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
TextAnalysis.columnindices
— Methodcolumnindices(terms::Vector{String})
Creates a column index lookup dictionary from a vector of terms.
TextAnalysis.coo_matrix
— Methodcoo_matrix(::Type{T}, doc::Vector{AbstractString}, vocab::OrderedDict{AbstractString, Int}, window::Int, normalize::Bool, mode::Symbol)
Basic low-level function that calculates the co-occurrence matrix of a document. Returns a sparse co-occurrence matrix sized n × n
where n = length(vocab)
with elements of type T
. The document doc
is represented by a vector of its terms (in order). The keywords
windowand
normalizeindicate the size of the sliding word window in which co-occurrences are counted and whether to normalize of not the counts by the distance between word positions. The
modekeyword can be either
:defaultor
:directionaland indicates whether the co-occurrence matrix should be directional or not. This means that if
modeis
:directionalthen the co-occurrence matrix will be a
n × nmatrix where
n = length(vocab)and
coom[i,j]will be the number of times
vocab[i]co-occurs with
vocab[j]in the document
doc. If
modeis
:defaultthen the co-occurrence matrix will be a
n × nmatrix where
n = length(vocab)and
coom[i,j]will be twice the number of times
vocab[i]co-occurs with
vocab[j]in the document
doc` (once for each direction, from i to j + from j to i).
Example
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 2.0
[1, 2] = 2.0
[3, 2] = 0.3999
[2, 3] = 0.3999
julia> using TextAnalysis, DataStructures
doc = StringDocument("This is a text about an apple. There are many texts about apples.")
docv = TextAnalysis.tokenize(language(doc), text(doc))
vocab = OrderedDict("This"=>1, "is"=>2, "apple."=>3)
TextAnalysis.coo_matrix(Float16, docv, vocab, 5, true, :directional)
3×3 SparseArrays.SparseMatrixCSC{Float16,Int64} with 4 stored entries:
[2, 1] = 1.0
[1, 2] = 1.0
[3, 2] = 0.1999
[2, 3] = 0.1999
TextAnalysis.coom
— Methodcoom(c::CooMatrix)
Access the co-occurrence matrix field coom
of a CooMatrix
c
.
TextAnalysis.coom
— Methodcoom(entity, eltype=DEFAULT_FLOAT_TYPE [;window=5, normalize=true])
Access the co-occurrence matrix of the CooMatrix
associated with the entity
. The CooMatrix{T}
will first have to be created in order for the actual matrix to be accessed.
TextAnalysis.cos_similarity
— Methodfunction cos_similarity(tfm::AbstractMatrix)
cos_similarity
calculates the cosine similarity from a term frequency matrix (typically the tf-idf matrix).
Example
crps = Corpus( StringDocument.([
"to be or not to be",
"to sing or not to sing",
"to talk or to silence"]) )
update_lexicon!(crps)
d = dtm(crps)
tfm = tf_idf(d)
cs = cos_similarity(tfm)
Matrix(cs)
# 3×3 Array{Float64,2}:
# 1.0 0.0329318 0.0
# 0.0329318 1.0 0.0
# 0.0 0.0 1.0
TextAnalysis.counter2
— Methodcounter2(
data,
min::Integer,
max::Integer
) -> DataStructures.DefaultDict{SubString{String}, DataStructures.Accumulator{String, Int64}, DataStructures.Accumulator{SubString{String}, Int64}}
counter is used to make conditional distribution, which is used by score functions to calculate conditional frequency distribution
TextAnalysis.dtm
— Methoddtm(crps::Corpus)
dtm(d::DocumentTermMatrix)
dtm(d::DocumentTermMatrix, density::Symbol)
Creates a simple sparse matrix of DocumentTermMatrix object.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> dtm(DocumentTermMatrix(crps))
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1
julia> dtm(DocumentTermMatrix(crps), :dense)
2×6 Array{Int64,2}:
1 2 0 1 1 1
1 0 2 1 1 1
TextAnalysis.dtv
— Methoddtv(d::AbstractDocument, lex::Dict{String, Int})
Produce a single row of a DocumentTermMatrix.
Individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument.
Examples
julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
1 2 0 1 1 1
TextAnalysis.entropy
— Methodentropy(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculate cross-entropy of model for given evaluation text.
Input text must be Vector
of ngram of same lengths
TextAnalysis.everygram
— Methodeverygram(seq::Vector{T}; min_len::Int=1, max_len::Int=-1)where { T <: AbstractString}
Return all possible ngrams generated from sequence of items, as an Array{String,1}
Example
julia> seq = ["To","be","or","not"]
julia> a = everygram(seq,min_len=1, max_len=-1)
10-element Array{Any,1}:
"or"
"not"
"To"
"be"
"or not"
"be or"
"be or not"
"To be or"
"To be or not"
TextAnalysis.extend!
— Methodextend!(model::NaiveBayesClassifier, dictElement)
Add the dictElement to dictionary of the Classifier model
.
TextAnalysis.features
— Methodfeatures(
fs::AbstractDict,
dict::AbstractVector
) -> Vector{Int64}
Compute an Array, mapping the value corresponding to elements of dict
to the input AbstractDict
.
TextAnalysis.fit!
— Methodfit!(model::NaiveBayesClassifier, str, class)
fit!(model::NaiveBayesClassifier, ::Features, class)
fit!(model::NaiveBayesClassifier, ::StringDocument, class)
Fit the weights for the model on the input data.
TextAnalysis.fmeasure_lcs
— Functionfmeasure_lcs(RLCS, PLCS, β)
Compute the F-measure based on WLCS.
Arguments
RLCS
- Recall FactorPLCS
- Precision Factorβ
- Parameter
TextAnalysis.frequencies
— Methodfrequencies(
xs::AbstractArray{T, 1}
) -> Dict{_A, Int64} where _A
Create a dict that maps elements in input array to their frequencies.
TextAnalysis.frequent_terms
— Functionfrequent_terms(crps, alpha=0.95)
Find the frequent terms from Corpus, occurring more than alpha
percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> frequent_terms(crps)
3-element Array{String,1}:
"is"
"This"
"Document"
See also: remove_frequent_terms!
, sparse_terms
TextAnalysis.get_ngrams
— Methodget_ngrams(segment, max_order)
Extracts all n-grams upto a given maximum order from an input segment. Returns the counter containing all n-grams upto max_order in segment with a count of how many times each n-gram occurred.
Arguments
segment
: text segment from which n-grams will be extracted.max_order
: maximum length in tokens of the n-grams returned by this methods.
TextAnalysis.hash_dtm
— Methodhash_dtm(crps::Corpus)
hash_dtm(crps::Corpus, h::TextHashFunction)
Represents a Corpus as a Matrix with N entries.
TextAnalysis.hash_dtv
— Methodhash_dtv(d::AbstractDocument)
hash_dtv(d::AbstractDocument, h::TextHashFunction)
Represents a document as a vector with N entries.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0
julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
TextAnalysis.index_hash
— Methodindex_hash(str, TextHashFunc)
Shows mapping of string to integer.
Parameters: - str = Max index used for hashing (default 100) - TextHashFunc = TextHashFunction type object
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
julia> index_hash("a", h)
8
julia> index_hash("b", h)
7
TextAnalysis.inverse_index
— Methodinverse_index(crps::Corpus)
Shows the inverse index of a corpus.
If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.
TextAnalysis.language!
— Methodlanguage!(doc, lang::Language)
Set the language of doc
to lang
.
Example
julia> d = StringDocument("String Document 1")
julia> language!(d, Languages.Spanish())
julia> d.metadata.language
Languages.Spanish()
See also: language
, languages
, languages!
TextAnalysis.language
— MethodTextAnalysis.languages!
— Methodlanguages!(crps, langs::Vector{Language})
languages!(crps, lang::Language)
Update languages of documents in a Corpus.
If the input is a Vector, then language of the i
th document is set to the i
th element in the vector, respectively. However, the number of documents must equal the length of vector.
TextAnalysis.languages
— Methodlanguages(crps)
Return the languages for each document in crps
.
See also: languages!
, language
, language!
TextAnalysis.lda
— Methodϕ, θ = lda(dtm::DocumentTermMatrix, ntopics::Int, iterations::Int, α::Float64, β::Float64; kwargs...)
Perform Latent Dirichlet allocation.
Required Positional Arguments
α
Dirichlet dist. hyperparameter for topic distribution per document.α<1
yields a sparse topic mixture for each document.α>1
yields a more uniform topic mixture for each document.β
Dirichlet dist. hyperparameter for word distribution per topic.β<1
yields a sparse word mixture for each topic.β>1
yields a more uniform word mixture for each topic.
Optional Keyword Arguments
showprogress::Bool
. Show a progress bar during the Gibbs sampling. Default value:true
.
Return Values
ϕ
:ntopics × nwords
Sparse matrix of probabilities s.t.sum(ϕ, 1) == 1
θ
:ntopics × ndocs
Dense matrix of probabilities s.t.sum(θ, 1) == 1
TextAnalysis.lexical_frequency
— Methodlexical_frequency(crps::Corpus, term::AbstractString)
Tells us how often a term occurs across all of the documents.
TextAnalysis.lexicon
— Methodlexicon(crps::Corpus)
Shows the lexicon of the corpus.
Lexicon of a corpus consists of all the terms that occur in any document in the corpus.
Example
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> lexicon(crps)
Dict{String,Int64} with 0 entries
TextAnalysis.lexicon_size
— Methodlexicon_size(crps::Corpus)
Tells the total number of terms in a lexicon.
TextAnalysis.logscore
— Methodlogscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
Evaluate the log score of this word in this context.
The arguments are the same as for score
and maskedscore
TextAnalysis.lookup
— Methodlookup(
voc::Vocabulary,
word::AbstractArray{T<:AbstractString, 1}
) -> Vector
lookup a sequence or words in the vocabulary
Return an Array of String
See Vocabulary
TextAnalysis.lsa
— Methodlsa(dtm::DocumentTermMatrix)
lsa(crps::Corpus)
Performs Latent Semantic Analysis or LSA on a corpus.
TextAnalysis.maskedscore
— Methodmaskedscore(
m::TextAnalysis.Langmodel,
temp_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
It is used to evaluate score with masks out of vocabulary words
The arguments are the same as for score
TextAnalysis.ngramize
— Methodngramize(lang, tokens, n)
Compute the ngrams of tokens
of the order n
.
Example
julia> ngramize(Languages.English(), ["To", "be", "or", "not", "to"], 3)
Dict{AbstractString,Int64} with 3 entries:
"be or not" => 1
"or not to" => 1
"To be or" => 1
TextAnalysis.ngramizenew
— Methodngramizenew( words::Vector{T}, nlist::Integer...) where { T <: AbstractString}
ngramizenew is used to out putting ngrmas in set
Example
julia> seq=["To","be","or","not","To","not","To","not"]
julia> ngramizenew(seq ,2)
7-element Array{Any,1}:
"To be"
"be or"
"or not"
"not To"
"To not"
"not To"
"To not"
TextAnalysis.ngrams
— Methodngrams(ngd::NGramDocument, n::Integer)
ngrams(d::AbstractDocument, n::Integer)
ngrams(d::NGramDocument)
ngrams(d::AbstractDocument)
Access the document text as n-gram counts.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> ngrams(sd)
Dict{String,Int64} with 7 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 1
"be.." => 1
"." => 1
TextAnalysis.onegramize
— Methodonegramize(lang, tokens)
Create the unigrams dict for input tokens.
Example
julia> onegramize(Languages.English(), ["To", "be", "or", "not", "to", "be"])
Dict{String,Int64} with 5 entries:
"or" => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2
TextAnalysis.padding_ngram
— Methodpadding_ngram(word::Vector{T}, n=1; pad_left=false, pad_right=false, left_pad_symbol="<s>", right_pad_symbol ="</s>") where { T <: AbstractString}
padding _ngram is used to pad both left and right of sentence and out putting ngrmas of order n
It also pad the original input Array of string
Example
julia> example = ["1","2","3","4","5"]
julia> padding_ngram(example,2,pad_left=true,pad_right=true)
6-element Array{Any,1}:
"<s> 1"
"1 2"
"2 3"
"3 4"
"4 5"
"5 </s>"
TextAnalysis.perplexity
— Methodperplexity(
m::TextAnalysis.Langmodel,
lm::DataStructures.DefaultDict,
text_ngram::AbstractVector
) -> Float64
Calculates the perplexity of the given text.
This is simply 2 ** cross-entropy(entropy) for the text, so the arguments are the same as entropy
TextAnalysis.predict
— Methodpredict(::NaiveBayesClassifier, str)
predict(::NaiveBayesClassifier, ::Features)
predict(::NaiveBayesClassifier, ::StringDocument)
Predict probabilities for each class on the input Features or String.
TextAnalysis.prepare!
— Methodprepare!(doc, flags)
prepare!(crps, flags)
Preprocess document or corpus based on the input flags.
List of Flags
- strip_patterns
- stripcorruptutf8
- strip_case
- stem_words
- tagpartof_speech
- strip_whitespace
- strip_punctuation
- strip_numbers
- stripnonletters
- stripindefinitearticles
- stripdefinitearticles
- strip_articles
- strip_prepositions
- strip_pronouns
- strip_stopwords
- stripsparseterms
- stripfrequentterms
- striphtmltags
Example
julia> doc = StringDocument("This is a document of mine")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: This is a document of mine
julia> prepare!(doc, strip_pronouns | strip_articles)
julia> text(doc)
"This is document of "
TextAnalysis.prob
— Functionprob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word
) -> Float64
prob(
m::TextAnalysis.Langmodel,
templ_lm::DataStructures.DefaultDict,
word,
context
) -> Float64
To get probability of word given that context
In other words, for given context calculate frequency distribution of word
TextAnalysis.prune!
— Methodprune!(dtm::DocumentTermMatrix{T}, document_positions; compact::Bool=true, retain_terms::Union{Nothing,Vector{T}}=nothing) where {T}
Delete documents specified by document_positions
from a document term matrix. Optionally compact the matrix by removing unreferenced terms.
TextAnalysis.remove_case!
— Methodremove_case!(doc)
remove_case!(crps)
Convert the text of doc
or crps
to lowercase. Does not support FileDocument
or crps
containing FileDocument
.
Example
julia> str = "The quick brown fox jumps over the lazy dog"
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: The quick brown fox jumps over the lazy dog
julia> remove_case!(sd)
julia> sd.text
"the quick brown fox jumps over the lazy dog"
See also: remove_case
TextAnalysis.remove_case
— Methodremove_case(str)
Convert str
to lowercase. See also: remove_case!
TextAnalysis.remove_corrupt_utf8!
— Methodremove_corrupt_utf8!(doc)
remove_corrupt_utf8!(crps)
Remove corrupt UTF8 characters for doc
or documents in crps
. Does not support FileDocument
or Corpus containing FileDocument
. See also: remove_corrupt_utf8
TextAnalysis.remove_corrupt_utf8
— Methodremove_corrupt_utf8(str)
Remove corrupt UTF8 characters in str
. See also: remove_corrupt_utf8!
TextAnalysis.remove_frequent_terms!
— Functionremove_frequent_terms!(crps, alpha=0.95)
Remove terms in crps
, occurring more than alpha
percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_frequent_terms!(crps)
julia> text(crps[1])
" 1"
julia> text(crps[2])
" 2"
See also: remove_sparse_terms!
, frequent_terms
TextAnalysis.remove_html_tags!
— Methodremove_html_tags!(doc::StringDocument)
remove_html_tags!(crps)
Remove html tags from the StringDocument
or documents crps
. Does not work for documents other than StringDocument
.
Example
julia> html_doc = StringDocument(
"
<html>
<head><script language="javascript">x = 20;</script></head>
<body>
<h1>Hello</h1><a href="world">world</a>
</body>
</html>
"
)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: <html> <head><s
julia> remove_html_tags!(html_doc)
julia> strip(text(html_doc))
"Hello world"
See also: remove_html_tags
TextAnalysis.remove_html_tags
— Methodremove_html_tags(str)
Remove html tags from str
, including the style and script tags. See also: remove_html_tags!
TextAnalysis.remove_patterns!
— Methodremove_patterns!(doc, rex::Regex)
remove_patterns!(crps, rex::Regex)
Remove patterns matched by rex
in document or Corpus. Does not modify FileDocument
or Corpus containing FileDocument
. See also: remove_patterns
TextAnalysis.remove_patterns
— Methodremove_patterns(str, rex::Regex)
Remove the part of str matched by rex. See also: remove_patterns!
TextAnalysis.remove_sparse_terms!
— Functionremove_sparse_terms!(crps, alpha=0.05)
Remove sparse terms in crps, occurring less than alpha
percent of documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> remove_sparse_terms!(crps, 0.5)
julia> crps[1].text
"This is Document "
julia> crps[2].text
"This is Document "
See also: remove_frequent_terms!
, sparse_terms
TextAnalysis.remove_whitespace!
— Methodremove_whitespace!(doc)
remove_whitespace!(crps)
Squash multiple whitespaces to a single space and remove all leading and trailing whitespaces in document or crps. Does no-op for FileDocument
, TokenDocument
or NGramDocument
. See also: remove_whitespace
TextAnalysis.remove_whitespace
— Methodremove_whitespace(str)
Squash multiple whitespaces to a single one. And remove all leading and trailing whitespaces. See also: remove_whitespace!
TextAnalysis.remove_words!
— Methodremove_words!(doc, words::Vector{AbstractString})
remove_words!(crps, words::Vector{AbstractString})
Remove the occurrences of words from doc
or crps
.
Example
julia> str="The quick brown fox jumps over the lazy dog"
julia> sd=StringDocument(str);
julia> remove_words = ["fox", "over"]
julia> remove_words!(sd, remove_words)
julia> sd.text
"the quick brown jumps the lazy dog"
TextAnalysis.rouge_l_sentence
— Functionrouge_l_sentence(
references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
weighted=false, weight_func=sqrt,
lang=Languages.English()
)::Vector{Score}
Calculate the ROUGE-L score between references
and candidate
at sentence level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
Note: the weighted
argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func
here has a power of 0.5 by default.
See also: rouge_n
, rouge_l_summary
TextAnalysis.rouge_l_summary
— Methodrouge_l_summary(
references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
lang=Languages.English()
)::Vector{Score}
Calculate the ROUGE-L score between references
and candidate
at summary level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence()
, rouge_n
TextAnalysis.rouge_n
— Methodrouge_n(
references::Vector{<:AbstractString},
candidate::AbstractString,
n::Int;
lang::Language
)::Vector{Score}
Compute n-gram recall between candidate
and the references
summaries.
The function takes the following arguments -
references::Vector{T} where T<: AbstractString
= The list of reference summaries.candidate::AbstractString
= Input candidate summary, to be scored against reference summaries.n::Integer
= Order of NGramslang::Language
= Language of the text, useful while generating N-grams. Defaults value is Languages.English()
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence
, rouge_l_summary
TextAnalysis.score
— Functionscore(m::MLE, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context in MLE
TextAnalysis.score
— Functionscore(m::InterpolatedLanguageModel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context in InterpolatedLanguageModel
Apply Kneserney and WittenBell smoothing depending upon the sub-Type
TextAnalysis.score
— Methodscore(m::gammamodel, temp_lm::DefaultDict, word::AbstractString, context::AbstractString)
score is used to output probability of word given that context
Add-one smoothing to Lidstone or Laplace(gammamodel) models
TextAnalysis.sentence_tokenize
— Methodsentence_tokenize(language, str)
Split str
into sentences.
Example
julia> sentence_tokenize(Languages.English(), "Here are few words! I am Foo Bar.")
2-element Array{SubString{String},1}:
"Here are few words!"
"I am Foo Bar."
See also: tokenize
TextAnalysis.sparse_terms
— Functionsparse_terms(crps, alpha=0.05])
Find the sparse terms from Corpus, occurring in less than alpha
percentage of the documents.
Example
julia> crps = Corpus([StringDocument("This is Document 1"),
StringDocument("This is Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> sparse_terms(crps, 0.5)
2-element Array{String,1}:
"1"
"2"
See also: remove_sparse_terms!
, frequent_terms
TextAnalysis.standardize!
— Methodstandardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument
Standardize the documents in a Corpus to a common type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> standardize!(crps, NGramDocument)
# After this step, you can check that the corpus only contains NGramDocument's:
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.stem!
— Methodstem!(doc)
stem!(crps)
Stems the document or documents in crps
with a suitable stemmer.
Stemming cannot be done for FileDocument
and Corpus made of these type of documents.
TextAnalysis.stem!
— Methodstem!(crps::Corpus)
Stem an entire corpus. Assumes all documents in the corpus have the same language (picked from the first)
TextAnalysis.stemmer_for_document
— Methodstemmer_for_document(doc)
Search for an appropriate stemmer based on the language of the document.
TextAnalysis.summarize
— Methodsummarize(doc [, ns])
Summarizes the document and returns ns
number of sentences. It takes 2 arguments:
d
: A document of typeStringDocument
,FileDocument
orTokenDocument
ns
: (Optional) Mention the number of sentences in the Summary, defaults to5
sentences.
By default ns
is set to the value 5.
Example
julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
"Assume this Short Document as an example."
"This has too foo sentences."
TextAnalysis.tag_scheme!
— Methodtag_scheme!(tags, current_scheme::String, new_scheme::String)
Convert tags
from current_scheme
to new_scheme
.
List of tagging schemes currently supported-
- BIO1 (BIO)
- BIO2
- BIOES
Example
julia> tags = ["I-LOC", "O", "I-PER", "B-MISC", "I-MISC", "B-PER", "I-PER", "I-PER"]
julia> tag_scheme!(tags, "BIO1", "BIOES")
julia> tags
8-element Array{String,1}:
"S-LOC"
"O"
"S-PER"
"B-MISC"
"E-MISC"
"B-PER"
"I-PER"
"E-PER"
TextAnalysis.text
— Methodtext(fd::FileDocument)
text(sd::StringDocument)
text(ngd::NGramDocument)
Access the text of Document as a string.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> text(sd)
"To be or not to be..."
TextAnalysis.tf!
— Methodtf!(dtm::SparseMatrixCSC{Real}, tf::SparseMatrixCSC{AbstractFloat})
Overwrite tf
with the term frequency of the dtm
.
tf
should have the has same nonzeros as dtm
.
TextAnalysis.tf!
— Methodtf!(dtm::AbstractMatrix{Real}, tf::AbstractMatrix{AbstractFloat})
Overwrite tf
with the term frequency of the dtm
.
Works correctly if dtm
and tf
are same matrix.
TextAnalysis.tf
— Methodtf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})
Compute the term-frequency
of the input.
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.166667
[2, 1] = 0.166667
[1, 2] = 0.333333
[2, 3] = 0.333333
[1, 4] = 0.166667
[2, 4] = 0.166667
[1, 5] = 0.166667
[2, 5] = 0.166667
[1, 6] = 0.166667
[2, 6] = 0.166667
TextAnalysis.tf_idf!
— Methodtf_idf!(dtm)
Compute tf-idf for dtm
TextAnalysis.tf_idf!
— Methodtf_idf!(dtm::SparseMatrixCSC{Real}, tfidf::SparseMatrixCSC{AbstractFloat})
Overwrite tfidf
with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm
.
The arguments must have same number of nonzeros.
TextAnalysis.tf_idf!
— Methodtf_idf!(dtm::AbstractMatrix{Real}, tf_idf::AbstractMatrix{AbstractFloat})
Overwrite tf_idf
with the tf-idf (Term Frequency - Inverse Doc Frequency) of the dtm
.
dtm
and tf-idf
must be matrices of same dimensions.
TextAnalysis.tf_idf
— Methodtf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})
Compute tf-idf
value (Term Frequency - Inverse Document Frequency) for the input.
In many cases, raw word counts are not appropriate for use because:
- Some documents are longer than other documents
- Some words are more frequent than other words
A simple workaround this can be done by performing TF-IDF
on a DocumentTermMatrix
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.0
[2, 1] = 0.0
[1, 2] = 0.231049
[2, 3] = 0.231049
[1, 4] = 0.0
[2, 4] = 0.0
[1, 5] = 0.0
[2, 5] = 0.0
[1, 6] = 0.0
[2, 6] = 0.0
TextAnalysis.timestamp!
— Methodtimestamp!(doc, timestamp::AbstractString)
Set the timestamp metadata of doc to timestamp
.
See also: timestamp
, timestamps
, timestamps!
TextAnalysis.timestamp
— MethodTextAnalysis.timestamps!
— Methodtimestamps!(crps, times::Vector{String})
timestamps!(crps, time::AbstractString)
Set the timestamps of the documents in crps
to the timestamps in times
, respectively.
See also: timestamps
, timestamp!
, timestamp
TextAnalysis.timestamps
— Methodtimestamps(crps)
Return the timestamps for each document in crps
.
See also: timestamps!
, timestamp
, timestamp!
TextAnalysis.title!
— MethodTextAnalysis.title
— MethodTextAnalysis.titles!
— Methodtitles!(crps, vec::Vector{String})
titles!(crps, str)
Update titles of the documents in a Corpus.
If the input is a String, set the same title for all documents. If the input is a vector, set title of i
th document to corresponding i
th element in the vector vec
. In the latter case, the number of documents must equal the length of vector.
TextAnalysis.titles
— MethodTextAnalysis.tokenize
— Methodtokenize(language, str)
Split str
into words and other tokens such as punctuation.
Example
julia> tokenize(Languages.English(), "Too foo words!")
4-element Array{String,1}:
"Too"
"foo"
"words"
"!"
See also: sentence_tokenize
TextAnalysis.tokens
— Methodtokens(d::TokenDocument)
tokens(d::(Union{FileDocument, StringDocument}))
Access the document text as a token array.
Example
julia> sd = StringDocument("To be or not to be...")
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
julia> tokens(sd)
7-element Array{String,1}:
"To"
"be"
"or"
"not"
"to"
"be.."
"."
TextAnalysis.update
— Methodupdate(vocab::Vocabulary, words) -> Dict{String, Int64}
See Vocabulary
TextAnalysis.weighted_lcs
— Functionweighted_lcs(X, Y, weight_score::Bool, returns_string::Bool, weigthing_function::Function)
Compute the Weighted Longest Common Subsequence of X and Y.
TextAnalysis.CooMatrix
— TypeBasic Co-occurrence Matrix (COOM) type.
Fields
coom::SparseMatriCSC{T,Int}
the actual COOM; elements represent
co-occurrences of two terms within a given window
terms::Vector{String}
a list of terms that represent the lexicon of
the document or corpus
column_indices::OrderedDict{String, Int}
a map between theterms
and the
columns of the co-occurrence matrix
TextAnalysis.CooMatrix
— MethodCooMatrix{T}(crps::Corpus [,terms] [;window=5, normalize=true])
Auxiliary constructor(s) of the CooMatrix
type. The type T
has to be a subtype of AbstractFloat
. The constructor(s) requires a corpus crps
and a terms
structure representing the lexicon of the corpus. The latter can be a Vector{String}
, an AbstractDict
where the keys are the lexicon, or can be omitted, in which case the lexicon
field of the corpus is used.
TextAnalysis.Corpus
— MethodCorpus(docs::Vector{T}) where {T <: AbstractDocument}
Collections of documents are represented using the Corpus type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
TextAnalysis.DocumentMetadata
— TypeDocumentMetadata(
language::Language,
title::String,
author::String,
timestamp::String,
custom::Any
)
Stores basic metadata about Document.
...
Arguments
language
: What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.title::String
: What is the title of the document? Defaults to "Untitled Document".author::String
: Who wrote the document? Defaults to "Unknown Author".timestamp::String
: When was the document written? Defaults to "Unknown Time".custom
: user specific data field. Defaults to nothing.
...
TextAnalysis.DocumentTermMatrix
— MethodDocumentTermMatrix(crps::Corpus)
DocumentTermMatrix(crps::Corpus, terms::Vector{String})
DocumentTermMatrix(crps::Corpus, lex::AbstractDict)
DocumentTermMatrix(dtm::SparseMatrixCSC{Int, Int},terms::Vector{String})
Represent documents as a matrix of word counts.
Allow us to apply linear algebra operations and statistical techniques. Need to update lexicon before use.
Examples
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
A 2 X 6 DocumentTermMatrix
julia> m.dtm
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1
TextAnalysis.FileDocument
— MethodFileDocument(pathname::AbstractString)
Represents a document using a plain text file on disk.
Example
julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"
julia> fd = FileDocument(pathname)
A FileDocument
* Language: Languages.English()
* Title: /usr/share/dict/words
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah
TextAnalysis.KneserNeyInterpolated
— MethodKneserNeyInterpolated(word::Vector{T}, discount:: Float64,unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}
Initiate Type for providing KneserNey Interpolated language model.
The idea to abstract this comes from Chen & Goodman 1995.
TextAnalysis.Laplace
— TypeLaplace(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}
Function to initiate Type(Laplace) for providing Laplace-smoothed scores.
In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma = 1.
TextAnalysis.Lidstone
— MethodLidstone(word::Vector{T}, gamma:: Float64, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}
Function to initiate Type(Lidstone) for providing Lidstone-smoothed scores.
In addition to initialization arguments from BaseNgramModel also requires a number by which to increase the counts, gamma.
TextAnalysis.MLE
— MethodMLE(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where {T <: AbstractString}
Initiate Type for providing MLE ngram model scores.
Implementation of Base Ngram Model.
TextAnalysis.NGramDocument
— MethodNGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString
Represents a document as a bag of n-grams, which are UTF8 n-grams and map to counts.
Example
julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
"or" => 1, "not" => 1,
"to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
"or" => 1
"be..." => 1
"not" => 1
"to" => 1
"To" => 1
"be" => 2
julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.NaiveBayesClassifier
— MethodNaiveBayesClassifier([dict, ]classes)
A Naive Bayes Classifier for classifying documents.
It takes two arguments:
classes
: An array of possible classes that the concerned data could belong to.dict
:(Optional Argument) An Array of possible tokens (words). This is automatically updated if a new token is detected in the Step 2) or 3)
Example
julia> using TextAnalysis: NaiveBayesClassifier, fit!, predict
julia> m = NaiveBayesClassifier([:spam, :non_spam])
NaiveBayesClassifier{Symbol}(String[], [:spam, :non_spam], Matrix{Int64}(undef, 0, 2))
julia> fit!(m, "this is spam", :spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam"], [:spam, :non_spam], [2 1; 2 1; 2 1])
julia> fit!(m, "this is not spam", :non_spam)
NaiveBayesClassifier{Symbol}(["this", "is", "spam", "not"], [:spam, :non_spam], [2 2; 2 2; 2 2; 1 2])
julia> predict(m, "is this a spam")
Dict{Symbol, Float64} with 2 entries:
:spam => 0.59883
:non_spam => 0.40117
TextAnalysis.Score
— Typestruct Score
precision::Float32
recall::Float32
fmeasure::Float32
TextAnalysis.Score
— MethodScore(
precision::AbstractFloat,
recall::AbstractFloat,
fmeasure::AbstractFloat
) -> Score
Stores a result of evaluation
TextAnalysis.Score
— MethodScore(; precision, recall, fmeasure) -> Score
TextAnalysis.StringDocument
— MethodStringDocument(txt::AbstractString)
Represents a document using a UTF8 String stored in RAM.
Example
julia> str = "To be or not to be..."
"To be or not to be..."
julia> sd = StringDocument(str)
A StringDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: To be or not to be...
TextAnalysis.TextHashFunction
— MethodTextHashFunction(cardinality)
TextHashFunction(hash_function, cardinality)
The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the Hash Trick in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N.
Parameters: - cardinality = Max index used for hashing (default 100) - hash_function = function used for hashing process (default function present, see code-base)
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
TextAnalysis.TokenDocument
— MethodTokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString
Represents a document as a sequence of UTF8 tokens.
Example
julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Array{String,1}:
"To"
"be"
"or"
"not"
"to"
"be..."
julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
* Language: Languages.English()
* Title: Untitled Document
* Author: Unknown Author
* Timestamp: Unknown Time
* Snippet: ***SAMPLE TEXT NOT AVAILABLE***
TextAnalysis.Vocabulary
— TypeVocabulary(word,unk_cutoff =1 ,unk_label = "<unk>")
Stores language model vocabulary. Satisfies two common language modeling requirements for a vocabulary:
- When checking membership and calculating its size, filters items
by comparing their counts to a cutoff value. Adds a special "unknown" token which unseen words are mapped to.
Example
julia> words = ["a", "c", "-", "d", "c", "a", "b", "r", "a", "c", "d"]
julia> vocabulary = Vocabulary(words, 2)
Vocabulary(Dict("<unk>"=>1,"c"=>3,"a"=>3,"d"=>2), 2, "<unk>")
julia> vocabulary.vocab
Dict{String,Int64} with 4 entries:
"<unk>" => 1
"c" => 3
"a" => 3
"d" => 2
Tokens with counts greater than or equal to the cutoff value will
be considered part of the vocabulary.
julia> vocabulary.vocab["c"]
3
julia> "c" in keys(vocabulary.vocab)
true
julia> vocabulary.vocab["d"]
2
julia> "d" in keys(vocabulary.vocab)
true
Tokens with frequency counts less than the cutoff value will be considered not
part of the vocabulary even though their entries in the count dictionary are
preserved.
julia> "b" in keys(vocabulary.vocab)
false
julia> "<unk>" in keys(vocabulary.vocab)
true
We can look up words in a vocabulary using its `lookup` method.
"Unseen" words (with counts less than cutoff) are looked up as the unknown label.
If given one word (a string) as an input, this method will return a string.
julia> lookup("a")
'a'
julia> word = ["a", "-", "d", "c", "a"]
julia> lookup(vocabulary ,word)
5-element Array{Any,1}:
"a"
"<unk>"
"d"
"c"
"a"
If given a sequence, it will return an Array{Any,1} of the looked up words as shown above.
It's possible to update the counts after the vocabulary has been created.
julia> update(vocabulary,["b","c","c"])
1
julia> vocabulary.vocab["b"]
1
TextAnalysis.Vocabulary
— MethodVocabulary(word::Array{T<:AbstractString, 1}) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff
) -> Vocabulary
Vocabulary(
word::Array{T<:AbstractString, 1},
unk_cutoff,
unk_label
) -> Vocabulary
TextAnalysis.WittenBellInterpolated
— MethodWittenBellInterpolated(word::Vector{T}, unk_cutoff=1, unk_label="<unk>") where { T <: AbstractString}
Initiate Type for providing Interpolated version of Witten-Bell smoothing.
The idea to abstract this comes from Chen & Goodman 1995.