Creating a Document Term Matrix
Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:
julia> using TextAnalysisjulia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
A DocumentTermMatrix object is a special type. If you would like to use a simple sparse matrix, call dtm() on this object:
julia> dtm(m)
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1If you would like to use a dense matrix instead, you can pass this as an argument to the dtm function:
julia> dtm(m, :dense)
2×6 Array{Int64,2}:
1 2 0 1 1 1
1 0 2 1 1 1Creating Individual Rows of a Document Term Matrix
In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the dtv function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:
julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
1 2 0 1 1 1The Hash Trick
The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the "Hash Trick" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the TextHashFunction(N) constructor:
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)You can see how this function maps strings to numbers by calling the index_hash function:
julia> index_hash("a", h)
8
julia> index_hash("b", h)
7Using a text hash function, we can represent a document as a vector with N entries by calling the hash_dtv function:
julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:
julia> hash_dtm(crps, h)
2×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0
0 2 0 0 1 1 0 0 2 0Every corpus has a hash function built-in, so this function can be called using just one argument:
julia> hash_dtm(crps)
2×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:
julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0TF (Term Frequency)
Often we need to find out the proportion of a document is contributed by each term. This can be done by finding the term frequency function
TextAnalysis.tf — Functiontf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})Compute the term-frequency of the input.
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.166667
[2, 1] = 0.166667
[1, 2] = 0.333333
[2, 3] = 0.333333
[1, 4] = 0.166667
[2, 4] = 0.166667
[1, 5] = 0.166667
[2, 5] = 0.166667
[1, 6] = 0.166667
[2, 6] = 0.166667The parameter, dtm can be of the types - DocumentTermMatrix , SparseMatrixCSC or Matrix
julia> using TextAnalysisjulia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrixjulia> tf(m)2×6 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.166667 0.333333 ⋅ 0.166667 0.166667 0.166667 0.166667 ⋅ 0.333333 0.166667 0.166667 0.166667
TF-IDF (Term Frequency - Inverse Document Frequency)
TextAnalysis.tf_idf — Functiontf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.
In many cases, raw word counts are not appropriate for use because:
- Some documents are longer than other documents
- Some words are more frequent than other words
A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix
Example
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
[1, 1] = 0.0
[2, 1] = 0.0
[1, 2] = 0.231049
[2, 3] = 0.231049
[1, 4] = 0.0
[2, 4] = 0.0
[1, 5] = 0.0
[2, 5] = 0.0
[1, 6] = 0.0
[2, 6] = 0.0In many cases, raw word counts are not appropriate for use because:
- (A) Some documents are longer than other documents
- (B) Some words are more frequent than other words
You can work around this by performing TF-IDF on a DocumentTermMatrix:
julia> using TextAnalysisjulia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrixjulia> tf_idf(m)2×6 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.0 0.231049 ⋅ 0.0 0.0 0.0 0.0 ⋅ 0.231049 0.0 0.0 0.0
As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.
Okapi BM-25
From the document term matparamterix, Okapi BM25 document-word statistic can be created.
bm_25(dtm::AbstractMatrix; κ, β)
bm_25(dtm::DocumentTermMatrixm, κ, β)It can also be used via the following methods Overwrite the bm25 with calculated weights.
bm_25!(dtm, bm25, κ, β)The inputs matrices can also be a Sparse Matrix. The parameters κ and β default to 2 and 0.75 respectively.
Here is an example usage -
julia> using TextAnalysisjulia> crps = Corpus([ StringDocument("a a a sample text text"), StringDocument("another example example text text"), StringDocument(""), StringDocument("another another text text text text") ])A Corpus with 4 documents: * 4 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> update_lexicon!(crps)julia> m = DocumentTermMatrix(crps)A 4 X 5 DocumentTermMatrixjulia> bm_25(m)4×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries: 1.29959 ⋅ ⋅ 1.89031 0.405067 ⋅ 0.882404 1.54025 ⋅ 0.405067 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1.40179 ⋅ ⋅ 0.676646
Co occurrence matrix (COOM)
The elements of the Co occurrence matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type Corpus, AbstractDocument (with the exception of NGramDocument).
CooMatrix(crps; window, normalize)
CooMatrix(doc; window, normalize)It takes following keyword arguments:
window::Integer-length of the Window size, defaults to5. The actual size of the sliding window is 2 * window + 1, with the keyword argument window specifying how many words to consider to the left and right of the center onenormalize::Bool-normalizes counts to distance between words, defaults totrue
It returns the CooMatrix structure from which the matrix can be extracted using coom(::CooMatrix). The terms can also be extracted from this. Here is an example usage -
julia> using TextAnalysisjulia> crps = Corpus([StringDocument("this is a string document")])A Corpus with 1 documents: * 1 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> C = CooMatrix(crps, window=1, normalize=false)CooMatrix{Float64}(sparse([2, 5, 1, 4, 3, 5, 1, 4], [1, 1, 2, 3, 4, 4, 5, 5], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], 5, 5), ["string", "document", "this", "is", "a"], OrderedCollections.OrderedDict("string" => 1, "document" => 2, "this" => 3, "is" => 4, "a" => 5))julia> coom(C)5×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries: ⋅ 2.0 ⋅ ⋅ 2.0 2.0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 2.0 ⋅ ⋅ ⋅ 2.0 ⋅ 2.0 2.0 ⋅ ⋅ 2.0 ⋅julia> C.terms5-element Vector{String}: "string" "document" "this" "is" "a"
It can also be called to calculate the terms for a specific list of words / terms in the document. In other cases it calculates the the co occurrence elements for all the terms.
CooMatrix(crps, terms; window, normalize)
CooMatrix(doc, terms; window, normalize)julia> C = CooMatrix(crps, ["this", "is", "a"], window=1, normalize=false)
CooMatrix{Float64}(
[2, 1] = 4.0
[1, 2] = 4.0
[3, 2] = 4.0
[2, 3] = 4.0, ["this", "is", "a"], OrderedCollections.OrderedDict("this"=>1,"is"=>2,"a"=>3))
The type can also be specified for CooMatrix with the weights of type T. T defaults to Float64.
CooMatrix{T}(crps; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc; window, normalize) where T <: AbstractFloat
CooMatrix{T}(crps, terms; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc, terms; window, normalize) where T <: AbstractFloatRemarks:
- The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents
- The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus
The Co occurrence matrix does not work for NGramDocument, or a Corpus containing an NGramDocument.
julia> C = CooMatrix(NGramDocument("A document"), window=1, normalize=false) # fails, documents are NGramDocument
ERROR: The tokens of an NGramDocument cannot be reconstructedSummarizer
TextAnalysis offers a simple text-rank based summarizer for its various document types.
TextAnalysis.summarize — Functionsummarize(doc [, ns])Summarizes the document and returns ns number of sentences. It takes 2 arguments:
d: A document of typeStringDocument,FileDocumentorTokenDocumentns: (Optional) Mention the number of sentences in the Summary, defaults to5sentences.
By default ns is set to the value 5.
Example
julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")
julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
"Assume this Short Document as an example."
"This has too foo sentences."