Creating a Document Term Matrix

Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix

A DocumentTermMatrix object is a special type. If you would like to use a simple sparse matrix, call dtm() on this object:

julia> dtm(m)
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

If you would like to use a dense matrix instead, you can pass this as an argument to the dtm function:

julia> dtm(m, :dense)
2×6 Array{Int64,2}:
 1  2  0  1  1  1
 1  0  2  1  1  1

Creating Individual Rows of a Document Term Matrix

In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the dtv function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:

julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
 1  2  0  1  1  1

The Hash Trick

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the "Hash Trick" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the TextHashFunction(N) constructor:

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

You can see how this function maps strings to numbers by calling the index_hash function:

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7

Using a text hash function, we can represent a document as a vector with N entries by calling the hash_dtv function:

julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0

This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:

julia> hash_dtm(crps, h)
2×10 Array{Int64,2}:
 0  2  0  0  1  3  0  0  0  0
 0  2  0  0  1  1  0  0  2  0

Every corpus has a hash function built-in, so this function can be called using just one argument:

julia> hash_dtm(crps)
2×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  2  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0

Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:

julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

TF (Term Frequency)

Often we need to find out the proportion of a document is contributed by each term. This can be done by finding the term frequency function

TextAnalysis.tfFunction
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute the term-frequency of the input.

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

See also: tf!, tf_idf, tf_idf!

source

The parameter, dtm can be of the types - DocumentTermMatrix , SparseMatrixCSC or Matrix

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> tf(m)2×6 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.166667 0.333333 ⋅ 0.166667 0.166667 0.166667 0.166667 ⋅ 0.333333 0.166667 0.166667 0.166667

TF-IDF (Term Frequency - Inverse Document Frequency)

TextAnalysis.tf_idfFunction
tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute tf-idf value (Term Frequency - Inverse Document Frequency) for the input.

In many cases, raw word counts are not appropriate for use because:

  • Some documents are longer than other documents
  • Some words are more frequent than other words

A simple workaround this can be done by performing TF-IDF on a DocumentTermMatrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

See also: tf!, tf_idf, tf_idf!

source

In many cases, raw word counts are not appropriate for use because:

  • (A) Some documents are longer than other documents
  • (B) Some words are more frequent than other words

You can work around this by performing TF-IDF on a DocumentTermMatrix:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"), StringDocument("To become or not to become")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix
julia> tf_idf(m)2×6 SparseArrays.SparseMatrixCSC{Float64, Int64} with 10 stored entries: 0.0 0.231049 ⋅ 0.0 0.0 0.0 0.0 ⋅ 0.231049 0.0 0.0 0.0

As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.

Okapi BM-25

From the document term matparamterix, Okapi BM25 document-word statistic can be created.

bm_25(dtm::AbstractMatrix; κ, β)
bm_25(dtm::DocumentTermMatrixm, κ, β)

It can also be used via the following methods Overwrite the bm25 with calculated weights.

bm_25!(dtm, bm25, κ, β)

The inputs matrices can also be a Sparse Matrix. The parameters κ and β default to 2 and 0.75 respectively.

Here is an example usage -

julia> using TextAnalysis
julia> crps = Corpus([ StringDocument("a a a sample text text"), StringDocument("another example example text text"), StringDocument(""), StringDocument("another another text text text text") ])A Corpus with 4 documents: * 4 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 4 X 5 DocumentTermMatrix
julia> bm_25(m)4×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries: 1.29959 ⋅ ⋅ 1.89031 0.405067 ⋅ 0.882404 1.54025 ⋅ 0.405067 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 1.40179 ⋅ ⋅ 0.676646

Co occurrence matrix (COOM)

The elements of the Co occurrence matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type Corpus, AbstractDocument (with the exception of NGramDocument).

CooMatrix(crps; window, normalize)
CooMatrix(doc; window, normalize)

It takes following keyword arguments:

  • window::Integer -length of the Window size, defaults to 5. The actual size of the sliding window is 2 * window + 1, with the keyword argument window specifying how many words to consider to the left and right of the center one
  • normalize::Bool -normalizes counts to distance between words, defaults to true

It returns the CooMatrix structure from which the matrix can be extracted using coom(::CooMatrix). The terms can also be extracted from this. Here is an example usage -

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("this is a string document")])A Corpus with 1 documents: * 1 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> C = CooMatrix(crps, window=1, normalize=false)CooMatrix{Float64}(sparse([2, 5, 1, 4, 3, 5, 1, 4], [1, 1, 2, 3, 4, 4, 5, 5], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], 5, 5), ["string", "document", "this", "is", "a"], OrderedCollections.OrderedDict("string" => 1, "document" => 2, "this" => 3, "is" => 4, "a" => 5))
julia> coom(C)5×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries: ⋅ 2.0 ⋅ ⋅ 2.0 2.0 ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ 2.0 ⋅ ⋅ ⋅ 2.0 ⋅ 2.0 2.0 ⋅ ⋅ 2.0 ⋅
julia> C.terms5-element Vector{String}: "string" "document" "this" "is" "a"

It can also be called to calculate the terms for a specific list of words / terms in the document. In other cases it calculates the the co occurrence elements for all the terms.

CooMatrix(crps, terms; window, normalize)
CooMatrix(doc, terms; window, normalize)
julia> C = CooMatrix(crps, ["this", "is", "a"], window=1, normalize=false)
CooMatrix{Float64}(
  [2, 1]  =  4.0
  [1, 2]  =  4.0
  [3, 2]  =  4.0
  [2, 3]  =  4.0, ["this", "is", "a"], OrderedCollections.OrderedDict("this"=>1,"is"=>2,"a"=>3))

The type can also be specified for CooMatrix with the weights of type T. T defaults to Float64.

CooMatrix{T}(crps; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc; window, normalize) where T <: AbstractFloat
CooMatrix{T}(crps, terms; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc, terms; window, normalize) where T <: AbstractFloat

Remarks:

  • The sliding window used to count co-occurrences does not take into consideration sentence stops however, it does with documents i.e. does not span across documents
  • The co-occurrence matrices of the documents in a corpus are summed up when calculating the matrix for an entire corpus
Note

The Co occurrence matrix does not work for NGramDocument, or a Corpus containing an NGramDocument.

julia> C = CooMatrix(NGramDocument("A document"), window=1, normalize=false) # fails, documents are NGramDocument
ERROR: The tokens of an NGramDocument cannot be reconstructed

Summarizer

TextAnalysis offers a simple text-rank based summarizer for its various document types.

TextAnalysis.summarizeFunction
summarize(doc [, ns])

Summarizes the document and returns ns number of sentences. It takes 2 arguments:

  • d : A document of type StringDocument, FileDocument or TokenDocument
  • ns : (Optional) Mention the number of sentences in the Summary, defaults to 5 sentences.

By default ns is set to the value 5.

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Array{SubString{String},1}:
 "Assume this Short Document as an example."
 "This has too foo sentences."
source