Features · TextAnalysis

Creating a Document Term Matrix

Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("To be or not to be"),
                      StringDocument("To become or not to become")])A Corpus with 2 documents:
 * 2 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 2 X 6 DocumentTermMatrix

A DocumentTermMatrix object is a special type. If you want to use a simple sparse matrix, call dtm() on this object:

julia> dtm(m)
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
  [1, 1]  =  1
  [2, 1]  =  1
  [1, 2]  =  2
  [2, 3]  =  2
  [1, 4]  =  1
  [2, 4]  =  1
  [1, 5]  =  1
  [2, 5]  =  1
  [1, 6]  =  1
  [2, 6]  =  1

If you want to use a dense matrix instead, you can pass this as an argument to the dtm function:

julia> dtm(m, :dense)
2×6 Matrix{Int64}:
 1  2  0  1  1  1
 1  0  2  1  1  1

Creating Individual Rows of a Document Term Matrix

In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the dtv function. Because individual documents do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:

julia> dtv(crps[1], lexicon(crps))
1×6 Matrix{Int64}:
 1  2  0  1  1  1

The Hash Trick

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can employ a trick called the "Hash Trick" in which we replace terms with their hashed values using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the TextHashFunction(N) constructor:

julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)

You can see how this function maps strings to numbers by calling the index_hash function:

julia> index_hash("a", h)
8

julia> index_hash("b", h)
7

Using a text hash function, we can represent a document as a vector with N entries by calling the hash_dtv function:

julia> hash_dtv(crps[1], h)
1×10 Matrix{Int64}:
 0  2  0  0  1  3  0  0  0  0

This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:

julia> hash_dtm(crps, h)
2×10 Matrix{Int64}:
 0  2  0  0  1  3  0  0  0  0
 0  2  0  0  1  1  0  0  2  0

Every corpus has a hash function built-in, so this function can be called using just one argument:

julia> hash_dtm(crps)
2×100 Matrix{Int64}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  0  2  0  0  0  0     0  0  0  0  0  0  0  0  0  0  0  0

Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:

julia> hash_dtv(crps[1])
1×100 Matrix{Int64}:
 0  0  0  0  0  0  0  0  0  0  0  0  0  …  0  0  0  0  0  0  0  0  0  0  0  0

Top Features

We can use the function top_terms(x, n) to quickly view the top features of a Document, DocumentTermMatrix or Corpus.

julia> top_terms(m, 5)
5-element Vector{Pair{String, Int64}}:
     "To" => 2
     "be" => 2
 "become" => 2
    "not" => 2
     "or" => 2

TF (Term Frequency)

Often we need to find out what proportion of a document is contributed by each term. This can be done using the term frequency function:

TextAnalysis.tf — Function

tf(dtm::DocumentTermMatrix)
tf(dtm::SparseMatrixCSC{Real})
tf(dtm::Matrix{Real})

Compute term frequency for the document-term matrix.

Arguments

dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)

Returns

Matrix{Float64} or SparseMatrixCSC{Float64}: Term frequency matrix

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.166667
  [2, 1]  =  0.166667
  [1, 2]  =  0.333333
  [2, 3]  =  0.333333
  [1, 4]  =  0.166667
  [2, 4]  =  0.166667
  [1, 5]  =  0.166667
  [2, 5]  =  0.166667
  [1, 6]  =  0.166667
  [2, 6]  =  0.166667

TF-IDF (Term Frequency - Inverse Document Frequency)

TextAnalysis.tf_idf — Function

tf_idf(dtm::DocumentTermMatrix)
tf_idf(dtm::SparseMatrixCSC{Real})
tf_idf(dtm::Matrix{Real})

Compute TF-IDF (Term Frequency-Inverse Document Frequency) values for the document-term matrix.

Arguments

dtm: Document-term matrix (DocumentTermMatrix, sparse matrix, or dense matrix)

Returns

Matrix{Float64} or SparseMatrixCSC{Float64}: TF-IDF weighted matrix

Notes

TF-IDF addresses issues with raw word counts:

Some documents are longer than other documents
Some words are more frequent than other words

Example

julia> crps = Corpus([StringDocument("To be or not to be"),
              StringDocument("To become or not to become")])

julia> update_lexicon!(crps)

julia> m = DocumentTermMatrix(crps)

julia> tf_idf(m)
2×6 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10 stored entries:
  [1, 1]  =  0.0
  [2, 1]  =  0.0
  [1, 2]  =  0.231049
  [2, 3]  =  0.231049
  [1, 4]  =  0.0
  [2, 4]  =  0.0
  [1, 5]  =  0.0
  [2, 5]  =  0.0
  [1, 6]  =  0.0
  [2, 6]  =  0.0

Okapi BM-25

From the document term matrix, Okapi BM25 document-word statistics can be created.

bm_25(dtm::AbstractMatrix; κ, β)
bm_25(dtm::DocumentTermMatrixm, κ, β)

It can also be used via the following method to overwrite the bm25 with calculated weights:

bm_25!(dtm, bm25, κ, β)

The input matrices can also be a SparseMatrix. The parameters κ and β default to 2 and 0.75 respectively.

Here is an example usage -

julia> using TextAnalysis
julia> crps = Corpus([
         StringDocument("a a a sample text text"),
         StringDocument("another example example text text"),
         StringDocument(""),
         StringDocument("another another text text text text")
       ])A Corpus with 4 documents:
 * 4 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)A 4 X 5 DocumentTermMatrix
julia> bm_25(m)4×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries:
 1.29959   ⋅         ⋅       1.89031  0.405067
  ⋅       0.882404  1.54025   ⋅       0.405067
  ⋅        ⋅         ⋅        ⋅        ⋅
  ⋅       1.40179    ⋅        ⋅       0.676646

Co-occurrence Matrix (COOM)

The elements of the co-occurrence matrix indicate how many times two words co-occur in a (sliding) word window of a given size. The COOM can be calculated for objects of type Corpus and AbstractDocument (with the exception of NGramDocument).

CooMatrix(crps; window, normalize)
CooMatrix(doc; window, normalize)

It takes following keyword arguments:

window::Integer: Length of the window size, defaults to 5. The actual size of the sliding window is 2 * window + 1, with the keyword argument window specifying how many words to consider to the left and right of the center word.
normalize::Bool: Normalizes counts to distance between words, defaults to true.

It returns the CooMatrix structure from which the matrix can be extracted using coom(::CooMatrix). The terms can also be extracted from this structure. Here is an example usage:

julia> using TextAnalysis
julia> crps = Corpus([StringDocument("this is a string document")])A Corpus with 1 documents:
 * 1 StringDocument's
 * 0 FileDocument's
 * 0 TokenDocument's
 * 0 NGramDocument's

Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> C = CooMatrix(crps, window=1, normalize=false)CooMatrix{Float64}(sparse([2, 5, 1, 4, 3, 5, 1, 4], [1, 1, 2, 3, 4, 4, 5, 5], [2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0], 5, 5), ["string", "document", "this", "is", "a"], OrderedCollections.OrderedDict("string" => 1, "document" => 2, "this" => 3, "is" => 4, "a" => 5))
julia> coom(C)5×5 SparseArrays.SparseMatrixCSC{Float64, Int64} with 8 stored entries:
  ⋅   2.0   ⋅    ⋅   2.0
 2.0   ⋅    ⋅    ⋅    ⋅
  ⋅    ⋅    ⋅   2.0   ⋅
  ⋅    ⋅   2.0   ⋅   2.0
 2.0   ⋅    ⋅   2.0   ⋅
julia> C.terms5-element Vector{String}:
 "string"
 "document"
 "this"
 "is"
 "a"

It can also be called to calculate the terms for a specific list of words/terms in the document. Otherwise, it calculates the co-occurrence elements for all terms.

CooMatrix(crps, terms; window, normalize)
CooMatrix(doc, terms; window, normalize)

julia> C = CooMatrix(crps, ["this", "is", "a"], window=1, normalize=false)
CooMatrix{Float64}(
  [2, 1]  =  4.0
  [1, 2]  =  4.0
  [3, 2]  =  4.0
  [2, 3]  =  4.0, ["this", "is", "a"], OrderedCollections.OrderedDict("this"=>1,"is"=>2,"a"=>3))

The type can also be specified for CooMatrix with weights of type T. T defaults to Float64.

CooMatrix{T}(crps; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc; window, normalize) where T <: AbstractFloat
CooMatrix{T}(crps, terms; window, normalize) where T <: AbstractFloat
CooMatrix{T}(doc, terms; window, normalize) where T <: AbstractFloat

Remarks:

The sliding window used to count co-occurrences does not take sentence boundaries into consideration; however, it respects document boundaries (i.e., it does not span across documents).
The co-occurrence matrices of the documents in a corpus are summed when calculating the matrix for an entire corpus.

Note

The co-occurrence matrix does not work for NGramDocument or a Corpus containing an NGramDocument.

julia> C = CooMatrix(NGramDocument("A document"), window=1, normalize=false) # fails, documents are NGramDocument
ERROR: The tokens of an NGramDocument cannot be reconstructed

Summarizer

TextAnalysis offers a simple text-rank based summarizer for its various document types.

TextAnalysis.summarize — Function

summarize(doc; ns=5)

Generate a summary of the document and return the top ns sentences.

Arguments

doc: Document of type StringDocument, FileDocument, or TokenDocument
ns: Number of sentences in the summary (default: 5)

Returns

Vector{SubString{String}}: Array of the most relevant sentences

Example

julia> s = StringDocument("Assume this Short Document as an example. Assume this as an example summarizer. This has too foo sentences.")

julia> summarize(s, ns=2)
2-element Vector{SubString{String}}:
 "Assume this Short Document as an example."
 "This has too foo sentences."

source