Creating a Document Term Matrix
Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:
julia> crps = Corpus([StringDocument("To be or not to be"),
StringDocument("To become or not to become")])
julia> update_lexicon!(crps)
julia> m = DocumentTermMatrix(crps)
DocumentTermMatrix(
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1, ["To", "be", "become", "not", "or", "to"], Dict("or"=>5,"not"=>4,"to"=>6,"To"=>1,"be"=>2,"become"=>3))
A DocumentTermMatrix
object is a special type. If you would like to use a simple sparse matrix, call dtm()
on this object:
julia> dtm(m)
2×6 SparseArrays.SparseMatrixCSC{Int64,Int64} with 10 stored entries:
[1, 1] = 1
[2, 1] = 1
[1, 2] = 2
[2, 3] = 2
[1, 4] = 1
[2, 4] = 1
[1, 5] = 1
[2, 5] = 1
[1, 6] = 1
[2, 6] = 1
If you would like to use a dense matrix instead, you can pass this as an argument to the dtm
function:
julia> dtm(m, :dense)
2×6 Array{Int64,2}:
1 2 0 1 1 1
1 0 2 1 1 1
Creating Individual Rows of a Document Term Matrix
In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the dtv
function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:
julia> dtv(crps[1], lexicon(crps))
1×6 Array{Int64,2}:
1 2 0 1 1 1
The Hash Trick
The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the "Hash Trick" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the TextHashFunction(N)
constructor:
julia> h = TextHashFunction(10)
TextHashFunction(hash, 10)
You can see how this function maps strings to numbers by calling the index_hash
function:
julia> index_hash("a", h)
8
julia> index_hash("b", h)
7
Using a text hash function, we can represent a document as a vector with N entries by calling the hash_dtv
function:
julia> hash_dtv(crps[1], h)
1×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0
This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:
julia> hash_dtm(crps, h)
2×10 Array{Int64,2}:
0 2 0 0 1 3 0 0 0 0
0 2 0 0 1 1 0 0 2 0
Every corpus has a hash function built-in, so this function can be called using just one argument:
julia> hash_dtm(crps)
2×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:
julia> hash_dtv(crps[1])
1×100 Array{Int64,2}:
0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0
TF-IDF (Term Frequency - Inverse Document Frequency)
In many cases, raw word counts are not appropriate for use because:
- (A) Some documents are longer than other documents
- (B) Some words are more frequent than other words
You can work around this by performing TF-IDF on a DocumentTermMatrix:
m = DocumentTermMatrix(crps)
tf_idf(m)
As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.
Sentiment Analyzer
It can be used to find the sentiment score (between 0 and 1) of a word, sentence or a Document. A trained model (using Flux) on IMDB word corpus with weights saved are used to calculate the sentiments.
model = SentimentAnalyzer(doc)
model = SentimentAnalyzer(doc, handle_unknown)
- doc = Input Document for calculating document (AbstractDocument type)
- handle_unknown = A function for handling unknown words. Should return an array (default (x)->[])