Features · TextAnalysis

#Features

Creating a Document Term Matrix

Often we want to represent documents as a matrix of word counts so that we can apply linear algebra operations and statistical techniques. Before we do this, we need to update the lexicon:

update_lexicon!(crps)
m = DocumentTermMatrix(crps)

A DocumentTermMatrix object is a special type. If you would like to use a simple sparse matrix, call dtm() on this object:

dtm(m)

If you would like to use a dense matrix instead, you can pass this as an argument to the dtm function:

dtm(m, :dense)

Creating Individual Rows of a Document Term Matrix

In many cases, we don't need the entire document term matrix at once: we can make do with just a single row. You can get this using the dtv function. Because individual's document do not have a lexicon associated with them, we have to pass in a lexicon as an additional argument:

dtv(crps[1], lexicon(crps))

The Hash Trick

The need to create a lexicon before we can construct a document term matrix is often prohibitive. We can often employ a trick that has come to be called the "Hash Trick" in which we replace terms with their hashed valued using a hash function that outputs integers from 1 to N. To construct such a hash function, you can use the TextHashFunction(N) constructor:

h = TextHashFunction(10)

You can see how this function maps strings to numbers by calling the index_hash function:

index_hash("a", h)
index_hash("b", h)

Using a text hash function, we can represent a document as a vector with N entries by calling the hash_dtv function:

hash_dtv(crps[1], h)

This can be done for a corpus as a whole to construct a DTM without defining a lexicon in advance:

hash_dtm(crps, h)

Every corpus has a hash function built-in, so this function can be called using just one argument:

hash_dtm(crps)

Moreover, if you do not specify a hash function for just one row of the hash DTM, a default hash function will be constructed for you:

hash_dtv(crps[1])

TF-IDF

In many cases, raw word counts are not appropriate for use because:

(A) Some documents are longer than other documents
(B) Some words are more frequent than other words

You can work around this by performing TF-IDF on a DocumentTermMatrix:

m = DocumentTermMatrix(crps)
tf_idf(m)

As you can see, TF-IDF has the effect of inserting 0's into the columns of words that occur in all documents. This is a useful way to avoid having to remove those words during preprocessing.