Corpus

Creating a Corpus

Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:

crps = Corpus(Any[StringDocument("Document 1"),
                  StringDocument("Document 2")])

Standardizing a Corpus

A Corpus may contain many different types of documents:

crps = Corpus(Any[StringDocument("Document 1"),
                  TokenDocument("Document 2"),
                  NGramDocument("Document 3")])

It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the standardize! function:

standardize!(crps, NGramDocument)

After this step, you can check that the corpus only contains NGramDocument's:

crps

Processing a Corpus

We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:

crps = Corpus(Any[StringDocument("Document 1"),
                  StringDocument("Document 2")])
remove_punctuation!(crps)

These operations are run on each document in the corpus individually.

Corpus Statistics

Often we wish to think broadly about properties of an entire corpus at once. In particular, we want to work with two constructs:

Because computations involving the lexicon can take a long time, a Corpus's default lexicon is blank:

lexicon(crps)

In order to work with the lexicon, you have to update it and then access it:

update_lexicon!(crps)
lexicon(crps)

But once this work is done, you can easier address lots of interesting questions about a corpus:

lexical_frequency(crps, "Summer")
lexical_frequency(crps, "Document")

Like the lexicon, the inverse index for a corpus is blank by default:

inverse_index(crps)

Again, you need to update it before you can work with it:

update_inverse_index!(crps)
inverse_index(crps)

But once you've updated the inverse index, you can easily search the entire corpus:

crps["Document"]
crps["1"]
crps["Summer"]

Converting a DataFrame from a Corpus

Sometimes we want to apply non-text specific data analysis operations to a corpus. The easiest way to do this is to convert a Corpus object into a DataFrame:

convert(DataFrame, crps)