Creating a Corpus
Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:
TextAnalysis.Corpus
— TypeCorpus(docs::Vector{T}) where {T <: AbstractDocument}
Collections of documents are represented using the Corpus type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
Standardizing a Corpus
A Corpus
may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the standardize!
function:
TextAnalysis.standardize!
— Functionstandardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocument
Standardize the documents in a Corpus to a common type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> standardize!(crps, NGramDocument)
# After this step, you can check that the corpus only contains NGramDocument's:
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
Processing a Corpus
We can apply the same sort of preprocessing steps that are defined for individual documents to an entire corpus at once:
julia> using TextAnalysis
julia> crps = Corpus([StringDocument("Document ..!!"), StringDocument("Document ..!!")])
A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokens
julia> prepare!(crps, strip_punctuation)
julia> text(crps[1])
"Document "
julia> text(crps[2])
"Document "
These operations are run on each document in the corpus individually.
Corpus Statistics
Often we wish to think broadly about properties of an entire corpus at once. In particular, we want to work with two constructs:
- Lexicon: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all of the documents. Often the most interesting words in a document are those words whose frequency within a document is higher than their frequency in the corpus as a whole.
- Inverse Index: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index tells us this and therefore provides a simplistic sort of search algorithm.
Because computations involving the lexicon can take a long time, a Corpus
's default lexicon is blank:
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
julia> lexicon(crps)
Dict{String,Int64} with 0 entries
In order to work with the lexicon, you have to update it and then access it:
julia> update_lexicon!(crps)
julia> lexicon(crps)
Dict{String,Int64} with 3 entries:
"Bar" => 1
"Foo" => 1
"Name" => 2
But once this work is done, you can easier address lots of interesting questions about a corpus:
julia> lexical_frequency(crps, "Name")
0.5
julia> lexical_frequency(crps, "Foo")
0.25
Like the lexicon, the inverse index for a corpus is blank by default:
julia> inverse_index(crps)
Dict{String,Array{Int64,1}} with 0 entries
Again, you need to update it before you can work with it:
julia> update_inverse_index!(crps)
julia> inverse_index(crps)
Dict{String,Array{Int64,1}} with 3 entries:
"Bar" => [2]
"Foo" => [1]
"Name" => [1, 2]
But once you've updated the inverse index, you can easily search the entire corpus:
julia> crps["Name"]
2-element Array{Int64,1}:
1
2
julia> crps["Foo"]
1-element Array{Int64,1}:
1
julia> crps["Summer"]
0-element Array{Int64,1}
Converting a DataFrame from a Corpus
Sometimes we want to apply non-text specific data analysis operations to a corpus. The easiest way to do this is to convert a Corpus
object into a DataFrame
:
convert(DataFrame, crps)
Corpus Metadata
You can also retrieve the metadata for every document in a Corpus
at once:
languages()
: What language is the document in? Defaults toLanguages.English()
, a Language instance defined by the Languages package.titles()
: What is the title of the document? Defaults to"Untitled Document"
.authors()
: Who wrote the document? Defaults to"Unknown Author"
.timestamps()
: When was the document written? Defaults to"Unknown Time"
.
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
julia> languages(crps)
2-element Array{Languages.English,1}:
Languages.English()
Languages.English()
julia> titles(crps)
2-element Array{String,1}:
"Untitled Document"
"Untitled Document"
julia> authors(crps)
2-element Array{String,1}:
"Unknown Author"
"Unknown Author"
julia> timestamps(crps)
2-element Array{String,1}:
"Unknown Time"
"Unknown Time"
It is possible to change the metadata fields for each document in a Corpus
. These functions use the same metadata value for every document:
julia> languages!(crps, Languages.German())
julia> titles!(crps, "")
julia> authors!(crps, "Me")
julia> timestamps!(crps, "Now")
Additionally, you can specify the metadata fields for each document in a Corpus
individually:
julia> languages!(crps, [Languages.German(), Languages.English
julia> titles!(crps, ["", "Untitled"])
julia> authors!(crps, ["Ich", "You"])
julia> timestamps!(crps, ["Unbekannt", "2018"])