Creating a Corpus
Working with isolated documents gets boring quickly. We typically want to work with a collection of documents. We represent collections of documents using the Corpus type:
TextAnalysis.Corpus — Type
Corpus(docs::Vector{T}) where {T <: AbstractDocument}Collections of documents are represented using the Corpus type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
StringDocument("Document 2")])
A Corpus with 2 documents:
* 2 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 0 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokenssourceStandardizing a Corpus
A Corpus may contain many different types of documents. It is generally more convenient to standardize all of the documents in a corpus using a single type. This can be done using the standardize! function:
TextAnalysis.standardize! — Function
standardize!(crps::Corpus, ::Type{T}) where T <: AbstractDocumentStandardize the documents in a Corpus to a common type.
Example
julia> crps = Corpus([StringDocument("Document 1"),
TokenDocument("Document 2"),
NGramDocument("Document 3")])
A Corpus with 3 documents:
* 1 StringDocument's
* 0 FileDocument's
* 1 TokenDocument's
* 1 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokens
julia> standardize!(crps, NGramDocument)
# After this step, you can check that the corpus only contains NGramDocument's:
julia> crps
A Corpus with 3 documents:
* 0 StringDocument's
* 0 FileDocument's
* 0 TokenDocument's
* 3 NGramDocument's
Corpus's lexicon contains 0 tokens
Corpus's index contains 0 tokenssourceProcessing a Corpus
We can apply the same preprocessing steps that are defined for individual documents to an entire corpus at once:
julia> using TextAnalysisjulia> crps = Corpus([StringDocument("Document ..!!"), StringDocument("Document ..!!")])A Corpus with 2 documents: * 2 StringDocument's * 0 FileDocument's * 0 TokenDocument's * 0 NGramDocument's Corpus's lexicon contains 0 tokens Corpus's index contains 0 tokensjulia> prepare!(crps, strip_punctuation)julia> text(crps[1])"Document "julia> text(crps[2])"Document "
These operations are run on each document in the corpus individually.
Corpus Statistics
Often we want to analyze properties of an entire corpus at once. In particular, we work with two key constructs:
- Lexicon: The lexicon of a corpus consists of all the terms that occur in any document in the corpus. The lexical frequency of a term tells us how often a term occurs across all documents. Often the most interesting words in a document are those whose frequency within that document is higher than their frequency in the corpus as a whole.
- Inverse Index: If we are interested in a specific term, we often want to know which documents in a corpus contain that term. The inverse index provides this information and enables a basic search algorithm.
Because computations involving the lexicon can be time-consuming, a Corpus has an empty lexicon by default:
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
julia> lexicon(crps)
Dict{String,Int64} with 0 entriesTo work with the lexicon, you must update it first and then access it:
julia> update_lexicon!(crps)
julia> lexicon(crps)
Dict{String,Int64} with 3 entries:
"Bar" => 1
"Foo" => 1
"Name" => 2Once this is done, you can easily address many interesting questions about a corpus:
julia> lexical_frequency(crps, "Name")
0.5
julia> lexical_frequency(crps, "Foo")
0.25Like the lexicon, the inverse index for a corpus is empty by default:
julia> inverse_index(crps)
Dict{String,Vector{Int64}} with 0 entriesAgain, you need to update it before you can work with it:
julia> update_inverse_index!(crps)
julia> inverse_index(crps)
Dict{String,Vector{Int64}} with 3 entries:
"Bar" => [2]
"Foo" => [1]
"Name" => [1, 2]Once you've updated the inverse index, you can easily search the entire corpus:
julia> crps["Name"]
2-element Vector{Int64}:
1
2
julia> crps["Foo"]
1-element Vector{Int64}:
1
julia> crps["Summer"]
Int64[]Converting a Corpus to a DataFrame
Sometimes we want to apply non-text-specific data analysis operations to a corpus. The easiest way to do this is to convert a Corpus object into a DataFrame:
julia> using DataFrames
julia> crps = Corpus([StringDocument("Name Foo"), StringDocument("Name Bar")])
julia> df = DataFrame(crps)
2×6 DataFrame
Row │ Language Title Author Timestamp Length Text
│ String? String? String? String? Int64? String?
─────┼────────────────────────────────────────────────────────────────────────────────────────
1 │ Languages.English() Untitled Document Unknown Author Unknown Time 8 Name Foo
2 │ Languages.English() Untitled Document Unknown Author Unknown Time 8 Name BarThis creates a DataFrame with columns for Language, Title, Author, Timestamp, Length, and Text for each document in the corpus.
Alternatively, you can manually construct a DataFrame with custom columns:
using DataFrames
df = DataFrame(
text = [text(doc) for doc in crps.documents],
language = languages(crps),
title = titles(crps),
author = authors(crps),
timestamp = timestamps(crps)
)Corpus Metadata
You can retrieve the metadata for every document in a Corpus at once:
languages(): What language is each document in? Defaults toLanguages.English(), a Language instance defined by the Languages package.titles(): What is the title of each document? Defaults to"Untitled Document".authors(): Who wrote each document? Defaults to"Unknown Author".timestamps(): When was each document written? Defaults to"Unknown Time".
julia> crps = Corpus([StringDocument("Name Foo"),
StringDocument("Name Bar")])
julia> languages(crps)
2-element Vector{Languages.English}:
Languages.English()
Languages.English()
julia> titles(crps)
2-element Vector{String}:
"Untitled Document"
"Untitled Document"
julia> authors(crps)
2-element Vector{String}:
"Unknown Author"
"Unknown Author"
julia> timestamps(crps)
2-element Vector{String}:
"Unknown Time"
"Unknown Time"You can change the metadata fields for each document in a Corpus. These functions set the same metadata value for every document:
julia> languages!(crps, Languages.German())
julia> titles!(crps, "")
julia> authors!(crps, "Me")
julia> timestamps!(crps, "Now")Additionally, you can specify the metadata fields for each document in a Corpus individually:
julia> languages!(crps, [Languages.German(), Languages.English()])
julia> titles!(crps, ["", "Untitled"])
julia> authors!(crps, ["Ich", "You"])
julia> timestamps!(crps, ["Unbekannt", "2018"])