Documents · TextAnalysis

Creating Documents

The basic unit of text analysis is a document. The TextAnalysis package allows you to work with documents stored in a variety of formats:

FileDocument: A document represented using a plain text file on disk
StringDocument: A document represented using a UTF-8 String stored in RAM
TokenDocument: A document represented as a sequence of UTF-8 tokens
NGramDocument: A document represented as a bag of n-grams, which are UTF-8 n-grams that map to counts

Note

These formats represent a hierarchy: you can always move down the hierarchy, but can generally not move up the hierarchy. A FileDocument can easily become a StringDocument, but an NGramDocument cannot easily become a FileDocument.

Creating any of the four basic types of documents is very easy:

TextAnalysis.StringDocument — Type

StringDocument(txt::AbstractString)

Represent a document using a UTF8 String stored in RAM.

Example

julia> str = "To be or not to be..."
"To be or not to be..."

julia> sd = StringDocument(str)
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...

source

TextAnalysis.FileDocument — Type

FileDocument(pathname::AbstractString)

Represent a document using a plain text file on disk.

Example

julia> pathname = "/usr/share/dict/words"
"/usr/share/dict/words"

julia> fd = FileDocument(pathname)
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah

source

TextAnalysis.TokenDocument — Type

TokenDocument(txt::AbstractString)
TokenDocument(txt::AbstractString, dm::DocumentMetadata)
TokenDocument(tkns::Vector{T}) where T <: AbstractString

Represent a document as a sequence of UTF8 tokens.

Example

julia> my_tokens = String["To", "be", "or", "not", "to", "be..."]
6-element Vector{String}:
    "To"
    "be"
    "or"
    "not"
    "to"
    "be..."

julia> td = TokenDocument(my_tokens)
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

source

TextAnalysis.NGramDocument — Type

NGramDocument(txt::AbstractString, n::Integer=1)
NGramDocument(txt::AbstractString, dm::DocumentMetadata, n::Integer=1)
NGramDocument(ng::Dict{T, Int}, n::Integer=1) where T <: AbstractString

Represent a document as a bag of n-grams, which are UTF8 n-grams that map to counts.

Example

julia> my_ngrams = Dict{String, Int}("To" => 1, "be" => 2,
                                     "or" => 1, "not" => 1,
                                     "to" => 1, "be..." => 1)
Dict{String,Int64} with 6 entries:
  "or"    => 1
  "be..." => 1
  "not"   => 1
  "to"    => 1
  "To"    => 1
  "be"    => 2

julia> ngd = NGramDocument(my_ngrams)
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

source

An NGramDocument consisting of bigrams or any higher-order representation N can be easily created by passing the parameter N to NGramDocument:

julia> using TextAnalysis
julia> NGramDocument("To be or not to be ...", 2)A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

For every type of document except a FileDocument, you can also construct a new document by simply passing in a string of text:

julia> using TextAnalysis
julia> sd = StringDocument("To be or not to be...")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
julia> td = TokenDocument("To be or not to be...")A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***
julia> ngd = NGramDocument("To be or not to be...")A NGramDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

The system will automatically perform tokenization or n-gramization to produce the required data. Unfortunately, FileDocuments cannot be constructed this way because filenames are themselves strings. It would cause confusion if filenames were treated as the text contents of a document.

However, there is one way around this restriction: you can use the generic Document() constructor function, which will infer the type of the inputs and construct the appropriate type of document object:

julia> Document("To be or not to be...")
A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: To be or not to be...
julia> Document("/usr/share/dict/words")
A FileDocument
 * Language: Languages.English()
 * Title: /usr/share/dict/words
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: A A's AMD AMD's AOL AOL's Aachen Aachen's Aaliyah

julia> Document(String["To", "be", "or", "not", "to", "be..."])
A TokenDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

julia> Document(Dict{String, Int}("a" => 1, "b" => 3))
A NGramDocument{AbstractString}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: ***SAMPLE TEXT NOT AVAILABLE***

This constructor is very convenient for working in the REPL, but should be avoided in production code because, unlike the other constructors, the return type of the Document function cannot be known at compile time.

Basic Functions for Working with Documents

Once you've created a document object, you can work with it in many ways. The most obvious operation is to access its text using the text() function:

julia> using TextAnalysis
julia> sd = StringDocument("To be or not to be...");
julia> text(sd)"To be or not to be..."

Note

This function works without warnings on StringDocuments and FileDocuments. For TokenDocuments it is not possible to know if the text can be reconstructed perfectly, so calling text(TokenDocument("This is text")) will produce a warning message before returning an approximate reconstruction of the text as it existed before tokenization. It is entirely impossible to reconstruct the text of an NGramDocument, so text(NGramDocument("This is text")) raises an error.

Instead of working with the text itself, you can work with the tokens or n-grams of a document using the tokens() and ngrams() functions:

julia> using TextAnalysis
julia> sd = StringDocument("To be or not to be...");
julia> tokens(sd)6-element Vector{String}:
 "To"
 "be"
 "or"
 "not"
 "to"
 "be"
julia> ngrams(sd)Dict{String, Int64} with 5 entries:
  "or"  => 1
  "not" => 1
  "to"  => 1
  "To"  => 1
  "be"  => 2

By default the ngrams() function produces unigrams. If you want to produce bigrams or trigrams, you can specify that directly using a numeric argument to the ngrams() function:

julia> using TextAnalysis
julia> sd = StringDocument("To be or not to be...");
julia> ngrams(sd, 2)Dict{AbstractString, Int64} with 5 entries:
  "To be"  => 1
  "or not" => 1
  "be or"  => 1
  "not to" => 1
  "to be"  => 1

The ngrams() function can also be called with multiple arguments:

julia> using TextAnalysis
julia> sd = StringDocument("To be or not to be...");
julia> ngrams(sd, 2, 3)Dict{AbstractString, Int64} with 9 entries:
  "To be"     => 1
  "or not"    => 1
  "be or"     => 1
  "be or not" => 1
  "or not to" => 1
  "not to"    => 1
  "to be"     => 1
  "not to be" => 1
  "To be or"  => 1

If you have an NGramDocument, you can determine whether it contains unigrams, bigrams, or a higher-order representation using the ngram_complexity() function:

julia> using TextAnalysis
julia> ngd = NGramDocument("To be or not to be ...", 2);
julia> ngram_complexity(ngd)2

This information is not available for other types of Document objects because it is possible to produce any level of complexity when constructing n-grams from raw text or tokens.

Document Metadata

In addition to methods for manipulating the text representation of a document, every document object also stores basic metadata about itself, including the following information:

language(): What language is the document in? Defaults to Languages.English(), a Language instance defined by the Languages package.
title(): What is the title of the document? Defaults to "Untitled Document".
author(): Who wrote the document? Defaults to "Unknown Author".
timestamp(): When was the document written? Defaults to "Unknown Time".

Try these functions on a StringDocument to see how the defaults work in practice:

julia> using TextAnalysis
julia> sd = StringDocument("This document has too foo words")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This document has too foo words
julia> language(sd)Languages.English()
julia> title(sd)"Untitled Document"
julia> author(sd)"Unknown Author"
julia> timestamp(sd)"Unknown Time"

If you need to reset these fields, you can use the mutating versions of the same functions:

julia> using TextAnalysis, LanguagesERROR: ArgumentError: Package Languages not found in current path.
- Run `import Pkg; Pkg.add("Languages")` to install the Languages package.
julia> sd = StringDocument("This document has too foo words")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: This document has too foo words
julia> language!(sd, Languages.Spanish())ERROR: UndefVarError: `Languages` not defined
julia> title!(sd, "El Cid")"El Cid"
julia> author!(sd, "Desconocido")"Desconocido"
julia> timestamp!(sd, "Desconocido")"Desconocido"

Preprocessing Documents

Having easy access to the text of a document and its metadata is important, but most text analysis tasks require some preprocessing.

At a minimum, your text source may contain corrupt characters. You can remove these using the remove_corrupt_utf8!() function:

TextAnalysis.remove_corrupt_utf8! — Function

remove_corrupt_utf8!(doc)
remove_corrupt_utf8!(crps)

Remove corrupt UTF8 characters for doc or documents in crps. Does not support FileDocument or Corpus containing FileDocument. See also: remove_corrupt_utf8

source

Alternatively, you may want to edit the text to remove items that are difficult to process automatically. For example, text may contain punctuation that you want to discard. You can remove punctuation using the prepare!() function:

julia> using TextAnalysis
julia> str = StringDocument("here are some punctuations !!!...")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: here are some punctuations !!!...
julia> prepare!(str, strip_punctuation)
julia> text(str)"here are some punctuations "

To remove case distinctions, use the remove_case!() function. You may also want to remove specific words from a document, such as a person's name. To do that, use the remove_words!() function:

julia> using TextAnalysis
julia> sd = StringDocument("Lear is mad")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: Lear is mad
julia> remove_case!(sd)
julia> text(sd)"lear is mad"
julia> remove_words!(sd, ["lear"])
julia> text(sd)" is mad"

At other times, you'll want to remove entire classes of words. To make this easier, you can use several classes of basic words defined by the Languages.jl package:

Articles: "a", "an", "the"
Indefinite Articles: "a", "an"
Definite Articles: "the"
Prepositions: "across", "around", "before", ...
Pronouns: "I", "you", "he", "she", ...
Stop Words: "all", "almost", "alone", ...

These special classes can all be removed using specially-named parameters:

prepare!(sd, strip_articles)
prepare!(sd, strip_indefinite_articles)
prepare!(sd, strip_definite_articles)
prepare!(sd, strip_prepositions)
prepare!(sd, strip_pronouns)
prepare!(sd, strip_stopwords)
prepare!(sd, strip_numbers)
prepare!(sd, strip_non_letters)
prepare!(sd, strip_sparse_terms)
prepare!(sd, strip_frequent_terms)
prepare!(sd, strip_html_tags)

These functions use word lists, so they work with many different languages without modification. These operations can also be combined for improved performance:

prepare!(sd, strip_articles| strip_numbers| strip_html_tags)

In addition to removing words, it is also common to take words that are closely related like "dog" and "dogs" and stem them to produce a smaller set of words for analysis. You can do this using the stem!() function:

julia> using TextAnalysis
julia> sd = StringDocument("They write, it writes")A StringDocument{String}
 * Language: Languages.English()
 * Title: Untitled Document
 * Author: Unknown Author
 * Timestamp: Unknown Time
 * Snippet: They write, it writes
julia> stem!(sd)
julia> text(sd)"They write , it write"