Evaluation Metrics

Natural Language Processing tasks require evaluation metrics. TextAnalysis currently provides the following evaluation metrics:

ROUGE-N, ROUGE-L, ROUGE-L-Summary

These metrics evaluate text based on the overlap of N-grams between the system and reference summaries.

Base.argmaxFunction
argmax(scores::Vector{Score})::Score
  • scores - Vector of Score objects

Return the maximum by f-measure field of each Score.

source
TextAnalysis.averageFunction
average(scores::Vector{Score})::Score
  • scores - Vector of Score objects

Return average values of scores as a Score with precision/recall/fmeasure.

source
TextAnalysis.rouge_nFunction
rouge_n(
    references::Vector{<:AbstractString}, 
    candidate::AbstractString, 
    n::Int; 
    lang::Language
)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

Arguments

  • references::Vector{T} where T<: AbstractString - List of reference summaries
  • candidate::AbstractString - Input candidate summary to be scored against reference summaries
  • n::Integer - Order of n-grams
  • lang::Language - Language of the text, useful while generating n-grams (default: Languages.English())

Return a vector of Score objects.

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
    references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
    weighted=false, weight_func=sqrt,
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Return a vector of Score objects.

See Rouge: A package for automatic evaluation of summaries

Note

The weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source

ROUGE-N Example

using TextAnalysis

candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."]

# Calculate ROUGE-N scores for different N values
rouge_2_scores = rouge_n(reference_summaries, candidate_summary, 2)
rouge_1_scores = rouge_n(reference_summaries, candidate_summary, 1)

# Get the best scores using argmax
results = [rouge_2_scores, rouge_1_scores] .|> argmax
2-element Vector{Score}:
 Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
 Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)

ROUGE-L Examples

ROUGE-L measures the longest common subsequence between the candidate and reference summaries:

using TextAnalysis

candidate = "Brazil, Russia, China and India are growing nations."
references = [
    "Brazil, Russia, India and China are the next big political powers.",
    "Brazil, Russia, India and China are BRIC nations."
]

# ROUGE-L for sentence-level evaluation
sentence_scores = rouge_l_sentence(references, candidate)

# ROUGE-L for summary-level evaluation (requires β parameter)
summary_scores = rouge_l_summary(references, candidate, 8)
2-element Vector{Score}:
 Score(precision=0.54545456, recall=0.42857143, fmeasure=0.42998898)
 Score(precision=0.6363636, recall=0.6363636, fmeasure=0.6363636)

BLEU (bilingual evaluation understudy)

TextAnalysis.bleu_scoreFunction
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Compute the BLEU score of translated segments against one or more references.

Return the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translation_length, and reference_length.

Arguments

  • reference_corpus: List of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: List of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: Maximum n-gram order to use when computing BLEU score.
  • smooth=false: Whether or not to apply Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source

Example adapted from NLTK:

using TextAnalysis

reference1 = [
    "It", "is", "a", "guide", "to", "action", "that",
    "ensures", "that", "the", "military", "will", "forever",
    "heed", "Party", "commands"
]
reference2 = [
    "It", "is", "the", "guiding", "principle", "which",
    "guarantees", "the", "military", "forces", "always",
    "being", "under", "the", "command", "of", "the",
    "Party"
]
reference3 = [
    "It", "is", "the", "practical", "guide", "for", "the",
    "army", "always", "to", "heed", "the", "directions",
    "of", "the", "party"
]

hypothesis1 = [
    "It", "is", "a", "guide", "to", "action", "which",
    "ensures", "that", "the", "military", "always",
    "obeys", "the", "commands", "of", "the", "party"
]

# Calculate BLEU score
score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])
(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)