Evaluation Metrics

Natural Language Processing tasks require certain Evaluation Metrics. As of now TextAnalysis provides the following evaluation metrics.

ROUGE-N, ROUGE-L, ROUGE-L-Summary

This metric evaluation based on the overlap of N-grams between the system and reference summaries.

Base.argmaxFunction
argmax(scores::Vector{Score})::Score

Returns maximum by precision fiels of each Score

source
TextAnalysis.rouge_nFunction
rouge_n(
    references::Vector{<:AbstractString}, 
    candidate::AbstractString, 
    n::Int; 
    lang::Language
)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

  • references::Vector{T} where T<: AbstractString = The list of reference summaries.
  • candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
  • n::Integer = Order of NGrams
  • lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

See also: rouge_l_sentence, rouge_l_summary

source
TextAnalysis.rouge_l_sentenceFunction
rouge_l_sentence(
    references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
    weighted=false, weight_func=sqrt,
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source
using TextAnalysis

candidate_summary =  "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."]

results = [
    rouge_n(reference_summaries, candidate_summary, 2),
    rouge_n(reference_summaries, candidate_summary, 1)
] .|> argmax
2-element Vector{Score}:
 Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
 Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)

BLEU (bilingual evaluation understudy)

TextAnalysis.bleu_scoreFunction
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

  • reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
  • translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
  • max_order: maximum n-gram order to use when computing BLEU score.
  • smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
source

NLTK sample

    using TextAnalysis

    reference1 = [
        "It", "is", "a", "guide", "to", "action", "that",
        "ensures", "that", "the", "military", "will", "forever",
        "heed", "Party", "commands"
    ]
    reference2 = [
        "It", "is", "the", "guiding", "principle", "which",
        "guarantees", "the", "military", "forces", "always",
        "being", "under", "the", "command", "of", "the",
        "Party"
    ]
    reference3 = [
        "It", "is", "the", "practical", "guide", "for", "the",
        "army", "always", "to", "heed", "the", "directions",
        "of", "the", "party"
    ]

    hypothesis1 = [
        "It", "is", "a", "guide", "to", "action", "which",
        "ensures", "that", "the", "military", "always",
        "obeys", "the", "commands", "of", "the", "party"
    ]

    score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])
(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)