Evaluation Metrics

Natural Language Processing tasks require certain Evaluation Metrics. As of now TextAnalysis provides the following evaluation metrics.

ROUGE-N, ROUGE-L, ROUGE-L-Summary

This metric evaluation based on the overlap of N-grams between the system and reference summaries.

Base.argmax — Function

argmax(scores::Vector{Score})::Score

scores - vector of Score

Returns maximum by precision fiels of each Score

source

TextAnalysis.average — Function

average(scores::Vector{Score})::Score

scores - vector of Score

Returns average values of scores as a Score with precision/recall/fmeasure

source

TextAnalysis.rouge_n — Function

rouge_n(
    references::Vector{<:AbstractString}, 
    candidate::AbstractString, 
    n::Int; 
    lang::Language
)::Vector{Score}

Compute n-gram recall between candidate and the references summaries.

The function takes the following arguments -

references::Vector{T} where T<: AbstractString = The list of reference summaries.
candidate::AbstractString = Input candidate summary, to be scored against reference summaries.
n::Integer = Order of NGrams
lang::Language = Language of the text, useful while generating N-grams. Defaults value is Languages.English()

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

source

TextAnalysis.rouge_l_sentence — Function

rouge_l_sentence(
    references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
    weighted=false, weight_func=sqrt,
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at sentence level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

Note: the weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.

See also: rouge_n, rouge_l_summary

source

TextAnalysis.rouge_l_summary — Function

rouge_l_summary(
    references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
    lang=Languages.English()
)::Vector{Score}

Calculate the ROUGE-L score between references and candidate at summary level.

Returns a vector of Score

See Rouge: A package for automatic evaluation of summaries

source

using TextAnalysis

candidate_summary =  "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the  BRIC(S) and have been invited to the G20 summit."]

results = [
    rouge_n(reference_summaries, candidate_summary, 2),
    rouge_n(reference_summaries, candidate_summary, 1)
] .|> argmax

2-element Vector{Score}:
 Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
 Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)

BLEU (bilingual evaluation understudy)

TextAnalysis.bleu_score — Function

bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)

Computes BLEU score of translated segments against one or more references. Returns the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translationlength and referencelength

Arguments

reference_corpus: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.
translation_corpus: list of translations to score. Each translation should be tokenized into a list of tokens.
max_order: maximum n-gram order to use when computing BLEU score.
smooth=false: whether or not to apply. Lin et al. 2004 smoothing.

Example:

one_doc_references = [
    ["apple", "is", "apple"],
    ["apple", "is", "a", "fruit"]
]  
one_doc_translation = [
    "apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)

source

NLTK sample

    using TextAnalysis

    reference1 = [
        "It", "is", "a", "guide", "to", "action", "that",
        "ensures", "that", "the", "military", "will", "forever",
        "heed", "Party", "commands"
    ]
    reference2 = [
        "It", "is", "the", "guiding", "principle", "which",
        "guarantees", "the", "military", "forces", "always",
        "being", "under", "the", "command", "of", "the",
        "Party"
    ]
    reference3 = [
        "It", "is", "the", "practical", "guide", "for", "the",
        "army", "always", "to", "heed", "the", "directions",
        "of", "the", "party"
    ]

    hypothesis1 = [
        "It", "is", "a", "guide", "to", "action", "which",
        "ensures", "that", "the", "military", "always",
        "obeys", "the", "commands", "of", "the", "party"
    ]

    score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])

(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)