Evaluation Metrics
Natural Language Processing tasks require evaluation metrics. TextAnalysis currently provides the following evaluation metrics:
ROUGE-N, ROUGE-L, ROUGE-L-Summary
These metrics evaluate text based on the overlap of N-grams between the system and reference summaries.
Base.argmax — Function
TextAnalysis.average — Function
TextAnalysis.rouge_n — Function
rouge_n(
references::Vector{<:AbstractString},
candidate::AbstractString,
n::Int;
lang::Language
)::Vector{Score}Compute n-gram recall between candidate and the references summaries.
Arguments
references::Vector{T} where T<: AbstractString- List of reference summariescandidate::AbstractString- Input candidate summary to be scored against reference summariesn::Integer- Order of n-gramslang::Language- Language of the text, useful while generating n-grams (default:Languages.English())
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence, rouge_l_summary
TextAnalysis.rouge_l_sentence — Function
rouge_l_sentence(
references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
weighted=false, weight_func=sqrt,
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at sentence level.
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
The weighted argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func here has a power of 0.5 by default.
See also: rouge_n, rouge_l_summary
TextAnalysis.rouge_l_summary — Function
rouge_l_summary(
references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
lang=Languages.English()
)::Vector{Score}Calculate the ROUGE-L score between references and candidate at summary level.
Return a vector of Score objects.
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence, rouge_n
ROUGE-N Example
using TextAnalysis
candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."]
# Calculate ROUGE-N scores for different N values
rouge_2_scores = rouge_n(reference_summaries, candidate_summary, 2)
rouge_1_scores = rouge_n(reference_summaries, candidate_summary, 1)
# Get the best scores using argmax
results = [rouge_2_scores, rouge_1_scores] .|> argmax2-element Vector{Score}:
Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)ROUGE-L Examples
ROUGE-L measures the longest common subsequence between the candidate and reference summaries:
using TextAnalysis
candidate = "Brazil, Russia, China and India are growing nations."
references = [
"Brazil, Russia, India and China are the next big political powers.",
"Brazil, Russia, India and China are BRIC nations."
]
# ROUGE-L for sentence-level evaluation
sentence_scores = rouge_l_sentence(references, candidate)
# ROUGE-L for summary-level evaluation (requires β parameter)
summary_scores = rouge_l_summary(references, candidate, 8)2-element Vector{Score}:
Score(precision=0.54545456, recall=0.42857143, fmeasure=0.42998898)
Score(precision=0.6363636, recall=0.6363636, fmeasure=0.6363636)BLEU (bilingual evaluation understudy)
TextAnalysis.bleu_score — Function
bleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)Compute the BLEU score of translated segments against one or more references.
Return the BLEU score, n-gram precisions, brevity penalty, geometric mean of n-gram precisions, translation_length, and reference_length.
Arguments
reference_corpus: List of lists of references for each translation. Each reference should be tokenized into a list of tokens.translation_corpus: List of translations to score. Each translation should be tokenized into a list of tokens.max_order: Maximum n-gram order to use when computing BLEU score.smooth=false: Whether or not to apply Lin et al. 2004 smoothing.
Example:
one_doc_references = [
["apple", "is", "apple"],
["apple", "is", "a", "fruit"]
]
one_doc_translation = [
"apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)sourceExample adapted from NLTK:
using TextAnalysis
reference1 = [
"It", "is", "a", "guide", "to", "action", "that",
"ensures", "that", "the", "military", "will", "forever",
"heed", "Party", "commands"
]
reference2 = [
"It", "is", "the", "guiding", "principle", "which",
"guarantees", "the", "military", "forces", "always",
"being", "under", "the", "command", "of", "the",
"Party"
]
reference3 = [
"It", "is", "the", "practical", "guide", "for", "the",
"army", "always", "to", "heed", "the", "directions",
"of", "the", "party"
]
hypothesis1 = [
"It", "is", "a", "guide", "to", "action", "which",
"ensures", "that", "the", "military", "always",
"obeys", "the", "commands", "of", "the", "party"
]
# Calculate BLEU score
score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)