Evaluation Metrics
Natural Language Processing tasks require certain Evaluation Metrics. As of now TextAnalysis provides the following evaluation metrics.
ROUGE-N, ROUGE-L, ROUGE-L-Summary
This metric evaluation based on the overlap of N-grams between the system and reference summaries.
Base.argmax
— Functionargmax(scores::Vector{Score})::Score
- scores - vector of
Score
Returns maximum by precision fiels of each Score
TextAnalysis.average
— Functionaverage(scores::Vector{Score})::Score
- scores - vector of
Score
Returns average values of scores as a Score
with precision/recall/fmeasure
TextAnalysis.rouge_n
— Functionrouge_n(
references::Vector{<:AbstractString},
candidate::AbstractString,
n::Int;
lang::Language
)::Vector{Score}
Compute n-gram recall between candidate
and the references
summaries.
The function takes the following arguments -
references::Vector{T} where T<: AbstractString
= The list of reference summaries.candidate::AbstractString
= Input candidate summary, to be scored against reference summaries.n::Integer
= Order of NGramslang::Language
= Language of the text, useful while generating N-grams. Defaults value is Languages.English()
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence
, rouge_l_summary
TextAnalysis.rouge_l_sentence
— Functionrouge_l_sentence(
references::Vector{<:AbstractString}, candidate::AbstractString, β=8;
weighted=false, weight_func=sqrt,
lang=Languages.English()
)::Vector{Score}
Calculate the ROUGE-L score between references
and candidate
at sentence level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
Note: the weighted
argument enables weighting of values when calculating the longest common subsequence. Initial implementation ROUGE-1.5.5.pl contains a power function. The function weight_func
here has a power of 0.5 by default.
See also: rouge_n
, rouge_l_summary
TextAnalysis.rouge_l_summary
— Functionrouge_l_summary(
references::Vector{<:AbstractString}, candidate::AbstractString, β::Int;
lang=Languages.English()
)::Vector{Score}
Calculate the ROUGE-L score between references
and candidate
at summary level.
Returns a vector of Score
See Rouge: A package for automatic evaluation of summaries
See also: rouge_l_sentence()
, rouge_n
using TextAnalysis
candidate_summary = "Brazil, Russia, China and India are growing nations. They are all an important part of BRIC as well as regular part of G20 summits."
reference_summaries = ["Brazil, Russia, India and China are the next big political powers in the global economy. Together referred to as BRIC(S) along with South Korea.", "Brazil, Russia, India and China are together known as the BRIC(S) and have been invited to the G20 summit."]
results = [
rouge_n(reference_summaries, candidate_summary, 2),
rouge_n(reference_summaries, candidate_summary, 1)
] .|> argmax
2-element Vector{Score}:
Score(precision=0.14814815, recall=0.16, fmeasure=0.15384616)
Score(precision=0.53571427, recall=0.5769231, fmeasure=0.5555556)
BLEU (bilingual evaluation understudy)
TextAnalysis.bleu_score
— Functionbleu_score(reference_corpus::Vector{Vector{Token}}, translation_corpus::Vector{Token}; max_order=4, smooth=false)
Computes BLEU score of translated segments against one or more references. Returns the BLEU score
, n-gram precisions
, brevity penalty
, geometric mean of n-gram precisions, translationlength and referencelength
Arguments
reference_corpus
: list of lists of references for each translation. Each reference should be tokenized into a list of tokens.translation_corpus
: list of translations to score. Each translation should be tokenized into a list of tokens.max_order
: maximum n-gram order to use when computing BLEU score.smooth=false
: whether or not to apply. Lin et al. 2004 smoothing.
Example:
one_doc_references = [
["apple", "is", "apple"],
["apple", "is", "a", "fruit"]
]
one_doc_translation = [
"apple", "is", "appl"
]
bleu_score([one_doc_references], [one_doc_translation], smooth=true)
using TextAnalysis
reference1 = [
"It", "is", "a", "guide", "to", "action", "that",
"ensures", "that", "the", "military", "will", "forever",
"heed", "Party", "commands"
]
reference2 = [
"It", "is", "the", "guiding", "principle", "which",
"guarantees", "the", "military", "forces", "always",
"being", "under", "the", "command", "of", "the",
"Party"
]
reference3 = [
"It", "is", "the", "practical", "guide", "for", "the",
"army", "always", "to", "heed", "the", "directions",
"of", "the", "party"
]
hypothesis1 = [
"It", "is", "a", "guide", "to", "action", "which",
"ensures", "that", "the", "military", "always",
"obeys", "the", "commands", "of", "the", "party"
]
score = bleu_score([[reference1, reference2, reference3]], [hypothesis1])
(bleu = 0.5045666840058485, precisions = [0.9444444444444444, 0.5882352941176471, 0.4375, 0.26666666666666666], bp = 1.0, geo_mean = 0.5045666840058485, translation_length = 18, reference_length = 16)