Evaluation and Metrics - AI Glossarium

📖

termen

BLEU (Bilingual Evaluation Understudy)

Automatic metric for evaluating the quality of machine translations by comparing the n-gram precision of the generated text against one or more human references. It measures the overlap of text segments between the model output and the reference.

📖

termen

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Set of metrics primarily used to evaluate the quality of automatic summaries, focusing on the recall of n-grams compared to reference summaries. ROUGE-N, ROUGE-L, and ROUGE-S are its most common variants.

📖

termen

Coherence Score

Qualitative indicator measuring the semantic and logical coherence of a generated text over a long span, evaluating whether sentences and paragraphs follow each other in a meaningful way. It is often calculated through sentence embedding models or specifically trained classifiers.

📖

termen

Human Evaluation

Gold standard methodology where human annotators judge the quality of LLM outputs according to predefined criteria such as relevance, fluency, or completeness. It provides a ground truth measure but is expensive and difficult to scale.

📖

termen

Toxicity

Metric evaluating the likelihood that a model generates offensive, hateful, discriminatory, or harmful content. It is typically measured by specialized classifiers trained on corpora of texts annotated for their toxicity.

📖

termen

Hallucination

Phenomenon where an LLM generates factually incorrect, unfounded, or invented information, presenting it as truths. Hallucination evaluation involves checking the faithfulness of the generated content against a known truth source.

📖

termen

Bias

Measure of a model's tendency to produce systematically harmful or stereotypical results toward certain demographic groups. Bias evaluation analyzes outputs to detect prejudices related to gender, race, or other sensitive attributes.

📖

termen

Faithfulness Metric

Indicator evaluating to what extent the content generated by an LLM, particularly in question-answering or summarization systems, remains faithful and consistent with the provided context or source documents. Low faithfulness indicates deviations or inventions.

📖

termen

MMLU (Massive Multitask Language Understanding) Benchmark

A comprehensive benchmark designed to measure a LLM's knowledge and comprehension abilities across a wide range of 57 subjects, from elementary math to US law and history. It assesses the model's ability to answer multiple-choice questions.

📖

termen

Helpfulness Score

A qualitative metric evaluating how useful, relevant, and problem-solving a response generated by a LLM is. This score is often obtained through human evaluation or reward models.

📖

termen

Few-Shot Evaluation

An evaluation technique where the model is presented with a very small number of examples (typically 1 to 5) of the target task within the prompt, to guide its understanding and performance. It assesses the model's rapid adaptation capability.

📖

termen

Alignment

A process and metric aimed at ensuring a LLM's behavior is consistent with human intentions, ethical values, and given instructions. Alignment evaluation checks if the model is helpful, harmless, and honest (HHH framework).

📖

termen

Diversity Metric

An indicator measuring the variety and breadth of vocabulary or themes in a text generated by a LLM, thus avoiding repetitions and generic responses. It can be calculated by the ratio of unique types to the total number of tokens (TTR) or by the semantic similarity between sentences.

📖

termen

Cross-Reference Score

A metric used to evaluate a LLM's consistency by generating multiple responses to the same question and measuring their semantic or factual similarity. A high score indicates high reliability and a low probability of contradiction.

📖

termen

TruthfulQA Scale

A specialized benchmark designed to measure a LLM's tendency to imitate common falsehoods found in training texts, evaluating its ability to generate factually true answers. It focuses on truthfulness rather than mere helpfulness or coherence.

📖

termen

Robustness Metric

An evaluation of the stability of a LLM's performance in the face of minor variations in the input (e.g., typos, paraphrases, or noise). A high robustness metric means that the output quality does not degrade significantly with perturbations.

📖

termen

Instruction Following Score

Metric evaluating the ability of an LLM to accurately understand and execute complex, multi-step instructions provided in a prompt. This score is crucial for agent applications and conditional generation tasks.

AI-woordenlijst

BLEU (Bilingual Evaluation Understudy)

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Coherence Score

Human Evaluation

Toxicity

Hallucination

Bias

Faithfulness Metric

MMLU (Massive Multitask Language Understanding) Benchmark

Helpfulness Score

Few-Shot Evaluation

Alignment

Diversity Metric

Cross-Reference Score

TruthfulQA Scale

Robustness Metric

Instruction Following Score

Geen resultaten gevonden