KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
BLEU (Bilingual Evaluation Understudy)
Automatic metric for evaluating the quality of machine translations by comparing the n-gram precision of the generated text against one or more human references. It measures the overlap of text segments between the model output and the reference.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
Set of metrics primarily used to evaluate the quality of automatic summaries, focusing on the recall of n-grams compared to reference summaries. ROUGE-N, ROUGE-L, and ROUGE-S are its most common variants.
Coherence Score
Qualitative indicator measuring the semantic and logical coherence of a generated text over a long span, evaluating whether sentences and paragraphs follow each other in a meaningful way. It is often calculated through sentence embedding models or specifically trained classifiers.
Human Evaluation
Gold standard methodology where human annotators judge the quality of LLM outputs according to predefined criteria such as relevance, fluency, or completeness. It provides a ground truth measure but is expensive and difficult to scale.
Toxicity
Metric evaluating the likelihood that a model generates offensive, hateful, discriminatory, or harmful content. It is typically measured by specialized classifiers trained on corpora of texts annotated for their toxicity.
Hallucination
Phenomenon where an LLM generates factually incorrect, unfounded, or invented information, presenting it as truths. Hallucination evaluation involves checking the faithfulness of the generated content against a known truth source.
Bias
Measure of a model's tendency to produce systematically harmful or stereotypical results toward certain demographic groups. Bias evaluation analyzes outputs to detect prejudices related to gender, race, or other sensitive attributes.
Faithfulness Metric
Indicator evaluating to what extent the content generated by an LLM, particularly in question-answering or summarization systems, remains faithful and consistent with the provided context or source documents. Low faithfulness indicates deviations or inventions.
MMLU (Massive Multitask Language Understanding) Benchmark
A comprehensive benchmark designed to measure a LLM's knowledge and comprehension abilities across a wide range of 57 subjects, from elementary math to US law and history. It assesses the model's ability to answer multiple-choice questions.
Helpfulness Score
A qualitative metric evaluating how useful, relevant, and problem-solving a response generated by a LLM is. This score is often obtained through human evaluation or reward models.
Few-Shot Evaluation
An evaluation technique where the model is presented with a very small number of examples (typically 1 to 5) of the target task within the prompt, to guide its understanding and performance. It assesses the model's rapid adaptation capability.
Alignment
A process and metric aimed at ensuring a LLM's behavior is consistent with human intentions, ethical values, and given instructions. Alignment evaluation checks if the model is helpful, harmless, and honest (HHH framework).
Diversity Metric
An indicator measuring the variety and breadth of vocabulary or themes in a text generated by a LLM, thus avoiding repetitions and generic responses. It can be calculated by the ratio of unique types to the total number of tokens (TTR) or by the semantic similarity between sentences.
Cross-Reference Score
A metric used to evaluate a LLM's consistency by generating multiple responses to the same question and measuring their semantic or factual similarity. A high score indicates high reliability and a low probability of contradiction.
TruthfulQA Scale
A specialized benchmark designed to measure a LLM's tendency to imitate common falsehoods found in training texts, evaluating its ability to generate factually true answers. It focuses on truthfulness rather than mere helpfulness or coherence.
Robustness Metric
An evaluation of the stability of a LLM's performance in the face of minor variations in the input (e.g., typos, paraphrases, or noise). A high robustness metric means that the output quality does not degrade significantly with perturbations.
Instruction Following Score
Metric evaluating the ability of an LLM to accurately understand and execute complex, multi-step instructions provided in a prompt. This score is crucial for agent applications and conditional generation tasks.