KI-Glossar
Das vollständige Wörterbuch der Künstlichen Intelligenz
Contention
Mechanism designed to restrict or guide the output of an LLM to prevent the generation of unwanted, dangerous, or out-of-scope content.
Prompt Guardrails
Set of rules and filters applied upstream to user input to detect and block malicious, inappropriate requests, or those attempting to bypass the model's security policies.
Output Filtering
Post-generation security mechanism that analyzes the LLM's response to identify and remove prohibited content before it is presented to the user.
Jailbreaking
Set of reverse engineering techniques aimed at bypassing an LLM's contention and security mechanisms to force it to produce normally prohibited responses.
Safety Layer
Distinct software component, often a classification model, that intercepts LLM inputs and outputs to evaluate their compliance with security policies.
Decoding Alignment
Strategy of modifying the decoding process (e.g., beam search, sampling) to penalize the generation of tokens or token sequences associated with unsafe content.
Self-Critique
Ability of an LLM to evaluate its own generated response against a set of predefined criteria (coherence, safety, accuracy) and revise it if necessary.
Adversarial Suffix
Learned character sequence added to the end of a prompt to manipulate the LLM's internal behavior and force a specific output, often used in jailbreaking attacks.
Preference Modeling
Process of creating a reward model that learns human preferences from pairwise response comparisons, essential for RLHF.
Refusal Training
Specialized training phase where the LLM learns to identify inappropriate requests and generate polite and informative refusal responses instead of attempting to answer.
Harmlessness Classification
Binary classification task to determine if an LLM output is 'harmless' or 'harmful', often implemented as a safety filter.
Sycophancy Mitigation
Set of techniques aimed at reducing an LLM's tendency to agree with incorrect user premises to please them, an undesirable behavior that compromises truthfulness.
Model Steering
Technique for dynamically adjusting an LLM's behavior during inference, often by modifying logits, to guide generation towards a desired and safe response space.