Alignment and Safety - AI Glossarium

📖

termen

Constitutional AI

Alignment methodology where models follow a predefined set of principles or constitution, allowing them to self-evaluate and correct their responses according to these ethical rules.

📖

termen

Red Teaming

Systematic process of evaluating model vulnerabilities by experts actively seeking to provoke undesirable or dangerous behaviors to identify and correct weaknesses.

📖

termen

Safety Alignment

Set of techniques aimed at ensuring language models avoid generating harmful, dangerous, or inappropriate content while maintaining their overall performance.

📖

termen

Value Alignment

Process aimed at aligning the objectives and behaviors of AI systems with fundamental human values, requiring a nuanced understanding of human preferences and ethics.

📖

termen

Model Jailbreaking

Attack techniques designed to bypass model safety and alignment mechanisms, forcing them to generate normally restricted or prohibited content.

📖

termen

Reward Modeling

Approach where a reward model learns to predict human preferences, serving as a guide for reinforcement learning of main language models.

📖

termen

Constitutional Principles

Set of explicitly defined fundamental rules and principles that guide AI model behavior, ensuring consistency and alignment with desired values.

📖

termen

Preference Learning

Machine learning domain where models learn from comparisons between different options to capture human preferences and align with them.

📖

termen

Harmlessness Training

Specific training process aimed at teaching models to avoid generating potentially harmful, dangerous, or prejudicial content for users.

📖

termen

Truthfulness Alignment

Alignment objective aimed at ensuring models provide factually correct information and avoid hallucinations or unverified claims.

📖

termen

Bias Mitigation

Set of techniques to identify, quantify, and reduce systemic biases in language models, ensuring fair and non-discriminatory representation.

📖

termen

Guardrails

Safety mechanisms implemented in AI systems to monitor and filter inputs/outputs, preventing dangerous or inappropriate interactions in real-time.

📖

termen

Constitutional Supervision

Supervision method where models are guided by an explicit constitution, allowing them to self-criticize and improve their responses according to these guiding principles.

📖

termen

Human Preference Data

Dataset collected from comparative human evaluations between different model responses, serving as a basis for alignment training and optimization.

📖

termen

Safety Fine-tuning

Specific refinement phase after initial pre-training, aimed at finely adjusting model behaviors to comply with safety and ethical constraints.

📖

termen

Alignment Taxonomy

Structured classification of different types and dimensions of alignment in AI, including value alignment, safety, robustness, and model interpretability.

AI-woordenlijst

Constitutional AI

Red Teaming

Safety Alignment

Value Alignment

Model Jailbreaking

Reward Modeling

Constitutional Principles

Preference Learning

Harmlessness Training

Truthfulness Alignment

Bias Mitigation

Guardrails

Constitutional Supervision

Human Preference Data

Safety Fine-tuning

Alignment Taxonomy

Geen resultaten gevonden