Alignment and Safety - AI-ordlista

📖

termer

Constitutional AI

Alignment methodology where models follow a predefined set of principles or constitution, allowing them to self-evaluate and correct their responses according to these ethical rules.

📖

termer

Red Teaming

Systematic process of evaluating model vulnerabilities by experts actively seeking to provoke undesirable or dangerous behaviors to identify and correct weaknesses.

📖

termer

Safety Alignment

Set of techniques aimed at ensuring language models avoid generating harmful, dangerous, or inappropriate content while maintaining their overall performance.

📖

termer

Value Alignment

Process aimed at aligning the objectives and behaviors of AI systems with fundamental human values, requiring a nuanced understanding of human preferences and ethics.

📖

termer

Model Jailbreaking

Attack techniques designed to bypass model safety and alignment mechanisms, forcing them to generate normally restricted or prohibited content.

📖

termer

Reward Modeling

Approach where a reward model learns to predict human preferences, serving as a guide for reinforcement learning of main language models.

📖

termer

Constitutional Principles

Set of explicitly defined fundamental rules and principles that guide AI model behavior, ensuring consistency and alignment with desired values.

📖

termer

Preference Learning

Machine learning domain where models learn from comparisons between different options to capture human preferences and align with them.

📖

termer

Harmlessness Training

Specific training process aimed at teaching models to avoid generating potentially harmful, dangerous, or prejudicial content for users.

📖

termer

Truthfulness Alignment

Alignment objective aimed at ensuring models provide factually correct information and avoid hallucinations or unverified claims.

📖

termer

Bias Mitigation

Set of techniques to identify, quantify, and reduce systemic biases in language models, ensuring fair and non-discriminatory representation.

📖

termer

Guardrails

Safety mechanisms implemented in AI systems to monitor and filter inputs/outputs, preventing dangerous or inappropriate interactions in real-time.

📖

termer

Constitutional Supervision

Supervision method where models are guided by an explicit constitution, allowing them to self-criticize and improve their responses according to these guiding principles.

📖

termer

Human Preference Data

Dataset collected from comparative human evaluations between different model responses, serving as a basis for alignment training and optimization.

📖

termer

Safety Fine-tuning

Specific refinement phase after initial pre-training, aimed at finely adjusting model behaviors to comply with safety and ethical constraints.

📖

termer

Alignment Taxonomy

Structured classification of different types and dimensions of alignment in AI, including value alignment, safety, robustness, and model interpretability.

AI-ordlista

Constitutional AI

Red Teaming

Safety Alignment

Value Alignment

Model Jailbreaking

Reward Modeling

Constitutional Principles

Preference Learning

Harmlessness Training

Truthfulness Alignment

Bias Mitigation

Guardrails

Constitutional Supervision

Human Preference Data

Safety Fine-tuning

Alignment Taxonomy

Inga resultat hittades