Alignment and Safety - Bảng thuật ngữ Trí tuệ nhân tạo

📖

thuật ngữ

Constitutional AI

Alignment methodology where models follow a predefined set of principles or constitution, allowing them to self-evaluate and correct their responses according to these ethical rules.

📖

thuật ngữ

Red Teaming

Systematic process of evaluating model vulnerabilities by experts actively seeking to provoke undesirable or dangerous behaviors to identify and correct weaknesses.

📖

thuật ngữ

Safety Alignment

Set of techniques aimed at ensuring language models avoid generating harmful, dangerous, or inappropriate content while maintaining their overall performance.

📖

thuật ngữ

Value Alignment

Process aimed at aligning the objectives and behaviors of AI systems with fundamental human values, requiring a nuanced understanding of human preferences and ethics.

📖

thuật ngữ

Model Jailbreaking

Attack techniques designed to bypass model safety and alignment mechanisms, forcing them to generate normally restricted or prohibited content.

📖

thuật ngữ

Reward Modeling

Approach where a reward model learns to predict human preferences, serving as a guide for reinforcement learning of main language models.

📖

thuật ngữ

Constitutional Principles

Set of explicitly defined fundamental rules and principles that guide AI model behavior, ensuring consistency and alignment with desired values.

📖

thuật ngữ

Preference Learning

Machine learning domain where models learn from comparisons between different options to capture human preferences and align with them.

📖

thuật ngữ

Harmlessness Training

Specific training process aimed at teaching models to avoid generating potentially harmful, dangerous, or prejudicial content for users.

📖

thuật ngữ

Truthfulness Alignment

Alignment objective aimed at ensuring models provide factually correct information and avoid hallucinations or unverified claims.

📖

thuật ngữ

Bias Mitigation

Set of techniques to identify, quantify, and reduce systemic biases in language models, ensuring fair and non-discriminatory representation.

📖

thuật ngữ

Guardrails

Safety mechanisms implemented in AI systems to monitor and filter inputs/outputs, preventing dangerous or inappropriate interactions in real-time.

📖

thuật ngữ

Constitutional Supervision

Supervision method where models are guided by an explicit constitution, allowing them to self-criticize and improve their responses according to these guiding principles.

📖

thuật ngữ

Human Preference Data

Dataset collected from comparative human evaluations between different model responses, serving as a basis for alignment training and optimization.

📖

thuật ngữ

Safety Fine-tuning

Specific refinement phase after initial pre-training, aimed at finely adjusting model behaviors to comply with safety and ethical constraints.

📖

thuật ngữ

Alignment Taxonomy

Structured classification of different types and dimensions of alignment in AI, including value alignment, safety, robustness, and model interpretability.

Thuật ngữ AI

Constitutional AI

Red Teaming

Safety Alignment

Value Alignment

Model Jailbreaking

Reward Modeling

Constitutional Principles

Preference Learning

Harmlessness Training

Truthfulness Alignment

Bias Mitigation

Guardrails

Constitutional Supervision

Human Preference Data

Safety Fine-tuning

Alignment Taxonomy

Không tìm thấy kết quả