Fine-tuning
DPO (Direct Preference Optimization)
An alternative to RLHF that directly optimizes the model from human preference data without requiring an intermediate reward model, simplifying the alignment process.
← Wstecz