Dialogue Evaluation and Metrics - 인공지능 용어집

📖

용어

METEOR

Evaluation metric combining precision and recall of n-grams with synonym alignments and grammatical inflections. Offers better correlation with human judgments than BLEU for dialogues.

📖

용어

Metric evaluating the logical and thematic coherence of a response relative to the previous conversational context. Measures the system's ability to maintain a consistent narrative thread throughout the dialogue.

📖

용어

Engagement Rate

Indicator quantifying a conversational system's ability to maintain user interest and participation. Typically calculated via conversation duration and number of exchange turns.

📖

용어

Task Success Rate

Metric measuring the percentage of dialogues where the user's objective was successfully achieved. Essential for evaluating the effectiveness of task-oriented conversational agents.

📖

용어

F1 Score Dialogue

Harmonic mean between precision and recall adapted to dialog contexts to evaluate response relevance. Particularly useful for response retrieval systems.

📖

용어

Dialogue Act Classification

Process of automatically identifying the communicative intention behind each utterance in a dialogue. Crucial for evaluating the relevance and contextual appropriateness of system responses.

📖

용어

Response Diversity

Metric measuring the variety and originality of responses generated by a conversational system. Avoids repetitive responses and maintains user interest over the long term.

📖

용어

Error Recovery Rate

Indicator evaluating the system's ability to recover from errors or misunderstandings in the dialogue. Measures the robustness and resilience of the conversational system in the face of unexpected events.

📖

용어

User Satisfaction Score

Subjective metric collected from users to evaluate their overall satisfaction after a conversational interaction. Often combined with Likert scales or explicit ratings.

📖

용어

Contextual Consistency

Measure of the temporal and factual consistency of information provided throughout a conversation. Avoids contradictions and ensures reliability of exchanges over time.

📖

용어

Turn-level Evaluation

Evaluation approach analyzing the quality of each individual exchange in a dialogue independently of others. Allows precise identification of system strengths and weaknesses.

📖

용어

Dialogue-level Evaluation

Evaluation method considering the conversation as a whole to judge the overall quality of the interaction. Takes into account narrative consistency and natural dialogue progression.

📖

용어

Automatic Evaluation Metrics

Set of algorithmic indicators allowing objective evaluation of dialogue quality without direct human intervention. Complementary to subjective evaluations for comprehensive analysis.

📖

용어

Human Evaluation Protocols

Standardized methodologies for subjective evaluation of conversational systems by human judges. Include predefined criteria, rating scales, and quality control procedures.

📖

용어

NDCG (Normalized Discounted Cumulative Gain)

Metric evaluating the quality of candidate response ranking by considering their position and relative relevance. Particularly useful for systems generating multiple response options.

AI 용어집