Text classification

📖

termini

NLP task consisting of automatically assigning a textual document to one or more predefined categories based on its semantic content.

📖

termini

Binary classification

Type of classification where the model must choose between two mutually exclusive classes, usually represented as positive/negative or 0/1.

📖

termini

Multi-class classification

Classification problem where each instance must be assigned to exactly one class among three or more, with mutually exclusive classes.

📖

termini

Multi-label classification

Variant of classification where a document can be simultaneously associated with multiple non-exclusive labels or categories.

📖

termini

Naive Bayes

Probabilistic classification algorithm based on Bayes' theorem with a conditional independence assumption between features.

📖

termini

SVM (Support Vector Machine)

Supervised learning algorithm that finds the optimal hyperplane separating classes in high-dimensional space by maximizing the margin.

📖

termini

Bag-of-Words

Text representation that counts word occurrences without considering their order or grammatical context.

📖

termini

TF-IDF

Statistical metric evaluating the importance of a word in a document relative to a corpus, combining term frequency and inverse document frequency.

📖

termini

Word Embeddings

Dense vector representations of words in a continuous space where semantic distances between words are preserved.

📖

termini

Transformers

Neural network architecture based on attention mechanisms that allows capturing long-range dependencies in sequences.

📖

termini

Confusion Matrix

A table for visualizing classifier performance by comparing predictions to true labels by class.

📖

termini

Cross-validation

Robust evaluation technique dividing data into subsets to train and test the model multiple times on different partitions.

📖

termini

Precision

Metric measuring the proportion of correct positive predictions among all positive predictions made by the model.

📖

termini

Recall

Metric evaluating the model's ability to correctly identify all actual positive instances in the dataset.

📖

termini

F1 Score

Harmonic mean of precision and recall, providing a single balanced measure of classification performance.

📖

termini

Overfitting

Phenomenon where the model learns training data too specifically and poorly generalizes to new unseen data.

📖

termini

Tokenization

Process of segmenting text into elementary units (tokens) such as words, subwords, or characters for analysis.

📖

termini

Stemming

Text normalization technique that reduces words to their morphological root by removing suffixes and prefixes.

📖

termini

Lemmatization

Linguistic process that reduces words to their canonical form (lemma) using morphological analysis and a dictionary.

Glossario IA