🏠 Ana Sayfa
Benchmarklar
📊 Tüm Benchmarklar 🦖 Dinozor v1 🦖 Dinozor v2 ✅ To-Do List Uygulamaları 🎨 Yaratıcı Serbest Sayfalar 🎯 FSACB - Nihai Gösteri 🌍 Çeviri Benchmarkı
Modeller
🏆 En İyi 10 Model 🆓 Ücretsiz Modeller 📋 Tüm Modeller ⚙️ Kilo Code
Kaynaklar
💬 Prompt Kütüphanesi 📖 YZ Sözlüğü 🔗 Faydalı Bağlantılar

YZ Sözlüğü

Yapay Zekanın tam sözlüğü

162
kategoriler
2.032
alt kategoriler
23.060
terimler
📖
terimler

Unigram Language Model Tokenization

Tokenization method that initializes a large vocabulary and then iteratively reduces it by removing subwords with the least impact on the unigram model's likelihood, producing an optimal vocabulary.

📖
terimler

Vocabulary

Static and predefined set of all unique tokens that a language model can recognize and process, whose size directly influences the model's capabilities and computational complexity.

📖
terimler

Special Token

Predefined token with a specific semantic function, such as [CLS] for classification, [SEP] for separation, or [PAD] for sequence alignment, used to structure model inputs.

📖
terimler

Embedding Matrix

Learned weight array where each row corresponds to the dense vector representation of a vocabulary token, serving as a projection layer to transform token identifiers into vectors.

📖
terimler

Subword Tokenization

Tokenization strategy that divides words into smaller units (subwords), allowing management of a finite vocabulary while being able to represent infinite words, including neologisms and typos.

📖
terimler

Character-level Tokenization

Granular tokenization approach where each character becomes a token, eliminating the out-of-vocabulary word problem but generating very long sequences and increasing computational complexity.

📖
terimler

Word-level Tokenization

Segmentation method where each entire word, delimited by spaces or punctuation, is treated as a single token, simple but vulnerable to the out-of-vocabulary (OOV) word problem.

📖
terimler

Tokenization Method

Specific set of rules and algorithms (e.g., BPE, WordPiece) that define how raw text is split into tokens, directly influencing model performance and robustness.

📖
terimler

Whitespace Tokenisation

Simple tokenization technique that segments text based solely on whitespace characters, often used as a first step before more sophisticated methods.

📖
terimler

Regular Expression Tokenisation (Regex Tokenisation)

Segmentation method that uses regular expression patterns to define complex tokenization rules, allowing for controlled separation of words, punctuation, and other symbols.

📖
terimler

SentencePiece Tokenisation

Specific implementation that treats text as a Unicode stream and applies a tokenization algorithm (like BPE or unigram) to create a fully decodable and language-independent vocabulary.

📖
terimler

Character Pair Encoding

BPE variant operating at the character level rather than byte level, merging the most frequent adjacent character pairs to build a subword vocabulary.

📖
terimler

N-gram Tokenisation

Approach that segments text into contiguous sequences of n items (characters or words), capturing local context information but suffering from combinatorial vocabulary explosion.

🔍

Sonuç bulunamadı