🏠 홈
벤치마크
📊 모든 벤치마크 🦖 공룡 v1 🦖 공룡 v2 ✅ 할 일 목록 앱 🎨 창의적인 자유 페이지 🎯 FSACB - 궁극의 쇼케이스 🌍 번역 벤치마크
모델
🏆 톱 10 모델 🆓 무료 모델 📋 모든 모델 ⚙️ 킬로 코드 모드
리소스
💬 프롬프트 라이브러리 📖 AI 용어 사전 🔗 유용한 링크

AI 용어집

인공지능 완전 사전

162
카테고리
2,032
하위 카테고리
23,060
용어
📖
용어

Byte Pair Encoding (BPE)

A data compression algorithm adapted for tokenization that iteratively merges the most frequent character pairs to create an optimized subword vocabulary.

📖
용어

WordPiece

A variant of BPE developed by Google that maximizes language probability when merging tokens, notably used in BERT models and their variants.

📖
용어

Unigram Language Model

A tokenization approach based on a unigram language model that selects the best segmentation by maximizing the product probability of tokens in the sequence.

📖
용어

SentencePiece

A language-independent tokenization library that treats text as a raw unicode sequence, eliminating the need for language-specific preprocessing.

📖
용어

Vocabulary Size

A critical parameter determining the total number of unique tokens in a model's vocabulary, directly influencing model size and its ability to handle linguistic diversity.

📖
용어

Special Tokens

Reserved tokens like [CLS], [SEP], [MASK], [PAD] used to delimit sequences, mask elements, or pad batches to a uniform length.

📖
용어

Tokenizer Training

The machine learning process of learning vocabulary and segmentation rules from a text corpus, optimizing representation for a specific task or domain.

📖
용어

Subword Regularization

A data augmentation technique applying different possible segmentations of the same text during training, improving model robustness and generalization.

📖
용어

Vocabulary Truncation

Process of limiting the vocabulary to the N most frequent tokens, replacing less frequent tokens with subwords or an [UNK] token to optimize computational efficiency.

📖
용어

Tokenization Pipeline

Sequential chain of preprocessing steps including normalization, pre-tokenization, model segmentation, and post-processing to produce the final tokens.

📖
용어

Tokenizer Config

JSON configuration file containing all the hyperparameters and metadata necessary to exactly reproduce the behavior of a specific tokenizer.

📖
용어

Fast Tokenizers

Optimized tokenizer implementations using Rust and efficient data structures, offering 10-100x better performance than pure Python implementations.

📖
용어

Tokenizer Inference

Phase of applying a trained tokenizer to new text data, converting raw text into token sequences ready for processing by the model.

🔍

결과를 찾을 수 없습니다