Multi-Head Attention
Head Dimension (d_k)
Dimension of key and value vectors in each attention head, calculated by dividing the model dimension by the number of heads, influencing the representational capacity of each head.
← Terug