损失函数
A function measuring the discrepancy between a model predictions and the actual ground-truth values, providing the gradient signal used to iteratively update weights during training.
"Cross-entropy loss is the standard choice for multi-class classification tasks."
均方误差交叉熵优化目标
反向传播
An algorithm computing gradients of the loss with respect to every weight in a neural network by applying the chain rule of calculus layer by layer from output back to input.
"Backpropagation enables efficient gradient computation across hundreds of millions of parameters."
梯度下降
An optimization algorithm that iteratively adjusts model parameters in the direction of steepest descent of the loss landscape, seeking a local or global minimum.
"Stochastic gradient descent (SGD) with mini-batches dramatically accelerates training on large datasets."
学习率SGD收敛
优化器
An algorithm (e.g., Adam, RMSProp, SGD with momentum) implementing gradient descent, adaptively managing learning rates and momentum to accelerate convergence and escape poor local minima.
"Adam optimizer is the de facto standard for training deep neural networks today."
AdamMomentum自适应学习率
超参数
Parameters set before training begins (e.g., learning rate, batch size, number of layers) that control the training process and critically affect model performance and convergence speed.
"Grid search, random search, and Bayesian optimization are common methods for systematic hyperparameter tuning."
学习率Batch Size调参
正则化
Techniques (L1/L2 penalty, dropout, data augmentation) that discourage overfitting by adding constraints or injecting noise during training, improving the model ability to generalize.
"L2 regularization adds a penalty proportional to the sum of squared weights to the total loss."
DropoutL1/L2数据增强
嵌入向量
A dense, low-dimensional representation of discrete entities (words, items, nodes) where similar items are positioned close together in the learned vector space, capturing semantic relationships.
"Word embeddings capture semantic relationships: king minus man plus woman approximates queen."
分词 / Token 化
The process of breaking raw text into discrete tokens — words, subwords (BPE, WordPiece, SentencePiece), or characters — that a language model can process as numerical input.
"The sentence AI is amazing might tokenize to AI/ is/ amazing with a BPE tokenizer."
BPEWordPiece子词
卷积神经网络
A neural network using learnable filters (kernels) that slide across input data to automatically detect spatial hierarchies — the foundational architecture for modern computer vision.
"CNNs excel at image classification, object detection, and video analysis tasks."
卷积层池化层计算机视觉
循环神经网络
A neural network architecture designed for sequential data, where connections between nodes form directed sequences, allowing information to persist across time steps via hidden states.
"RNNs process text sequentially, maintaining a hidden state that encodes the entire context so far."
序列模型隐藏状态时序数据
长短期记忆网络
A specialized RNN architecture with gating mechanisms (input, forget, output gates) that allow it to learn long-range dependencies and effectively mitigate the vanishing gradient problem.
"LSTMs revolutionized speech recognition and machine translation before the Transformer era."
门控机制长期依赖序列建模
Transformer 架构
A neural network architecture introduced in 2017 (Attention Is All You Need) relying entirely on self-attention mechanisms, eliminating recurrence and convolution, enabling massive parallelization across GPUs.
"GPT, BERT, Claude, and Gemini are all ultimately built on the Transformer architecture."
自注意力机制
A mechanism computing a weighted sum of all positions in a sequence for each position using learned Query, Key, and Value projections, enabling capture of long-range dependencies regardless of distance.
"Self-attention lets the model directly relate sat to both cat and mat regardless of their distance."
Query/Key/Value注意力分数上下文建模