第一步 Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). 这里调用 self.sp_model.encode(text, out_type=str), sp_model 就是 sentencepiece 中的一个函数,执行完出来变为 ['▁"', 'this', '▁is', '▁a', '▁python', '▁code', ':"']
第二步 将 token string 转变为 token id -> Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary. 具体就是个 for 循环,对之前分好的 tokens 一个一个转。
next_token_logits = outputs.logits[:, -1, :]
# pre-process distribution
next_token_scores = logits_processor(input_ids, next_token_logits)
next_token_scores = logits_warper(input_ids, next_token_scores)
...
probs = nn.functional.softmax(next_token_scores, dim=-1)
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
# logits_processor 中执行的操作 classMinLengthLogitsProcessor(LogitsProcessor):
[`LogitsProcessor`] enforcing a min-length by setting EOS probability to 0.def__init__(self, min_length: int, eos_token_id: Union[int, List[int]]):
ifnotisinstance(min_length, int) or min_length < 0:
raise ValueError(f"`min_length` has to be a non-negative integer, but is {min_length}")
ifisinstance(eos_token_id, int):
eos_token_id = [eos_token_id]
ifnotall([isinstance(i, int) for i in eos_token_id]) orany([i < 0for i in eos_token_id]):
logger.warning(f"`eos_token_id` has to be a list of positive integers, but is {eos_token_id}")
self.min_length = min_length
self.eos_token_id = eos_token_id
def__call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
cur_len = input_ids.shape[-1]
if cur_len < self.min_length:
for i inself.eos_token_id:
scores[:, i] = -float("inf")
return scores
...
# logits_warper 调用的三个函数 classTemperatureLogitsWarper(LogitsWarper):
def__init__(self, temperature: float):
ifnotisinstance(temperature, float) ornot (temperature > 0):
raise ValueError(f"`temperature` has to be a strictly positive float, but is {temperature}")
self.temperature = temperature
def__call__(self, input_ids: torch.Tensor, scores: torch.Tensor) -> torch.FloatTensor:
scores = scores / self.temperature
return scores
classTopKLogitsWarper(LogitsWarper):
def__init__(self, top_k: int, filter_value: float = -float("Inf"), min_tokens_to_keep: int = 1):
ifnotisinstance(top_k, int) or top_k <= 0:
raise ValueError(f"`top_k` has to be a strictly positive integer, but is {top_k}")
self.top_k = max(self.top_k, min_tokens_to_keep)
self.filter_value = filter_value
def__call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
top_k = min(self.top_k, scores.size(-1)) # Safety check # Remove all tokens with a probability less than the last token of the top-k
indices_to_remove = scores < torch.topk(scores, top_k)[0][..., -1, None]
scores = scores.masked_fill(indices_to_remove, self.filter_value)
return scores
classTopPLogitsWarper(LogitsWarper):
def__init__(self, top_p: float, filter_value: float = -float("Inf"), min_tokens_to_keep: int = 1):
top_p = float(top_p)
if top_p < 0or top_p > 1.0:
raise ValueError(f"`top_p` has to be a float > 0 and < 1, but is {top_p}")
self.top_p = top_p
self.filter_value = filter_value
self.min_tokens_to_keep = min_tokens_to_keep
def__call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
sorted_logits, sorted_indices = torch.sort(scores, descending=False)
cumulative_probs = sorted_logits.softmax(dim=-1).cumsum(dim=-1)
# Remove tokens with cumulative top_p above the threshold (token with 0 are kept)
sorted_indices_to_remove = cumulative_probs <= (1 - self.top_p)
ifself.min_tokens_to_keep > 1:
# Keep at least min_tokens_to_keep
sorted_indices_to_remove[..., -self.min_tokens_to_keep :] = 0# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
scores = scores.masked_fill(indices_to_remove, self.filter_value)
return scores
其中主要的过程:
将 logits 传递给 logits_processor 和 logits_warper,在这两个方法中进行一些预处理过程,例如添加惩罚项或对概率分布进行修改,使得生成的结果更符合期望,具体调用了(这里用到了 temperature 参数,作用是调节模型生成的随机性,temperature 通常被用于控制 softmax 函数的形状,从而影响生成序列的多样性,当 temperature 值接近 0 时,模型趋向于输出最可能的单个结果,也就是模型的输出趋向于确定性。这种情况下,所有的概率质量都集中在概率最大的那个输出上,其他的输出的概率几乎为 0,当 temperature 值比较大(大于 1)时,模型趋向于输出更多样化的结果,也就是增加了模型输出的随机性。在这种情况下,不同的输出之间的概率差异减小,使得即使概率较小的输出也有可能被选中)
TopKLogitsWarper 类是一个用于处理模型输出分数(scores)的工具,主要用于进行所谓的'Top-K 截断'。在自然语言生成的过程中,Top-K 截断是一种常见的技巧,它的目标是在每个生成步骤中,只保留 K 个最可能的输出选项,而忽略其他的选项。这种方法可以降低生成过程的复杂性,并且可以减少不太可能的输出的干扰。
TopPLogitsWarper 类实现了被称为'Top-p(或 nucleus)抽样'的策略。该策略用于限制模型在每个生成步骤中所考虑的可能输出的范围。在 Top-p 抽样中,我们不再固定考虑概率最高的 K 个输出,而是根据概率分布的累积分布函数(CDF)来选择可能的输出。我们设置一个阈值 P,然后选择输出,直到它们的累积概率大于等于 P。由于这个方法根据概率分布动态地调整输出的数量,所以它可以更好地处理不同的分布情况,从而在某些情况下可以生成更自然的文本。
input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
if streamer isnotNone:
streamer.put(next_tokens.cpu())
model_kwargs = self._update_model_kwargs_for_generation(
outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
)
# if eos_token was found in one sentence, set sentence to finished if eos_token_id_tensor isnotNone:
unfinished_sequences = unfinished_sequences.mul(
next_tokens.tile(eos_token_id_tensor.shape[0], 1).ne(eos_token_id_tensor.unsqueeze(1)).prod(dim=0)
)
# stop when each sentence is finished, or if we exceed the maximum length if unfinished_sequences.max() == 0or stopping_criteria(input_ids, scores):
ifnot synced_gpus:
breakelse:
this_peer_finished = True
其中:
根据条件判断语句对生成结果进行一些后处理,例如将已完成的序列末尾添加 padding token,更新 model inputs 和 length 等等
input_sentences = [
"DeepSpeed is a machine learning framework",
"He is working on",
"He has a",
"He got all",
"Everyone is happy and I can",
"The new movie that got Oscar this year",
"In the far far distance from our galaxy,",
"Peace is the only way",
]
tokenizer.pad_token = 0
input_tokens = tokenizer.batch_encode_plus(input_sentences, return_tensors="pt", padding=True)
bad_words 和 stop_words
LLM 在生成 token 的时候需要避免一些:
不想生成的 token
遇到就停止的 token
kv-cache
kv-cache 是 LLM 推理过程的一个优化点,可以减少计算,不过需要额外显存去存这些 cache。在推理过程中,Key 和 Value 矩阵会被缓存起来,避免重复计算历史 token 的注意力权重,这对于长文本生成尤为重要。