Why current LLM uses left padding?

type

status

date

slug

summary

Intro

If you have already got some knowledges in Machine Learning area, you should have heard the padding tricks. To be more specific, in NLP/LLM, it usually looks like this:

This kind of right-padding is common to be found in tutorials cuz many of them adopt BERT as example. However, as decoder structure dominates this field, is it always the case?

Why we need padding?

💡

Padding is a preprocessing trick to make all the input sentences same length, so that they can be processed in batch as a single tensor.

Image you have the following two sentences:

Seq_1: I have an apple. Seq_b: A complex system that works is invariably found to have evolved from a simple system that worked.

🤔️Uhhhhhh……ok, now the tokenizer returns sub-word representations of the original sentences in ids. However, they are in different length. It is natural since the original sentences are different. We can’t expect they always have a same tokenized representation in length. But that makes us impossible to feed them to model, since the model only accept inputs in high-dimension matrix (or say the tensor). So what should we do?

The solution is simple and intuitive —— adding dummy tokens to those shorter representations, so that they can have the same length. For example, we pad seq_a to the same length as seq_b:

It is clear that both padding-sides give a tensor of shape (Batch, Text)=(2,18), we can now send them to the model, but what’s the difference?

How we sample the next token?

The exact answer is: it depends on the generation function you use.

However, in most implementation I have seen, you have to pad left so that you can sample correctly even though you applied the correct attention mask.

Before we start, let’s refresh our mind on how causal language model predict next token, it is in an auto-regressive manner, below is the implementation in GPT-2:

Pay attention to this line:

You may have noticed why. Remember, if you pad on right, you will have:

I have an apple. <PAD> <PAD> …

However, the algorithm always picks the logits of last token to predict next token, if we did the padding on right hand side, the model is actually using the logits of <PAD> to do next token sampling! This is completely wrong. Although with attention mask, which assign zero attention score to those position with <PAD> token, as long as the algorithm finally look at the last <PAD> position to start sampling, the torch.multinomial function would use a wrong logits, which leads to an incorrect next-token prediction. You can see a similar issue here.

Not only in the very prototype GPT-2. In the default .generate() call of Huggingface shipped LLaMA2, beam_search() (try check it with `vars(model.generation_config)` and transformers.generation.utils.GenerationMixin.generate) is applied and you can find the identical generation logic as in GPT-2:

They actually added a warning message to suggest left padding

Also, as I mentioned in the very beginning, some pre-trained model does not have <PAD> token during pretrianing, for example, in LLaMA2 and GPT2:

You should also note that although left-padding is more common, it is not always the case, in the FAIR LLaMA2 repo, their original generate() function is implemented for right-padding, but you can notice that it is a little bit more complicate. That could be partially why left-padding is more popular.