HOW BPE AFFECTS MEMORIZATION IN TRANSFORMERS

Their conclusion that it is the sequence length that matters makes intuitive sense. The Transformer architecture internally compares all input token pairs many times and the number of token combinations grows quadratically with the sequence length. When we imagine the classification as searching for particular relation in the input, it totally makes sense that the fewer possible relations we have, the easier it is to find something.

## 模型和超参数

### BPE 设置

PAQ dataset，(0.5,1.5,10,,15,20)✖️10^3对应1280, 1784, 5784, 10784, 15784, 20776 vocab size

### 改变模型为分类器

EOS token 映射等方法将模型改变为分类器

## BPE影响记忆实验

Memorizing random Labels：a,b两图模型随着BPE词表的大小增加，对随机标签拟合的更好。

Membership inference: c图中看到，(M)LM的检验精度具有不同的增长区域，表明：泛化与记忆并不直接矛盾，有一定程度的粒度可以实现更好的记忆的更好的泛化

