But does the tokenizer have anything to do with Glitch Tokens? Glitch Tokens see...

GeneralMayhem · on June 9, 2023

It does a bit, because the fact that they're able to persist is sort of an artifact of how naive the tokenizer is (it's a counting operation based on n-grams), and that it runs as a separate step. There's no feedback from the transformer to the tokenizer to say "hey, this token is actually pretty meaningless, maybe try again on that one". That means that strings of characters that are common but very low semantic value, like the example of Reddit usernames that mostly post on /r/counting, will be included in the model's vocabulary even though they're not interesting.

When humans see extremely low-information-density data, we can forget it. And the model can too, but only kind of - it can forget (or rather, never learn) what the "word" means, but it can't forget that it's a word.