Cool! I had been thinking about trying this as well, after reading about the idea in one of Cosma Shalizi's notebooks [0]. I'd love to see how something like this performs when "trained" on a corpus the size of the web when given the same kind of computational resources used to train modern LLMs.
[0] http://bactra.org/notebooks/nn-attention-and-transformers.ht...