> This is not a hypothetical scenario; I’ve personally encountered a case of someone using an LLM attempt to contribute code I recognized from a specific Open Source project under one license to another project under a different license
You say you "recognized code". Does it mean that you weren't able to find the exact match?
> an LLM is actually just regurgitating portions of its inputs
You seem to be talking about the inputs to the autoregressive pretraining stage. Correct? Then it's not how LLMs work, unless we use a definition of portions as a "few letters blocks."
I found exact matches. I also found inexact matches, where C functions had been turned into C++ member functions and the like. “Recognized” does not somehow imply a lack of precision.
The LLM the person used was trained on a very large corpus of Open Source code, and reproduced that code exactly. Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.
Were those functions trivial? With, say, 1% probability of someone who have not seen them writing them like that?
> Just like LLMs have reproduced chapters of books and articles from the New York Times exactly.
Have you read the articles? As far as I remember they fed large chunks of an article multiple times to an LLM to sometimes get a not-so-long exact match. It can mean that LLMs can infer a style and humans are predictable.
> […] fed large chunks of an article multiple times to an LLM […]
So they had to prompt? An LLM? I got this argument before and still don’t get what it’s trying to say. These models do not output anything unless prompted, that’s not any kind of gotcha.
On the code outputting front there is a lot of relevant evidence beyond the NYC lawsuit [0].
If I slightly modify GPL code, that doesn’t give me the right to relicense.
No, the functions weren’t trivial, and a lot of the surrounding code and structure bore substantial similarities as well. If you saw the two files next to each other, you’d assume it was the result of a copy-paste-adjust process if you didn’t know an LLM was involved.
I can only speculate that the model that generated the code hasn't undergone selective unlearning for verbatim data (SUV) or something similar. As you understand "sometimes generates verbatim code" and "just regurgitates [non-trivial] portions its input" are different statements.
The possibility of SUV clearly shows that a model does more than "just regurgitating."
You say you "recognized code". Does it mean that you weren't able to find the exact match?
> an LLM is actually just regurgitating portions of its inputs
You seem to be talking about the inputs to the autoregressive pretraining stage. Correct? Then it's not how LLMs work, unless we use a definition of portions as a "few letters blocks."