Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

François Chollet, author of Keras, lays it out pretty clearly that LLMs are just memorizing [0] (there’s more links if you follow that thread).

I think LLMs are incredible, and spend most of my days working with them closely, but they are not nearly as close to “AGI” as people think primarily due to their inability to really generalize.

At the end of the day LLMs aren’t that different than old school n-gram Markov chains, except rather than working on n-grams, they’re working in a (very sophisticated) latent space. Their power is really these incredible latent languages spaces we’re still just starting to understand.

In all my years of tech the “AI” space is the most curious hype-bubble since the things people expect to happen are entirely out line with what is possible, while at the same time the potential of these models is still, imho, underexplored and largely ignored by the vast majority of people attempting to build things with them.

99% of the people I know working in this space are just calling APIs and trying to do some variant of code generations, where a small minority of people are really trying to figure out what’s going on in these models and what can be done with them successfully.

0. https://twitter.com/fchollet/status/1755250582334709970



We can prove that transformers can do computation beyond memorization, and that they can learn it from examples, unless you think learning an algorithm is memorization.

https://arxiv.org/abs/2301.05217

https://arxiv.org/abs/2310.16028


Surely learning an algorithm is memorization? I still remember the physics equations from when I was at school (a long time ago) because it was drilled into us repeatedly. Stuff like v = u + at, or s=tu + 1/2 (at)^2, etc etc.

I don't really recall how they were derived etc but I did memorize them. It definitely doesn't mean I'm good at physics or understand the deeper meaning of these equations. I also had to memorize bubble sort, merge, heap, quicksort etc when I was in university, but I don't think I could invent these sorts from first principles without looking it up.

Memorization doesn't really equal understanding, it's just memorization.


> Surely learning an algorithm is memorization?

Not in this context, no. An LLM is never given any algorithm to memorise. It is given only input->output sets, and it "learns" what the algorithm is from those sets.

We know that it does this because it is able to generalize that algorithm to inputs and outputs outside of its set of training examples. So we know that it doesn't only memorise which input connects to which output, and regurgitate that information. It has come to "understand" the formula that connects the input to the output, without ever being given that formula.


It’s fitting a function to data and guessing the value of the function for a new input. It does not know what is under the hood of the function, etc., and see the implications when the function produces a wrong value … etc.


It can do that. Verifying an answer is just another algorithm it can learn.

LLMs mostly can't do math but that, like most of their other flaws, is because of the tokenizer.


> Surely learning an algorithm is memorization?

Yes. In a sense, every algorithm can be reduced to memorization, where you precalculate all possible inputs to all possible outputs and store them in a giant lookup table.


Maybe there doing something we don’t really have a name for yet ? This is why it causes so much controversy.

Not really generalizing, not memorizing, maybe approximating ?


There is certainly a lot of discussion around semantics to be had before we can agree on what exactly an LLM is capable of.

We know that LLMs tend not to be good at math. Some people will say that the fact that an LLM cannot sum two numbers demonstrates they cannot generalise. Yet others would say the way LLMs calculate sums is generalisation because it mirrors our own process of addition which works mostly by memorisation and a lot of double checking.

My perspective is that the "thinking" that LLMs do is a lot closer to the kind that humans do which is to say a lot of pattern matching but with no fundamentally precise logic underlying it. If LLMs are flawed in some way then humans are also flawed in the same way.


There is a name for this already: associative memory. That’s how you catch a ball: you condition your memory of catching a ball with proprioceptive and visual input, much like a multimodal transformer. There’s no thinking involved - you wouldn’t be able to catch it if you had to think


An associative memory can model truth tables of logic circuits, and so the steps to go from associative memory to arbitrary computation pretty much requires adding state and a loop. And we do provide state and a loop. As such, while that does not prove that any given model is capable of reasoning, describing what goes on at any step as associative memory says nothing meaningful about the computational power of the wider system.

It's very possible the way we train current models will turn out to place fundamental limits on the abilities the models will have, but that will not be because the models act as associative memories.


I think the current models will be a fundamental part of larger, more comprehensive models, and they will handle the “easy” parts, much like we, humans, use our associative memory to be able to do things in realtime, “without thinking”. Yann LeCun wrote a paper on this which more people should read: https://openreview.net/pdf?id=BZ5a1r-kVsf. Moreover, Meta’s research is starting to move in that direction https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...


> Not really generalizing, not memorizing, maybe approximating ?

The term you are searching for is confabulating.


"Vibing"?


No need to come up with any new name, we already have it:

1. autocompletion or

2. next token prediction and

3. (reverse) diffusion


They might do "computation" and seem to be "intelligent", however they do lack logic and rigor, and cannot explain their reasoning. In contrast to old-school AI.

Such "intelligent" monkeys might be able to write reports, lead governments or lead wars, but in engineering you need more skills than that. Which leads us to the lack of a proper definition of AGI. (https://en.wikipedia.org/wiki/Artificial_general_intelligenc...)


I guess what constitutes memorization depends on what you consider "learning an algorithm". Memorizing to humans doesn't really mean learning the exact input/output pairs per se. Like a student might "learn" an algorithm for differentiation (d/dx x^n = n x^(n-1)) and then differentiate 732638 x^2 just fine despite never seeing it before, but then tell you the derivative of yx^2 with respect to y is 2yx, or something. Did they really learn how to differentiate or did they just learn a common vibe around it? When teachers see that sort of regurgitation, they call it memorization, despite the input being unique from what the student had seen in the past.


A lot of the words he uses in that tweet aren't well defined. e.g. memorization, dataset (does he mean the literal words/tokens or any token that is close in space after embedding?), pattern, category, program. The tweet is practically meaningless. I'm not criticizing him because his blog post is nuanced and he clearly understands what he's talking about, but that tweet almost certainly means something quite specific to him and he's communicating quite poorly.

As you mention, there is a sophisticated representation of the tokens. It's so sophisticated that one may reasonably stop calling them tokens (or, even data) and start calling them "concepts". Now, if someone (or something) has memorized how all the concepts go together... that's pretty darn intelligent.


I think there’s a lot of people who got into tech in 2020. They are new programmers and technologists and made a life change in 2020.

I think they were a big part of the crypto bubble. Lots of talent, hungry for that sweet startup gold, but without the technical background to really know what’s going on.

I believe these same groups are operating in the same way with AI. Recklessly bashing together APIs and cloud services to create MVPs.

It’s all the worst parts of startup culture concentrated.

Anyway, that’s why i think most of the AI space rn is just people calling APIs and acting like they discovered fire.

</salty rant>


Francois is not a researcher, however. LLMs aren’t just plain “memories”. They are very explicitly _associative_ memories. And it just so happens that this is mostly what our own brains do, too.

Actual cognition is slow and expensive for us, and we try to use it as little as possible, filling in what we can with easy, associative, low energy, near instant stuff.

Therein lies the reason why AI can be considered a boon for us humans. If machines took over the mundane work that just drains our energy and doesn't add much value, we could finally have the time to actually do what they can’t - think deeply about stuff, with their help where we find our faculties lacking. Rocket for the mind, if you will, rather than a bicycle.


I think even "actual cognition" is highly unlikely to need anything much more than associative memories plus some state and a loop. The expense being having to "execute" a large number of steps rather than "just" effectively pulling learned results from "cache".

To be very reductive, an associative memory can hold a truth table. Put minimal state, IO and a loop around that and you have a universal Turing machine. Which is why the "it's just memory" or "it's just Markov chains" is so tedious - it says near nothing about the computational power of the system including the model.

There's plenty of reason, of course, to question to what extent we know the abilities of the models, but when people assume dismissing it as "memory" or "just statistics" or "just Markov chains" I usually take it as a signal they don't understand how few limitations that imposes.


I wonder if cognition or intelligence is actually "brain damage". Your bad memory is leading to erroneous paths but some of these will eureka. Genetic evolution and natural selection are essentially that.

If your memory is too bad, then you are either insane or in an advanced state of Alzheimer. If you have enough stable paths to lead a quasi-normal life, then you become an inventor or an artist; or something atypical.

Hallucination is the feature, not the bug.


You must have a pretty impressive resume if you don’t consider François a researcher! The wikipedia would disagree with you [0], as would anyone that has had any interaction with him on the subject.

0. https://en.wikipedia.org/wiki/Fran%C3%A7ois_Chollet


Try to find a single paper on Transformers or LLMs in general in Francois’ scientific output: https://dblp.org/pid/116/8242.html

Don’t get me wrong, Keras is impressive, and Francois is impressive as well. But for insight on LLMs you should probably listen to people who specialize in them.


If you measure knowledge by the number of publications, then LLMs know nothing.


They made the argument he's not a researcher in this field, not that he doesn't know anything. And an LLM is indeed not a researcher in this field either.


It appears that my previous comment lacked clarity. I assumed the inherent absurdity and illogicality of the statement would be self-evident. Dismissing François's opinion, a seasoned professional with extensive expertise in model development and the creator of ARC, solely because he asserts that LLMs are far from achieving AGI and using publication count as the metric for evaluating him is not a well-reasoned argument. While LLMs and transformers are indeed remarkable achievements and will undoubtedly reveal more properties, they have yet to exhibit any true signs of "intelligence."


Acknowledging that they have not yet shown signs of "true intelligence" is a vastly more moderate claim than the person above ascribed to him.


An AGI might be something that can harness the LLM but also self learn.


More like AI will finally be the bicycle for the mind.


IDK, credit where credit is due: traditional computers got us much further than we’d be able to go on our own. So they’re a fine “bicycle” in my view.


But we can trivially show that the larger models can generalize for some questions even if we verify the answer isn’t in the training set.


About [0], Phi-2 is sort of a proof of concept of having very narrow and high quality dataset normal transformer models can perform at 8x to 10x better results. Of course, if you add messy prompts they will fail! What a dumb view.

And the Yi-* models are suspect of being trained on the test set or at least be contaminated. All the other models barely move and if they do, it's probably an artifact of being multiple choice. There were papers showing most models improve if they can reason the answer by letting it have more tokens in the answer.

The chat elo-like rankings are much more interesting:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

For completeness, here's the paper linked in the tweet https://arxiv.org/pdf/2402.01781.pdf


What is intelligence or generalization if not "sophisticated, complex latent space" navigation?


I think this is exactly what most people are misunderstanding. LLMs are wonderful tools for navigating the sociolinguistic matrix, which embeds the majority of human knowledge both explicitly and covertly with a huge amount of contextual nuance.

Transformers are capable of decoding and operating on this covert knowledge as easily as the encyclopedic knowledge that we tend to assume is the “main part” when we write something down.

What llms demonstrate is that language covertly encodes logic and algorithmic knowledge that is at least as rich as the “factual” encoding that written content seems to be at face value.

LLMs just make this knowledge accessible for computation. What is amazing is that this function alone is capable of producing a simulacrum of agency and intelligence all by itself. This suggests that the human cultural component is a pretty huge part of what we consider to be “human” and without it we’d just be clever apes.

It’s the clever ape part that LLMs don’t have, but it’s possible that the transformer model might be able to “ape” most of that as well if applied to the task of existing as an embodied entity in the world (see transformer use in robotic task completion)

I would not be terribly surprised if comprehensive multimodality, encompassing the entire spectrum of sensory experience as well as physical environment interactivity, gets us really close to something we could consider AGI.

Just as LLMs extract the embedded relationships in written knowledge, the physical and sensory spaces encode an enormous quantity of information that can similarly be generalised to extrapolate a cornucopia of additional concepts (gravity, object permanence, physics in general, relativity, etc). Meaningfully “tokenizing” these spaces will likely be the key to making this work effectively.


I find this human-supremacy point of view simply false.

Humans are just "aping" more (with large spread within its population by the way).

There is nothing special about humans, we're just "aping" slightly more than other animals - computers will arrive and immediately surpass us at our "aping" = what we call "intelligence".

This old, false argument with constant goalpost moving will run out of space sooner or later.


I think you misinterpret my statement, as I too believe that human intelligence is based on the same principles as are being exploited in generative AI. I think there is ample evidence to support that viewpoint, even adversarial images that affect humans in the same way (to a lesser extent) as they do classifiers.

I do not believe there is anything special about human intelligence, mostly just that it embodies more complexity than we currently have access to in our training data, and perhaps the hardware required might be expensive, or maybe not really.

Interactive / bootstrapping learning is still something we will need to figure out.


As far as I know we don't have much real understanding of how human or even animal intelligence works. It may be that it is entirely based on "sophisticated, complex latent space" navigation or it may be that it has nothing whatsoever to do with that (or anything in-between).


Our brains can reuse patterns they learned in one area into another area, LLMs need specific examples and can combine these in weird unpredictable and often inhuman ways losing context and meaning, the bits humans care about. They often aren’t even a good starting point compared to just thinking deeply for 5 minutes.


Show me a Markov chain that can thrash a 9-dan Go master.

"But that wasn't an LLM."

OK, show me a Markov chain that can write a Python program that can play Go at all.


Put a loop around a Markov chain where you provide a 'tape interface' taking instructions from and feeding input back to, the state, and you have a Markov decision process with a hard-wired decision maker acting as the tape. Provide the right Markov chain, and you have a universal Turing machine. So the extension needed from a Markov chain to something that could be programmed to do what you describe - say by running an ML model - is only very slight. And we do provide loops, and state when we run inference, just not infinite.

I'm agreeing with your overall point, to be clear - my point is that calling something a Markov chain is effectively calling it trivially extendable to something that can in principle compute everything any physical entity confined to the known laws of physics can, and so what it boils down to is whether or not the model is trained in a way that gives it those abilities, and not the put-down of the potential ability of such a system that people usually intend the "just a Markov chain" as.


Semantics, I suppose. Those ANNs were Markov chains, in a sense.


Pop a loop around a Markov chain that provides a "tape interface" and you have something capable of simulating a Turing machine. So when people bring up the Markov chains argument, they're saying next to nothing about the potential computational abilities of the system, even though they usually intend to dismiss it.

I tend to see people bringing that up in a dismissive way (not suggesting you are) as a clear indication they either haven't thought the argument through or do not understand how little it takes for a system to be Turing complete, and so for that argument to be meaningless.


I mean, I’m not sure what you’re trying to prove by asking for a Markov chain model like that. It’s trivially true that you can have a Markov chain output whatever you like (somewhat artificially, but we are talking about memorisation here) if you pick your training data carefully.


So "pick the training data carefully," and show me what I'm asking for, given that it is "trivial."


The data consists of a graph whose vertices correspond to the output sequence you desire and where there is an edge (with probability 1) from string x to string y if and only if x precedes y.

This model will arise from the desired sequence as a single training example (learning the probability of each pair of consecutive tokens), provided none are repeated.

Now run your Markov chain with initial input {first token in your sequence}.


Im curios, what sort of area you think needs more research when it comes to these models




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: