Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Bitter Lesson (2019) [pdf] (utexas.edu)
80 points by jdkee on Aug 16, 2023 | hide | past | favorite | 60 comments


This is the original page, I’d link to this:

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

u/dang, swap links if you see this?


Email is the correct way to acquire his attention for things like this.


Yah, entirely unclear why this has been submitted as a browser-saved PDF of the original HTML, when the original HTML is right there!


Hi, I say this so much that it is almost a revolving PSA at this point, but here we go again:

Big reminder that the Bitter Lesson is _not_ saying "just scale your methods and they work". What the Bitter Lesson _is_, however, is "work on methods that scale". There is a _huge_ distinction between the two, in my opinion.

Effectively, if I can add my layer of (re?)interpretation on it, it's saying that specialized boutique, hand-designed solutions don't play very well in the long-term in an arena with Moore's Law and money. But, what it's not saying, is that just making an algorithm bigger is the solution. This is how I see most people missing its interpretation.

For an algorithm to be effective, it needs 3 things in my opinion: 1. It needs to scale (measurable by some factor) 2. Because we have 1, it needs to have an implementation with extraordinarily rapid iteration time 3. If we have 2, we need an implementation that is extraordinarily lightweight (enough to run on consumer machines)

These three factors together, in my personal estimation, unlock algorithmic research progress in an area.

An ancillary fourth rule that really drives progress IMO is competition, formalized or otherwise, that is 1. open, 2. well-known, 3. incentivized and 4. accessible.

This is my personal opinion of course, and bound to be flawed in some way, but -- from personal experience, at least -- when I've seen this field (or research fields in general) align with this kind of method of research is where the speed of algorithmic research has absolutely exploded. :) <3


Sutton cites people who were not convinced by the success of Deep Blue in chess:

> They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess.

But it is now clear that these people were completely right! The Deep Blue approach (apparently mostly brute force search) didn't scale. E.g. Go wasn't solved this way, and the AlphaGo approach (based on reinforcement learning) did in fact scale to a lot of other games with AlphaZero and MuZero.

Sutton in fact says that both search and learning produce the big success in AI. But now, with the dominance of ML, it seems that mainly learning is responsible the big successes in AI, not search. Machine learning means that the AI models aren't written by hand, GOFAI style, like Deep Blue, but instead by a learning algorithm, which in turn is written by hand. That seems to be the breakthrough, not brute force search. (Do GPTs use "search" in any conventional way? I don't think so.)


SGD is a method of learning from examples (mapping X -> y). RL is a method of search (maximizing reward signal in a potentially open-ended environment, based on feedback on the actions that a model takes). Go was solved by combining learning and search (e.g. narrowing the search space with learned, educated guesses).

Pretrained dense GPTs don't search in any conventional way. However, when these GPTs are fine-tuned using RL, an aspect of search is reintroduced. One the most interesting directions labs are pursuing to improve on today's best LLMs is to fine-tune them with search (RL) after a learning-based (next-token prediction) pre-training.


Yes, AlphaGo does use search, but classifying RL as search seems to be stretching the term.


We train deep learning models with SGD, which is in fact a form of search -- in a high-dimensional parameter space.

The purpose of SGD is to find parameter values that minimize a training loss.


I was thinking along similar lines, it does smell like search to me at its core. But do you think there is a meaningful differentiation for something more nuanced like AdamW? Does the latter fall into the "brute force search" category too though it tends to converge faster (and arguably more optimally than SGD per Jeremy Howard's observations)?


However the ultimate optimal game of chess is still not known/solved though (in the end it's either a win for white, for black or a draw) and brute-force is alive and well: every possible position with seven pieces on board has been solved by brute-force and the endgame is now only one lookup in a lookup table when there are only seven pieces left on board.

I think they're currently solving it for eight pieces.

So it's learning but then brute-force for the end game.


If I remember correctly, the algorithm in Alpha Go used a combination of reinforcement learning and searching the space of moves and possible outcomes to determine the next move. So it does indeed use both search and learning.

Regardless, the author's point is that computation is a better way of finding and exploiting patterns/strategies than our own intuitions. The distinction between search and learning is not the important one here.


There was an important step prior to alpha go. At the time the combinatorics were in favor of Go. But someone had the bright idea to do a probabalistic search of the space. The key idea was to play a ton of random games and rate each position based what percentage that spot was included in winning games. This blew away all other go ai at the time. Sadly this was about the time I stopped having time to follow the space, so I’m not sure how this idea was further incorporated in Go AI. But it was truly a revolutionary idea at the time


In hindsight computation wasn't the important thing though. A lot of things require a lot of computation that aren't intelligent or don't scale well, like Deep Blue. The important breakthrough in AI was learning ("machine learning").


Search/Reasoning/Inference time compute, however you phrase it is still essential. You need search to improve upon learning to work in novel situations.


Humans do very little "search", as we can see in games like Go. Which suggests it isn't essential for AI either.


Just adding an obvious corollary regarding parallelization, to "leverage computation" and "massive computation became available and a means was found to put it to good use":

  Parallelizable methods win (e.g. DL), by using the computation available
Further: progress will occur on real-world problems that can be solved by those methods, making those applications dominant throughout society. (i.e. everything looks like a nail to someone with a hammer... but if it's a really good hammer, that may be the best approach)


Yes. Prior to this though in the AI community there was a lot of debate about the pros and cons of "weak search" (ie generalized methods) and "strong search" (ie methods which make use of domain knowledge). The bitter lesson is more or less that in the long run weak search always wins and the only real judgement call is whether strong search can give you a transient benefit in a particular case before weak search surpasses it and whether that is worth it.

In my head this always parallelled the "premature optimization" conversation in programming. Most programmers would say that inlining etc was only justified when you've benchmarked so you know how the benchmarks are performing but the experience of whole program optimization in things like hotspot jvm and llvm seems to suggest that even benchmarked optimizations can be premature because only a vm can do optimizations on the real world use case.


I think the Go example is facetious. For the longest time, Go simply was not amenable to search. It is only because of breakthrough in theory that Go became amendable to search. And, it is hardly a straightforward search mind you. Two self learning networks are involved. A first to guide the expansion of a Monte Carlo based tree search, a second to evaluate the nodes in the tree. You can hardly blame a researcher in the 70s, 80s, 90s, 00s for not focusing on computer power, when you have 19^2 expansion factor for each move.


Go has been described as the drosophila of AI. Once a program will know how to play go, we'll be able to make more general artificial intelligences. This is exactly what has happened with alphago and chatgpt. We're not repeating the same thing that happened with chess, but we are past the singularity. Recent advances in AI are lighting up the very nature of intelligence. Our memory and our brain are a machine for predicting the future. It's very similar to what recent models do. Alphagozero has rediscovered the history of chess openings in the same order as humans. We are no longer discovering simple new algorithms supported by machine advances, but architecting predictive systems based on guided learning. When we developed the algorithm for playing chess, the possibilities for generalization were limited. With these new architectures, new possibilities are appearing every day. IMHO, the predictions were true: since alphago, we've entered a new world.


>> Go has been described as the drosophila of AI.

John McCarthy described chess as "the drosophila of AI":

http://jmc.stanford.edu/articles/drosophila/drosophila.pdf

Who was it that described Go as the drosophila of AI?


It was John McCarthy, in the very paper that you cite!

See section 7:

As a fourth Drosophila I would like to mention the research on Computer Go.


I finished my studies in 1993. Artificial intelligence was my major and I had done a study on programming the game of Go (a review of research state). At that time, the success of brute force for chess had caused much disillusionment in the AI world, which had hoped to see a drosophila. The transfer of this hope to the game of go was present in numerous publications. Even if it doesn't show up very well in a Google search, there are plenty of allusions in articles about programming the game of Go. Since the creation of go language, it is more difficult to find.


Oh, oops. I missed that. I linked the pdf thinking it was a different text.

Thanks for the correction.


There's no real knowledge transfer from Alphago to Chatgpt. They share some underlying techniques, but they sit on different branches of a tree, one is not built on top of the other.


It's not called "the bitter lesson" for being optimistic.

One of the points, I think, is that there is very little a researcher in the 70s could have done to make progress on these problems, because the computing power just wasn't there yet; and once the computing power was here, almost all of that 70s researcher's work became obsolete.


i don't know. it's mysterious that nobody even came up with the algorithms for decades, which often do still work well on toy examples. MCTS is the kind of thing people definitely could have thought of at any point since 1950; and neural network value functions for board games had already succeeded with TD-GAMMON.

this is even more striking in poker, where counterfactual regret minimization wasn't invented until 2007, despite being a relatively simple algorithm to describe and all the essential intellectual building blocks being known since the 1960s.

von Neumann had invented extensive form games with the express idea of modeling poker; there were researchers in the 1950s/60s (Hannan, Blackwell) working out the core ideas of regret minimization. One innovation (the sequence form representation of game strategies) was only known in the Soviet Union in the 1960s and not propagated to the West. But this was independently rediscovered in the early 1990s and it still took another 15 years for CFR to be developed.


Can you please give a link to the soviet paper? Thank you very much :)


The title is "Reduction of a game with complete memory to a matrix game" by I. V. Romanovsky, Doklady Akademii Nauk SSSR 1962. Unfortunately I don't have access to the article.


Thank you!


> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries.

I agree and think “ML explainability” efforts are doomed to fail as ML becomes increasingly more effective. There is no a priori reason that the human brain should be capable of intuitively grokking sufficiently advanced general learners. We can invent them and improve them, but saying that we will be able to understand what the myriad matrix multiplications are “doing” will be like saying we understand the human brain because we can model the physics of its constituent atoms. The emergent complexity is too high for us to make any sense of it.


There is also no a-priori reason why we shouldn’t be able to understand how the higher-level behavior emerges. And without such an understanding, trying to improve or control the behavior is like poking around in the dark.


> trying to improve or control the behavior is like poking around in the dark

Which is basically what we've been doing with AI so far, as the paper notes (among other fields, like medicine). Have you heard the joke "grad student descent"?


This (popular) opinion makes no sense. How are you going to improve something you dont understand? Throwing ramdom software pieces to see what sticks? The recent progress was possible because people understood intuitively enough of the limitations of earlier models to think and invent a remedy.


Not that I necessarily agree with the above poster, but this:

> How are you going to improve something you dont understand?

Is just nonsense. Evolution understands nothing, yet produced a mind. Closer to us, the early people who produced all the crops that led to the shift to agriculture, and it's later improvements, absolutely did not understand how any of it worked.

Evaluation and selection are sufficient to improve things. Understanding is useful, but optional.


These are just absurdly remote analogies that have nothing to with how ML algorithms have developed thus far, nor how they will develop in the the immediate future.

So how would you "improve" here an now any algorithm. Create and insert random code and evaluate? Jeepers people are losing touch with reality.


> This (popular) opinion makes no sense. How are you going to improve something you dont understand?

How does that opinion not make sense? There are numerous things humans have invented for which we have little understanding of how they work: medical drugs, anesthesia, certain quantum phenomena utilized in semiconductors, etc.

I would argue that for current state-of-the-art LLMs, the implementation is likewise ahead of the theory at the moment.


"Understanding" does not need internalizing a process or algorithm at low level. We are not stochastic parrots. It is simply possesing sufficient insight so as to be able to explore nearby designs, formulate and test hypotheses, narrow the search space etc.

There is a strange emerging AI cult that is also in force here in HN that seems to believe these algorithms have evolved themselves or were some random trial and error. Ergo, they can keep evolving and the researchers dont need to understand a thing about how they work.

Serendipity in combining ideas that prove effective plays a role but within a fairly well defined conceptual sandbox. But progress with AI is more or less conditional on people having sufficient understanding to coax algorithmic structures in the desired direction.


For decades ML researchers were walking on eggshells, thinking that they are going to run into overfitting and bias/variance tradeoffs, adding regularisation, worrying about local maxima, etc. Then one day a certain company kept making their models bigger without any concern and suddenly everyone thinks that there are unlimited returns to scale, when in reality you merely bought yourself a one time victory.


(Reads:) "Did you know that time-traveling was invented, cos time was seen as a threat to life ?"

P-:



Ehh, I'm more on Sutton's side so far.

Brook's post goes over the classics (Moore's law is ending, curating a dataset requires human intervention, etc) and posits that making a huge model won't be a competitive strategy for long because it gets to expensive to train and use.

It's a bit early to tell, but so far that hasn't materialized. OpenAI got state-of-the-art results with GPT-4, AFAIK by sticking for very-super-big models together. Open source experiments with LLAMA show you can still get good results with heavy quantization. Distillation hasn't be too explored by mainstream projects, but I bet there's lots of potential there too.

Right now the winning strategy looks to be "go really big, then figure out how to go small".


Going really big and figuring out how to go small is also working; see Lottery Tickets, ensembles, distillation, and so on.


As someone who holds a speed WR with a very tiny model (<~8.5 MB or so), I would argue kindly that Sutton's lesson is about methods that scale -- and scale goes in the opposite direction too!

If you do not have solid scaling, your link between micro methods predicting mega methods is broken. Hence, Sutton's bitter law forms the foundation for a few other lemmas that I think underpin what makes really effective research (which is iteration time, and how we effectively reduce it as much as possible and make it as accessible as possible -- which thankfully for ML algorithms seems to go hand in hand! <3 :')))) )


I think the convergence to transformer architectures already proofs this article from 2019 wrong in some ways. Even though CNNs did and do a good job, they will most likely be superseded by vision transformers. Of course you could argue that stable diffusion models and LLMs also have some domain knowledge inside. But it might be a little less with multimodal networks. So I guess the lesson should be: Don't put more domain knowledge inside than what's needed to make a product out of it right now, because everything else won't progress the field.


I think the Bitter Lessons do not consider two important points with regards to neural nets :

1. Not all nets work for all problems, those that work tend to have the right inductive biases. We discovered the architectures partially by trial an error, nevertheless they work because of encoded prior information.

2. Data and computation are bounded. GPT4 was basically trained on all text, further advancements probably need more insight not more data.


> GPT4 was basically trained on all text

Well, there's multi-modal training. There's tons of untapped audiovisual data.


GPT4 is a multi-modal model. They haven't exposed a way to use the image embeddings, so consumers can't utilize it, but the model accepts image input and it was trained on images and text. Yes, there are other modals that can be incorporated, and reinforcement learning is still pretty nascent/very much unsolved.


The fact that it's 1) multi-modal and 2) we're approaching text exhaustion does not imply 3) that other modalities are exhausted


Yeah I just like pointing out that GPT-4 is multimodal because most people don't seem to realize this, as they never read the GPT-4 papers.


The essence of the bitter lesson is that the less inductive bias we try to bake in, the better performing they tend to be.

Transformers are preferred to ConvNets these days in Computer Vision despite the latter having all sorts of vision based inductive biases.

GPT-4 has not been trained on all text lol


1. Partially that just moves where we must search for implicitly included symmetries. Word embedding most famously create structure that allows to define shift operations in the embedding space “doctor + female = nurse” might be the most infamous. By evolution only those word embedding so that yield good results are used. Again just because we did not put the structure there, but discovered it by trial and error does not mean that the structure is not key to the success.

2. Gpt 4 was trained on 13T tokens, all books ever written would be about 6.5 T tokens by Fermi calculation.


1. Sorry but that's still far less bias than ConvNets for vision. And far less bias than previous methods for language. The less bias the better. It's simple.

2. Assuming the rumor is fact, GPT was trained for multiple epochs and books are a small percent of what trains LLMs lol.


What do you mean with more insights? If they are not represented as some data, how could they ever influence any computational model?


By building the model in such a way that assumptions about the data are in the model structure.

ConvNets hardcode translation equivariance, more general convolutions can hardcode more general equivariant structures.

The whole field of geometric deep learning is about constructing nets based on insights about the structure of the data manifold, and many successful nets turn out to do this.


Nobody said anything about insights.


I don't doubt the taste was bitter at the time, but it was never going to be forever. In 2018 models had 100e6 parameters, and in 2013 GPT4 has (wild guess) 2e12 parameters [0]. Conservatively that's 500% per year. At it's peak, hardware never improved at that rate and has since slowed down dramatically.

So it was always going to end, and further advancement was always going to revert being driven by a neural net of some sort. In fact, it looks to me we are already at that point.

The interesting question is what neural net will end up driving it.

[0] https://www.qualcomm.com/news/onq/2023/07/generative-ai-tren...


On HN before, at least once.[1]

[1] https://news.ycombinator.com/item?id=36017857


The money quote (basically says why deep learning will keep on dominating): "One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning."


this thread on a "bitter lesson 2.0" relating to robotics from earlier this year is interesting

https://twitter.com/hausman_k/status/1613544836266885120


I only see the first post in the series when I click this link. Do I have to click something else to see the rest of the thread?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: