"Eventually though, open source Linux gained popularity – initially because it a...

causal · on July 23, 2024

"Open weights" is a more appropriate term but I'll point out that these weights are also largely inscrutable to the people with the code that trained it. And for licensing reasons, the datasets may not be possible to share.

There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.

Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...

stavros · on July 23, 2024

"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.

larodi · on July 23, 2024

Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?

proteal · on July 23, 2024

Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.

stavros · on July 23, 2024

That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.

nightski · on July 24, 2024

Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.

_flux · on July 24, 2024

You can do a lot more with an executable as well than just execute it. So maybe the analogy is apt, even if not exact.

Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".

the8thbit · on July 25, 2024

Its not really an analogy. LLMs are quite literally executables in the same way that jpegs are executables. They both specify machine readable (but not human readable) domain specific instructions executed by the image viewer/inference harness.

And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.

For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.

What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).

We simply need much stronger interpretability for that to be possible.

rizky05 · on July 24, 2024

This is debatable, even an executable is valuable artifact. You can also do a lot with executable in expert hand.

frabcus · on July 23, 2024

I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.

nine_k · on July 23, 2024

Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.

verdverm · on July 24, 2024

Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.

piperswe · on July 24, 2024

You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.

krisoft · on July 24, 2024

And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.

verdverm · on July 24, 2024

I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model

the8thbit · on July 25, 2024

JPEGs and 3D models are also executable binaries. They, like model weights, contain domain specific instructions that execute in a domain specific and turing incomplete environment. The model weights are the instructions, and those instructions are interpreted by the inference harness to produce outputs.

sigmoid10 · on July 23, 2024

>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.

stavros · on July 23, 2024

Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.

plausibility · on July 23, 2024

It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.

politelemon · on July 24, 2024

It's the literal (figurative) nobody rather than the literal (literal) nobody.

mattnewton · on July 24, 2024

There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm

WithinReason · on July 24, 2024

If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?

guappa · on July 24, 2024

Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.

sharpshadow · on July 24, 2024

If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.

guappa · on July 24, 2024

It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.

llm_trw · on July 24, 2024

This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.

TeMPOraL · on July 24, 2024

And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights

llm_trw · on July 24, 2024

Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.

saurik · on July 24, 2024

Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?

llm_trw · on July 24, 2024

It means nothing because LLMs aren't software.

Phelinofist · on July 24, 2024

Do they not run on a computer?

llm_trw · on July 24, 2024

So does a video. Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it? What if the files can only be open in a proprietary program?

Videos aren't software and neither are llms.

saurik · on July 24, 2024

If a video doesn't have source code, then it can't be open source. Likewise, if you feel that an LLM doesn't have source code because of some property of what it is -- as you claim it isn't software and somehow that means that it abstractly removes it from consideration for this concept (an idea I think is ridiculous, FWIW: an LLM is clearly software that runs in a particularly interesting virtual machine defined by the model architecture) -- then; somewhat trivially, it also can't be open source. It is, as the person you are responding to says, at best "open weights".

If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.

However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.

dns_snek · on July 24, 2024

> Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it?

If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].

Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]

[1] https://en.wikipedia.org/wiki/Open-source_film

[2] https://en.wikipedia.org/wiki/Open-source_hardware

the8thbit · on July 24, 2024

Videos and images are software. They are compiled binaries with very domain specific instructions executed in a very non-turing complete context. They are generally not released as open source, and in many cases the source code (the file used to edit the video or image) is lost. They are not seen, colloquially, as software, but that does not mean that they are not software.

If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.

the8thbit · on July 24, 2024

"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).

blackeyeblitzar · on July 23, 2024

There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796

_heimdall · on July 23, 2024

Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.

swatcoder · on July 23, 2024

The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.

llm_trw · on July 24, 2024

>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.

Tepix · on July 24, 2024

If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.

solarmist · on July 23, 2024

The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.

_heimdall · on July 23, 2024

Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.

TeMPOraL · on July 24, 2024

Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.

derefr · on July 23, 2024

Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.

echoangle · on July 23, 2024

I think no one would claim that “Doom” is open source though, if that’s the situation.

camgunz · on July 24, 2024

That's what op is saying, the engine is GPLv2, but the assets are copyrighted. There's Freedoom though and it's pretty good [0].

[0]: https://freedoom.github.io/

saurik · on July 24, 2024

The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)

_heimdall · on July 24, 2024

The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.

saurik · on July 24, 2024

I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?

achrono · on July 23, 2024

> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.

gowld · on July 23, 2024

https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.

_heimdall · on July 23, 2024

I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.

the8thbit · on July 24, 2024

I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.

stavros · on July 23, 2024

Data is to models what code is to software.

_heimdall · on July 23, 2024

I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.

Xelynega · on July 23, 2024

Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.

_heimdall · on July 23, 2024

You really should be able to train a model on whatever data you choose to use though.

Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.

Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.

croemer · on July 24, 2024

But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.

Aeolun · on July 24, 2024

I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?

jlokier · on July 24, 2024

My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.

jijji · on July 24, 2024

yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/

yencabulator · on July 26, 2024

And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.

aerzen · on July 23, 2024

LLAMA is an open-weights model. I like this term, let's use that instead of open source.

gowld · on July 23, 2024

Can a human programmer edit the weights according to some semantics?

sebastiennight · on July 23, 2024

It is possible to merge two fine-tunes of models from the same family by... wait for it... averaging or combining their weights[0].

I am still amazed that we can do that.

[0]: https://arxiv.org/abs/2212.09849

the8thbit · on July 25, 2024

This is absolutely wild.

root_axis · on July 23, 2024

Yes. Using fine tuning.

sitkack · on July 24, 2024

Yes, there is the concept of a "frakenmerge" and folks have also bolted on vision and audio models to LLMs.

ab5tract · on July 23, 2024

If you can’t share the dataset, under what twisted reality are you fine to share the derivative models based on those unsharable datasets?

In a better world, there would be no “I ran some algos on it and now it’s mine” defense.

guitarlimeo · on July 24, 2024

Yeah was gonna say exactly the same thing. Weird how the legislation allows releasing LLMs trained on data that is not allowed to be shared otherwise.

floydnoel · on July 24, 2024

Meta might possibly have a license to use (some of) that data, but not a license to distribute it. Legislation has little to do with it, I imagine.

yangcheng · on July 23, 2024

latest llama 3.1 is in a different repo, https://github.com/meta-llama/llama-models/blob/main/models/... , but yes, the code is shared. It astonishing that in software 2.0 era, powerful applications like llama has only hundreds of lines of code, and most work hidden in training data. Source code alone is no longer that informative as Software 1.0

danielrhodes · on July 24, 2024

For models of this size, the code used to train them is going to be very custom to the architecture/cluster they are built on. It would be almost useless to anybody outside of Meta. The dataset would be more a lot more interesting, as it would at the very least show everybody how they got it to behave in certain ways.

twelvechairs · on July 24, 2024

Open training data would be great too.

If you have open data and open source code you can reproduce the weights

blharr · on July 24, 2024

Not easily for these large scale models, but theoretically maybe

ajxlasA · on July 24, 2024

Really? I have to check out the training code again. Last time I looked the training and inference code were just example toys that were barely usable.

Has that changed?

input_sh · on July 23, 2024

Open Source Initiative (kind of a de-facto authority on what's open source and what not) is spending a whole lot of time figuring out what it means for an AI system to be open source. In other words, they're basically trying to come up with a new license because the existing ones can't easily apply.

I believe this is the current draft: https://opensource.org/deepdive/drafts/the-open-source-ai-de...

downWidOutaFite · on July 23, 2024

OSI made themselves the authority because they hated Richard Stallman and his Free Software movement. It's just marketing.

gowld · on July 23, 2024

RMS has no interest in governing Open Source, so your comment bears no particular relevance.

RMS is an advocate for Free Software. Free Software generally implies Open Source, but not the converse.

RMS considers openness of source to be a separate category from the freeness of software. "Free software is a political movement; open source is a development model."

https://www.gnu.org/licenses/license-list.en.html

ab5tract · on July 23, 2024

Are you really pretending that OSI and the open source label itself wasn’t a reactionary movement that vilified free software principles in hopes of gaining corporate traction?

Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.

cheema33 · on July 23, 2024

> True open source advocates will find little to refute in what I’ve said.

No true Scotsman https://en.wikipedia.org/wiki/No_true_Scotsman

OSI helped popularize the open source movement. They not only make it palatable to businesses, but got them excited about it. I think that FSF/Stallman alone would not have been very successful on this front with GPL/AGPL.

ab5tract · on July 23, 2024

Like I said, honest open source advocates won’t take issue to how I framed their position.

Here’s a more important point: how far would the open source people have gotten without GCC and glibc?

Much less far than they will ever admit, in my experience.

miffy900 · on July 24, 2024

> Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.

> Like I said, honest open source advocates won’t take issue to how I framed their position.

Yet you've failed to provide even a single point of evidence to back up your claim.

> "honest open source advocates"

You've literally just made this term up. It's meaningless.

ab5tract · on July 26, 2024

It’s not a term, it’s a phrase. It means “open source advocates who are being honest about their advocacy”, in case you really need such a degree of clarification.

I’ve met honest open source advocates before and, once again, they would be unlikely to refute the fact that “open source” was invented in explicit contrast to “free software” to achieve corporate palatability.

The comment you are responding to was literally responding to a comment which validated this exact sentiment.

As to providing evidence, those of us who were there at the time don’t need any and those of you who weren’t ought to seek some. It’s not my job to link to the nearly infinite number of conversations where this obvious dynamic played out.

halostatue · on July 24, 2024

For some advocates, sure. I was there, too — although at the beginning of my career and not deeply involved in most licensing discussions until the founding of Mozilla (where I argued against the GNU GPL and was generally pleased with the result of the MPL). However, from ~1990, I remember sharing some code where I "more or less" made my code public domain but recommended people consider the GNU GPL as part of the README (I don't have the source code available, so I don't recall).

Your characterization is quit easily refutable, because at the time that OSI was founded, there was already an explosion of possible licenses and RMS and other GNUnatics were making lots of noise about GNU/Linux and trying to be as maximalist as possible while presenting any choice other than the GNU GPL as "against freedom".

This certainly would not have held well with people who were using the MIT Licence or BSD licences (created around the same time as the GNU GPL v1), who believed (and continue to believe) that there were options other than a restrictive viral licence‡. Yes, some of the people involved vilified the "free software principles", but there were also GNU "advocates" who were making RMS look tame with their wording (I recall someone telling me to enjoy "software slavery" because I preferred licences other than the GNU GPL).

The "Free Software" advocates were pretending that the goals of their licence were the only goals that should matter for all authors and consumers of software. That is not and never has been the case, so it is unsurprising that there was a bit of reaction to such extremism.

OSI and the open source label were a move to make things easier for corporations to accept and understand by providing (a) a clear unifying definition, and (b) a set of licences and guidelines for knowing what licenses did what and the risks and obligations they presented to people who used software under those licences.

‡ Don't @ me on this, because both the virality and restrictiveness are features of the GNU GPL. If it weren't for the nonsense in the preamble, it would be a good licence. As it is, it is an effective if rampantly misrepresented licence.

dogleash · on July 24, 2024

Didn't the Open Source Definition start as the DFSG? You telling me Debian hates the Free Software movement? Unless you define "hating Free Software" as "not banning the BSD license", then I'll have to disagree.

halflings · on July 23, 2024

Training code is only useful to people in academia, and the closest thing to "code you can modify" are open weights.

People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.

This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)

kemiller · on July 23, 2024

I dunno — if an open source project required, say, a proprietary compiler, that would diminish its open source-ness. But I agree it's not totally comparable, since the weights are not particularly analogous to machine code. We probably need a new term. Open Weights.

0-_-0 · on July 24, 2024

There are many "compilers", you can download The Pile yourself.

Zambyte · on July 23, 2024

> If so, then how can current ML models be open source?

The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.

moffkalast · on July 23, 2024

You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.

Zambyte · on July 23, 2024

Where has Facebook linked that? I can't find anywhere that they actually published that.

nickpsecurity · on July 24, 2024

Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.

moffkalast · on July 23, 2024

Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.

verdverm · on July 24, 2024

Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads

root_axis · on July 23, 2024

No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.

Zambyte · on July 24, 2024

In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.

Zambyte · on July 24, 2024

I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D

thayne · on July 23, 2024

I think it would also include the code used to train it

pphysch · on July 23, 2024

That would be more analogous to the build toolchain than the source code, but yes

tshaddox · on July 23, 2024

Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.

pphysch · on July 24, 2024

Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.

Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.

But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.

GuB-42 · on July 23, 2024

I like the term "open weights". Open source would be the dataset and code that generates these weights.

There is still a lot you can do with weights, like fine tuning, and it is arguably more useful as retraining the entire model would cost millions in compute.

shdjkKA · on July 24, 2024

Of course you are right, I'd put it less carefully: The quoted Linux line is deceptive marketing.

- If we start with the closed training set, that is closed and stolen, so call it Stolen Source.

- What is distributed is a bunch of float arrays. The Llama architecture is published, but not the training or inference code. Without code there is no open source. You can as well call a compiler book open source, because it tells you how to build a compiler.

Pure marketing, but predictably many people follow their corporate overlords and eagerly adopt the co-opted terms.

Reminder again that FB is not releasing this out of altruism, but because they have an existing profitable business model that does not depend on generated chats. They probably do use it internally for tracking and building profiles, but that is the same as using Linux internally, so they release the weights to destroy the competition.

Isn't price dumping an anti trust issue?

mkolodny · on July 23, 2024

Llama’s code is open source: https://github.com/meta-llama/llama3/blob/main/llama/model.p...

Flimm · on July 23, 2024

No, it's not. The Llama 3 Community License Agreement is not an open source license. Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]. This license has multiple restrictions on use and distribution which make it not open source. I know Facebook keeps calling this stuff open source, maybe in order to get all the good will that open source branding gets you, but that doesn't make it true. It's like a company calling their candy vegan while listing one its ingredients as pork-based gelatin. No matter how many times the company advertises that their product is vegan, it's not, because it doesn't meet the definition of vegan.

[0] - https://opensource.org/osd

8note · on July 23, 2024

Isn't the MIT license the generally accepted "open source" license? It's a community owned term, not OSI owned

yjftsjthsd-h · on July 23, 2024

MIT is a permissive open source license, not the open source license.

henryfjordan · on July 23, 2024

There are more licenses than just MIT that are "open source". GPL, BSD, MIT, Apache, some of the Creative Commons licenses, etc. MIT has become the defacto default though

https://opensource.org/license (linking to OSI for the list because it's convenient, not because they get to decide)

NiloCK · on July 24, 2024

These discussions (ie, everything that follows here) would be much easier if the crowd insisting on the OSI definition of open source would capitalize Open Source.

In English, proper nouns are capitalized.

"Open" and "source" are both very normal English words. English speakers have "the right" to use them according to their own perspective and with personal context. It's the difference between referring to a blue tooth, and Bluetooth, or to an apple store or an Apple store.

CamperBob2 · on July 23, 2024

Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]

Who died and made OSI God?

MaxBarraclough · on July 23, 2024

This isn't helpful. The community defers to the OSI's definition because it captures what they care about.

We've seen people try to deceptively describe non-OSS projects as open source, and no doubt we will continue to see it. Thankfully the community (including Hacker News) is quick to call it out, and to insist on not cheapening the term.

This is one the topics that just keeps turning up:

* https://news.ycombinator.com/item?id=24483168

* https://news.ycombinator.com/item?id=31203209

* https://news.ycombinator.com/item?id=36591820

CamperBob2 · on July 23, 2024

This isn't helpful. The community...

Speak for yourself, please. The term is much older than 1998, with one easily-Googled example being https://www.cia.gov/readingroom/docs/DOC_0000639879.pdf , and an explicit case of IT-related usage being https://i.imgur.com/Nw4is6s.png from https://www.google.com/books/edition/InfoWarCon/09X3Ove9uKgC... .

Unless a registered trademark is involved (spoiler: it's not) no one, whether part of a so-called "community" or not, has any authority to gatekeep or dictate the terms under which a generic phrase like "open source" can be used.

Flimm · on July 24, 2024

Neither of those usages relate to IT, they both are about sources of intelligence (espionage). Even if they were, the OSI definition won, nobody is using the definitions from 1995 CIA or the 1996 InfoConWar book in the realm of IT, not even Facebook.

The community has the authority to complain about companies mis-labelling their pork products as vegan, even if nobody has a registered trademark on the term vegan. Would you tell people to shut up about that case because they don't have a registered trademark? Likewise, the community has authority to complain about Meta/Facebook mis-labelling code as open source even when they put restrictions on usage. It's not gate-keeping or dictatorship to complain about being misled or being lied to.

CamperBob2 · on July 24, 2024

Would you tell people to shut up about that case because they don't have a registered trademark?

I especially like how I'm the one telling people to "shut up" all of a sudden.

As for the rest, see my other reply.

Flimm · on July 24, 2024

You're right, I and those who agree with me were the first to ask people to "shut up", in this case, to ask Meta to stop misusing the term open source. And I was the first to say "shut up", and I know that can be inflammatory and disrespectful, so I shouldn't have used it. I'm sorry. We're here in a discussion forum, I want you to express your opinion even it is to complain about my complaints. For what it's worth, your counter-arguments have been stronger and better referenced than any other I have read (for the case of accepting a looser definition of the term open source in the realm of IT).

CamperBob2 · on July 24, 2024

All good, and I also apologize if my objection came across as disrespectful.

This whole 'Open Source' thing is a bigger pet peeve than it should be, because I've received criticism for using the term on a page where I literally just posted a .zip file full of source code. The smart thing to do would have been to ignore and forget the criticism, which I will now work harder at doing.

In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'. It's a standard English-language word that according to Merriam-Webster goes back to 1944. So that would amount to an open-and-shut case of false advertising, which I don't think applies here at all.

MaxBarraclough · on July 24, 2024

> In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'.

I don't see the difference. Open source software is a term of art with a specific meaning accepted by its community. When people misuse the term, invariably in such a way as to broaden it to include whatever it is they're pushing, it's right that the community responds harshly.

CamperBob2 · on July 24, 2024

Terms of art do not require licenses. A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Oﬃce 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This kind of argument is literally why trademark law exists. OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.

MaxBarraclough · on July 25, 2024

> Terms of art do not require licenses.

Agreed. There is no trademark on aileron or carburetor or context-free grammar. A couple of years ago I made this same point myself. [0]

> A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This taxonomy doesn't hold up.

Again, it's a term of art with a clear meaning accepted by its community. We've seen numerous instances of cynical and deceptive misuse of the term, which the community rightly calls out because it's not fair play, it's deliberate deception.

> This kind of argument is literally why trademark law exists

It is not. Trademark law exists to protect brands, not to clarify terminology.

You seem to be contradicting your earlier point that terms of art do not require licenses.

> OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.

I haven't expressed any opinion on that topic, and I don't see a need to.

[0] https://news.ycombinator.com/item?id=31203209

CamperBob2 · on July 25, 2024

If the OSI members wanted to "clarify the terminology" in a way that permitted them (and you) to exclude others, trademark law would have absolutely been the correct way to do that. It's too late, however. The ship has sailed.

Come up with a new term and trademark that, and heck, I'll help you out with a legal fund donation when Facebook and friends inevitably try to appropriate it. Apart from that, you've fought the good fight and done what you could. Let it go.

vbarrielle · on July 23, 2024

The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

Recently, companies are trying to market things as open source when in reality, they fail to adhere to the definition.

I think we should not let these companies change the meaning of the term, which means it's important to explain every time they try to seem more open than they are.

I'm afraid the battle is being lost though.

Suppafly · on July 23, 2024

>The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

It was defined and accepted by the community well before OSI came around though.

gowld · on July 23, 2024

Citation? Wikipedia would appreciate your contribution.

https://en.wikipedia.org/wiki/Open_source

> Linus Torvalds, Larry Wall, Brian Behlendorf, Eric Allman, Guido van Rossum, Michael Tiemann, Paul Vixie, Jamie Zawinski, and Eric Raymond [...] > At that meeting, alternatives to the term "free software" were discussed. [...] Raymond argued for "open source. The assembled developers took a vote, and the winner was announced at a press conference the same evening

The original "Open source Definition" was derived from Debian's Social Contract, which did not use the term "open source"

https://web.archive.org/web/20140328095107/http://www.debian...

CamperBob2 · on July 24, 2024

Citation? Wikipedia would appreciate your contribution.

It's not hard to find earlier examples where the phrase is used to describe enabling and (yes) leveraging community contributions to accomplish things that otherwise wouldn't be practical; see my other post for a couple of those.

But then people will rightfully object that the term "Open Source", when used in a capacity related to journalistic or intelligence-gathering activities, doesn't have anything to do with software licensing. Even if OSI had trademarked the phrase, which they didn't, that shouldn't constrain its use in another context.

To which I'd counter that this statement is equally true when discussing AI models. We are going to have to completely rewire copyright law from the ground up to deal with this. Flame wars over what "Open Source" means or who has the right to use the phrase are going to look completely inconsequential by the time the dust settles.

Flimm · on July 24, 2024

I'll concede that "open source" may mean other things in other contexts. For example, an open source river may mean something in particular to those who study rivers. This thread was not talking about a new context, it was not even talking about the weights of a machine learning model or the licensing of training data, it was talking about the licensing of the code in a particular GitHub repository, llama3.

AI may make copyright obsolete, or it may make copyright more important than ever, but my prediction is that the IT community will lose something of great value if the term "open source" is diluted to include licenses that restrict usage, restrict distribution, and restrict modification. I can understand why people may want to choose somewhat restrictive licenses, just like I can understand why a product may contain gelatin, but I don't like it when the product is mis-labelled as vegan. There are plenty of other terms that could be used, for example, "open" by itself. I'm honestly curious if you would defend a pork product labelled as vegan, or do you just feel that the analogy doesn't apply?

mesebrec · on July 23, 2024

This is like saying any python program is open source because the python runtime is open source.

Inference code is the runtime; the code that runs the model. Not the model itself.

mkolodny · on July 23, 2024

I disagree. The file I linked to, model.py, contains the Llama 3 model itself.

You can use that model with open data to train it from scratch yourself. Or you can load Meta’s open weights and have a working LLM.

causal · on July 23, 2024

Yeah a lot of people here seem to not understand that PyTorch really does make model definitions that simple, and that has everything you need to resume back-propagation. Not to mention PyTorch itself being open-sourced by Meta.

That said the LLama-license doesn't meet strict definitions of OS, and I bet they have internal tooling for datacenter-scale training that's not represented here.

yjftsjthsd-h · on July 23, 2024

> The file I linked to, model.py, contains the Llama 3 model itself.

That makes it source available ( https://en.wikipedia.org/wiki/Source-available_software ), not open source

macrolime · on July 23, 2024

Source available means you can see the source, but not modify it. This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

yjftsjthsd-h · on July 23, 2024

> Source available means you can see the source, but not modify it.

No, it doesn't mean that. To quote the page I linked, emphasis mine,

> Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called open-source. The licenses associated with the offerings range from allowing code to be viewed for reference to allowing code to be modified and redistributed for both commercial and non-commercial purposes.

> This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

Per https://github.com/meta-llama/llama3/blob/main/LICENSE there's also a laundry list of ways you're not allowed to use it, including restrictions on commercial use. So not Open Source.

apsec112 · on July 23, 2024

That's not the training code, just the inference code. The training code, running on thousands of high-end H100 servers, is surely much more complex. They also don't open-source the dataset, or the code they used for data scraping/filtering/etc.

the8thbit · on July 23, 2024

"just the inference code"

It's not the "inference code", its the code that specifies the architecture of the model and loads the model. The "inference code" is mostly the model, and the model is not legible to a human reader.

Maybe someday open source models will be possible, but we will need much better interpretability tools so we can generate the source code from the model. In most software projects you write the source as a specification that is then used by the computer to implement the software, but in this case the process is reversed.

blackeyeblitzar · on July 23, 2024

That is just the inference code. Not training code or evaluation code or whatever pre/post processing they do.

patrickaljord · on July 23, 2024

Is there an LLM with actual open source training code and dataset? Besides BLOOM https://huggingface.co/bigscience/bloom

navinsylvester · on July 23, 2024

Here you go - https://github.com/apple/corenet

osanseviero · on July 23, 2024

Yes, there are a few dozen full open source models (license, code, data, models)

blackeyeblitzar · on July 23, 2024

What are some of the other ones? I am aware mainly of OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...)

bilsbie · on July 23, 2024

Can’t you do fine tuning on those binaries? That’s a modification.

the8thbit · on July 23, 2024

You can fine tune the models, and you can modify binaries. However, there is no human readable "source" to open in either case. The act of "fine tuning" is essentially brute forcing the system to gradually alter the weights such that loss is reduced against a new training set. This limits what you can actually do with the model vs an actual open source system where you can understand how the system is working and modify specific functionality.

Additionally, models can be (and are) fine tuned via APIs, so if that is the threshold required for a system to be "open source", then that would also make the GPT4 family and other such API only models which allow finetuning open source.

whimsicalism · on July 23, 2024

I don't find this argument super convincing.

There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models.

"Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient.

the8thbit · on July 23, 2024

"There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models."

Yes, the difference is that one is provided over a remote API, and the provider of the API can restrict how you interact with it, while the other is performed directly by the user. One is a SaaS solution, the other is a compiled solution, and neither are open source.

""Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient."

Whatever you want to call it, this doesn't sound like modifying functionality in source code. When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times. What I don't do is have a very simple routine make very small modifications to all of the system's functionality, then check the result of that small change across the broad spectrum of functionality, and repeat millions of times.

Kubuxu · on July 23, 2024

The gap between fine-tuning API and weights-available is much more significant than you give it credit for.

You can take the weights and train LoRAs (which is close to fine-tuning), but you can also build custom adapters on top (classification heads). You can mix models from different fine-tunes or perform model surgery (adding additional layers, attention heads, MoE).

You can perform model decomposition and amplify some of its characteristics. You can also train multi-modal adapters for the model. Prompt tuning requires weights as well.

I would even say that having the model is more potent in the hands of individual users than having the dataset.

thayne · on July 23, 2024

That still doesn't make it open source.

There is a massive difference between a compiled binary that you are allowed to do anything you want with, including modifying it, building something else on top or even pulling parts of it out and using in something else, and a SaaS offering where you can't modify the software at all. But that doesn't make the compiled binary open source.

emporas · on July 23, 2024

> When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times.

You can modify individual neurons if you are so inclined. That's what Anthropic have done with the Claude family of models [1]. You cannot do that using any closed model. So "Open Weights" looks very much like "Open Source".

Techniques for introspection of weights are very primitive, but i do think new techniques will be developed, or even new architectures which will make it much easier.

[1] https://www.anthropic.com/news/mapping-mind-language-model

the8thbit · on July 23, 2024

"You can modify individual neurons if you are so inclined."

You can also modify a binary, but that doesn't mean that binaries are open source.

"That's what Anthropic have done with the Claude family of models [1]. ... Techniques for introspection of weights are very primitive, but i do think new techniques will be developed"

Yeah, I don't think what we have now is robust enough interpretability to be capable of generating something comparable to "source code", but I would like to see us get there at some point. It might sound crazy, but a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy.

I think getting to open sourcable models is probably pretty important for producing models that actually do what we want them to do, and as these models become more powerful and integrated into our lives and production processes the inability to make them do what we actually want them to do may become increasingly dangerous. Muddling the meaning of open source today to market your product, then, can have troubling downstream effects as focus in the open source community may be taken away from interpretability and on distributing and tuning public weights.

sebastiennight · on July 23, 2024

> a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy

My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing.

We are climbing out of the trough of disillusionment maybe, but to say that we have reached mind-blowing heights with interpretability seems a bit of an hyperbole, unless I've missed some enormous breakthrough.

the8thbit · on July 24, 2024

"My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing."

I think this is a situation where both things are true. Much more progress has been made in capabilities research than interpretability and the interpretability tools we have now (at least, in regards to specific models) would have been seen as impossible or at least infeasible a few years back.

bilsbie · on July 23, 2024

You make a good point but those are also just limitations of the technology (or at least our current understanding of it)

Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?

sebastiennight · on July 23, 2024

Your hypothetical apple-grower family would simply share a handbook which meticulously shared the initial species of apple used, the breeding protocol, the hybridization method, and any other factors used to breed this perfect apple.

Having the handbook and materials available would make it possible for others to reproduce the resulting apple, or to obtain similar apples with different properties by modifying the protocols.

The handbook is the source code.

On the other hand, what we have here is Monsanto saying: "we've got those Terminator-lineage apples, and we're open-sourcing them by giving you the actual apples as an end product for free. Feel free to breed them into new varieties at will as long as you're not a Big Farm company."

Not open source.

wrs · on July 23, 2024

What would enable someone to reproduce the tree from scratch, and continue developing that line of trees, using tools common to apple tree breeders? I’m not an apple tree breeder, but I suspect that’s the seeds. Maybe the genetic sequence is like source code in some analogical sense, but unless you can use that information to produce an actual seed, it doesn’t qualify in a practical sense. Trees don’t have a “compilation phase” to my knowledge, so any use of “open source” would be a stretch.

the8thbit · on July 23, 2024

"You make a good point but those are also just limitations of the technology (or at least our current understanding of it)"

Yeah, that is my point. Things that don't have source code can't be open source.

"Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?"

I think we need to be weary of dilemmas without solutions here. For example, let's think about another analogy: I was in a car accident last week. How can I open source my car accident?

I don't think all, or even most things, are actually "open sourcable". ML models could be open sourced, but it would require a lot of work to interpret the models and generate the source code from them.

gowld · on July 23, 2024

Be charitable and intellectually curious. What would "open" look like?

GNU says "The GNU GPL can be used for general data which is not software, as long as one can determine what the definition of “source code” refers to in the particular case. As it turns out, the DSL (see below) also requires that you determine what the “source code” is, using approximately the same definition that the GPL uses."

and offers these categories, for example:

https://www.gnu.org/licenses/license-list.en.html#NonFreeSof...

* Software Licenses

* * GPL-Compatible Free Software Licenses

\

* * GPL-Incompatible Free Software Licenses

\

* Licenses For Documentation

* * Free Documentation Licenses

\

* Licenses for Other Works

* * Licenses for Works of Practical Use besides Software and Documentation

* * Licenses for Fonts

* * Licenses for Works stating a Viewpoint (e.g., Opinion or Testimony)

* * Licenses for Designs for Physical Objects

the8thbit · on July 24, 2024

"Be charitable and intellectually curious. What would "open" look like?"

To really be intellectually curious we need to be open to the idea that there is not (yet) a solution to this problem. Or in the analogy you laid out, that it is simply not possible for the system to be "open source".

Note that most of the licenses listed under the "Licenses for Other Works" section say "It is incompatible with the GNU GPL. Please don't use it for software or documentation, since it is incompatible with the GNU GPL and with the GNU FDL." This is because these are not free software/open source licenses. They are licenses that the FSF endorses because they encourage openness and copyleft in non-software mediums, and play nicely with the GPL when used appropriately (i.e. not for software).

The GPL is appropriate for many works that we wouldn't conventionally view as software, but in those contexts the analogy is usually so close to the literal nature of software that it stops being an analogy. The major difference is public perception. For example, we don't generally view jpegs as software. However, jpegs, at their heart, are executable binaries with very domain specific instructions that are executed in a very much non-Turing complete context. The source code for the jpeg is the XCF or similar (if it exists) which contains a specification (code) for building the binary. The code becomes human readable once loaded into an IDE, such as GIMP, designed to display and interact with the specification. This is code that is most easily interacted with using a visual IDE, but that doesn't change the fact that it is code.

There are some scenarios where you could identify a "source code" but not a "software". For example, a cake can be open sourced by releasing the recipe. In such a context, though, there is literally source code. It's just that the code never produces a binary, and is compiled by a human and kitchen instead of a computer. There is open source hardware, where the source code is a human readable hardware specification which can be easily modified, and the hardware is compiled by a human or machine using that specification.

The scenario where someone has bred a specific plant, however, can not be open source, unless they have also deobfuscated the genome, released the genome publicly, and there is also some feasible way to convert the deobfuscated genome, or a modification of it, into a seed.

jpadkins · on July 23, 2024

> vs an actual open source system where you can understand how the system is working and modify specific functionality.

No one on the planet understands how the model weights work exactly, nor can they modify them specifically (i.e. hand modifying the weights to get the result they want). This is an impossible standard.

The source code is open (sorta, it does have some restrictions). The weights are open. The training data is closed.

the8thbit · on July 24, 2024

> No one on the planet understands how the model weights work exactly

Which is my point. These models aren't open source because there is no source code to open. Maybe one day we will have strong enough interpretability to generate source from these models, and then we could have open source models. But today its not possible, and changing the meaning of open source such that it is possible probably isn't a great idea.

beloch · on July 23, 2024

It's no secret that implementing AI usually involves far more investment into training and teaching than actual code. You can know how a neural net or other ML model works. You can have all the code before you. It's still a huge job (and investment) to do anything practical with that. If Meta shares the code their AI runs on with you, you're not going to be able to do much with it unless you make the same investment in gathering data and teaching to train that AI. That would probably require data Meta won't share. You'd effectively need your own Facebook.

If everyone open sources their AI code, Meta can snatch the bits that help them without much fear of helping their direct competitors.

the8thbit · on July 24, 2024

I think you're misunderstanding what I'm saying. I don't think its technically feasible for current models to be open source, because there is no source code to open. Yes, there is a harness that runs the model, but the vast, vast amount of instructions are contained in the model weights, which are akin to a compiled binary.

If we make large strides in interpretability we may have something resembling source code, but we're certainly not there yet. I don't think the solution to that problem should be to change the definition of open source and pretend the problem has been solved.

bjornsing · on July 24, 2024

The term “source code” can mean many things. In a legal context it’s often just defined as the preferred format for modification. It can be argued that for artificial neural networks that’s the weights (along with code and preferably training data).

kashyapc · on July 24, 2024

I agree; there's a lot of muddiness in the term "open source AI". Earlier this year there was a talk[1] at FOSDEM, titled "Moving a step closer to defining Open Source AI". It is from someone at the Open Source Initiative. The video and slides are available in the link below[1]. From the abstract:

"Finding an agreement on what constitutes Open Source AI is the most important challenge facing the free software (also known as open source) movement. European regulation already started referring to "free and open source AI", large economic actors like Meta are calling their systems "open source" despite the fact that their license contain restrictions on fields-of-use (among other things) and the landscape is evolving so quickly that if we don't keep up, we'll be irrelevant."

[1] https://fosdem.org/2024/schedule/event/fosdem-2024-2805-movi... defining-open-source-ai/

rbits · on July 24, 2024

You release all the technology and the training data. Everything that was used to create the model, including instructions.

I'm not sure if facebook has done that

szundi · on July 23, 2024

Open source = reproducible binaries (weights) by you on your computer, IMO.

Strategy of FB is that they are good to be a user only and fine ruining competitor’s business with good enough free alternatives while collecting awards as saviors of whatever.

ric2b · on July 24, 2024

If that were the definition then any software you can install on your computer would be open source. It makes open source lose nearly all meaning.

Just say "open weights", not "open source".

j_maffe · on July 23, 2024

Not sure what you mean by "they are good to be a user only." Whatever their strategy is, this is great for the community.

orthoxerox · on July 23, 2024

Open training dataset + open steps sufficient to train exactly the same model.

the8thbit · on July 23, 2024

This isn't what Meta releases with their models, though I would like to see more public training data. However, I still don't think that would qualify as "open source". Something isn't open source just because its reproducible out of composable parts. If one, very critical and system defining part is a binary (or similar) without publicly available source code, then I don't think it can be said to be "open source". That would be like saying that Windows 11 is open source because Windows Calculator is open source, and its a component of Windows.

blackeyeblitzar · on July 23, 2024

Here’s one list of what is needed to be actually open source:

https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...

orthoxerox · on July 23, 2024

That's what I meant by "open steps", I guess I wasn't clear enough.

the8thbit · on July 23, 2024

Is that what you meant? I don't think releasing the sequence of steps required to produce the model satisfies "open source", which is how I interpreted you, because there is still no source code for the model.

Yizahi · on July 23, 2024

They can't release training dataset if it was illegally scrapped all over the web without permission :) (taps head)

langcss · on July 24, 2024

Coming up with the words and concepts to describe the models is a challenge.

Does the training data require permission from the copyright holder to use? Are the weights really open source or more like compiled assembly?

jsheard · on July 23, 2024

I also think that something like Chromium is a better analogy for corporate open source models than a grassroots project like Linux is. Chromium is technically open source, but Google has absolute control over the direction of it's development and realistically it's far too complex to maintain a fork without Googles resources, just like Meta has complete control over what goes into their open models, and even if they did release all the training data and code (which they don't) us mere plebs could never afford to train a fork from scratch anyway.

skybrian · on July 23, 2024

I think you’re right from the perspective of an individual developer. You and I are not about to fork Chromium any time soon. If you presume that forking is impractical then sure, the right to fork isn’t worth much.

But just because a single developer couldn’t do it doesn’t mean it couldn’t be done. It means nobody has organized a large enough effort yet.

For something like a browser, which is critical for security, you need both the organization and the trust. Despite frequent criticism, Mozilla (for example) is still considered pretty trustworthy in a way that an unknown developer can’t be.

Yizahi · on July 23, 2024

If Microsoft can't do it, then we can reasonably conclude that it can't be done for any practical purpose. Discussing infinitesimal possibilities is better left to philosophers.

skybrian · on July 23, 2024

Doesn’t Microsoft maintain its own fork of Chromimum?

umbra07 · on July 23, 2024

yes - their browser is chromium-based

rmbyrro · on July 23, 2024

If you think about LLMs as a new kind of programming runtime, the matrices are the source.

stale2002 · on July 23, 2024

Ok call it Open Weights then if the dictionary definitions matter so much to you.

The actual point that matters is that these models are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer.

the8thbit · on July 23, 2024

They don't "[allow] developers to modify its code however they want", which is a critical component of "open source", and one that Meta is clearly trying to leverage in branding around its products. I would like them to start calling these "public weight models", because what they're doing now is muddying the waters so much that "open source" now just means providing an enormous binary and an open source harness to run it in, rather than serving access to the same binary via an API.

sailingparrot · on July 23, 2024

Feels a bit like you are splitting hair for the pleasure of semantic arguments to be honest. Yes there are no source in ML, so if we want to be pedantic it shouldn't be called open source. But what really matters in the open source movement is that we are able to take a program built by someone and modify it to do whatever we want with it, without having to ask someone for permission or get scrutinized or have to pay someone.

The same applies here, you can take those models and modify them to do whatever you want (provided you know how to train ML models), without having to ask for permission, get scrutinized or pay someone.

I personally think using the term open source is fine, as it conveys the intent correctly, even if, yes, weights are not sources you can read with your eyes.

wrs · on July 23, 2024

Calling that “open source” renders the word “source” meaningless. By your definition, I can release a binary executable freely and call it “open source” because you can modify it to do whatever you want.

Model weights are like a binary that nobody has the source for. We need another term.

sailingparrot · on July 23, 2024

No it’s not the same as releasing a binary, feels like we can’t get out of the pedantics. I can in theory modify a binary to do whatever I want. In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute.

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.

the8thbit · on July 23, 2024

"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.

Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.

A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.

I would not call that binary "open source", because the source would, in fact, not be open.

wrs · on July 23, 2024

Can you change the tokenizer? No, because all you have is the weights trained with the current tokenizer. Therefore, by any normal definition, you don’t have the source. You have a giant black box of numbers with no ability to reproduce it.

sailingparrot · on July 23, 2024

> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb

wrs · on July 24, 2024

You can change the tokenizer and build another model, if you can come up with your own version of the rest of the source (e.g., the training set, RLHF, etc.). You can’t change the tokenizer for this model, because you don’t have all of its source.

sailingparrot · on July 26, 2024

There is nothing that requires you to train with the same training set, or to re-do RLHF. You can train on fineweb, and llama 3.1 will learn to use your new tokenizer just fine.

There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.

jononor · on July 23, 2024

You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest. One also has to come up with suite le datasets, from scratch. Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

sailingparrot · on July 23, 2024

> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.

wrs · on July 23, 2024

I think maybe we’re talking past each other because it seems obvious to me and others that the weights are the output of the compilation process, whereas you seem to think they’re the input. Whether you can fine tune the weights is irrelevant to whether you got all the materials needed to make them in the first place (i.e., the source).

I can do all sorts of things by “fine tuning” Excel with formulas, but I certainly don’t have the source for Excel.

slavik81 · on July 24, 2024

> The same applies here, you can take those models and modify them to do whatever you want without having to ask for permission, get scrutinized or pay someone.

The "Additional Commercial Terms" section of the license includes restrictions that would not meet the OSI definition of open source. You must ask for permission if you have too many users.

bornfreddy · on July 23, 2024

"Public weight models" sounds about right, thanks for coming up with a good term! Hope it catches.

stale2002 · on July 23, 2024

My central point is this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer."

I presume you agree with it.

> rather than serving access

Its not the same access though.

I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

> They don't "[allow] developers to modify its code however they want"

Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

Thats a huge deal!

And it is dishonest to compare a situation where limitations are both minimal and almost unenforceable (Except against maybe Google) to a situation where its physically not possible to get access to the model weights to do what you want with them.

the8thbit · on July 24, 2024

> Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

The limitations here are technical, not legal. (Though I am aware of the legal restrictions as well, and I think its worth noting that no other project would get by calling themselves open source while imposing a restriction which prevents competitors from using the system to build their competing systems.) There isn't any source code to read and modify. Yes, you can fine tune a model just like you can modify a binary but this isn't source code. Source code is a human readable specification that a computer can use to transform into executable code. This allows the human to directly modify functionality in the specification. We simply don't have that, and it will not be possible unless we make a lot of strides in interpretability research.

> Its not the same access though.

> I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

I'm not saying that systems that are provided as SaaS don't tend to be more restrictive in terms of what they let you do through the API they expose vs what is possible if you run the same system locally. That may not always be true, but sure, as a general rule it is. I mean, it can't be less restrictive. However, that doesn't mean that being able to run code on your own machine makes the code open source. I wouldn't consider Windows open source, for example. Why? Because they haven't released the source code for Windows. Likewise, I wouldn't consider these models open source because their creators haven't released source code for them. Being technically infeasible to do doesn't mean that the definition changes such that its no longer technically infeasible. It is simply infeasible, and if we want to change that, we need to do work in interpretability, not pretend like the problem is already solved.

stale2002 · on July 24, 2024

So then yes you agree with this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer." And that this is very significant.

gorgoiler · on July 23, 2024

One counterpoint is that major publications (eg New York Times) would have you believe that AI is a mildly lossy compression algorithm capable of reconstructing the original source material.

_flux · on July 24, 2024

I believe it is able to reconstruct parts of the original source material—if the interrogator already knows the original source material to prompt the model appropriately.

actinium226 · on July 23, 2024

It's not?