Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Eventually though, open source Linux gained popularity – initially because it allowed developers to modify its code however they wanted ..."

I find the language around "open source AI" to be confusing. With "open source" there's usually "source" to open, right? As in, there is human legible code that can be read and modified by the user? If so, then how can current ML models be open source? They're very large matrices that are, for the most part, inscrutable to the user. They seem akin to binaries, which, yes, can be modified by the user, but are extremely obscured to the user, and require enormous effort to understand and effectively modify.

"Open source" code is not just code that isn't executed remotely over an API, and it seems like maybe its being conflated with that here?



"Open weights" is a more appropriate term but I'll point out that these weights are also largely inscrutable to the people with the code that trained it. And for licensing reasons, the datasets may not be possible to share.

There is still a lot of modifying you can do with a set of weights, and they make great foundations for new stuff, but yeah we may never see a competitive model that's 100% buildable at home.

Edit: mkolodny points out that the model code is shared (under llama license at least), which is really all you need to run training https://github.com/meta-llama/llama3/blob/main/llama/model.p...


"Open weights" means you can use the weights for free (as in beer). "Open source" means you get the training dataset and the methodology. ~Nobody does open source LLMs.


Indeed, since when the deliverable being a jpeg/exe, which is similar to what the model file is, is considered the source? it is more like open result or freely available vm image, which works, but has its core FS scrambled or crypted.

Zuck knows this very well and it does him no honour to speak like, and from his position this equals attempt ate trying to change the present semantics of open source. Of course, others do that too - using the notion of open source to describe something very far from open.

What Meta is doing under his command can better be desdribed as releasing the resulting...build, so that it can be freely poked around and even put to work. But the result cannot be effectively reversed engineered.

Whats more ridiculous is that precisely because the result is not the source in its whole form, that these graphical structures can made available. Only thanks to the fact it is not traceable to the source, which makes the whole game not only closed, but like... sealed forever. An unfair retell of humanity's knowledge tossed around in very obscure container that nobody can reverse engineer.

how's that even remotely similar to open source?


Even if everything was released how you described, what good would that really do for an individual without access to heaps of compute? Functionally there seems to be no difference between open weights and open compute because nobody could train a facsimile model. Furthermore, all frontier models are inscrutable due to their construction. It’s wild to me seeing people complain semantics when meta dropped their model for cheap. Now I’m not saying we should suck the zuck for this act of charity, but you have to imagine that other frontier models are not thrilled that meta has invalidated their compute moats with the release of llama. Whether we like it or not, we’re on this AI rollercoaster and I’m glad that it’s not just oligopolists dictating the direction forward. I’m happy to see meta take this direction, knowing that the alternatives are much worse.


That's not the discussion. We're talking about what open source is, and it's having the weights and the method to recreate the model.

If someone gives me an executable that I can run for free, and then says "eh why do you want the source, it would take you a long time to compile", that doesn't make it open source, it just makes it gratis.


Calling weights an executable is disingenuous and not a serious discussion. You can do a lot more with weights than you could with a binary executable.


You can do a lot more with an executable as well than just execute it. So maybe the analogy is apt, even if not exact.

Actually executables you can reverse engineer it into something that could be compiled back into an executable with the exact same functionality, which is AFAIK impossible to do with "open weights". Still, we don't call free executables "open source".


Its not really an analogy. LLMs are quite literally executables in the same way that jpegs are executables. They both specify machine readable (but not human readable) domain specific instructions executed by the image viewer/inference harness.

And yes, like other executables, they are not literal black boxes. Rather, they provide machine readable specifications which are not human readable without immense effort.

For an LLM to be open source there would need to be source code. Source code, btw, is not just a procedure that can be handed to a machine to produce code that can be executed by the machine. That means the training data and code is not sufficient (or necessary) for an open source model.

What we need for an open source model is a human readable specification of the model's functionality and data structures which allows the user to modify specific arbitrary functionally/structure, and can be used to produce an executable (the model weights).

We simply need much stronger interpretability for that to be possible.


This is debatable, even an executable is valuable artifact. You can also do a lot with executable in expert hand.


I'd find knowing what's in the training data hugely valuable - can analyse it to understand and predict capabilities.


Linux is open source and is mostly C code. You cannot run C code directly, you have to compile it and produce binaries. But it's the C code, not binary form, where the collaboration happens.

With LLMs, weights are the binary code: it's how you run the model. But to be able to train the model from scratch, or to collaborate on new approaches, you have to operate at a the level of architecture, methods, and training data sets. They are the source code.


Analogies are always going to fall short. With LLM weights, you can modify them (quant, fine-tuning) to get something different, which is not something you do with compiled binaries. There are ample areas for collaboration even without being able to reproduce from scratch, which takes $X Millions of dollars, also something that a typical binary does not have as a feature.


You can absolutely modify compiled binaries to get something different. That's how lots of video game modding and ROM hacks work.


And we would absolutely do it more often if compiling would cost as much as training of an LLM costs now.


I considered adding "normally" to the binary modifications expecting a response like this. The concepts are still worlds apart

Weights aren't really a binary in the same sense that a compiler produces, they lack instructions and are more just a bunch of floating point values. Nor can you run model weights without separate code to interpret them correctly. In this sense, they are more like a JPEG or 3d model


JPEGs and 3D models are also executable binaries. They, like model weights, contain domain specific instructions that execute in a domain specific and turing incomplete environment. The model weights are the instructions, and those instructions are interpreted by the inference harness to produce outputs.


>Nobody does open source LLMs.

There are a bunch of independent, fully open source foundation models from companies that share everything (including all data). AMBER and MAP-NEO for example. But we have yet to see one in the 100B+ parameter category.


Sorry, the tilde before "nobody" is my notation for "basically nobody" or "almost nobody". I thought it was more common.


It is more common when it comes to numbers I guess. There are ~5 ancestors in this comment chain, if I would agree roughly 4-6 is acceptable.


It's the literal (figurative) nobody rather than the literal (literal) nobody.


There are plenty of open source LLMs, they just aren’t at the top of the leaderboards yet. Here’s a recent example, I think from Apple: https://huggingface.co/apple/DCLM-7B

Using open data and dclm: https://github.com/mlfoundations/dclm


If weights are not the source, then if they gave you the training data and scripts but not the weights, would that be "open source"?


Yes, but they won't do that. Possibly because extensive copyright violation in the training data that they're not legally allowed to share.


If somebody would leak the training data and they would deny that it’s real ergo not getting sued and the data would be available.

Edit typo.


It's not available if you can't use it because you don't have as many lawyers as facebook and can't ignore laws so easily.


This is bending the definition to the other extreme.

Linux doesn't ship you the compiler you need to build the binaries either, that doesn't mean it's closed source.

LLMs are fundamentally different to software and using terms from software just muddies the waters.


And LLMs don't ship with a Python distribution.

Linux sources :: dataset that goes into training

Linux sources' build confs and scripts :: training code + hyperparameters

GCC :: Python + PyTorch or whatever they use in training

Compiled Linux kernel binary :: model weights


Just because you keep saying it doesn't make it true.

LLMs are not software any more than photographs are.


Then what is the "source"? If we are to use the term "source" then what does that mean here, as distinct from it merely being free?


It means nothing because LLMs aren't software.


Do they not run on a computer?


So does a video. Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it? What if the files can only be open in a proprietary program?

Videos aren't software and neither are llms.


If a video doesn't have source code, then it can't be open source. Likewise, if you feel that an LLM doesn't have source code because of some property of what it is -- as you claim it isn't software and somehow that means that it abstractly removes it from consideration for this concept (an idea I think is ridiculous, FWIW: an LLM is clearly software that runs in a particularly interesting virtual machine defined by the model architecture) -- then; somewhat trivially, it also can't be open source. It is, as the person you are responding to says, at best "open weights".

If a video somehow does have source code which can "generate it", then the question of what it means for the source code to the video to be open even if the only program which can read it and generate the video is closed source is equivalent to asking if a program written in Visual Basic can ever be open source given that the Visual Basic compiler is closed source. Personally, I can see arguments either way on this issue, though most people seem to agree that the program is still open source in such a situation.

However, we need not care too much about the answer to that specific conundrum, as the moral equivalent of both the compiler and the runtime virtual machine are almost always open source. What is then important is much easier: if you don't provide the source code to the project, even if the compiler is open source and even if it runs on an open source machine, clearly the project -- whatever it is that we might try to be discussing, including video files -- cannot be open source. The idea that a video can be open source when what you mean is the video is unencrypted and redistributanle but was merely intended to be played in an open source video player is absurd.


> Is a video open source if you're given the permissions to edit it? To distribute it? Given the files to generate it?

If you're given the source material and project files to continue editing where the original editors finished, and you're granted the rights to re-distribute - Yes, that would be open source[1].

Much like we have "open source hardware" where the "source" consists of original schematics, PCB layouts, BOM, etc. [2]

[1] https://en.wikipedia.org/wiki/Open-source_film

[2] https://en.wikipedia.org/wiki/Open-source_hardware


Videos and images are software. They are compiled binaries with very domain specific instructions executed in a very non-turing complete context. They are generally not released as open source, and in many cases the source code (the file used to edit the video or image) is lost. They are not seen, colloquially, as software, but that does not mean that they are not software.

If a video lacks a specification file (the source code) which can be used by a human reader to modify specific features in the video, then it is software that is simply incapable of being open sourced.


"LLMs are fundamentally different to software and using terms from software just muddies the waters."

They're still software, they just don't have source code (yet).


There is a comment elsewhere claiming there are a few dozen fully open source models: https://news.ycombinator.com/item?id=41048796


Why is the dataset required for it to be open source?

If I self host a project that is open sourced rather than paying for a hosted version, like Sentry.io for example, I don't expect data to come along with the code. Licensing rights are always up for debate in open source, but I wouldn't expect more than the code to be available and reviewable for anything needed to build and run the project.

In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available. I'm not actually sure if Meta does share all that, but training data is separate from open source IMO.


The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run. Provided you have reasonably widespread proficiency in industry standard tools, you can take something that's open source, modify that source, and rebuild/redeploy/reinterpret/re-whatever to make it behave the way that you want or need it to behave.

This is in contrast to a compiled binary or obfuscated source image, where alteration may be possible with extraordinairy skill and effort but is not expected and possibly even specirically discouraged.

In this sense, weights are entirely like those compiler binaries or obfuscated sources rather than the source code usually associated with "open source"

To be "open source" we would want LLM's where one might be able to manipulate the original training data or training algorithm to produce a set of weights more suited to one's own desires and needs.

Facebook isn't giving us that yet, and very probably can't. They're just trading on the weird boundary state of the term "open source" -- it still carries prestige and garners good will from its original techno-populist ideals, but is so diluted by twenty years of naive consumers who just take it to mean "I don't have to pay to use this" that the prestige and good will is now misplaced.


>The open source movement, from which the name derives, was about the freedom to make bespoke alterations to the software you choose to run.

The open source movement was a cash grab to make the free software movement more palatable to big corp by moving away from copy left licenses. The MIT license is perfectly open source and means that you can buy software without ever seeing its code.


If you obtain open source licensed software you can pass it on legally (and freely). With some licenses you also have to provide the source code.


The sticking point is you can’t build the model. To be able to build the model from scratch you need methodology and a complete description of the data set.

They only give you a blob of data you can run.


Got it, that makes sense. I still wouldn't expect them to have to publicly share the data itself, but if you can't take the code they share and run it against your own data to build a model that wouldn't be open source in my understanding of it.


Data is the source code here, though. Training code is effectively a build script. Data that goes into training a model does not function like assets in videogames; you can't swap out the training dataset after release and get substantially the same thing. If anything, you can imagine the weights themselves are the asset - and even if the vendor is granting most users a license to copy and modify it (unlike with videogames), the asset itself isn't open source.

So, the only bit that's actually open-sourced in these models is the inference code. But that's a trivial part that people can procure equivalents of elsewhere or reproduce from published papers. In this sense, even if you think calling the models "open source" is correct, it doesn't really mean much, because the only parts that matter are not open sourced.


Compare/contrast:

DOOM-the-engine is open source (https://github.com/id-Software/DOOM), even though DOOM-the-asset-and-scenario-data is not. While you need a copy of DOOM-the-asset-and-scenario-data to "use DOOM to run DOOM", you are free to build other games using DOOM-the-engine.


I think no one would claim that “Doom” is open source though, if that’s the situation.


That's what op is saying, the engine is GPLv2, but the assets are copyrighted. There's Freedoom though and it's pretty good [0].

[0]: https://freedoom.github.io/


The thing they are pointing at and which is the thing people want is the output of the training engine, not the inputs. This is like someone saying they have an open source kernel, but they only release a compiler and a binary... the kernel code is never released, but the kernel is the only reason anyone even wants the compiler. (For avoidance of anyone being somehow confused: the training code is a compiler which takes training data and outputs model weights.)


The output of the training engine, I.E. the model itself, isn't source code at all though. The best approximation would be considering it obfuscated code, and even then it's a stretch since it is more similar to compressed data.

It sounds like Meta doesn't share source for the training logic. That would be necessary for it to really be open source, you need to be able to recreate and modify the codebase but that has nothing to do with the training data or the trained model.


I didn't claim the output is source code, any more than the kernel is. Are you sure you don't simply agree with me?


> not actually sure if Meta does share all that

Meta shares the code for inference but not for training, so even if we say it can be open-source without the training data, Meta's models are not open-source.

I can appreciate Zuck's enthusiasm for open-source but not his willingness to mislead the larger public about how open they actually are.


https://opensource.org/osd

"The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed."

> In the case of an LLM I would expect that to mean the code run to train the model, the code for the model data structure itself, and the control code for querying the model should all be available

The M in LLM is for "Model".

The code you describe is for an LLM harness, not for an LLM. The code for the LLM is whatever is needed to enable a developer to modify to inputs and then build a modified output LLM (minus standard generally available tools not custom-created for that product).

Training data is one way to provide this. Another way is some sort of semantic model editor for an interpretable model.


I still don't quite follow. If Meta were to provide all code required to train a model (it sounds like they don't), and they provided the code needed to query the model you train to get answers how is that not open source?

> Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

This definition actually makes it impossible for any LLM to be considered open source until the interpretability problem is solved. The trained model is functionally obfuscated code, it can't be read or interpreted by a human.

We may be saying the same thing here, I'm not quite sure if you're saying the model must be available or if what is missing is the code to train your own model.


I'm not the person you replied directly to so I can't speak for them, but I did start this thread, and I just wanted to clarify what I meant in my OP, because I see a lot of people misinterpreting what I meant.

I did not mean that LLM training data needs to be released for the model to be open source. It would be a good thing if creators of models did release their training data, and I wouldn't even be opposed to regulation which encourages or even requires that training data be released when models meet certain specifications. I don't even think the bar needs to be high there- We could require or encourage smaller creators to release their training data too and the result would be a net positive when it comes to public understanding of ML models, control over outputs, safety, and probably even capabilities.

Sure, its possible that training data is being used illegally, but I don't think the solution to that is to just have everyone hide that and treat it as an open secret. We should either change the law, or apply it equally.

But that being said, I don't think it has anything to do with whether the model is "open source". Training data simply isn't source code.

I also don't mean that the license that these models are released under is too restrictive to be open source. Though that is also true, and if these models had source code, that would also prevent them from being open source. (Rather, they would be "source available" models)

What I mean is "The trained model is functionally obfuscated code, it can't be read or interpreted by a human." As you point out, it is definitionally impossible for any contemporary LLM to be considered open source. (Except for maybe some very, very small research models?) There's no source code (yet) so there is no source to open.

I think it is okay to acknowledge when something is technically infeasible, and then proceed to not claim to have done that technically infeasible thing. I don't think the best response to that situation is to, instead, use that as justification for muddying the language to such a degree that its no longer useful. And I don't think the distinction is trivial or purely semantic. Using the language of open source in this way is dangerous for two reason.

The first is that it could conceivably make it more challenging for copyleft licenses such as the GPL to protect the works licensed with them. If the "public" no longer treats software with public binaries and without public source code as closed source, then who's to say you can't fork the linux kernel, release the binary, and keep the code behind closed doors? Wouldn't that also be open source?

The second is that I think convincing a significant portion of the open source community that releasing a model's weights is sufficient to open source a model will cause the community to put more focus on distributing and tuning weights, and less time actually figuring out how to construct source code for these models. I suspect that solving interpretability and generating something resembling source code may be necessary to get these models to actually do what we want them to do. As ML models become increasingly integrated into our lives and production processes, and become increasingly sophisticated, the danger created by having models optimized towards something other than what we would actually like them optimized towards increases.


Data is to models what code is to software.


I don't quite agree there. Based on other comments it sounds like Meta doesn't open source the code used to train the model, that would make it not open source in my book.

The trained model doesn't need to be open source though, and frankly I'm not sure what the value there is specifically with regards to OSS. I'm not aware of a solution to interpretability problem, even if the model is shared we can't understand what's in it.

Microsoft ships obfuscated code with Windows builds, but that doesn't make it open source.


Wouldn't the "source code" of the model be closer to the source code of a compiler or the runtime library?

IMO a pre-trained model given with the source code used to train/run it is analogous to a company shipping a compiler and a compiled binary without any of the source, which is why I don't think it's "open source" without the training data.


You really should be able to train a model on whatever data you choose to use though.

Training data instead source code at all, it's content fed into the ingestion side to train a model. As long as source for ingedting and training a model is available, which it sounds like isn't the case for Meta, that would be open source as best I understand it.

Said a little differently, I would need to be able to review all code used to generate a model and all code used to query the model for it to be OSS. I don't need Meta's training data or their actual model at all, I can train my own with code that I can fully audit and modify if I choose to.


But surely you wouldn't call it open source if sentry just gave you a binary - and the source code wasn't available.


I suspect that even if you allowed people to take the data, nobody but a FAANG like organisation could even store it?


My impression is the training data for foundation models isn't that large. It won't fit on your laptop drive, but it will fit comfortably in a few racks of high-density SSDs.


yeah, according to the article [0] about the release of Llama 3.1 405B, it was trained on 15 trillion tokens using 16000 Nvidia H100's to do it. Even if they did release the training data, I don't think many people would have the number of gpus required to actually do any real training to create the model....

[0] https://ai.meta.com/blog/meta-llama-3-1/


And a token is the sequence number of a sequence of input in a restricted dictionary. GPT-2 was said to have 50k distinct tokens, so I think it's safe to assume even the latest ones are well below 4M tokens, so max 4 bytes per token. 15 trillion tokens -> 4 bytes/token * 15 T tokens -> training input<=60 TB doesn't sound that large.

It's the computation that is costly.


LLAMA is an open-weights model. I like this term, let's use that instead of open source.


Can a human programmer edit the weights according to some semantics?


It is possible to merge two fine-tunes of models from the same family by... wait for it... averaging or combining their weights[0].

I am still amazed that we can do that.

[0]: https://arxiv.org/abs/2212.09849


This is absolutely wild.


Yes. Using fine tuning.


Yes, there is the concept of a "frakenmerge" and folks have also bolted on vision and audio models to LLMs.


If you can’t share the dataset, under what twisted reality are you fine to share the derivative models based on those unsharable datasets?

In a better world, there would be no “I ran some algos on it and now it’s mine” defense.


Yeah was gonna say exactly the same thing. Weird how the legislation allows releasing LLMs trained on data that is not allowed to be shared otherwise.


Meta might possibly have a license to use (some of) that data, but not a license to distribute it. Legislation has little to do with it, I imagine.


latest llama 3.1 is in a different repo, https://github.com/meta-llama/llama-models/blob/main/models/... , but yes, the code is shared. It astonishing that in software 2.0 era, powerful applications like llama has only hundreds of lines of code, and most work hidden in training data. Source code alone is no longer that informative as Software 1.0


For models of this size, the code used to train them is going to be very custom to the architecture/cluster they are built on. It would be almost useless to anybody outside of Meta. The dataset would be more a lot more interesting, as it would at the very least show everybody how they got it to behave in certain ways.


Open training data would be great too.

If you have open data and open source code you can reproduce the weights


Not easily for these large scale models, but theoretically maybe


Really? I have to check out the training code again. Last time I looked the training and inference code were just example toys that were barely usable.

Has that changed?


Open Source Initiative (kind of a de-facto authority on what's open source and what not) is spending a whole lot of time figuring out what it means for an AI system to be open source. In other words, they're basically trying to come up with a new license because the existing ones can't easily apply.

I believe this is the current draft: https://opensource.org/deepdive/drafts/the-open-source-ai-de...


OSI made themselves the authority because they hated Richard Stallman and his Free Software movement. It's just marketing.


RMS has no interest in governing Open Source, so your comment bears no particular relevance.

RMS is an advocate for Free Software. Free Software generally implies Open Source, but not the converse.

RMS considers openness of source to be a separate category from the freeness of software. "Free software is a political movement; open source is a development model."

https://www.gnu.org/licenses/license-list.en.html


Are you really pretending that OSI and the open source label itself wasn’t a reactionary movement that vilified free software principles in hopes of gaining corporate traction?

Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.


> True open source advocates will find little to refute in what I’ve said.

No true Scotsman https://en.wikipedia.org/wiki/No_true_Scotsman

OSI helped popularize the open source movement. They not only make it palatable to businesses, but got them excited about it. I think that FSF/Stallman alone would not have been very successful on this front with GPL/AGPL.


Like I said, honest open source advocates won’t take issue to how I framed their position.

Here’s a more important point: how far would the open source people have gotten without GCC and glibc?

Much less far than they will ever admit, in my experience.


> Most of us who were there remember it differently. True open source advocates will find little to refute in what I’ve said.

> Like I said, honest open source advocates won’t take issue to how I framed their position.

Yet you've failed to provide even a single point of evidence to back up your claim.

> "honest open source advocates"

You've literally just made this term up. It's meaningless.


It’s not a term, it’s a phrase. It means “open source advocates who are being honest about their advocacy”, in case you really need such a degree of clarification.

I’ve met honest open source advocates before and, once again, they would be unlikely to refute the fact that “open source” was invented in explicit contrast to “free software” to achieve corporate palatability.

The comment you are responding to was literally responding to a comment which validated this exact sentiment.

As to providing evidence, those of us who were there at the time don’t need any and those of you who weren’t ought to seek some. It’s not my job to link to the nearly infinite number of conversations where this obvious dynamic played out.


For some advocates, sure. I was there, too — although at the beginning of my career and not deeply involved in most licensing discussions until the founding of Mozilla (where I argued against the GNU GPL and was generally pleased with the result of the MPL). However, from ~1990, I remember sharing some code where I "more or less" made my code public domain but recommended people consider the GNU GPL as part of the README (I don't have the source code available, so I don't recall).

Your characterization is quit easily refutable, because at the time that OSI was founded, there was already an explosion of possible licenses and RMS and other GNUnatics were making lots of noise about GNU/Linux and trying to be as maximalist as possible while presenting any choice other than the GNU GPL as "against freedom".

This certainly would not have held well with people who were using the MIT Licence or BSD licences (created around the same time as the GNU GPL v1), who believed (and continue to believe) that there were options other than a restrictive viral licence‡. Yes, some of the people involved vilified the "free software principles", but there were also GNU "advocates" who were making RMS look tame with their wording (I recall someone telling me to enjoy "software slavery" because I preferred licences other than the GNU GPL).

The "Free Software" advocates were pretending that the goals of their licence were the only goals that should matter for all authors and consumers of software. That is not and never has been the case, so it is unsurprising that there was a bit of reaction to such extremism.

OSI and the open source label were a move to make things easier for corporations to accept and understand by providing (a) a clear unifying definition, and (b) a set of licences and guidelines for knowing what licenses did what and the risks and obligations they presented to people who used software under those licences.

‡ Don't @ me on this, because both the virality and restrictiveness are features of the GNU GPL. If it weren't for the nonsense in the preamble, it would be a good licence. As it is, it is an effective if rampantly misrepresented licence.


Didn't the Open Source Definition start as the DFSG? You telling me Debian hates the Free Software movement? Unless you define "hating Free Software" as "not banning the BSD license", then I'll have to disagree.


Training code is only useful to people in academia, and the closest thing to "code you can modify" are open weights.

People are framing this as if it was an open-source hierarchy, with "actual" open-source requiring all training code to be shared. This is not obvious to me, as I'm not asking people that share open-source libraries to also share the tools they used to develop them. I'm also not asking them to share all the design documents/architecture discussion behind this software. It's sufficient that I can take the end result and reshape it in any way I desire.

This is coming from an LLM practitioner that finetunes models for a living; and this constant debate about open-source vs open-weights seems like a huge distraction vs the impact open-sourcing something like Llama has... this is truly a Linux-like moment. (at a much smaller scale of course, for now at least)


I dunno — if an open source project required, say, a proprietary compiler, that would diminish its open source-ness. But I agree it's not totally comparable, since the weights are not particularly analogous to machine code. We probably need a new term. Open Weights.


There are many "compilers", you can download The Pile yourself.


> If so, then how can current ML models be open source?

The source of a language model is the text it was trained on. Llama models are not open source (contrary to their claims), they are open weight.


You can find the entire Llama 3.0 pretraining set here: https://huggingface.co/datasets/HuggingFaceFW/fineweb

15T tokens, 45 terrabytes. Seems fairly open source to me.


Where has Facebook linked that? I can't find anywhere that they actually published that.


Many companies stopped publishing their data sets after people published evidence they were mass, copyright infringement. They dropped the specifics of pretraining data from the model cards.

Aside from licensing content, that content creators don’t like redistribution means a lawful model would probably only use Gutenberg’s collection and permissive code. Anything else, including Wikipedia, usually has licensing requirements they might violate.


Yeah I don't think I've seen it linked officially, but Meta does this sort of semi-official stuff all the time, leaking models ahead of time for PR, they even have a dedicated Reddit account for releasing unofficial info.

Regardless, it fits the compute used and the claim that they trained from public web data, and was suspiciously published by HF staff shortly after L3 released. It's about as official as the Mistral 7B v0.2 base model. I.e. mostly, but not entirely, probably for some weird legal reasons.


Says it is ~94TB, with >130k downloads, implying more than 12 exabytes of copying, seems a bit off, wonder how they are calculating downloads


No. The text is an asset used by the source to train the model. The source can process arbitrary text. Text is just text, it was written for communication purposes, software (defined by source code) processes that text in a particular way to train a model.


In programming, "source" and "asset" have specific meanings that conflict with how you used them.

Source is the input to some built artifact. It is the source of that artifact. As in: where the artifact comes from. Textual input is absolutely the source of the ML model. What you are using "source" as is analogous to the source of the compiler in traditional programming.

Asset is an artifact used as input, that is revered verbatim by the output. For example, a logo baked into an application to be rendered in the UI. The compilation of the program doesn't make a new logo, it just moves the asset into the built artifact.


I hadn't had my morning coffee yet when I wrote this and I have no idea what I meant instead of "revered", but you get the idea :D


I think it would also include the code used to train it


That would be more analogous to the build toolchain than the source code, but yes


Surely traditional “open source” also needs some notion of a reproducible build toolchain, otherwise the source code itself is approximately useless.

Imagine if the source code was in a programming language of which the basic syntax and semantics were known to no one but the original developers.

Or more realistically, I think it’s a major problem if an open source project can only be built by an esoteric process that only the original developers have access to.


Source code in a vacuum is still valuable as a way to deal with missing/inaccurate documentation and diagnose faults and their causes.

Raw training datasets similarly has some value as you can analyze it for different characteristics to understand why the trained model is under/over-representing different concepts.

But yes real FOSS should be "open-build" and allow anyone to build a test-passing artifact from raw source material.


I like the term "open weights". Open source would be the dataset and code that generates these weights.

There is still a lot you can do with weights, like fine tuning, and it is arguably more useful as retraining the entire model would cost millions in compute.


Of course you are right, I'd put it less carefully: The quoted Linux line is deceptive marketing.

- If we start with the closed training set, that is closed and stolen, so call it Stolen Source.

- What is distributed is a bunch of float arrays. The Llama architecture is published, but not the training or inference code. Without code there is no open source. You can as well call a compiler book open source, because it tells you how to build a compiler.

Pure marketing, but predictably many people follow their corporate overlords and eagerly adopt the co-opted terms.

Reminder again that FB is not releasing this out of altruism, but because they have an existing profitable business model that does not depend on generated chats. They probably do use it internally for tracking and building profiles, but that is the same as using Linux internally, so they release the weights to destroy the competition.

Isn't price dumping an anti trust issue?



No, it's not. The Llama 3 Community License Agreement is not an open source license. Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]. This license has multiple restrictions on use and distribution which make it not open source. I know Facebook keeps calling this stuff open source, maybe in order to get all the good will that open source branding gets you, but that doesn't make it true. It's like a company calling their candy vegan while listing one its ingredients as pork-based gelatin. No matter how many times the company advertises that their product is vegan, it's not, because it doesn't meet the definition of vegan.

[0] - https://opensource.org/osd


Isn't the MIT license the generally accepted "open source" license? It's a community owned term, not OSI owned


MIT is a permissive open source license, not the open source license.


There are more licenses than just MIT that are "open source". GPL, BSD, MIT, Apache, some of the Creative Commons licenses, etc. MIT has become the defacto default though

https://opensource.org/license (linking to OSI for the list because it's convenient, not because they get to decide)


These discussions (ie, everything that follows here) would be much easier if the crowd insisting on the OSI definition of open source would capitalize Open Source.

In English, proper nouns are capitalized.

"Open" and "source" are both very normal English words. English speakers have "the right" to use them according to their own perspective and with personal context. It's the difference between referring to a blue tooth, and Bluetooth, or to an apple store or an Apple store.


Open source licenses need to meet the criteria of the only widely accepted definition of "open source", and that's the one formulated by the OSI [0]

Who died and made OSI God?


This isn't helpful. The community defers to the OSI's definition because it captures what they care about.

We've seen people try to deceptively describe non-OSS projects as open source, and no doubt we will continue to see it. Thankfully the community (including Hacker News) is quick to call it out, and to insist on not cheapening the term.

This is one the topics that just keeps turning up:

* https://news.ycombinator.com/item?id=24483168

* https://news.ycombinator.com/item?id=31203209

* https://news.ycombinator.com/item?id=36591820


This isn't helpful. The community...

Speak for yourself, please. The term is much older than 1998, with one easily-Googled example being https://www.cia.gov/readingroom/docs/DOC_0000639879.pdf , and an explicit case of IT-related usage being https://i.imgur.com/Nw4is6s.png from https://www.google.com/books/edition/InfoWarCon/09X3Ove9uKgC... .

Unless a registered trademark is involved (spoiler: it's not) no one, whether part of a so-called "community" or not, has any authority to gatekeep or dictate the terms under which a generic phrase like "open source" can be used.


Neither of those usages relate to IT, they both are about sources of intelligence (espionage). Even if they were, the OSI definition won, nobody is using the definitions from 1995 CIA or the 1996 InfoConWar book in the realm of IT, not even Facebook.

The community has the authority to complain about companies mis-labelling their pork products as vegan, even if nobody has a registered trademark on the term vegan. Would you tell people to shut up about that case because they don't have a registered trademark? Likewise, the community has authority to complain about Meta/Facebook mis-labelling code as open source even when they put restrictions on usage. It's not gate-keeping or dictatorship to complain about being misled or being lied to.


Would you tell people to shut up about that case because they don't have a registered trademark?

I especially like how I'm the one telling people to "shut up" all of a sudden.

As for the rest, see my other reply.


You're right, I and those who agree with me were the first to ask people to "shut up", in this case, to ask Meta to stop misusing the term open source. And I was the first to say "shut up", and I know that can be inflammatory and disrespectful, so I shouldn't have used it. I'm sorry. We're here in a discussion forum, I want you to express your opinion even it is to complain about my complaints. For what it's worth, your counter-arguments have been stronger and better referenced than any other I have read (for the case of accepting a looser definition of the term open source in the realm of IT).


All good, and I also apologize if my objection came across as disrespectful.

This whole 'Open Source' thing is a bigger pet peeve than it should be, because I've received criticism for using the term on a page where I literally just posted a .zip file full of source code. The smart thing to do would have been to ignore and forget the criticism, which I will now work harder at doing.

In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'. It's a standard English-language word that according to Merriam-Webster goes back to 1944. So that would amount to an open-and-shut case of false advertising, which I don't think applies here at all.


> In the case of a pork producer who labels their products as 'vegan', that's different because there is some authority behind the usage of 'vegan'.

I don't see the difference. Open source software is a term of art with a specific meaning accepted by its community. When people misuse the term, invariably in such a way as to broaden it to include whatever it is they're pushing, it's right that the community responds harshly.


Terms of art do not require licenses. A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This kind of argument is literally why trademark law exists. OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.


> Terms of art do not require licenses.

Agreed. There is no trademark on aileron or carburetor or context-free grammar. A couple of years ago I made this same point myself. [0]

> A given term is either an ordinary dictionary word that everyone including the courts will readily recognize ("Vegan"), a trademark ("Microsoft® Office 365™"), or a fragment of language that everyone can feel free to use for their own purposes without asking permission. "Open Source" falls into the latter category.

This taxonomy doesn't hold up.

Again, it's a term of art with a clear meaning accepted by its community. We've seen numerous instances of cynical and deceptive misuse of the term, which the community rightly calls out because it's not fair play, it's deliberate deception.

> This kind of argument is literally why trademark law exists

It is not. Trademark law exists to protect brands, not to clarify terminology.

You seem to be contradicting your earlier point that terms of art do not require licenses.

> OSI did not elect to go down that path. Maybe they should have, but I respect their decision not to, and perhaps you should, too.

I haven't expressed any opinion on that topic, and I don't see a need to.

[0] https://news.ycombinator.com/item?id=31203209


If the OSI members wanted to "clarify the terminology" in a way that permitted them (and you) to exclude others, trademark law would have absolutely been the correct way to do that. It's too late, however. The ship has sailed.

Come up with a new term and trademark that, and heck, I'll help you out with a legal fund donation when Facebook and friends inevitably try to appropriate it. Apart from that, you've fought the good fight and done what you could. Let it go.


The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

Recently, companies are trying to market things as open source when in reality, they fail to adhere to the definition.

I think we should not let these companies change the meaning of the term, which means it's important to explain every time they try to seem more open than they are.

I'm afraid the battle is being lost though.


>The OSI was created about 20 years ago and defined and popularized the term open source. Their definition has been widely accepted over that period.

It was defined and accepted by the community well before OSI came around though.


Citation? Wikipedia would appreciate your contribution.

https://en.wikipedia.org/wiki/Open_source

> Linus Torvalds, Larry Wall, Brian Behlendorf, Eric Allman, Guido van Rossum, Michael Tiemann, Paul Vixie, Jamie Zawinski, and Eric Raymond [...] > At that meeting, alternatives to the term "free software" were discussed. [...] Raymond argued for "open source. The assembled developers took a vote, and the winner was announced at a press conference the same evening

The original "Open source Definition" was derived from Debian's Social Contract, which did not use the term "open source"

https://web.archive.org/web/20140328095107/http://www.debian...


Citation? Wikipedia would appreciate your contribution.

It's not hard to find earlier examples where the phrase is used to describe enabling and (yes) leveraging community contributions to accomplish things that otherwise wouldn't be practical; see my other post for a couple of those.

But then people will rightfully object that the term "Open Source", when used in a capacity related to journalistic or intelligence-gathering activities, doesn't have anything to do with software licensing. Even if OSI had trademarked the phrase, which they didn't, that shouldn't constrain its use in another context.

To which I'd counter that this statement is equally true when discussing AI models. We are going to have to completely rewire copyright law from the ground up to deal with this. Flame wars over what "Open Source" means or who has the right to use the phrase are going to look completely inconsequential by the time the dust settles.


I'll concede that "open source" may mean other things in other contexts. For example, an open source river may mean something in particular to those who study rivers. This thread was not talking about a new context, it was not even talking about the weights of a machine learning model or the licensing of training data, it was talking about the licensing of the code in a particular GitHub repository, llama3.

AI may make copyright obsolete, or it may make copyright more important than ever, but my prediction is that the IT community will lose something of great value if the term "open source" is diluted to include licenses that restrict usage, restrict distribution, and restrict modification. I can understand why people may want to choose somewhat restrictive licenses, just like I can understand why a product may contain gelatin, but I don't like it when the product is mis-labelled as vegan. There are plenty of other terms that could be used, for example, "open" by itself. I'm honestly curious if you would defend a pork product labelled as vegan, or do you just feel that the analogy doesn't apply?


This is like saying any python program is open source because the python runtime is open source.

Inference code is the runtime; the code that runs the model. Not the model itself.


I disagree. The file I linked to, model.py, contains the Llama 3 model itself.

You can use that model with open data to train it from scratch yourself. Or you can load Meta’s open weights and have a working LLM.


Yeah a lot of people here seem to not understand that PyTorch really does make model definitions that simple, and that has everything you need to resume back-propagation. Not to mention PyTorch itself being open-sourced by Meta.

That said the LLama-license doesn't meet strict definitions of OS, and I bet they have internal tooling for datacenter-scale training that's not represented here.


> The file I linked to, model.py, contains the Llama 3 model itself.

That makes it source available ( https://en.wikipedia.org/wiki/Source-available_software ), not open source


Source available means you can see the source, but not modify it. This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.


> Source available means you can see the source, but not modify it.

No, it doesn't mean that. To quote the page I linked, emphasis mine,

> Source-available software is software released through a source code distribution model that includes arrangements where the source can be viewed, and in some cases modified, but without necessarily meeting the criteria to be called open-source. The licenses associated with the offerings range from allowing code to be viewed for reference to allowing code to be modified and redistributed for both commercial and non-commercial purposes.

> This is kinda the opposite, you can modify the model, but you don't see all the details of its creation.

Per https://github.com/meta-llama/llama3/blob/main/LICENSE there's also a laundry list of ways you're not allowed to use it, including restrictions on commercial use. So not Open Source.


That's not the training code, just the inference code. The training code, running on thousands of high-end H100 servers, is surely much more complex. They also don't open-source the dataset, or the code they used for data scraping/filtering/etc.


"just the inference code"

It's not the "inference code", its the code that specifies the architecture of the model and loads the model. The "inference code" is mostly the model, and the model is not legible to a human reader.

Maybe someday open source models will be possible, but we will need much better interpretability tools so we can generate the source code from the model. In most software projects you write the source as a specification that is then used by the computer to implement the software, but in this case the process is reversed.


That is just the inference code. Not training code or evaluation code or whatever pre/post processing they do.


Is there an LLM with actual open source training code and dataset? Besides BLOOM https://huggingface.co/bigscience/bloom



Yes, there are a few dozen full open source models (license, code, data, models)


What are some of the other ones? I am aware mainly of OLMo (https://blog.allenai.org/olmo-open-language-model-87ccfc95f5...)


Can’t you do fine tuning on those binaries? That’s a modification.


You can fine tune the models, and you can modify binaries. However, there is no human readable "source" to open in either case. The act of "fine tuning" is essentially brute forcing the system to gradually alter the weights such that loss is reduced against a new training set. This limits what you can actually do with the model vs an actual open source system where you can understand how the system is working and modify specific functionality.

Additionally, models can be (and are) fine tuned via APIs, so if that is the threshold required for a system to be "open source", then that would also make the GPT4 family and other such API only models which allow finetuning open source.


I don't find this argument super convincing.

There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models.

"Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient.


"There's a pretty clear difference between the 'finetuning' offered via API by GPT4 and the ability to do whatever sort of finetuning you want and get the weights at the end that you can do with open weights models."

Yes, the difference is that one is provided over a remote API, and the provider of the API can restrict how you interact with it, while the other is performed directly by the user. One is a SaaS solution, the other is a compiled solution, and neither are open source.

""Brute forcing" is not the correct language to use for describing fine-tuning. It is not as if you are trying weights randomly and seeing which ones work on your dataset - you are following a gradient."

Whatever you want to call it, this doesn't sound like modifying functionality in source code. When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times. What I don't do is have a very simple routine make very small modifications to all of the system's functionality, then check the result of that small change across the broad spectrum of functionality, and repeat millions of times.


The gap between fine-tuning API and weights-available is much more significant than you give it credit for.

You can take the weights and train LoRAs (which is close to fine-tuning), but you can also build custom adapters on top (classification heads). You can mix models from different fine-tunes or perform model surgery (adding additional layers, attention heads, MoE).

You can perform model decomposition and amplify some of its characteristics. You can also train multi-modal adapters for the model. Prompt tuning requires weights as well.

I would even say that having the model is more potent in the hands of individual users than having the dataset.


That still doesn't make it open source.

There is a massive difference between a compiled binary that you are allowed to do anything you want with, including modifying it, building something else on top or even pulling parts of it out and using in something else, and a SaaS offering where you can't modify the software at all. But that doesn't make the compiled binary open source.


> When I modify source code, I might make a change, check what that does, change the same functionality again, check the new change, etc... up to maybe a couple dozen times.

You can modify individual neurons if you are so inclined. That's what Anthropic have done with the Claude family of models [1]. You cannot do that using any closed model. So "Open Weights" looks very much like "Open Source".

Techniques for introspection of weights are very primitive, but i do think new techniques will be developed, or even new architectures which will make it much easier.

[1] https://www.anthropic.com/news/mapping-mind-language-model


"You can modify individual neurons if you are so inclined."

You can also modify a binary, but that doesn't mean that binaries are open source.

"That's what Anthropic have done with the Claude family of models [1]. ... Techniques for introspection of weights are very primitive, but i do think new techniques will be developed"

Yeah, I don't think what we have now is robust enough interpretability to be capable of generating something comparable to "source code", but I would like to see us get there at some point. It might sound crazy, but a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy.

I think getting to open sourcable models is probably pretty important for producing models that actually do what we want them to do, and as these models become more powerful and integrated into our lives and production processes the inability to make them do what we actually want them to do may become increasingly dangerous. Muddling the meaning of open source today to market your product, then, can have troubling downstream effects as focus in the open source community may be taken away from interpretability and on distributing and tuning public weights.


> a few years ago the degree of interpretability we have today (thanks in no small part to Anthropic's work) would have sounded crazy

My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing.

We are climbing out of the trough of disillusionment maybe, but to say that we have reached mind-blowing heights with interpretability seems a bit of an hyperbole, unless I've missed some enormous breakthrough.


"My understanding is that a few years ago, if we knew the degree of interpretability we have today (compared to capability) it would have been devastatingly disappointing."

I think this is a situation where both things are true. Much more progress has been made in capabilities research than interpretability and the interpretability tools we have now (at least, in regards to specific models) would have been seen as impossible or at least infeasible a few years back.


You make a good point but those are also just limitations of the technology (or at least our current understanding of it)

Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?


Your hypothetical apple-grower family would simply share a handbook which meticulously shared the initial species of apple used, the breeding protocol, the hybridization method, and any other factors used to breed this perfect apple.

Having the handbook and materials available would make it possible for others to reproduce the resulting apple, or to obtain similar apples with different properties by modifying the protocols.

The handbook is the source code.

On the other hand, what we have here is Monsanto saying: "we've got those Terminator-lineage apples, and we're open-sourcing them by giving you the actual apples as an end product for free. Feel free to breed them into new varieties at will as long as you're not a Big Farm company."

Not open source.


What would enable someone to reproduce the tree from scratch, and continue developing that line of trees, using tools common to apple tree breeders? I’m not an apple tree breeder, but I suspect that’s the seeds. Maybe the genetic sequence is like source code in some analogical sense, but unless you can use that information to produce an actual seed, it doesn’t qualify in a practical sense. Trees don’t have a “compilation phase” to my knowledge, so any use of “open source” would be a stretch.


"You make a good point but those are also just limitations of the technology (or at least our current understanding of it)"

Yeah, that is my point. Things that don't have source code can't be open source.

"Maybe an analogy would help. A family spent generations breeding the perfect apple tree and they decided to “open source” it. What would open sourcing look like?"

I think we need to be weary of dilemmas without solutions here. For example, let's think about another analogy: I was in a car accident last week. How can I open source my car accident?

I don't think all, or even most things, are actually "open sourcable". ML models could be open sourced, but it would require a lot of work to interpret the models and generate the source code from them.


Be charitable and intellectually curious. What would "open" look like?

GNU says "The GNU GPL can be used for general data which is not software, as long as one can determine what the definition of “source code” refers to in the particular case. As it turns out, the DSL (see below) also requires that you determine what the “source code” is, using approximately the same definition that the GPL uses."

and offers these categories, for example:

https://www.gnu.org/licenses/license-list.en.html#NonFreeSof...

* Software Licenses

* * GPL-Compatible Free Software Licenses

\

* * GPL-Incompatible Free Software Licenses

\

* Licenses For Documentation

* * Free Documentation Licenses

\

* Licenses for Other Works

* * Licenses for Works of Practical Use besides Software and Documentation

* * Licenses for Fonts

* * Licenses for Works stating a Viewpoint (e.g., Opinion or Testimony)

* * Licenses for Designs for Physical Objects


"Be charitable and intellectually curious. What would "open" look like?"

To really be intellectually curious we need to be open to the idea that there is not (yet) a solution to this problem. Or in the analogy you laid out, that it is simply not possible for the system to be "open source".

Note that most of the licenses listed under the "Licenses for Other Works" section say "It is incompatible with the GNU GPL. Please don't use it for software or documentation, since it is incompatible with the GNU GPL and with the GNU FDL." This is because these are not free software/open source licenses. They are licenses that the FSF endorses because they encourage openness and copyleft in non-software mediums, and play nicely with the GPL when used appropriately (i.e. not for software).

The GPL is appropriate for many works that we wouldn't conventionally view as software, but in those contexts the analogy is usually so close to the literal nature of software that it stops being an analogy. The major difference is public perception. For example, we don't generally view jpegs as software. However, jpegs, at their heart, are executable binaries with very domain specific instructions that are executed in a very much non-Turing complete context. The source code for the jpeg is the XCF or similar (if it exists) which contains a specification (code) for building the binary. The code becomes human readable once loaded into an IDE, such as GIMP, designed to display and interact with the specification. This is code that is most easily interacted with using a visual IDE, but that doesn't change the fact that it is code.

There are some scenarios where you could identify a "source code" but not a "software". For example, a cake can be open sourced by releasing the recipe. In such a context, though, there is literally source code. It's just that the code never produces a binary, and is compiled by a human and kitchen instead of a computer. There is open source hardware, where the source code is a human readable hardware specification which can be easily modified, and the hardware is compiled by a human or machine using that specification.

The scenario where someone has bred a specific plant, however, can not be open source, unless they have also deobfuscated the genome, released the genome publicly, and there is also some feasible way to convert the deobfuscated genome, or a modification of it, into a seed.


> vs an actual open source system where you can understand how the system is working and modify specific functionality.

No one on the planet understands how the model weights work exactly, nor can they modify them specifically (i.e. hand modifying the weights to get the result they want). This is an impossible standard.

The source code is open (sorta, it does have some restrictions). The weights are open. The training data is closed.


> No one on the planet understands how the model weights work exactly

Which is my point. These models aren't open source because there is no source code to open. Maybe one day we will have strong enough interpretability to generate source from these models, and then we could have open source models. But today its not possible, and changing the meaning of open source such that it is possible probably isn't a great idea.


It's no secret that implementing AI usually involves far more investment into training and teaching than actual code. You can know how a neural net or other ML model works. You can have all the code before you. It's still a huge job (and investment) to do anything practical with that. If Meta shares the code their AI runs on with you, you're not going to be able to do much with it unless you make the same investment in gathering data and teaching to train that AI. That would probably require data Meta won't share. You'd effectively need your own Facebook.

If everyone open sources their AI code, Meta can snatch the bits that help them without much fear of helping their direct competitors.


I think you're misunderstanding what I'm saying. I don't think its technically feasible for current models to be open source, because there is no source code to open. Yes, there is a harness that runs the model, but the vast, vast amount of instructions are contained in the model weights, which are akin to a compiled binary.

If we make large strides in interpretability we may have something resembling source code, but we're certainly not there yet. I don't think the solution to that problem should be to change the definition of open source and pretend the problem has been solved.


The term “source code” can mean many things. In a legal context it’s often just defined as the preferred format for modification. It can be argued that for artificial neural networks that’s the weights (along with code and preferably training data).


I agree; there's a lot of muddiness in the term "open source AI". Earlier this year there was a talk[1] at FOSDEM, titled "Moving a step closer to defining Open Source AI". It is from someone at the Open Source Initiative. The video and slides are available in the link below[1]. From the abstract:

"Finding an agreement on what constitutes Open Source AI is the most important challenge facing the free software (also known as open source) movement. European regulation already started referring to "free and open source AI", large economic actors like Meta are calling their systems "open source" despite the fact that their license contain restrictions on fields-of-use (among other things) and the landscape is evolving so quickly that if we don't keep up, we'll be irrelevant."

[1] https://fosdem.org/2024/schedule/event/fosdem-2024-2805-movi... defining-open-source-ai/


You release all the technology and the training data. Everything that was used to create the model, including instructions.

I'm not sure if facebook has done that


Open source = reproducible binaries (weights) by you on your computer, IMO.

Strategy of FB is that they are good to be a user only and fine ruining competitor’s business with good enough free alternatives while collecting awards as saviors of whatever.


If that were the definition then any software you can install on your computer would be open source. It makes open source lose nearly all meaning.

Just say "open weights", not "open source".


Not sure what you mean by "they are good to be a user only." Whatever their strategy is, this is great for the community.


Open training dataset + open steps sufficient to train exactly the same model.


This isn't what Meta releases with their models, though I would like to see more public training data. However, I still don't think that would qualify as "open source". Something isn't open source just because its reproducible out of composable parts. If one, very critical and system defining part is a binary (or similar) without publicly available source code, then I don't think it can be said to be "open source". That would be like saying that Windows 11 is open source because Windows Calculator is open source, and its a component of Windows.


Here’s one list of what is needed to be actually open source:

https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...


That's what I meant by "open steps", I guess I wasn't clear enough.


Is that what you meant? I don't think releasing the sequence of steps required to produce the model satisfies "open source", which is how I interpreted you, because there is still no source code for the model.


They can't release training dataset if it was illegally scrapped all over the web without permission :) (taps head)


Coming up with the words and concepts to describe the models is a challenge.

Does the training data require permission from the copyright holder to use? Are the weights really open source or more like compiled assembly?


I also think that something like Chromium is a better analogy for corporate open source models than a grassroots project like Linux is. Chromium is technically open source, but Google has absolute control over the direction of it's development and realistically it's far too complex to maintain a fork without Googles resources, just like Meta has complete control over what goes into their open models, and even if they did release all the training data and code (which they don't) us mere plebs could never afford to train a fork from scratch anyway.


I think you’re right from the perspective of an individual developer. You and I are not about to fork Chromium any time soon. If you presume that forking is impractical then sure, the right to fork isn’t worth much.

But just because a single developer couldn’t do it doesn’t mean it couldn’t be done. It means nobody has organized a large enough effort yet.

For something like a browser, which is critical for security, you need both the organization and the trust. Despite frequent criticism, Mozilla (for example) is still considered pretty trustworthy in a way that an unknown developer can’t be.


If Microsoft can't do it, then we can reasonably conclude that it can't be done for any practical purpose. Discussing infinitesimal possibilities is better left to philosophers.


Doesn’t Microsoft maintain its own fork of Chromimum?


yes - their browser is chromium-based


If you think about LLMs as a new kind of programming runtime, the matrices are the source.


Ok call it Open Weights then if the dictionary definitions matter so much to you.

The actual point that matters is that these models are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer.


They don't "[allow] developers to modify its code however they want", which is a critical component of "open source", and one that Meta is clearly trying to leverage in branding around its products. I would like them to start calling these "public weight models", because what they're doing now is muddying the waters so much that "open source" now just means providing an enormous binary and an open source harness to run it in, rather than serving access to the same binary via an API.


Feels a bit like you are splitting hair for the pleasure of semantic arguments to be honest. Yes there are no source in ML, so if we want to be pedantic it shouldn't be called open source. But what really matters in the open source movement is that we are able to take a program built by someone and modify it to do whatever we want with it, without having to ask someone for permission or get scrutinized or have to pay someone.

The same applies here, you can take those models and modify them to do whatever you want (provided you know how to train ML models), without having to ask for permission, get scrutinized or pay someone.

I personally think using the term open source is fine, as it conveys the intent correctly, even if, yes, weights are not sources you can read with your eyes.


Calling that “open source” renders the word “source” meaningless. By your definition, I can release a binary executable freely and call it “open source” because you can modify it to do whatever you want.

Model weights are like a binary that nobody has the source for. We need another term.


No it’s not the same as releasing a binary, feels like we can’t get out of the pedantics. I can in theory modify a binary to do whatever I want. In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute.

Here, modifying that model is not harder that doing regular ML, and I can redistribute.

Meta doesn’t have access to some magic higher level abstraction for that model that would make working with it easier that they did not release.

The sources in ML are the architecture the training and inference code and a paper describing the training procedure. It’s all there.


"In practice it is intractably hard to make any significant modification to a binary, and even if you could, you would then not be legally allowed to e.g. redistribute."

It depends on the binary and the license the binary is released under. If the binary is released to the public domain, for example, you are free to make whatever modifications you wish. And there are plenty of licenses like this, that allow closed source software to be used as the user wishes. That doesn't make it open source.

Likewise, there are plenty of closed source projects who's binaries we can poke and prod with much higher understanding of what our changes are actually doing than we're able to get when we poke and prod LLMs. If you want to make a Pokemon Red/Blue or Minecraft mod you have a lot of tools at your disposal.

A project that only exists as a binary which the copyright holder has relinquished rights to, or has released under some similar permissive closed source license, but people have poked around enough to figure out how to modify certain parts of the binary with some degree of predictability is a more apt analogy. Especially if the original author has lost the source code, as there is no source code the speak of when discussing these models.

I would not call that binary "open source", because the source would, in fact, not be open.


Can you change the tokenizer? No, because all you have is the weights trained with the current tokenizer. Therefore, by any normal definition, you don’t have the source. You have a giant black box of numbers with no ability to reproduce it.


> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb


You can change the tokenizer and build another model, if you can come up with your own version of the rest of the source (e.g., the training set, RLHF, etc.). You can’t change the tokenizer for this model, because you don’t have all of its source.


There is nothing that requires you to train with the same training set, or to re-do RLHF. You can train on fineweb, and llama 3.1 will learn to use your new tokenizer just fine.

There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.


You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest. One also has to come up with suite le datasets, from scratch. Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.


> You have to write all the code needed to do the modifications you are interested in. That is, there is no source code provided that can be used to make the modifications of interest.

Just like open source?

> Training setup and data is completely non trivial for a large language model. To replicate Llama would take hundreds of hours of engineering, at least.

The entire point of having the pre-trained weight released is to *not* have to do this. You just need to finetune, which can be done with very little data, depending on the task, and many open source toolkits, that work with those weights, exist to make this trivial.


I think maybe we’re talking past each other because it seems obvious to me and others that the weights are the output of the compilation process, whereas you seem to think they’re the input. Whether you can fine tune the weights is irrelevant to whether you got all the materials needed to make them in the first place (i.e., the source).

I can do all sorts of things by “fine tuning” Excel with formulas, but I certainly don’t have the source for Excel.


> The same applies here, you can take those models and modify them to do whatever you want without having to ask for permission, get scrutinized or pay someone.

The "Additional Commercial Terms" section of the license includes restrictions that would not meet the OSI definition of open source. You must ask for permission if you have too many users.


"Public weight models" sounds about right, thanks for coming up with a good term! Hope it catches.


My central point is this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer."

I presume you agree with it.

> rather than serving access

Its not the same access though.

I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

> They don't "[allow] developers to modify its code however they want"

Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

Thats a huge deal!

And it is dishonest to compare a situation where limitations are both minimal and almost unenforceable (Except against maybe Google) to a situation where its physically not possible to get access to the model weights to do what you want with them.


> Actually, the fact that the model weights are available means that you can even ignore any limitations that you think are on it, and you'll probably just get away with it. You are also ignoring the fact that the limitations are minimal to most people.

The limitations here are technical, not legal. (Though I am aware of the legal restrictions as well, and I think its worth noting that no other project would get by calling themselves open source while imposing a restriction which prevents competitors from using the system to build their competing systems.) There isn't any source code to read and modify. Yes, you can fine tune a model just like you can modify a binary but this isn't source code. Source code is a human readable specification that a computer can use to transform into executable code. This allows the human to directly modify functionality in the specification. We simply don't have that, and it will not be possible unless we make a lot of strides in interpretability research.

> Its not the same access though.

> I am sure that you are creative enough to think of many questions that you could ask llama3, that would instead get you kicked off of OpenAI.

I'm not saying that systems that are provided as SaaS don't tend to be more restrictive in terms of what they let you do through the API they expose vs what is possible if you run the same system locally. That may not always be true, but sure, as a general rule it is. I mean, it can't be less restrictive. However, that doesn't mean that being able to run code on your own machine makes the code open source. I wouldn't consider Windows open source, for example. Why? Because they haven't released the source code for Windows. Likewise, I wouldn't consider these models open source because their creators haven't released source code for them. Being technically infeasible to do doesn't mean that the definition changes such that its no longer technically infeasible. It is simply infeasible, and if we want to change that, we need to do work in interpretability, not pretend like the problem is already solved.


So then yes you agree with this:

"are available for most people to use for a lot of stuff, and this is way way better than what competitors like OpenAI offer." And that this is very significant.


One counterpoint is that major publications (eg New York Times) would have you believe that AI is a mildly lossy compression algorithm capable of reconstructing the original source material.


I believe it is able to reconstruct parts of the original source material—if the interrogator already knows the original source material to prompt the model appropriately.


It's not?


Unfortunately open source really just means an open API these days. The API is heavily intertwined with closed source.


No, open source means that sources are open, typically for inspection, modification etc. Also here it can be considered the case. Likely in order to claim "true open source", they would have to share dataset? But even this might not be enough for truely open source model? This dataset is nothing but another artifact. So how did they arrive at this dataset, now they have to share pipelines and infra...

.. the thing is, we have not dealt with llm much, it's hard to say what can be considered open source llm just yet, so we use that as metaphore for now


Weight is the new code.


I think saying it's the new binary is closer to the truth. You can't reproduce it, but you can use it. In this new version, you can even nudge it a bit to do something a little different.

New stuff, so probably not good to force old words, with known meanings, onto new stuff.


The model is more akin to a python script than a compiled C binary. This is how I see it:

Training Code and dataset are analogous to the developer who wrote the script

Model and weights are end product that is then released

Inference Code is the runtime that could execute the code. That would be e.g. PyTorch, which can import the weights and run inference.


> The model is more akin to a python script than a compiled C binary.

No, I completely disagree. Python is near pseudo-text source. Source exists for the specific purpose of being easily and completely understood, by humans, because it's for and from humans. You can turn a python calculator into a web server, because it can be split and separated at any point, because it can be completely understood at any point, and it's deterministic at every point.

A model cannot be understood by a human. It isn't meant to be. It's meant to be used, very close to as is. You can't fundamentally change the model, or dissect it, you can only nudge it in a direction, with the force of that nudge being proportional to the money you can burn, along with hope that it turns out how you want.

That's why I say it's closer to a binary: more of a black box you can use. You can't easily make a binary do something fundamentally different without changing the source. You can't easily see into that black box, or even know what it will do without trying. You can only nudge it to act a little differently, or use it as part of a workflow. (decompilation tools aside ;))


None of Meta's models are "open source" in the FOSS sense, even the latest Llama 3.1. The license is restrictive. And no one has bothered to release their training data either.

This post is an ad and trying to paint these things as something they aren't.


> no one has bothered to release their training data

If the FOSS community sets this as the benchmark for open source in respect of AI, they're going to lose control of the term. In most jurisdictions it would be illegal for the likes of Meta to release training data.


Regardless of the training data, the license even heavily restricts how you can use the model.

Please read through their "acceptable use" policy before you decide whether this is really in line with open source.


> Please read through their "acceptable use" policy before you decide whether this is really in line with open source

I'm not taking a specific posiion on this license. I haven't read it closely. My broad point is simply that open source AI, as a term, cannot practically require the training data be made available.


> In most jurisdictions it would be illegal for the likes of Meta to release training data.

How come releasing an LLM trained on that data is not illegal then? I think it should be.


the training data is the source.


I don’t think it’s that simple. The source is “the preferred form of the work for making modifications to it” (to use the GPL’s wording).

For an LLM, that’s not the training data. That’s the model itself. You don’t make changes to an LLM by going back to the training data and making changes to it, then re-running the training. You update the model itself with more training data.

You can’t even use the training code and original training data to reproduce the existing model. A lot of it is non-deterministic, so you’ll get different results each time anyway.

Another complication is that the object code for normal software is a clear derivative work of the source code. It’s a direct translation from one form to another. This isn’t the case with LLMs and their training data. The models learn from it, but they aren’t simply an alternative form of it. I don’t think you can describe an LLM as a derivative work of its training data. It learns from it, it isn’t a copy of it. This is mostly the reason why distributing training data is infeasible – the model’s creator may not have the license to do so.

Would it be extremely useful to have the original training data? Definitely. Is distributing it the same as distributing source code for normal software? I don’t think so.

I think new terminology is needed for open AI models. We can’t simply re-use what works for human-editable code because it’s a fundamentally different type of thing with different technical and legal constraints.


No the preferred way to make modifications is using the the training code. One may also input a snapshot weighs to start from, but the training code is definitely what you would modify to make a change.


how do you train it in a different language by changing the training code?


By selecting different dataset. Of course this dataset does need to exist. In practice building and curating datasets also involves a lot of code.


sounds like you need the data to train the model.


Given a well behaved training setup, you will give an equivalently powerful model, given the same dataset and training scripts, and training settings. At least if you are willing to run it several times, an pick the best one - a process that is commonly used for large models.


> the training data is the source

Sure. But that's not going to be released. The term open source AI cannot be expected to cover it because it's not practical.


Meta can call it something else other than open source.

Synthetic part of the training data could be released.


Of course it could be practical - provide the data. The fact of that society is a dystopian nightmare controlled by a few megacorporations that don't want free information does not justify outright changing the meaning of the language.


> provide the data

Who? It's not their data.


why are they using it?


And why legislation allows them to use the data to train their LLM and release that, but not release the data?


So because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?


> because it's really hard to do proper Open Source with these LLMs, means we need to change the meaning of Open Source so it fits with these PR releases?

Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Meta et al would love for the choice to be between, on one hand, open weights only, and, on the other hand, open training data, because the latter is impractical. That dichotomy guarantees that when someone says open source AI they'll mean open weights. (The way open source software, today, generally means source available, not FOSS.)


>Meanwhile, the term "open source" is massively popular. So it will get used. The question is how.

Here's the source of the disagreement. You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding.

Other person is saying it doesn't matter how convenient it is or how much Meta wants to use it, that the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use.

This would be like Adobe giving Photoshop away for free, but for personal use only and not for making ads for Adobe's competitors. Sure, Adobe likes it and most users may be fine with it, but it isn't open source.

>The way open source software, today, generally means source available, not FOSS.

I don't agree with that. When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core".


> You're justifying the use of the term "open source" by saying it's logical for Meta to want to use it for its popularity and layman (incorrect) understanding

I'm actually not a fan of Meta's definition. I'm arguing specifically against an unrealistic definition, because for practical purposes that cedes the term to Meta.

> the term "open source" is misleading for a product where the "source" is the training data, and the final product has onerous restrictions on use

Agree. I think the focus should be on the use restrictions.

> When a company says "open source" but it's not free, the tech community is quick to call it "source available" or "open core"

This isn't consistently applied. It's why we have the free vs open vs FOSS fracture.


> Open training data is hard to the point of impracticality. It requires excluding private and proprietary data.

Right, so the onus is on Facebook/Meta to get that right, then they could call something Open Source, until then, find another name that already doesn't have a specific meaning.

> (The way open source software, today, generally means source available, not FOSS.)

No, but it's going in that way. Open Source, today, still means that the things you need to build a project, is publicly available for you to download and run on your own machine, granted you have the means to do so. What you're thinking of is literally called "Source Available" which is very different from "Open Source".

The intent of Open Source is for people to be able to reproduce the work themselves, with modifications if they want to. Is that something you can do today with the various Llama models? No, because one core part of the projects "source code" (what you need to reproduce it from scratch), the training data, is being held back and kept private.


source available is absolutely not the same as open source

you are playing very loosely with terms that have specific, widely accepted definitions (e.g. https://opensource.org/osd )

I don't get why you think it would be useful to call LLMs with published weights "open source"


> terms that have specific, widely accepted definitions

OSF's definition is far from the only one [1]. Switzerland is currently implementing CH Open's definition, the EU another one, et cetera.

> I don't get why you think it would be useful to call LLMs with published weights "open source"

I don't. I'm saying that if the choice is between open weights or open weights + open training data, open weights will win because the useful definition will outcompete the pristine one in a public context.

[1] https://en.wikipedia.org/wiki/Open-source_software#Definitio...


For the EU, I'm guessing you're talking about the EUPL, which is FSF/OSI approved and GPL compatible, generally considered copyleft.

For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here?

I'm guessing that all these definitions have at least some points in common, which involves (another guess) at least being able to produce the output artifacts/binaries by yourself, something that you cannot do with Llama, just as an example.


> For the CH Open, I'm not finding anything specific, even from Swiss websites, could you help me understand what you're referring to here

Was on the HN front page earlier [1][2]. The definition comes strikingly close to source on request with no use restrictions.

> all these definitions have at least some points in common

Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

[1] https://news.ycombinator.com/item?id=41047172

[2] https://joinup.ec.europa.eu/collection/open-source-observato...


> Agreed. But they're all different. There isn't an accepted defintiion of open source even when it comes to software; there is an accepted set of broad principles.

Agreed, but are we splitting hairs here and is it relevant to the claim made earlier?

> (The way open source software, today, generally means source available, not FOSS.)

Do any of these principles or definitions from these orgs agree/disagree with that?

My hypothesis is that they generally would go against that belief and instead argue that open source is different from source available. But I haven't looked specifically to confirm if that's true or not, just a guess.


> are we splitting hairs here and is it relevant to the claim made earlier?

I don't think so. Take the Swiss definition. Source on request, not even available. Yet being branded and accepted as open source.

(To be clear, the Swiss example favours FOSS. But it also permits source on request and bundles them together under the same label.)


diluting open source into a marketing term meaning "you can download something" would be a sad result


> specific, widely accepted definitions

Realistically, nobody outside of Hacker News commenters have ever cared about the OSD. It's just not how the term is used colloquially.


who says open source colloquially? ime anyone who doesn't care about software licenses will just say free (per free beer)

and (strong personal opinion) any software developer should have a firm grip on the terminology and details for legal reasons


> who says open source colloquially?

There is a large span of people between gray beard programmer and lay person, and many in that span have some concept of open-source. It's often used synonymously with visible source, free software, or in this case, open weights.

It seems unfortunate - though expected - that over half of the comments in this thread are debating the OSD for the umpeenth time instead of discussing the actual model release or accompanying news posts. Meanwhile communities like /r/LocalLlama are going hog wild with this release and already seeing what it can do.

> any software developer should have a firm grip on the terminology and details for legal reasons

They'd simply need to review the terms of the license to see if it fits their usage. It doesn't really matter if the license satisfies the OSD or not.


No, we need to adapt an existing term into the new context that it is being deployed in.


We've had a similar debate before, but the last time it about whether Linux device drivers based on non-public datasheets under NDA were actually open source. This debate occurred again over drivers that interact with binary blobs.

I disagree with the purists - if you can legally change the source or weights - even without having access to the data used by the upstream authors - it's open enough for me. YMMV.


No. It's an asset used in the training process, the source code can process arbitrary training data.


I don’t think even that is true. I conjecture that Facebook couldn’t reproduce the model weights if they started over with the same training data, because I doubt such a huge training run is a reproducible deterministic process. I don’t think anyone has “the” source.


numpy.random.seed(1234)


AI2 has released training data in their OLMo model: https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e73...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: