GPT-4 is a Risky Dependency for FOSS Projects

Saturday, April 15, 2023

Alongside the surge of interest in Copilot, ChatGPT, and now GPT-4, there’s been a corresponding surge in open source projects1 that use them. I totally understand the curiosity about what large language models (LLMs) can do, the creative itch to experiment with them, and the engineering impulse to apply them to our long-standing problems. Yet, I hesitate.

Free and Open Source’?

While I realize the phrase “free and open source” arguably elides some important distinctions, I chose the term because I think the issues affect you whether you’re a CTO convinced open source is good for business, a BSD kernel hacker, a card carrying member of the FSF, a software developer who submits kernel patches for your day job, someone with a bunch of GitHub sponsors, or a hobbyist maintainer who has made exactly $0.00 from your PRs.

First, there are ethical issues throughout the process of creating and tuning these models. One of the earliest stages is gathering training data, and there is definite definite consent issues in how data is collected. In the case of Copilot, some software developers are actually suing GitHub for violating their copyright. Even if courts decide that it’s already legal or legislatures specifically legalize this usage, mining people’s words or images without their permission is dubious. Because LLMs’ training data is vast—hundreds of billions of words—it’s not clear how you would even go about getting permission or could be confident that if someone denied permission, their information wasn’t included by mistake. With GPT-4, the training set isn’t public, so you can’t definitively determine whether your information is included. (There are apparently tests to see whether a given text was likely incorporated into a model.) Part of what makes the lack of permission concerning is that the model vendors (and, likely their customers) are profiting off them. And finally, at the end of the model creation process process, OpenAI also directly exploited workers by paying them between $1.32 and $2 an hour to attempt to tune the model to not produce slurs, misinformation, or bigotry.

Secondly, there are ethical concerns with the societal use of these models. One common concern is that propaganda and misinformation will be able to be generated very cheaply. More subtly, users might trust superficially competent output that is wrong in ways that are not obvious. Another abuse stems from the privacy issues resulting from how they were created: In some cases, the model inadvertently memorizes specific private information that can be extracted.

When we discuss deliberate misuse, it’s worth noting the most determined malicious users will do so regardless of what free and open source software developers and advocates do—research, regulation, and industry-wide cooperation are probably needed to avoid or curtail these abuses—but free and open source may make it easier to use LLMs inappropriately out of laziness or cheapness by making them convenient to use in a variety of contexts. Another concern is worsening these issues by essentially lending LLMs the reputational boost from open source communities. If LLMs are embraced uncritically by free and open source technologists, it will further encourage the rapid deployment of large language models, which seems premature given the technical, practical, and ethical concerns that remain. This problem is not new—the use of any new tecnology has to be balanced against both known and unknown risks.

These models also have a big externality because of their energy consumption. While these models will probably get more efficient over time, the fact that large tech companies are not open about the resources their models use means it is difficult to know how much energy they use. The authors of “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜” observe that the people most impacted by the contributions of LLM to climate change may not even be able to reap the benefits because people in those regions of the world do not necessarily speak languages that the models are trained in.

Thirdly, there are practical concerns. OpenAI’s model is not only not free or open source, it’s not open at all. In theory, once people adapt to GPT-4’s specific quirks and become accustomed to having it available, OpenAI could jack up the price or add other onerous conditions of use. Of course, these issues are the same vendor lock-in issues people have always worried about, not much different from using an email provider with proprietary extensions to IMAP or using AWS S3 as your MIT-licensed web app’s sole storage backend. And these lock-in issues may not turn out to be a significant problem—there are many companies, academics, and open source projects intent on replicating GPT-4, and even if they don’t catch up for a while, there’s a good chance they’ll be good enough. But I think we need to be explicit that we’re risking vendor lock-in with these projects.

Fourthly, there are legal concerns specifically with using these models to write free or open source software. Even if your project is compatible with the license of basically any code it would be copying, there may be specific provisions you need to comply with for the specific code the model has helpfully inserted into your project.

These problems don’t trouble me very much in the small—I don’t think someone playing with with the OpenAI API on their weekends is causing any harm. The open source development I’ve seen so far are not that much larger than this scale—at its biggest, it’s experimentation that many people have joined in on, but experimentation nonetheless. I’m worried about where we’ll end up in two, five, or ten years from now if the energies of free and open source developers continue to be directed mainly toward closed models like GPT-4. If people give OpenAI and other vendors of closed models their money and support, whether implicit or explicit, they and their competitors have no motivation to be more open. Their business and research practices will become more normalized. They will also get more and more training data through people using their service. Over the course of years, they could plausibly end up with a near monopoly on commercial LLMs. “Voting with your wallet” (or your API calls, I guess?) isn’t an effective strategy for change by itself, but that doesn’t mean your choices are automatically harmless.

To be clear, I’m not calling for a halt in open source development to parallel the one called for in the “Pause Giant AI Experiments” open letter2. I think we’ve learned a lot from these open source experiments, and these projects can plausibly be adapted to totally open models created using ethical methods. (In fact, there are already efforts to adapt them to open source models, even though ethical questions remain.) Instead, I think as these open source efforts mature beyond being mere experimentation, we should be mindful of where our efforts are going and try to move toward open models trained using ethically sourced data wherever possible. We also need to keep a careful eye on what uses and abuses our tools are enabling.

It’s also important to note that while open large language models that are part of the solution, they don’t solve all these problems. When staff at The Register and researchers examined the large open ImageNet dataset in 2019, it had multiple ethical issues, including labeling photos of people with slurs and, based on their scraping techniques, likely using highly personal or even explicit photos without consent, even if the licensing on the photos technically allows for this use. Open LLMs that have been released since 2019 use similar techniques as their closed counterparts to gather their training data and have similar consent issues.

Finally, we should resist the urge to exoticize large language models. Rather than getting distracted by hypothetical nightmares about superhuman AI, we need to address the problems, costs, and limitations that are either here already or have a clear path to becoming reality. Users, advocates, and developers of free and open source software frequently make similar choices about technologies that are as new and as closed as many LLMs are, and there’s often fierce disagreement on the best way forward. (That’s partly why there’s a split between “free software” and “open source software” in the first place.) We need to have that conversation about GPT-4 and other closed LLMs.

Further reading

  1. Some that are popular or otherwise caught my eye are copilot.lua, llama.cpp, and llm. While I’m raising concerns about the use of LLMs in open source, I’m not trying to single these projects out. 

  2. So what do I think about halting “giant AI experiments”? While I think AI research should focus on ethical concerns and a deeper understanding of the models that exist rather than chasing higher parameter counts or training set sizes, I’m not sure whether a pause is feasible or the right way to conceptualize it. Regardless, the audience of this blog post are users, advocates, and developers of free and open source software who have influence in their community but not at Google, OpenAI, or other large tech companies.