Inside the Black Box: The Potential of Open-Source AI

AI is one of the most divisive and highly-contested topics happening in tech at the moment. Tech evangelists tout the admittedly impressive abilities of AI tools like large language models (LLMs) for improving productivity and democratizing skills previously only held by highly-trained specialists. Pessimists focus more on AI’s risks to the global workforce, further consolidating rapidly accelerating income inequality. Furthermore, privatized AI has the risk of further codifying biases and behavior of the data on which the LLM was trained. Open-source AI has the potential to mitigate many of these risks.

Open-source AI is having a moment, no doubt driven in part by some of these concerns. Mark Zuckerberg has called open AI “the path forward.” Others are clearly listening — both IBM and Google have since released open-source AI models.

With some of the largest tech companies on Earth investing serious time, money, and resources into open-source AI, there’s clearly something to it. So, what are the potentials of open-source AI? To properly understand the benefits of open-source AI models, we first need to understand what it means to be open-source.

What Is Open Source?

The term open source originally referred to a software development philosophy, where code is made publicly available so that anyone can contribute, copy, or modify the source code. Since then, open-source development has expanded to include methods of working that are decentralized and deeply collaborative.

Many organizations have codified the open-source philosophy into their core principles. According to Red Hat, for a tool, project, or product to be open-source, it needs:

Collaborative participation
Shared responsibility
Open exchange
Meritocratic and inclusive
Community-oriented development
Open collaboration
Self-organization
Respect and reciprocity

The open-source paradigm has enabled some of the most popular software in the world, like Linux and Kubernetes. What potential does open source have for AI?

How Open Source Can Benefit AI

One of the biggest criticisms leveled against AI is what researchers call ‘the black box problem.‘ According to Professor Samir Rawashdeh of the University of Michigan, Dearborn, the black box problem makes it difficult to troubleshoot AI tools as no one’s really sure what they’re doing. For AI to truly be safe and trustworthy, developers need better awareness of how AI tools like LLMs make their decisions.

Open-source AI doesn’t only refer to AI tools on GitHub, either. It also refers to the data a LLM is trained on. GPT-4 was famously trained on 10 trillion words, but only the developers know which words. However, this need for greater transparency doesn’t necessarily entail exposing vulnerable data.

An open-source definition has emerged to meet many of these needs. According to Version 1.0 of the Open-Source AI Definition, for an AI to qualify as open-source, it only needs to provide “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system,” not necessarily expose the data itself. Open-source AI has major potential — it just needs to be implemented properly.

Let’s explore some other potential benefits of open-source AI.

Enhanced Innovation

As we saw with our examples of Linux and Kubernetes, making a tool or project open-source can greatly speed up development. Not having to start with a blank slate with each individual project can save an impressive amount of time, money, energy, and resources. It also spreads troubleshooting across a wider community. More minds mean problems get solved more quickly and with less effort from all parties involved.

Improved Access

Open-source AI tools promise to make the potential benefits of AI available to more people. When the code, tools, and data used to train and use AI effectively are accessible to everybody, not just Fortune 500 companies, the barriers to adopting cutting-edge research are greatly reduced.

Consider the InstructLab project, a model-agnostic open-source project for AI designed to allow more people to contribute to training LLMs.

Improved Safety

As noted earlier, AI tools like LLMs are only as good as the data they’re trained on. Earlier incarnations of ChatGPT produced horrifying examples of hate speech despite extensive safeguarding. This shouldn’t come as a great surprise given the “black box problem,” as it’s more than likely that social media data was used to train LLMs.

Open-source AI doesn’t inherently solve these problems, but it at least makes it obvious what data is fueling the AI. This improves the ability to minimize the risks and potential biases caused by training data. It also allows developers to create tools and monitoring environments for future AI models and AI-driven tools.

Encourages Competition

Monopolies are never good for innovation. A monopoly’s main job is not to innovate but to maintain the status quo, as noted by The University of Chicago Business Law Review. Open-source AI opens the playing field, encouraging tools like small language models (SLMs) trained for specific use cases.

SLMs perform as well, if not better, than LLMs trained on much more data. They can also be trained and deployed much more quickly. It also makes both users and developers far less dependent on particular vendors, reducing the likelihood of vendor lock-in.

Reduces Cost

Last but not least, open-source AI promises to make AI much less expensive to implement than proprietary solutions. The average salary for a data scientist in the United States is $126,979 as of March 2025. This fact alone means that only bigger companies can design, develop, and deploy AI. That’s not to mention the cost of storing and hosting the necessary training data.

Drawbacks Around Open-Source AI

While open-source AI has a lot going for it, that in no way suggests it doesn’t have any potential drawbacks. Every technology has strengths and weaknesses, and it’s up to the user to decide if the pros outweigh any potential drawbacks. Open-source AI is no exception.

One potential drawback of open-source AI is that sometimes, it’s not even fully open-source. A recent paper, The Model Openness Framework: Promoting Completeness and Openness for Reproducibility, Transparency, and Usability in Artificial Intelligence, warns about “open washing,” where certain components are released as open-source while others remain proprietary. Others might still have usage restrictions despite being labeled as open-source, including non-commercial use or non-compete clauses. If you’re going to use open-source AI, make sure to read the fine print so you pick the tool that’s right for your particular needs.

Properly assessing open-source AI at this point in time is tricky. The Open Source AI Definition from The Open Source Initiative (OSI) is a step in the right direction, but it’s not universally adopted or accepted. This lack of consensus could mean that new standards and definitions could arise, forcing early adopters to have to redo all their early work.

Open-source AI doesn’t necessarily mean completely open, either. Some open-source models might make all of their data completely transparent and available, as well as every transaction performed by the AI. Others might expose the training data but not how it’s used, leaving you to guess how the AI arrives at the answers they do.

Open-source tools and software tend to be a lot more work than out-of-the-box solutions, as well. Even the flashiest and most popular open-source tool tends to require some form of troubleshooting or configuration to work properly. If you’re newer or less experienced with tech or programming, open-source AI might not be the right fit for you.

Lastly, open-source AI could be a security risk. Open-source software is particularly vulnerable to attack for many different reasons. Using open-source tools and solutions could put your supply chain at risk, which should be watched out for as supply chain attacks are anticipated to rise in 2025.

Vulnerabilities within open-source packages aren’t always immediately apparent. Hidden dependencies and nefarious commits can sometimes plague open-source AI tools, even if the code itself doesn’t reflect it. As with the usage rights, spend some time doing your due diligence if you intend to use open-source AI in any serious way.

Final Thoughts on Open-Source AI

Despite the controversies, AI isn’t going anywhere. If anything, AI is still accelerating. Open-source AI adoption is accelerating, as well, according to a recent report from McKinsey. More than 50% of organizations are already using some form of open-source AI. Of the developers interviewed, 81% report their experience with open-source tools being necessary in their field.

Clearly, open-source AI will need to be reckoned with if an organization plans on working with AI in the future. This means learning how to work with open-source AI effectively.

To fully leverage the potential of open-source AI, you should become familiar with the Open Source AI Definition. As the OSI notes in their paper Data Governance in Open Source AI, “organizations that care for open, fair and public-interest AI need to pay particular attention to and establish a shared position on data sharing and data governance.” For AI to truly realize its revolutionary potential, it needs to be widely accessible and transparent about its training data and how it’s used. Open-source AI is an essential step towards this goal.