“OpenAI, Microsoft, and Google say that AI has to be done in this way. What is that way? Collect or scrape humongous amounts of data with or without permission, build a huge AI model with thousands of GPUs running. We are saying there is an alternative way,” said Chaitanya Chokkareddy, an open-source enthusiast and CTO at Ozonetel, who came up with the idea of a Telugu AI story-telling assistant called “Chandamama Kathalu”.
He identified the dominance of giants such as OpenAI as an incentive for developers to build more open-source AI models in India. “When OpenAI launched a model and ChatGPT became successful, we started to question if the world would lose out because all of that is in one place and in a proprietary mode. It’s a closed model,” Chokkareddy said, speaking at a recent discussion that debated the openness of AI models, organised by Delhi-based tech policy organisation Software Freedom Law Centre (SFLC).
The panel discussion held on Monday, November 26, also saw participation from Sunil Abraham, policy director of Meta India, Udbhav Tiwari, director of global public policy at Mozilla, and Smita Gupta, co-lead of the Open Justice for AI initiative. The session was moderated by Arjun Adrian D’Souza, senior legal counsel at SFLC.
Tech companies like OpenAI have kept the inner workings of their AI models tightly under wraps. However, this has spawned efforts to ensure greater transparency in AI development. Surprisingly, Meta has emerged as one of the leading advocates for this push towards openness in AI.
Emphasising the social media giant’s open-source approach to AI, Abraham said, “We have 615 open-source AI projects that have been released under a variety of licences. In some cases, the training data can be made available. In many other cases, the training data is not made available especially for large language models (LLMs).”
In February this year, Meta released a powerful open-source AI model called Llama 2 that was made available for anyone to download, modify, and reuse. However, the company’s seat at the open-source table has been strongly challenged by researchers who argued that the Llama models have not been released under a conventional open-source licence.
Monday’s discussion not only touched upon the licensing of open-source AI models but also explored the risks posed by such AI models, the controversy over how an open-source AI model is defined, and who is responsible for AI hallucinations, among other issues.
The definition of open-source AI models
The contention regarding Meta’s branding of its AI models as “open” shifted the focus to a larger issue: What qualifies as an open-source AI model?
According to the Open Source Initiative (OSI), an open-source AI model is one which is made available for the following:
– Use the system for any purpose and without having to ask for permission.
– Study how the system works and inspect its components.
– Modify the system for any purpose, includeing to change its output.
– Share the system for others to use with or without modifications, for any purpose.
Notably, Meta’s Llama model falls short of OSI’s standards for an open-source AI model as it does not allow access to training data and places certain restrictions on its commercial use by companies with more than 700 monthly active users (MAUs) or more.
When asked about the consensus on OSI’s definition, Sunil Abraham said, “If your regulatory obligations are going to change, then there needs to be a consensus on a definition of open-source AI models.” He also raised a critical question: What happens if an AI model meets 98 per cent of the definition?
A major challenge for developers is figuring out the right licensing conditions under which their open-source AI models can be released. Chokkareddy said that it is one of the reasons why his Telugu speech recognition AI model and dataset have not yet been released.
“For the last six months, SFLC and I have been trying to figure out what is the right licence under which the dataset and AI model can be released so that any other datasets or AI models fine-tuned on top of it, will also be in the open domain,” he said.
Meanwhile, Tiwari opined that copyright issues related to training data could disincentivise companies from releasing their AI models as open-source. “The moment they put up a list of datasets upon which their models have been trained, they will be taken to court and they will be sued by authors, publishing houses, and newspapers. We’re already seeing this happen around the world and no one wants to deal with it,” he said.
On building an open-source AI model for the legal system, Gupta spoke about one that she helped build called “Aalap”. The model, with a 32k context window, meant to serve as a legal and paralegal assistant, was trained on data pertaining to six Indian legal tasks such as analysing the facts of the case, determining what law could be applied to the case, creating an event timeline, etc.
However, Gupta said that developing Aalap was extremely costly. Her team struggled to build an open-source stack as there was no benchmark or toolkit instructing them on how to do it. The maintenance of documentation was also a very real challenge for us, she added.
Highlighting that open-source AI has come under attack in the US and other parts of the world, Tiwari said that the criticism stems from the framing of open-source AI models as a binary to closed models in terms of their capabilities and associated risks.
“I also think that we have to recognise that merely because something is open source doesn’t mean it automatically brings all of the benefits that open source software brings to society,” he said, acknowledging that “benevolent entities whose incentives may align with open source today may not necessarily apply with open source tomorrow.”
One of the main risks posed by open source AI is the lack of content moderation. There is research that demonstrates how even consensual sexual imagery or CSAM are some very real risks that are not posed by closed models but open-source AI models as many of the safeguards can be simply removed, Tiwari said.
“If you allow these capabilities to exist openly in the world, then the harm that they can be put to by nefarious actors is much greater than the possible benefit that they could bring,” he argued while drawing attention to regulatory exemptions granted to open-source AI models under the European Union’s landmark AI Act.
Similarly, Gupta also said that it was critical for developers to ensure that personal identifiable information (PII) does not permeate through multiple layers of the open-source stack. She also cautioned against “scope creep” where certain PII of citizens who are looking for free legal aid is not used to reach out to them for marketing or any other purposes.
Experts have also warned that making AI models open-source does not eliminate the risk of information hallucination.
Terming AI as a “black box” with no underlying scientific theory that explains why the technology works, Abraham opined that AI-generated hallucinations cannot be reliably attributed to a backdoor or feature – even if the AI model is open-source.
“With traditional free and open-source software, you saw the source code and if you noticed that there was a back door in the source code, then everybody knows that there is a back door in the source code. The outputs from an LLM are co-created with the person providing the prompts. So, it is almost impossible for a developer to hide something downstream from the user,” the Meta executive said.
In contrast, Chokkareddy argued that the problem of hallucination can be addressed by ensuring that the dataset does not have anything unwanted. “If the training data does not have nude photos, there is no way an AI system can hallucinate a nude image. AI can be a dream machine but it cannot dream something it has not seen,” he said.
Content retrieved from: https://indianexpress.com/article/technology/artificial-intelligence/open-source-ai-models-licensing-risks-sflc-discussion-9694616/.