Taking control: Generative AI trains on private, enterprise data

More companies, seeking to avoid the risk of models trained on public data, are expected to train generative AI on their own data to enhance productivity, optimize costs, and unlock complex insights.

Article

•

10-min read

•

29 November 2023

•

Deloitte Center for Technology, Media & Telecommunications

In 2023, generative AI came out of the shadows. Grabbing headlines and driving an explosion of startups, it likely reshaped the strategic roadmaps of some of the world’s largest companies. For the first time, AI systems became conversational, creative, and even seemingly emotional, able to render remarkable imagery and return deep and comprehensive (if not entirely accurate) answers to complex queries. In a matter of months, the capabilities of large language models (LLMs) and visual diffusion models provoked international debate about their potential impact on global economics and geopolitics.¹

Although this initial wave of generative AI has been primarily consumer-facing and trained on public data, a deeper groundswell is building beneath private models that include more proprietary and domain-specific data. Companies that have been accumulating data for years now have an opportunity to unlock more of its value with generative AI. Doing so effectively could help solve for some of the current challenges of public models but will likely require thoughtful investments and decision-making.

Deloitte predicts that, in 2024, enterprise spending on generative AI will grow by 30%, from an estimated US$16 billion in 2023.² While enthusiasm has been high, enterprises have mostly been cautiously experimenting, trying to figure out the specific value of generative AI for their businesses and the costs of deploying, scaling, and operating them effectively.³

Still, the market is expanding, and more enterprises are allocating budgets to generative AI. In 2024, much of their generative AI spending is expected to be paid to leading cloud service providers for training models and providing computation for user queries, as well as to data scientists that will help bridge company data to foundational models. However, 2024 could also see growth in more on-premise graphic processing unit (GPU) data centers as larger businesses—and government entities—seek to bring more generative AI capabilities in-house and under their control, mirroring the prior lifecycle for digital transformation to cloud, then hybrid, and into data centers. The main limitations to growth will likely be access to talent—and for some, to GPUs⁴—but companies may also wrestle with unclear use cases and issues relating to data quality.

Pros and cons of public models

The year ahead will likely see somewhat less exuberance for generative AI by a more reasoned assessment of its capabilities and costs. Users and use cases are expected to help clarify where its strengths lie and where it may be unfit or simply untrustworthy. Challenges faced by early public models, like factual errors, “hallucinations” (where the model fabricates something that may sound right⁵), copyright, and fair use, are being confronted by those providers while further incentivizing more private models.⁶

Because generative models have required such massive volumes of training data, the first wave of public models were mainly trained on the largest data set available: the public internet.⁷ Thus, these models have also absorbed the many biases, contradictions, inaccuracies, and uncertainties of the internet itself. In some ways, this has enabled them to converse on a remarkable array of topics and to exhibit surprisingly creative, poetic, and even seemingly emotional behaviors. But it has also required work to normalize results, avoid toxic output, and reinforce more accurate and preferred responses.

When pressed for facts, models trained on public data, such as social network posts, may fabricate them.⁸ And they can do so with authority, causing many users to believe their assertions without properly fact-checking the results. Popular LLMs were not designed to be factually accurate, but rather statistically accurate. They are very good at guessing what a typical human expects to come next in a sentence. This capability, combined with a model’s “temperature”—the amount of randomness allowed in a models’ response⁹—can introduce hallucinations, leading to a headline, for example, of a lawyer presenting “case law” from a generated legal brief that was made up by the model.¹⁰ However, this capability also fuels its creativity—for example, using visual diffusion models to generate novel character design for video games.¹¹

Publicly trained models have also run afoul of laws regarding copyright and fair use, with lawsuits mounting from those who see their own works reflected in generative outputs.¹² This has been especially problematic for diffusion models that generate images based on public training sets that include copywritten works.¹³ In response, some providers are enabling websites to cloak their content from being scraped as training data, potentially adding to the challenges of public models seeking training sets.¹⁴ And although copyright laws can vary by market, some can make AI-derived works indefensible, either for being overly derivative of prior art or for not being human enough to merit copyright.¹⁵ However, artists and copyright holders may be challenged to prove derivation from training sets that include billions of diverse inputs.¹⁶ Additionally, companies may be concerned about losing control of their own data if they add it to public models. Data leakage can happen when data used in training sets becomes visible to users—accidentally or by adversarial prompt engineering.¹⁷ For all these reasons, many businesses have been hesitant to adopt publicly trained generative AI.¹⁸

Leading providers of generative AI are also reckoning with these challenges and feeling pressure to evolve their business models.¹⁹ They face lawsuits and regulations for all the above reasons, while spending capital to train and tune models that are supporting millions of daily user prompts.²⁰ With enormous costs of computing for training models and inference at scale, hyperscale data centers may be capable providers and also be able to bear the brunt of costs and responsibility.

From consumer-facing to private domains

Because the fundamental capabilities of generative AI are compelling and relying on public solutions can introduce unwanted risks, more companies are looking to deploy their own models trained on their own private data.²¹ Doing so can avoid copyright and usage issues while enabling companies to develop bespoke solutions designed to produce desired behaviors and trustworthy results.

For many media and entertainment companies, generative AI has already disrupted content creation, enabling anyone to generate text, audio, and images. However, the most common tools that have enabled this disruption were trained on the public web, provoking lawsuits by authors and artists that believe their own works were included without consent or remuneration.²² To avoid such usage issues, both Adobe Systems²³ and Getty Images²⁴ have launched generative AI solutions that are trained on their own licensed visual content—the photographs and digital images they have amassed over their years of operation. When these tools generate new images, the results are explicitly within the licensing and reuse agreements of their content libraries. This can help them avoid copyright challenges while extending pathways for creators to license and monetize their own work in private training sets.

Still, companies will be required to abide by the leading practices and regulations governing the kinds of data being used, such as personally identifiable data or medical information. Those companies merging private and public data may similarly be challenged to integrate them effectively, while adhering to data privacy and copyright laws. Nevertheless, these are conversational learning systems, perhaps in their early days, that are showing potential to find and amplify value in data.

If data is “the new oil,” as many have said, LLMs and diffusion models may offer a higher-performance engine to help make it work. Many companies have accumulated large amounts of data that generative AI can help them operationalize. Generative AI can offer them a better lens on their data, combining a conversational and visual interface with the ability to reckon with vast troves of data far beyond human reasoning. Gazing across 2024, more companies may see the influence of generative AI not only on their operations and product lines, but also within their C-suite and boardrooms.

The bottom line

More companies are looking to use generative AI to help drive productivity and optimize costs. Using generative AI capabilities, they may also be able to unlock more value in their data by surfacing complex insights, ferreting out errors and fraud, reducing risks in decision-making, seeking optimizations, predicting opportunities, and even amplifying creative innovation. Some are already developing domain-specific solutions that may likely show results in the coming year.²⁵ Indeed, more companies are beginning to unlock the competitive advantages of generative AI, so there may be risks in waiting. But there are many considerations for the costs of development and operations, where to deploy different parts of the value chain, and how to set guardrails and ensure accurate and trustworthy results.

Training with private data can avoid some of the pitfalls but may still require efforts to make generative AI outputs trustworthy and accurate. Constraining training sets to be more domain-specific can narrow their range of responses. Reinforcement learning²⁶ and human feedback²⁷ are already helping steer models toward preferred behaviors, but companies that know their data best should be the ones leading the development of reward models and policy optimization.²⁸ Such efforts can help tackle hallucinations and bias, although they might have their own limitations.²⁹ By optimizing for specific results, the novelty and creativity of a model can degrade over time.³⁰ Done well, feedback can enable greater domain expertise and superhuman reasoning capabilities within those domains.³¹

Companies planning to develop their own models should consider the costs of doing so. Models may be relatively easy to develop, especially with new, open-source models entering the market. Depending on the use case, a given company should try to understand how large of a model may be necessary, how much data they might need to train it effectively, and how much computation may be necessary to get it up and running. Companies may have diverse data sets of differing qualities that should be conditioned and brought together into a database.³² Data will then need to be organized. As they are familiar with their own data, companies may be the best equipped to guide the accurate labeling of training sets.

Generative AI models can have billions of parameters that may require training on very large data sets. This can necessitate a lot of computation.³³ Companies may need to work with hyperscale cloud providers—and plan to pay for the cycles—or purchase their own hardware, which can be expensive to buy and operate.³⁴ Training may be the most expensive part, but trained models should then respond to queries. If query workloads are large, inference costs can go up as well. This means that companies should carefully consider the costs of talent, computation, and time to develop, deploy, and operate a model, compared to the anticipated path to return on investment. Having a clear set of goals in mind, and a roadmap of implementations to get there, can keep projects on track while surfacing gains or losses early.

Using computation and expertise can also prompt considerations for deployment and collaborators. It may make sense to work with existing cloud providers. As companies scale, or if they have proprietary or sensitive data, they may choose a hybrid or on-premise data center approach. If they do, they should be thoughtful about redundancy and security, as they would with any other critical services—and maybe more so; that is because a compromised system could divulge deep intelligence about the company’s data, or an adversarial attack could cause a trusted AI to deliver manipulated behaviors to stakeholders.

In most cases, an ecosystem approach could benefit by distributing investment, expertise, and risk. However, each company should consider the best approach to the outcomes they’re trying to achieve. There are different pathways, but the “right” path should reflect a company’s unique needs for cost, performance, security, data types, and strategic objectives. Generative AI is a fast-moving and highly funded field that is just beginning to reveal its use cases, opportunities, and implications.

AI in the C-suite and the boardroom

Looking to the future, what might it mean if companies have their own intelligent learning systems? What does an AI-native organization look like? How much is it business-aligned versus human-aligned? What are the implications of having a conversational LLM that can see things in your data—or patterns of your competitors—that you can’t? Companies may soon have multiple agents that have moved into numerous workflows not just for operations but also planning and decision making.

As these systems establish value and trust, they could move further up the hierarchy of decision-making, potentially becoming a conversational voice in the C-suite or the boardroom.³⁵ Such possibilities were often considered science fiction, but in 2024, they’ll seem much closer and should be anticipated.

Ultimately, business leaders will be tasked with experimentation and careful planning to determine what generative AI can do for their bottom line. Will the capabilities of generative AI enable truly differentiated financial performance and competitive advantage? And if so, for how long might those competitive advantages last? Will they become the new table stakes for business performance? Stepping back, what signals might emerge to show whether generative AI is incremental or revolutionary?

Endnotes

David Solomon and Eric Schmidt, “The future of generative AI,” Goldman Sachs, September 13, 2023.
Michael Shirer, “IDC forecasts spending on GenAI solutions will reach $143 billion in 2027 with a five-year compound annual growth rate of 73.3%,” IDC, October 16, 2023.
Katyanna Quach, “Despite the hype, generative AI is not a significant chunk of enterprise cloud spend,” Register, September 12, 2023.
Lucas Mearian, “Chip industry strains to meet AI-fueled demands — will smaller LLMs help?,” Computerworld, September 28, 2023
Janakiram MSV, “How to reduce the hallucinations from large language models,” New Stack, June 9, 2023.
Tiana Garbett and James G. Gatto, “Generative AI and copyright – some recent denials and unanswered questions,” National Law Review 13, no. 319 (2023).
Sharon Goldman, “Generative AI’s secret sauce – data scraping – comes under attack,” VentureBeat, July 6, 2023.
Sascha Heyer, “Generative AI – understand and mitigate hallucinations in LLMs,” Medium, June 13, 2023.
Sascha Heyer, “Generative AI – mastering the language model parameters for better output,” Medium, June 12, 2023.
Benjamin Weiser and Nate Shweber, “The ChatGPT lawyer explains himself,” New York Times, June 8, 2023.
Shannon Liao, “A.I. may help design your favorite video game character,” New York Times, May 22, 2023.
“From ChatGPT to Getty v. Stability AI: A running list of key AI-lawsuits,” Fashion Law, October 19, 2023.
James Vincent, “Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement,” Verge, February 6, 2023.
Danielle Romain, “An update on web publisher controls,” Keyword, Google, September 28, 2023.
Christopher Hutton, “Generative AI set for era-defining clash with copyright law,” Washington Examiner, April 20, 2023.
Blake Brittain, “US judge finds flaws in artist’s lawsuit against AI companies,” Reuters, June 19, 2023.
Jaydeep Borkar, “What can we learn from data leakage and unlearning for law?,” Cornell University, July 19, 2023.
Carl Franzen, “More than 70% of companies are experimenting with generative AI, but few are willing to commit more spending,” VentureBeat, July 25, 2023.
Medium, “Why Gen AI adoption among businesses will look radically different in 2024,” September 13, 2023.
Will Oremus, “AI chatbots lose money every time you use them. That is a problem.,” Washington Post, June 5, 2023.
Economist, “AI is setting off a great scramble for data,” August 13, 2023.
Christopher J. Valente, Michael J. Stortz, Amy Wong, Peter E. Soskin, and Michael W. Meredith, “Recent trends in generative artificial intelligence litigation in the United States,” K&L Gates, September 5, 2023.
Ashley Still, “Reimagining our video and audio tools with Adobe Firefly,” Adobe Blog, April 17, 2023.
Getty Images Newsroom, “Getty Images launches commercially safe generative AI offering,” September 25, 2023.
Jamiel Sheikh, “Bloomberg uses its vast data to create new finance AI,” Forbes, April 5, 2023.
Cameron Hashemi-Pour, “What is reinforcement learning?,” TechTarget, August 2023.
Jan Leike, Miljan Martic, and Shane Legg, “Learning through human feedback,” Google DeepMind, June 12, 2017.
Dimitriy Konyrev, “Reinforcement learning with human feedback (RLHF) for LLMs,” SuperAnnotate, April 27, 2023.
Ben Dickson, “The challenges of reinforcement learning from human feedback (RLHF),” TechTalks, September 4, 2023.
Jithin James, “The impact of temperature in LLMs: Balancing determinism and creativity,” Medium, July 12, 2023.
Leike, Martic, and Legg, “Learning through human feedback.”
Tom Davenport and Maryam Alavi, “How to train generative AI using your company’s data,” Harvard Business Review, July 6, 2023.
Sid Sheth, “Generative AI drives an explosion in compute: The looming need for sustainable AI,” SiliconAngle, February 5, 2023.
Guido Appenzeller, Matt Bornstein, and Martin Cassado, “Navigating the high cost of AI compute,” Andreessen Horowitz, April 27, 2023.
Stanley McChrystal, “AI has entered the situation room,” Foreign Policy, June 19, 2023.

Acknowledgments

Cover image by: Manya Kuzemchenko

Visit the Deloitte Center for Technology, Media & Telecommunications

GET MORE INSIGHTS

Access more insights for the technology, media, and entertainment; semiconductor; telecommunication; and sports sectors.

DELOITTE INSIGHTS

DELOITTE RESEARCH CENTERS

Welcome!

Latest Insights

Recommendations

About Deloitte Insights

Topics for you

Taking control: Generative AI trains on private, enterprise data

More companies, seeking to avoid the risk of models trained on public data, are expected to train generative AI on their own data to enhance productivity, optimize costs, and unlock complex insights.

Pros and cons of public models

From consumer-facing to private domains

The bottom line

AI in the C-suite and the boardroom

Endnotes

Acknowledgments

Visit the Deloitte Center for Technology, Media & Telecommunications

DELOITTE INSIGHTS

DELOITTE RESEARCH CENTERS

Welcome!

Latest Insights

What do organizations need most in a disrupted, boundaryless age? More imagination.

Recommendations

TMT Predictions 2026: The AI gap narrows but persists

About Deloitte Insights

About Deloitte Insights

Deloitte Insights Magazine, issue 33

Topics for you

Watch & Listen

Dbriefs

Deloitte Insights Videos

Subscribe

Deloitte Insights Newsletters

Taking control: Generative AI trains on private, enterprise data

More companies, seeking to avoid the risk of models trained on public data, are expected to train generative AI on their own data to enhance productivity, optimize costs, and unlock complex insights.

Pros and cons of public models

From consumer-facing to private domains

The bottom line

AI in the C-suite and the boardroom

By

Endnotes

Acknowledgments

Visit the Deloitte Center for Technology, Media & Telecommunications

Related content

What does it take to run a metaverse?

2025 Digital Media Trends: Social platforms are becoming a dominant force in media and entertainment

The future of shoppable media can build on the success of social shopping