Taking control: Generative AI trains on private, enterprise data

More companies, seeking to avoid the risk of models trained on public data, are expected to train generative AI on their own data to enhance productivity, optimize costs, and unlock complex insights.

Chris Arkenberg

United States

Baris Sarer

United States

Gillian Crossan

United States

Rohan Gupta

United States

In 2023, generative AI came out of the shadows. Grabbing headlines and driving an explosion of startups, it likely reshaped the strategic roadmaps of some of the world’s largest companies. For the first time, AI systems became conversational, creative, and even seemingly emotional, able to render remarkable imagery and return deep and comprehensive (if not entirely accurate) answers to complex queries. In a matter of months, the capabilities of large language models (LLMs) and visual diffusion models provoked international debate about their potential impact on global economics and geopolitics.1

Although this initial wave of generative AI has been primarily consumer-facing and trained on public data, a deeper groundswell is building beneath private models that include more proprietary and domain-specific data. Companies that have been accumulating data for years now have an opportunity to unlock more of its value with generative AI. Doing so effectively could help solve for some of the current challenges of public models but will likely require thoughtful investments and decision-making.

Deloitte predicts that, in 2024, enterprise spending on generative AI will grow by 30%, from an estimated US$16 billion in 2023.2 While enthusiasm has been high, enterprises have mostly been cautiously experimenting, trying to figure out the specific value of generative AI for their businesses and the costs of deploying, scaling, and operating them effectively.3

Still, the market is expanding, and more enterprises are allocating budgets to generative AI. In 2024, much of their generative AI spending is expected to be paid to leading cloud service providers for training models and providing computation for user queries, as well as to data scientists that will help bridge company data to foundational models. However, 2024 could also see growth in more on-premise graphic processing unit (GPU) data centers as larger businesses—and government entities—seek to bring more generative AI capabilities in-house and under their control, mirroring the prior lifecycle for digital transformation to cloud, then hybrid, and into data centers. The main limitations to growth will likely be access to talent—and for some, to GPUs4—but companies may also wrestle with unclear use cases and issues relating to data quality.

Pros and cons of public models

The year ahead will likely see somewhat less exuberance for generative AI by a more reasoned assessment of its capabilities and costs. Users and use cases are expected to help clarify where its strengths lie and where it may be unfit or simply untrustworthy. Challenges faced by early public models, like factual errors, “hallucinations” (where the model fabricates something that may sound right5), copyright, and fair use, are being confronted by those providers while further incentivizing more private models.6

Because generative models have required such massive volumes of training data, the first wave of public models were mainly trained on the largest data set available: the public internet.7 Thus, these models have also absorbed the many biases, contradictions, inaccuracies, and uncertainties of the internet itself. In some ways, this has enabled them to converse on a remarkable array of topics and to exhibit surprisingly creative, poetic, and even seemingly emotional behaviors. But it has also required work to normalize results, avoid toxic output, and reinforce more accurate and preferred responses.

When pressed for facts, models trained on public data, such as social network posts, may fabricate them.8 And they can do so with authority, causing many users to believe their assertions without properly fact-checking the results. Popular LLMs were not designed to be factually accurate, but rather statistically accurate. They are very good at guessing what a typical human expects to come next in a sentence. This capability, combined with a model’s “temperature”—the amount of randomness allowed in a models’ response9—can introduce hallucinations, leading to a headline, for example, of a lawyer presenting “case law” from a generated legal brief that was made up by the model.10 However, this capability also fuels its creativity—for example, using visual diffusion models to generate novel character design for video games.11

Publicly trained models have also run afoul of laws regarding copyright and fair use, with lawsuits mounting from those who see their own works reflected in generative outputs.12 This has been especially problematic for diffusion models that generate images based on public training sets that include copywritten works.13 In response, some providers are enabling websites to cloak their content from being scraped as training data, potentially adding to the challenges of public models seeking training sets.14 And although copyright laws can vary by market, some can make AI-derived works indefensible, either for being overly derivative of prior art or for not being human enough to merit copyright.15 However, artists and copyright holders may be challenged to prove derivation from training sets that include billions of diverse inputs.16 Additionally, companies may be concerned about losing control of their own data if they add it to public models. Data leakage can happen when data used in training sets becomes visible to users—accidentally or by adversarial prompt engineering.17 For all these reasons, many businesses have been hesitant to adopt publicly trained generative AI.18

Leading providers of generative AI are also reckoning with these challenges and feeling pressure to evolve their business models.19 They face lawsuits and regulations for all the above reasons, while spending capital to train and tune models that are supporting millions of daily user prompts.20 With enormous costs of computing for training models and inference at scale, hyperscale data centers may be capable providers and also be able to bear the brunt of costs and responsibility.

From consumer-facing to private domains

Because the fundamental capabilities of generative AI are compelling and relying on public solutions can introduce unwanted risks, more companies are looking to deploy their own models trained on their own private data.21 Doing so can avoid copyright and usage issues while enabling companies to develop bespoke solutions designed to produce desired behaviors and trustworthy results.

For many media and entertainment companies, generative AI has already disrupted content creation, enabling anyone to generate text, audio, and images. However, the most common tools that have enabled this disruption were trained on the public web, provoking lawsuits by authors and artists that believe their own works were included without consent or remuneration.22 To avoid such usage issues, both Adobe Systems23 and Getty Images24  have launched generative AI solutions that are trained on their own licensed visual content—the photographs and digital images they have amassed over their years of operation. When these tools generate new images, the results are explicitly within the licensing and reuse agreements of their content libraries. This can help them avoid copyright challenges while extending pathways for creators to license and monetize their own work in private training sets.

Still, companies will be required to abide by the leading practices and regulations governing the kinds of data being used, such as personally identifiable data or medical information. Those companies merging private and public data may similarly be challenged to integrate them effectively, while adhering to data privacy and copyright laws. Nevertheless, these are conversational learning systems, perhaps in their early days, that are showing potential to find and amplify value in data.

If data is “the new oil,” as many have said, LLMs and diffusion models may offer a higher-performance engine to help make it work. Many companies have accumulated large amounts of data that generative AI can help them operationalize. Generative AI can offer them a better lens on their data, combining a conversational and visual interface with the ability to reckon with vast troves of data far beyond human reasoning. Gazing across 2024, more companies may see the influence of generative AI not only on their operations and product lines, but also within their C-suite and boardrooms.

The bottom line

More companies are looking to use generative AI to help drive productivity and optimize costs. Using generative AI capabilities, they may also be able to unlock more value in their data by surfacing complex insights, ferreting out errors and fraud, reducing risks in decision-making, seeking optimizations, predicting opportunities, and even amplifying creative innovation. Some are already developing domain-specific solutions that may likely show results in the coming year.25 Indeed, more companies are beginning to unlock the competitive advantages of generative AI, so there may be risks in waiting. But there are many considerations for the costs of development and operations, where to deploy different parts of the value chain, and how to set guardrails and ensure accurate and trustworthy results.

Training with private data can avoid some of the pitfalls but may still require efforts to make generative AI outputs trustworthy and accurate. Constraining training sets to be more domain-specific can narrow their range of responses. Reinforcement learning26 and human feedback27 are already helping steer models toward preferred behaviors, but companies that know their data best should be the ones leading the development of reward models and policy optimization.28 Such efforts can help tackle hallucinations and bias, although they might have their own limitations.29 By optimizing for specific results, the novelty and creativity of a model can degrade over time.30 Done well, feedback can enable greater domain expertise and superhuman reasoning capabilities within those domains.31

Companies planning to develop their own models should consider the costs of doing so. Models may be relatively easy to develop, especially with new, open-source models entering the market. Depending on the use case, a given company should try to understand how large of a model may be necessary, how much data they might need to train it effectively, and how much computation may be necessary to get it up and running. Companies may have diverse data sets of differing qualities that should be conditioned and brought together into a database.32 Data will then need to be organized. As they are familiar with their own data, companies may be the best equipped to guide the accurate labeling of training sets.

Generative AI models can have billions of parameters that may require training on very large data sets. This can necessitate a lot of computation.33 Companies may need to work with hyperscale cloud providers—and plan to pay for the cycles—or purchase their own hardware, which can be expensive to buy and operate.34 Training may be the most expensive part, but trained models should then respond to queries. If query workloads are large, inference costs can go up as well. This means that companies should carefully consider the costs of talent, computation, and time to develop, deploy, and operate a model, compared to the anticipated path to return on investment. Having a clear set of goals in mind, and a roadmap of implementations to get there, can keep projects on track while surfacing gains or losses early.

Using computation and expertise can also prompt considerations for deployment and collaborators. It may make sense to work with existing cloud providers. As companies scale, or if they have proprietary or sensitive data, they may choose a hybrid or on-premise data center approach. If they do, they should be thoughtful about redundancy and security, as they would with any other critical services—and maybe more so; that is because a compromised system could divulge deep intelligence about the company’s data, or an adversarial attack could cause a trusted AI to deliver manipulated behaviors to stakeholders.

In most cases, an ecosystem approach could benefit by distributing investment, expertise, and risk. However, each company should consider the best approach to the outcomes they’re trying to achieve. There are different pathways, but the “right” path should reflect a company’s unique needs for cost, performance, security, data types, and strategic objectives. Generative AI is a fast-moving and highly funded field that is just beginning to reveal its use cases, opportunities, and implications.

AI in the C-suite and the boardroom

Looking to the future, what might it mean if companies have their own intelligent learning systems? What does an AI-native organization look like? How much is it business-aligned versus human-aligned? What are the implications of having a conversational LLM that can see things in your data—or patterns of your competitors—that you can’t? Companies may soon have multiple agents that have moved into numerous workflows not just for operations but also planning and decision making. 

As these systems establish value and trust, they could move further up the hierarchy of decision-making, potentially becoming a conversational voice in the C-suite or the boardroom.35 Such possibilities were often considered science fiction, but in 2024, they’ll seem much closer and should be anticipated.

Ultimately, business leaders will be tasked with experimentation and careful planning to determine what generative AI can do for their bottom line. Will the capabilities of generative AI enable truly differentiated financial performance and competitive advantage? And if so, for how long might those competitive advantages last? Will they become the new table stakes for business performance? Stepping back, what signals might emerge to show whether generative AI is incremental or revolutionary?

By

Chris Arkenberg

United States

Baris Sarer

United States

Gillian Crossan

United States

Rohan Gupta

United States

Endnotes

  1. David Solomon and Eric Schmidt, “The future of generative AI,” Goldman Sachs, September 13, 2023. 

    View in Article
  2. Michael Shirer, “IDC forecasts spending on GenAI solutions will reach $143 billion in 2027 with a five-year compound annual growth rate of 73.3%,” IDC, October 16, 2023.

    View in Article
  3. Katyanna Quach, “Despite the hype, generative AI is not a significant chunk of enterprise cloud spend,” Register, September 12, 2023.

    View in Article
  4. Lucas Mearian, “Chip industry strains to meet AI-fueled demands — will smaller LLMs help?,” Computerworld, September 28, 2023

    View in Article
  5. Janakiram MSV, “How to reduce the hallucinations from large language models,” New Stack, June 9, 2023. 

    View in Article
  6. Tiana Garbett and James G. Gatto, “Generative AI and copyright – some recent denials and unanswered questions,” National Law Review 13, no. 319 (2023).

    View in Article
  7. Sharon Goldman, “Generative AI’s secret sauce – data scraping – comes under attack,” VentureBeat, July 6, 2023.

    View in Article
  8. Sascha Heyer, “Generative AI – understand and mitigate hallucinations in LLMs,” Medium, June 13, 2023.

    View in Article
  9. Sascha Heyer, “Generative AI – mastering the language model parameters for better output,” Medium, June 12, 2023. 

    View in Article
  10. Benjamin Weiser and Nate Shweber, “The ChatGPT lawyer explains himself,” New York Times, June 8, 2023. 

    View in Article
  11. Shannon Liao, “A.I. may help design your favorite video game character,” New York Times, May 22, 2023.

    View in Article
  12. From ChatGPT to Getty v. Stability AI: A running list of key AI-lawsuits,” Fashion Law, October 19, 2023.

    View in Article
  13. James Vincent, “Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement,” Verge, February 6, 2023.

    View in Article
  14. Danielle Romain, “An update on web publisher controls,” Keyword, Google, September 28, 2023.

    View in Article
  15. Christopher Hutton, “Generative AI set for era-defining clash with copyright law,” Washington Examiner, April 20, 2023.

    View in Article
  16. Blake Brittain, “US judge finds flaws in artist’s lawsuit against AI companies,” Reuters, June 19, 2023.

    View in Article
  17. Jaydeep Borkar, “What can we learn from data leakage and unlearning for law?,” Cornell University, July 19, 2023. 

    View in Article
  18. Carl Franzen, “More than 70% of companies are experimenting with generative AI, but few are willing to commit more spending,” VentureBeat, July 25, 2023.

    View in Article
  19. Medium, “Why Gen AI adoption among businesses will look radically different in 2024,” September 13, 2023.

    View in Article
  20. Will Oremus, “AI chatbots lose money every time you use them. That is a problem.,” Washington Post, June 5, 2023. 

    View in Article
  21. Economist, “AI is setting off a great scramble for data,” August 13, 2023.

    View in Article
  22. Christopher J. Valente, Michael J. Stortz, Amy Wong, Peter E. Soskin, and Michael W. Meredith, “Recent trends in generative artificial intelligence litigation in the United States,” K&L Gates, September 5, 2023. 

    View in Article
  23. Ashley Still, “Reimagining our video and audio tools with Adobe Firefly,” Adobe Blog, April 17, 2023.

    View in Article
  24. Getty Images Newsroom, “Getty Images launches commercially safe generative AI offering,” September 25, 2023.

    View in Article
  25. Jamiel Sheikh, “Bloomberg uses its vast data to create new finance AI,” Forbes, April 5, 2023. 

    View in Article
  26. Cameron Hashemi-Pour, “What is reinforcement learning?,” TechTarget, August 2023. 

    View in Article
  27. Jan Leike, Miljan Martic, and Shane Legg, “Learning through human feedback,” Google DeepMind, June 12, 2017.

    View in Article
  28. Dimitriy Konyrev, “Reinforcement learning with human feedback (RLHF) for LLMs,” SuperAnnotate, April 27, 2023. 

    View in Article
  29. Ben Dickson, “The challenges of reinforcement learning from human feedback (RLHF),” TechTalks, September 4, 2023. 

    View in Article
  30. Jithin James, “The impact of temperature in LLMs: Balancing determinism and creativity,” Medium, July 12, 2023. 

    View in Article
  31. Leike, Martic, and Legg, “Learning through human feedback.” 

    View in Article
  32. Tom Davenport and Maryam Alavi, “How to train generative AI using your company’s data,” Harvard Business Review, July 6, 2023. 

    View in Article
  33. Sid Sheth, “Generative AI drives an explosion in compute: The looming need for sustainable AI,” SiliconAngle, February 5, 2023. 

    View in Article
  34. Guido Appenzeller, Matt Bornstein, and Martin Cassado, “Navigating the high cost of AI compute,” Andreessen Horowitz, April 27, 2023.

    View in Article
  35. Stanley McChrystal, “AI has entered the situation room,” Foreign Policy, June 19, 2023. 

    View in Article

Acknowledgments

Cover image by: Manya Kuzemchenko