Skip to main content

Critical role of data quality in enabling AI in R&D

Artificial intelligence (AI) is transforming the pharmaceutical research & development (R&D) landscape, accelerating innovation and shortening the path from discovery to market. From drug discovery to trial design, AI is redefining how breakthroughs happen. Yet AI is only as powerful as the data behind it. Poor data can delay development, approvals, and delivery of life-saving treatment. This blog post explores why ensuring data is Findable, Accessible, Interoperable, and Reusable (FAIR) is not just a best practice, but an essential foundation for deploying reliable AI models, driving faster innovation cycles, and ultimately realising the full promise of AI in R&D.

AI in R&D: Why does data quality matter?

AI is on the path to becoming integral across the R&D value chain, from target identification to clinical development. Deloitte research suggests large biopharma companies could gain $5-7bn over five years by scaling AI.1  R&D offers the largest value opportunity (30-45 per cent) with AI shortening times for molecule delivery, improving trial efficiency, and enhancing regulatory success rates, helping to ultimately drive cost savings and revenue growth.2

High-quality, FAIR data (Findable, Accessible, Interoperable, Reusable) is critical for reliable AI, enabling transformative R&D benefits such as:3

  • Higher performing models – accurate, consistent, well-labelled data is essential for the deployment of trusted AI models that perform as intended. For example, AI-driven compound screening in the pharmaceutical industry relies on harmonised assay data from multiple labs, allowing prediction of toxicity and efficacy with higher precision.
  • Faster discovery cycles – complete, timely and reliable datasets streamline data preparation, enabling quicker signal identification and focused research on most promising leads. A Deloitte survey of biopharma R&D executives found that lab-of-the-future investments led to 30 per cent greater cost efficiencies.4
  • Improved reproducibility – contextualised, standardised and accessible data facilitates cross-team and institutional validation of findings.
  • Regulatory confidence – robust data governance which meets all regulatory requirements ensures the inherent compliance and traceability, increasing trust in AI-driven outputs.
  • Collaboration at scale – trusted, standardised datasets enable cross-organisational model training and shared innovation. A global pharma’s open data partnerships with academia, leveraging standardised metadata, demonstrate how trusted data scales innovation.

Leveraging these practices enables GenAI to be employed across R&D. Examples of uses, the role that AI can play and the value derived are detailed in Figure 1.

Figure 1. Impact of AI across R&D processes

The Challenge: Why R&D data quality is hard to achieve

Despite its importance, achieving data quality in R&D is far from simple. R&D data spans diverse modalities (omics, imaging, clinical, sensor data) generated by disparate systems and teams, with unique formats and standards.

Common pain points include:

  • Inconsistent data capture and extensive manual effort: manual processes in data capture and compilation lead to inconsistent data entry, errors (e.g., varied annotations, unit discrepancies), and inefficiencies, ultimately compromising data reliability.
  • Inconsistent data definitions: variations in terminology and coding schemes (e.g. CDISC vs. custom schemas) require significant harmonisation for quality and reuse.5 Our Clinical Data Harmonisation: From Silos to Insights blog demonstrated how aligning ontologies improved cross-trial analytics.6
  • Missing metadata: lack of experimental context (e.g., the exact version of the experimental protocol followed) causes misinterpretation in data quality and insight generation.
  • Measurement variability: reconciling results from different instruments or labs is difficult without standardised calibration and measurement processes.
  • Rework and repetition: Despite intentions for data reuse, significant work duplication, such as repeatedly building similar databases, persists.
  • Fragmented data landscape: R&D data is often scattered across disparate systems (e.g., local databases, bespoke LIMS, disconnected wet/dry labs), leading to inconsistencies, compromised end-to-end integrity, and limiting comprehensive cross-study analysis.
  • Challenges in enabling innovative trial models - complexities in integrating diverse data sources impede innovative trial models, including virtual and decentralised designs.

Efficient AI requires resolving these data quality challenges, a task AI can also assist by detecting anomalies, correcting errors, and standardising data. This is exemplified by a 'lab-in-the-loop' methodology, which uses continuous lab data to train self-improving AI for accelerated drug discovery.7

Achieving AI-ready datasets requires close business-technology collaboration to define what is 'good data’ and using this scope to establish the most critical data and how it needs to be integrated to achieve the desired outcome.

Enabling strong AI models through superior data quality

To realise the full potential of AI in R&D, organisations should treat data as a strategic asset instead of a by-product. Achieving this requires coordinated business, data, and technology leadership across the following eight enablers:

I. Strategic vision: define a clear, AI-aligned data quality strategy with specific, measurable standards, integrated into the data and AI lifecycle, and quantify business impact (e.g., reduced cycle times, submission rework).

II. Prioritise critical data assets: map critical assets (e.g., patient demographics, trial design, omics data) in these areas to key decision points and AI use case.

III. Robust data governance (DG) & standards: explicit ownership, validation rules, and stewardship structures within a unified data governance framework, augmented by AI tools.

IV. Automation at source: capture structured data through digital lab notebooks and automated Extract, Transform, Load (ETL) pipelines, supported by AI-driven data cleansing, to drive consistency and integrity.

V. Metadata management: leverage contextual data (glossaries, dictionaries, lineage) and data catalogues for understanding and accessibility, with AI automating their identification, classification, and suggestion to reduce manual efforts.

VI. Scalable, interoperable infrastructure: integrate structured and unstructured data (e.g., clinical notes, scientific literature, imaging data) across platforms using modern data architectures.

VII. Dedicated operating model: define an appropriate operating model that embeds DQ accountability across R&D, IT, and data teams through defined roles, metrics, and incentives.

VIII. Continuous improvement: monitor DQ metrics, refine via feedback loops, and communicate its business importance to drive AI awareness and R&D support.

Operationalising superior data quality requires a systematic management process, as illustrated in Figure 2.

Figure 2. Data quality management process

Data quality as a competitive differentiator

While the promise of AI in R&D is enormous, its realisation is linked to the quality of the data it consumes. The complexities of R&D data, spanning diverse modalities and systems, present significant challenges to achieving the FAIR data principles necessary for robust AI models. By adopting a strategic, systematic approach, encompassing clear vision, robust governance, automation and continuous improvements, organisation can transform their data from a byproduct into a strategic asset. Organisations that manage data as a rigorous R&D asset will realise faster decision cycles, improved model reliability, and enhanced regulatory confidence, ultimately delivering life-changing therapies to patients more efficiently.

Stay up to date

Get the latest blog posts from Thoughts from the Centre direct to your mailbox by subscribing to our mailing list.