Skip to main content

Accelerating drug discovery & development

6 min read

IP and datasets

The use of advanced analytical techniques is rapidly accelerating the pace at which companies involved in developing drugs and treatments are breaking new ground and generating new ideas and products. Many biotech companies are leveraging machine learning and Artificial Intelligence (AI) to achieve their aims.

Accelerating disease understanding and drug development

Improving the efficiency of drug discovery is a crucial priority of the pharmaceutical industry that AI has accelerated. In 2019, Deloitte published an article quantifying the cost of drug development and the reducing return on investment. The report highlighted the ongoing drive to identify new ways of improving the efficiency and cost-effectiveness of the process. According to the research, the use of artificial intelligence to accelerate drug development has achieved the following:

Figure 1. Drug and target binding

Source: Report From the Deloitte Centre for Health Solutions 2019

Leveraging existing datasets

Key to achieving the benefit of accelerating development is leveraging existing data (GOSTAR, n.d.) from a variety of sources to synthesise new insights. In respect of the biotech industry, drug developers have access to existing biological (Brown, 2018), chemical, clinical, pharmacological (Xie, 2017) datasets and data available from scientific literature (figure 2), from several commercial suppliers as datasets or through specifically designed AI platforms, or as open-source (Medium, 2019) material.

Figure 2. Dataset usage roadmap

The trend is towards the increasing use, and indeed reliance, on the use of big data to support and steer new drug discovery. However, smaller biotech firms cannot hope to fund the multi-million-dollar cost of generating such datasets and must therefore look to acquiring access to datasets produced by others.

Figure 3: AI in drug discovery papers

At present, there needs to be more discussion of any pitfalls that may be equally connected to using such data. Our research and work as commercial Intellectual Property advisers (Calvert, n.d.) within this industry have highlighted some questions that data users will need to ask when considering the use of big data, such as:

  • who owns the input data?
  • Is it free to use or available at a cost?
  • what are the terms on which the data is made available?
  • to what extent is the work of the drug developer influenced or curtailed by the need to work within the contractual terms?
  • are alternative options for sourcing the data available elsewhere, and if not,
  • how does the user ensure continuity of data supply and on fair terms?
  • can the input data owner make any claim to the output insight or data?

Considerations for the data user

Some of these questions have been raised by experts in the field and are questions on which awareness should be raised and debate overdue.

As commercial IP experts, however, we advise three things:

01. Clarity on data ownership. Datasets, access to which may be helpful in developing drugs, might be available from for-profit suppliers of data, not-for-profit suppliers, or academic papers.

For a profit dataset, suppliers effectively sell access to the datasets through the taking of a fee-paying license. The license may come with limitations, restrictions, and confidentiality obligations that the buyer must be aware of when engaging in onward development using the data.

When using datasets available in academic papers, authors, or potentially their employers, own copyright and potentially database rights in the creative work as well as the presentation of the data. The paper's publisher may have acquired the author's copyright or taken a license (on complete or limited terms) and may also have generated their own copyright in respect of the published work. To use the data presented in an academic paper or access the raw datasets, the user may be advised to acquire a license from the copyright owner, be it the author and/or the publisher. Such licenses may also come with limitations and fees, which will need to be managed by the user.

Finally, data may be available from not-for-profit or open sources. Copyright or database rights will still subsist in creative works; however, in this scenario, any required license terms may be more comfortable, or rights may have been waived.

Thereafter, if the user of such datasets is considering involving third parties in their onward drug development, it may also be regarded as good practice to perform proactive diligence (Calvert, n.d.) on the development portfolio during which ownership, risks and management strategies are documented in preparation for scrutiny by third parties.

02. Clarity on data risk. Having established access to the supply of data, the user must also consider whether said data will be maintained and updated or whether they have, and need, access to updated datasets and, if so, what potential risks are associated with the loss of access to the data. Having understood such limitations and risks, it may also be wise to document, manage and mitigate the risks.

When considering mitigating options, it is advisable to partner with multiple suppliers of similar datasets. Alternatively, a ‘buyer’ of data may choose to focus on using only open-source material. Recent trends appear to suggest that the movement of medically relevant datasets to being ‘open’ (Centaur, n.d.) for AI is growing (see also here). For example, Stanford University has recently established its Centre for Artificial Intelligence in Medicine and imaging (AIMI) (Stanford, n.d.), and it is rapidly growing its freely accessible repository of annotated medical imaging datasets with the help of a partnership with Microsoft’s AI for Health programme.

The above risks, particularly management and mitigation strategies, may also be documented so that if third-party involvement is envisaged, clarity of risks and management strategies comfort partners during transactions (Calvert, IP Scouting & Acquisitions , n.d.).

03. Clarity on onward data mapping. Knowing and keeping in mind the restrictions on how the incoming data can be used and the permissions obtained is critical. It is highly advisable to capture all IP (Calvert, IP Discovery , n.d.), both registered and unregistered, such as know-how, copyright and trade secrets in an IP database along with reference to any use limitation factors, such as licensing contractual terms. Such a database needs to be maintained to be useful, but equally, data must be extractable and presentable in a manner that is easily understandable and preferably visual.


As with any highly data-driven industry, there are pitfalls and risks to be aware of, not only when handling third-party data but when generating one’s own.

Owned data that is generated as a result of one’s own activities (AI or otherwise) may benefit from a number of protections such as copyright, database rights, and trade secrets; however, it is imperative that such data is appropriately managed such that those rights, protections, and potential for exploitation are retained.

Third-party data has the benefit of being a fast means of obtaining large datasets (at the potential downside of high costs). Still, it also includes potential risks associated with how one is allowed to exploit the data internally and commercially, as well as questions arising around ownership.

Whichever means is chosen for obtaining data sets for further research and development, appropriate management and risk mitigation strategies must be implemented to ensure that the company can exploit that data to achieve its commercial objectives.

Did you find this useful?

Thanks for your feedback

If you would like to help improve further, please complete a 3-minute survey

Our thinking