We often help our clients with developing the architecture of their data platform. But what is a data platform and which components should be part of it? And why should you have a data platform?
Normally, data is created and stored in different IT systems such as the Enterprise Resource Planning (ERP) system, Customer Relationship Management (CRM) system or the manufacturing system. For your operational processes this is a sound approach, but when you want to make the data available for other purposes this can be a burden. For example, it is difficult to combine data from different IT systems (such as combining data from ERP and CRM systems), or if you want to make the data available for analytics or reporting. A data platform can provide an improved alternative where you can access the data from different sources, combine it, store it, analyze it and report on the combined data.
The purpose of a data platform is to collect, store, transform and analyze data and make that data available to (business) users or other systems. It is often used for business intelligence, (advanced) analytics (such as machine learning) or as a data hub.
The platform consists of several components that can be categorized into common layers that each have a certain function. These layers are: Data Sources, Integration Layer, Processing Layer, Storage Layer, Analytics Layer, Visualization Layer, Security, and Data Governance (Figure 1).
Figure 1 – Layers of a Data Platform
The purpose of the different layers is briefly described below. Keep an eye out for our follow-up blogs where we will discuss each layer of the data platform in extensive detail.
Data Sources
This layer contains the different sources of the data platform. This can be any information system, like ERP or CRM systems, but it can also be other sources like Excel files, Text files, pictures, audio, video or streaming sources like IOT devices.
Ingestion Layer
The ingestion layer is responsible for loading the data from the data sources into the data platform. This layer is about extracting data from the source systems, checking the data quality and storing the data in the landing or staging area of the data platform.
Processing Layer
The processing layer is responsible for transforming the data so that it can be stored in the correct data model. Processing can be done in batches (scheduled on a specific time/day) or done real-time depending on the type of data source and the requirements for the data availability.
Storage Layer
The data is stored in the storage layer. This can be a relational database or some other storage technologies such as cloud storage, Hadoop, NoSQL database or Graph database.
Analytics Layer
In the analytics layer the data is further processed (analyzed). This can be all kinds of (advanced) analytics algorithms, for example for machine learning. The outcome of the analytics can be sent to the visualization layer or stored in the storage layer.
Visualization Layer
The data is presented to the end-user in the visualization layer. This can be in the form of reports, dashboards, self-service BI tooling or API’s so that the data can be used by other systems.
Centralized or not?
An important decision to consider is whether to use a centralized data platform (data fabric) or a decentralized data platform (data mesh).
In a data fabric, all company data is stored, processed and accessible from a central data platform that contains the data from all departments or data domains.
In a data mesh, the data from the different departments or data domains are stored, processed and accessible from multiple local (decentralized) platforms. In a data mesh there is not one (centralized) data platform, but multiple data platforms that serve the data for a specific department or domain.
Security
Data GovernanceOne of the important tasks of a data platform is to guarantee that only users that are allowed to use the data have access. A common method is user authentication and authorization, but it can also be required that the data is encrypted (storage and in transfer) and that all activities on the data are audited so that is it known who has accessed or modified which data.
Data Governance
Data governance is about locating the data in a data catalog, collecting and storing metadata about the data, managing the master data and/or reference data, and providing insights on where the data in the data platform originates from (i.e., data lineage).
When an architecture for a data platform is developed, it is often a brownfield development. This means that either some components are already in place or some components need to be enhanced so that it can be part of the data platform.
One of the many benefits of having a data platform is that all organizational data are accessible from one central place; one holistic view of the company. This does not mean that all data should be physically stored in one location (there are different concepts on how to store what kind of data), but it means that from a logical viewpoint all data is accessible in one place.
In the upcoming blogs we will discuss the different layers of the data platform in more detail. The next blog will be about the Data Sources and Ingestion Layer. We invite you to read the next blog to learn more about the different options to ingest data from various data sources.
Deloitte's Data Modernization & Analytics team helps clients with modernizing their data-infrastructure to accelerate analytics delivery, such as self-service BI and AI-powered solutions. This is done by combining best practices and proven solutions with innovative, next-generation technologies, such as cloud-enabled platforms and big data architectures.
Brownfield vs greenfield
A brownfield development of a data platform is using (parts) of the existing components.
A greenfield development of a data platform is a completely new platform without using any of the existing components.
Would you like to know more about developing the architecture of data platforms? Please contact Martijn Blom via +31 (0)88 2880720 or Ingrid Lanting on +31 (0)88 288 04 98.