With the growing number of data sources and need for agility, a decentralized data architecture concept- Data Mesh can be explored to enforce data quality and governance adherence. Data Mesh achieves this via decentralizing the data responsibility to domain level and making high quality transformed data only available as a product.
By Jarvin Mutatiina and Ernst Blaauw
Every year more data is produced globally. This holds also for companies: more details than ever are recorded from customers, partners, transactions, products and supply chain resulting in more data. According to IDC , “the global datasphere will grow from 45 zettabytes in 2019 to 175 by 2025”. This data forms the raw material from which organizations are drawing valuable, actionable insights. But the collection, integration and governance of this data is still one of the main challenges and inhibitors as established in recent research by Deloitte.
Many organizations are now looking at a relatively new concept called “Data Mesh” to overcome these main challenges and inhibitors. They are realizing that flexible access to data with critical benefit around decreased time-to-market can be guaranteed by focusing on domain specific data products enabled by common support functions. The Data Mesh leverages concepts of newer architectural approaches (e.g. service mesh) and focuses on the data management part rather than connectivity and orchestration. So what is Data Mesh and what are the benefits?
The first paradigm to get to a reliable, integrated and central data repository was the data warehouse. Data warehouses basically boiled down to copying operational data into a centralized, harmonized, well-defined data repository which should lead to a “single source of truth”. That turned out to be mostly inflexible and not really well-suited for the era of “Big Data”, in which the data got higher volume, variety and velocity. The Data Lake concept was invented to capture raw data from various sources into a single repository, in order to build various data layers to suit multiple use cases. The data lake was better suited to support a variety of “big data” (e.g. data streaming, NoSQL database technologies…etc.).
However, data lakes also did not always deliver to their promise. As they become increasingly more complex with the vast amounts of data, the process to create new data products adhering to the company standards may take too much time. Business switched to ways to circumvent the central IT organization, so their projects could continue. However, this resulted in non-compliant solutions – in other words shadow IT. Non-compliant solutions might provide initial results faster but will never be sustainable for production environments and therefore inhibits the application of analytical insights at scale.
Data lakes and data warehouses share the properties that the data processing pipelines are mostly managed by centralized IT teams and that the data is stored in a centralized location. As data volumes grow, data landscape complexity will also grow; inevitably resulting in centralized systems failing to cope with drastically increased scalability and agility needs of the organization.
This model does not always translate well to a typical organization: different business domains know best what is in their data, but it is supposed to be managed centrally. Central IT teams are very busy trying to keep up with all requests from the company - but most of the time backlogs grow rather than shrink. Domain knowledge is not available when it is needed, leading to a decrease of the quality of deliveries. Here the concept of Data Mesh might be a solution to tweak the disadvantages with data warehouses/ lakes without losing investments made so far.
Why is it so popular now?
Data Mesh is a pretty new concept (emerged around 2019 originated by Zhamak Dehghani) and it is picking popularity. Data Mesh has been very interesting for enterprises seeking fast time-to-market with growing data sources/ volumes. This is achieved via decentralizing the data responsibility to domain level and making high quality transformed data only available as a product. Business domain knowledge is preserved while also making the data available to the rest of the business. Data engineers do not have to sieve through unfamiliar data, often dumped in data lakes from multiple sources. The proposed architecture aims to ease the often strained collaboration between data experts and data owners concerning the growing specific domain business acumen needed to bring value to data.
The Data Mesh concept is a democratized approach of managing data where different business domains operationalize their own data, backed by a central and self-service data infrastructure. The infrastructure comprises of data pipeline engines, storage and computing capabilities that are bundled as illustrated in Figure 1.
Rather than looking at enterprise data as one huge data repository, data mesh considers it as a set of repositories of data products. Hence, a business domain (e.g. “Finance”) provides data as a product; ready to use for analysis purposes, discoverable and reliable. This way, the data product owner is the actual business domain representative that has the deep domain knowledge. This is illustrated in the Data Product layer in Figure 2. Thus, no specific domain knowledge gets lost like it could in the translation towards a data warehouse/lake and no bottleneck occurs at the central data engineering team.
Different types of data consumers, like data scientists and business analysts, have direct access to relevant data product(s), on the basis of service level agreements.
The data products are also self-explanatory , in the sense that the product is discoverable and described, so it can be used in a “plug and play” fashion without the need for complex data transformation functions like we know from the data warehouse/lake concepts. By ensuring all data products have the same format, data governance guidelines are enforced across the domain data products within the mesh. The industry standards for governance are illustrated in the federated data governance layer in Figure 3.
The three layers; distributed data product layer, federated data governance and self-service data infrastructure interact together to form the Data Mesh Reference card as in Figure 4 :
There are direct benefits for an organization adopting this architectural concept;
Despite the benefits that Data Mesh is expected to bring; particularly its decentralization property introduces a couple of challenges. Difficulties with managing the multiple data products and their corresponding metadata may very well lead to a mess of spaghetti data pipelines. Below are some of the potential improvement points for Data Mesh:
When does it make sense to use Data Mesh
A data mesh strategy could benefit organizations that have a diverse data landscape with various different business domains. It can help organizations that are highly decentralized or looking to be, as the data mesh structure allows different teams to manage their own data and only make quality data available to the rest of the organization as a product.
Other feasible organizational/ operational use cases are; faster data delivery, high number of data sources, rapid change in business goals and easier migrations during mergers and acquisitions. Decentralized units could be viewed as operational, functional or regional divisions that share a common goal regardless of organization size.
Following the discussion of the Data Mesh concept; its benefits, impact points and considerations; it is clear on what value adopting this architecture could have for your organization’s operational and organizational competitive edge.
Do you want to know more on Data Mesh? Please contact Stefan van Duin at +31 (0) 88 288 4754