In the previous blogs about the Layered Architecture for Data Platforms we introduced the layered architecture for data platforms, dove deeper into the Data sources and Ingestion layer, and discussed the Processing layer. In this blog we look into the storage layer. In general, the storage layer is used to store the data in a data model so that the data can be analyzed or reported on.
When the data is ingested from the data sources and subsequently transformed in the processing layer, it is then stored in the storage layer. The purpose of the storage layer is to protect the data from disasters, malfunctions or user errors, to make the data available for developers, data scientists and end-users, and to archive data that needs to be kept for a long period of time.
There are many different technologies that can be used to store the data, each with their own advantages and drawbacks dependent on the type of storage you need. The most common storage technologies are:
Most of the products from different vendors in the above categories are much alike, so for most use cases it really does not make much of a difference which product and vendor is chosen. This is not the case for the NoSQL databases. There are currently a few hundred NoSQL databases and they each have different properties that makes them useful for certain use cases. In general, these NoSQL databases can be categorized in 4 different groups (see figure 2):
But also within each group there are many differences between the databases.
A data platform can use multiple different NoSQL databases or multiple storage technologies for different purposes or different types of data. However, before deciding on the storage technology you want to use, you need to consider the purpose of the storage layer. This can be:
For each of these seven possible purposes, there is a specific storage technology that works best (as seen in Table 1).
A data platform often serves multiple purposes, which means that it can have multiple storage technologies for each different purpose. As you can see in Table 1, cloud storage is good for most purposes except when the data is frequently changed; the relational databases are good is most workloads, except when it involves unstructured data or when high performance is needed. Hadoop is a good choice when you need to store unstructured data or when you need a low-cost long term storage solution. In-memory databases are especially good when performance is very important and the amount of (structured) data is small. MPP databases are a good choice when you need to store and process huge amounts of structured data. NoSQL databases are not good for most purposes, but then can be a very good choice for certain specific use cases.
When considering which storage technologies to use, you can ask yourself the following questions:
Which type(s) of data do I need to store? Is it structured, semi-structured or unstructured data? Not only look at the current situation but also think about future requirements.
What are the performance requirements of the storage technology? Should the storage technology be fast enough to store the incoming data or should it also be fast enough to serve the data to the end-users?
What are the scalability requirements? Do you have a very stable predictive workload or will the workload differ greatly?
Do you want to store the data in the Cloud? This question can be related to where your data is generated and where you want to consume your data. It can also be a regulatory question whether it is allowed to store the data in the cloud.
Do you want to prevent vendor lock-in? Vendor lock-in can be reduced by using industry standards and open source technologies. Also, some vendors provide storage technologies that work on multiple cloud providers.
The storage layer is one of the places where security plays an important role, which will be covered in more depth in a follow-up blog about the security layer. One of the most important decisions that need to be considered is whether security is enforced at the storage layer or at other layers such as the analytics layer and/or the visualization layer. If security is enforced at the storage layer, it is possible to enforce that only certain people with the original rights can gain access to the data independent of the tool being used. If security is enforced in the analytics layer and/or the visualization layer, it should be ensured that it is not possible to bypass those layers to access the data in the storage layer. There are advantages and drawbacks for each option which will be discussed in the blog about the security layer. The storage layer can also have another important role in the security and that is that many storage technologies support auditing of all activities. This is a log of which users have accessed which data, and it can give alerts when certain users perform specific activities.
In this blog we described the different storage technologies and the purposes for which storage is needed in a data platform. Some storage technologies are a better fit for certain purposes than others which is showcased in Table 1. We have also introduced some standard questions that can help you in deciding which storage technologies to use.
Deloitte can help you with choosing which storage technologies to use for the data platform and we can also help you with implementing them. Our next blog will be about the Analytics Layer. If you want to know more about how the data can be analyzed, please read our next blog in our series about the Layered Architecture.
Deloitte's Data Modernization & Analytics team helps clients with modernizing their data-infrastructure to accelerate analytics delivery, such as self-service BI and AI-powered solutions. This is done by combining best practices and proven solutions with innovative, next-generation technologies, such as cloud-enabled platforms and big data architectures.
Best Practices
Design a backup and restore strategy.
Think about a data retention. How long should the data be kept? How long is it legally required and allowed to keep the data?
Make a choice between schema-on-read and schema-on-write. Schema-on-write means that all incoming data should be modelled, while schema-on-read includes storing the data as-is and only modelled when it is used. Often a data warehouse uses schema-on-write while a data lake uses schema-on-read.
Next to my role as Partner at Monitor Deloitte I have the privilege to serve as the Transport Hospitality and Services (THS) leader in the Netherlands and the Strategic Transformation Leader in EMEA. I am passionate about supporting clients in developing their Strategy and defining and executing the Strategic Transformation required to deliver that Strategy. I get a lot of energy from working with so many talented colleagues and clients over extended periods of time. Despite popular belief I enjoy and am pretty good at gardening and DIY.