How To Choose The Right Data Storage Strategy for AI
Is it just about Data Warehouses or Data Lakes?
A lot has been written about Data Warehouses and Data Lakes, and which approach is better for data scientists to build, learn, test and deploy their Artificial Intelligence (AI) and Machine Learning (ML) algorithms. Complicating the issue are the latest data management advancements that are now blurring the line between the Data Warehouse and the Data Lake and offering the possibility to efficiently scale to exabytes.
In this post we will quickly go over which use-cases the Data Warehouse and Data Lake have traditionally been designed for, and the new data management technologies data scientists can now choose from. We will then focus on which approach would be better for AI.
Data Warehouses and Data Lakes are both used for storing large amounts of data. Both use data extracted from transactional systems, IoTs and external data sources. Both can keep large amounts of historical data, so data scientists can perform trend analysis and compare today’s numbers with past data.
The Data Warehouse We Know
Data Warehouses were designed for querying and advanced analytics. They utilize highly structured relational data models, well adapted for efficient high-speed queries (reading the data). The purpose of the data in the Data Warehouse is pre-defined to meet specific business goals. Hence, only useful data is collected from the source systems. It is then cleaned and processed with those specific business use-cases in mind, before storing. But, because data warehouses are highly structured, they need to be carefully designed, they take a long time to update (writing the data), they are not easily modifiable, and they don’t scale as naturally as Data Lakes.
The Data Lake We Know
Data Lakes were designed to store (very) large amounts of raw data. The purpose of the data in the Data Lake doesn’t necessarily need to be pre-defined. Users access the data and explore it, how they see fit. Data in most cases flows from the source systems to the data Lake is in its natural state, or nearly-untransformed. All data is welcomed: data that may never be used or might someday be used. Data is then stored in a flat unstructured architecture, which enables rapid updates (writing the data), and allows the Data Lake to massively scale horizontally. But, because of their design, Data Lakes are not efficient for high-speed queries (reading the data).
The New Meta Warehouse
Some organizations opt to run both a Data Warehouse and a Data Lake in order to take advantage of both models and cover all the use-cases data scientists might come up with. Both can coexist just fine, but this approach quickly becomes doubly expensive to maintain and scale.
A more economical approach involves applying more structure to the Data Lake, in order to help speed-up the queries. It entails building a Meta Data Warehouse on top of the Data Lake. The Meta Warehouse, indexes the data in the Data Lake and applies structured views. The Data Lake can be built using open-source systems like ApacheTM Hadoop® or other cloud object stores like AWS® S3, Microsoft® Azure Data Lake or GCS. These systems can handle any data type and scale very well. ApacheTM Hive® which provides structured views of the data, can be used for the Meta Warehouse. Other systems like Snowflake®, or Delta Lake from Databricks®, also provide these kinds of hybrid solutions. But, these architectures have shortcomings. Organizations find themselves needing to clean the data from their operational systems, after it is ingested in the Data Lake. As the Data Lake grows larger and larger, cleaning it up becomes very expensive.
The Hadoop Distributed File System (HDFS), the founding layer of the Hadoop Framework, was designed as a distributed file system to work with larger blocks of data than traditional file systems in order to achieve faster i/o operations when dealing with very big data sets. Hive maintains a “map” that keeps track of where the data is stored — which data is in which file. It also manages the SQL queries to the underlying storage layer, e.g. HDFS. However, Hive is great for terabyte size databases, but runs into limitations when your data grows to petabytes or exabytes. Although Hive in most set-ups, runs the i/o operations on data subsets (as opposed to the whole Data Lake), it still stores all the meta data, including all the different schemas, in one centralized meta store. This central meta store architecture can become a bottleneck, and prevent Hive from scaling efficiently.
A new generation of storage technologies designed to address these scaling issues, is emerging. One of those is ApacheTM Iceberg®. It is designed to handle high-performance queries on huge dataset with hundreds of petabytes. It handles schema evolutions with no side-effects, offers version rollback and prevents things like incomplete reads/writes and corrupted data states as tables grow to billions of files and thousands of partitions. The Iceberg “secret sauce” is in the decentralization of the meta data store. All meta data are written into files which are stored in an object store like HDFS, S3, ADLS or GCS.
Start Small
But, we shouldn’t get carried away. Organizations looking to eventually store and query petabytes for their AI and ML projects, should still start with a small data store. This will allow them to reduce cleaning, storage and computation costs, and improve their projects’ time-to-market. Data engineers can build small data stores, one for each data scientist’s project. Then, over time, the Data Lake can be an aggregate of all the smaller data stores.
Data scientists should evaluate if their models are mathematically sound or sound for business. Studies have shown that you can get “good enough” accuracy to make sound business decisions with a smaller data set. Using a larger data set, may bump your accuracy up and be more mathematically satisfying, but the extra cost and time to get there can often outweigh the business benefits. Additional raw data can always be added subsequently.
But, with this start-small approach, how do you determine which data to import from your data sources for AI/ML? It is recommended that data scientists don’t blindly crunch through the whole data store. They should look at the data first, and explore it to better understand it, before deciding which data they’ll use for their learning models. Another common practice is to download a data sub-set on a notebook and try different tests before deciding which data to finally use for the AI/ML algorithms and how to model the data.
The Orchestration Platform
So, should organizations use a Data Warehouse, a Data Lake or a Meta Warehouse to run their AI/ML models? The answer is yes. The answer sounds ludicrous because which approach to choose, is not the right question to ask.
The more important point is that data scientists should not be bothered by this question.
They should concentrate on building, testing and scoring their models and finding the best insight. They do not care if the data is stored in a Data Warehouse or a Data Lake. They expect the speed of the Data Warehousefor fast queries and the fast uploads and scalability of the Data Lake.
The key is to use an AI orchestration platform that can automatically manage the underlying data infrastructure for the data scientists. This platform should be able to configure, secure and automate the whole data infrastructure life-cycle, including its maintenance and scalability.
This AI orchestration platform should separate the storage layer from the compute layer and have a decentralized meta data store with indexes pointing to a bunch of smaller data stores. As seen above, separating the storage layer from the computer layer offers data scientists the ability to scale their models to petabytes. It also provides them with the flexibility to use the data store of their choice or a mix of different data store types based on their use-cases. They could use for example specialized data stores like a hierarchical database, a document databases, a Time Series Database (TSDB), a graph database management system, or a real-time database to handle real-time queries on data streams.
The AI orchestration platform should also accommodate data scientists and offer them the flexibility to use a database for experimentations, and a different one for production. It should also link the different files and meta data back to the data sources, in order to offer data scientists a global view of the data, and automatically map the data and AI pipelines to the datastore as data grows.
Conclusion
Instead on focusing on wether to use a data warehouse or a data lake, data scientists should consider using a smart AI orchestration platform that will allow them to focus on building, training and deploying their machine learning algorithms at scale. To manage size and avoid paying for overblown data storage requirements, the AI orchestration platform should let your organization start small, create new projects, add new data store types and smoothly manage your data growth to petabytes, using the latest data management advancements, such as the ones mentioned above. Thanks to the meta data the AI orchestration platform can also automate a lot of steps and triggers along the whole data and AI pipelines.
The AI orchestration platform of the future will even be able to decide automatically which kind of data store is best to store your data in, based on your use-cases. This type of AI orchestration platform will bring data scientists better flexibility, greater control, and quicker business value while saving on storage, CPU and RAM expenses as their projects scale.