Two thoughts on data lakes
Because we are still in the early stages, today’s opinion of data lakes is anything but universal. At a high level, there are two thoughts. One group considers the data lake not only important, but also bidding for data-driven companies. This group understands the limitations of modern data storage – mainly that they were not built to handle large streams of unstructured data. What’s more, the difference between “on write” and “on read” is not simply a matter of semantics. On the contrary, the latter lends itself to much faster response times and by extension the analyzes.
It’s a point of view, and I happen to agree with it. To be fair, we haven’t reached an industry agreement here – far from it. Skeptics of computer lakes are not shy about their opinions. Kynics see the data lake as a buzzword or hype by software vendors with a serious stake in the game. Furthermore, some consider the data as a new name for an old concept of limited utility for their businesses.
Adding to the legitimate confusion surrounding the topic, there are few people who use the term “data lake” in a consistent way. Some people call any data preparation, storage, or discovery environment a data lake.
Parallels with Hadoop and relational databases
When imagining the need for data lakes, it might be best to think of Hadoop – open source, distributed file system that is increasingly organizations adopt. Hadoop grew for many reasons, not least of which is that it met a real need that relational database management systems (RDBMS) could not meet. To be fair, its open source nature, fault tolerance and parallel processing are also high on the list.
RDBMSs were simply not designed to handle gigabytes or petabytes with unstructured data. Try loading thousands of photos, videos, tweets, articles and emails to your traditional SQL server or Oracle database and run reports or write SQL statements. Good luck with that.
For decades, data warehouses have handled even large amounts of structured data exceptionally well: employee lists, sales, transactions and the like. They feed countless applications with business intelligence and business reporting. However, it is unreasonable to expect the same data warehouses to effectively process fundamentally different amounts of data, speeds and types.
A note on metadata
Data lakes rely on ontologies and metadata to make sense out of data loaded into them. Again, methodologies vary. But generally, each data element in a lake inherits a unique identifier that is assigned extensive metadata (tags). Conclusion: The Data Lake is here to stay.
The bright future of the data lake
There is little doubt in my mind that the data lake will occupy an increasingly important place the future of data management. Organizations will continue to integrate “small” data with its large counterpart, and foolish is the soul who believes that an application – no matter how expensive or robust – can handle everything.
When a business question arises, users will increasingly need answers faster than traditional data storage and reporting that can provide stalwarts. When used correctly, data lakes allow users to analyze smaller data sets and quickly answer critical questions.