What is a data lake and why does it matter?

Goes past the early hype

When the hype about early data lakes disappeared, a data lake stopped being confused with a data platform. Instead, it was recognized as a container for multiple collections of varied data that coexist in one convenient location.

Today, data lakes are formally included in the company’s data and analysis strategies. Organizations recognize that the term data lake refers to only one part of the enterprise ecosystem, which includes:

  • Source systems.
  • Intake pipelines.
  • Integration and data processing technologies.
  • Databases.
  • Meta data.
  • Analysis engines.
  • Layer for data access.

To be a comprehensive business intelligence platform that generates high business value, a data lake requires integration, purification, metadata management and control. Leading organizations are now taking this holistic approach to data lake management. As a result, they can use analytics to correlate different data from different sources in different structures. This means more comprehensive insights for the company to use when making decisions.

Why are data lakes important?

Because a data lake can quickly absorb all types of new data – while providing self-service access, exploration and visualization – companies can see and respond to new information faster. In addition, they have access to data they could not get before.

These new data types and sources are available for data discovery, proofs of concept, visualizations and advanced analyzes. For example, a data lake is the most common data source for machine learning – a technique commonly used on logs, click-through data from websites, social media content, streaming sensors, and data derived from other Internet-connected devices.

Many companies have long wanted the ability to conduct discovery-oriented investigation, advanced analysis and reporting. A data lake quickly provides the necessary scope and diversity of data to do so. It can also be a consolidation point for both big data and traditional data, enabling analytical correlations across all data.

Although typically used to store raw data, a lake can also store some of the intermediate or completely transformed, restructured or aggregated data generated by a data warehouse and its downstream processes. This is often done to reduce the time data researchers have to spend on common data preparation tasks.

The same approach is sometimes used to hide or anonymize personally identifiable information (PII) or other sensitive data that is not necessary for analysis. This helps companies comply with data security and privacy policies. Access control is another method companies can use to maintain security.