Data integration: That’s not what it used to be

Data integration involves combining multiple data sources to present overall results. The term data integration used to refer to a specific set of data storage processes called “extract, transform, load” or ETL. The ETL generally consisted of three stages:

  • Extract data from multiple sources and move them to a staging area.
  • Applying a number of transformations, including data standardization and purification (where data values ​​are mapped to similar standard formats) – followed by reorganizing the data into a format suitable for loading into a target data warehouse.
  • Loading the transformed data into an analytical data storage environment.

For many data storage professionals, the phrase data integration is synonymous with ETL. Over time, however, techniques and approaches used to move data from original sources to a data warehouse are used for many other data management scenarios. Today, the concept of data integration is much broader. And frankly, it’s more robust than its limited use for data warehouse populations.

An evolution

One of the first innovative twists and turns was to rethink the traditional sequence of operations. Instead of extracting, transforming and loading, some environments chose to extract the data, load it into the target environment, and then apply the transformations. This approach, called “ELT” (extraction, load, transformation), not only removes the need for an intermediate staging platform – it also allows for more consistent transformations when all the retrieved datasets are available for review at the same time within the data storage context. In addition, the ELT approach accommodates the inclusion and transformation of data from real-time sources together with conventionally produced data extracts.

Yet the volumes of both structured and unstructured data explode as the number of real-time data flows grows. In contrast, the practice of data integration has expanded to incorporate a richer, more dynamic set of capabilities that support both data storage and analytics needs as well as a growing number of data processes for operational processes. These processes are increasingly data-driven (such as just-in-time manufacturing, real-time insurance claims processing and Internet of Things applications).

Modern data integration

Unlike the traditional approach to ETL, data integration today encompasses holistic approaches to data availability, accessibility and movement – that is, how data is moved from one place to another. A modern practice of data integration includes further processes for understanding how source data objects are introduced into the environment, how they move across the organization, how information is used by different consumers, what types of transformations are used along the way, and how to ensure interpretation consistency in across different business functions. In essence, data integration products allow you to customize data system solutions that channel the flow of data from manufacturers to consumers.

Apart from the traditional methods of standardization, purification and transformation, today’s data integration often includes many other options, such as those described next.

Dataflow modeling

These techniques and tools are used to document the data line. It includes how data objects move from their points of origin across all touch points for reading and updates, and the ways in which these data objects are delivered to downstream consumers. Many data integration products provide modeling models for data that show data line and even provide search and impact analysis related to specific data elements and values.

data Quality

External standards and regulations often impose strict restrictions on the use and availability of data – e.g. The level of protection of personal data required by the European General Data Protection Regulation or GDPR. This has driven the need for continuous vigilance with regard to data quality enforcement. It has also motivated interest in integrating data validation, monitoring and notifications of lost expectations directly into information flows. In response to these needs, several data integration tools add data quality controls that can be integrated directly into business applications.

Data virtualization and data federation

A growing interest in data accessibility has led application designers to rethink their approaches to data accessibility, especially as shameless data copying creates several data duties of varying consistency and timeliness. An attractive alternative is to leave the data objects in their original locations and use data virtualization techniques to create a semantically representative model layered on top of federated data access services that access data in their original locations. These capabilities reduce data replication while increasing data usage.

Change data capture

Even in cases where data extracts have been provided, it is possible to reduce the amount of data required to maintain consistency using change data capture (CDC). The CDC is a data integration method that monitors changes to the source data systems and propagates changes together to any replicated databases.

Data protection

Methods of data protection, such as encryption at rest, encryption in motion, and data masking, simply adhere to policies to prevent unnecessary exposure of personally identifiable information (PII). As these safeguards are used as data objects move from one point to another, they are increasingly part of a data integration toolkit.

Data streaming and integrated business rules

The dramatic increase in the analysis affects all data integrators to ingest and process streaming data. Streaming data integration differs from conventional data integration in that “pieces” of the data streams are processed in time windows. There are certainly some limitations to the ability to apply sets of transformations to the entire dataset at once. But integrated business rules can be applied to real-time data objects to achieve some – if not all – of the necessary transformations before downstream data usage.

Data directories and services

As more organizations consume larger amounts of both structured and unstructured data, there is increasing interest in moving acquired data to a data lake built using an underlying object repository (which has custom metadata). To cater to different consumer communities, organizations use data directories to inventory the available data sets and record developed data services that can be used to access these managed data sets.

Today’s data integration: Know your options

When considering options for data integration tools and technologies, today’s hybrid computing environments are much more complex than those of the good old days. Conventional servers are becoming attached to big data analytics platforms, and we are increasingly seeing data located both on-site and in the cloud. There is also a reliance on a growing number of “as-a-service” offerings to manage a wide range of enterprise data assets.

Want to determine the best ways that modern data integration technologies can support your data processing needs? Make sure you understand the full range of expectations your users have about how they should be able to access and use data.

Source link