What is a data pipeline?
When an organisation identifies a need for data as a part of designing a service or internal systems, they often run up against issues in accessing or using that data. The data may be in the wrong place, or in the wrong format, and sometimes missing key fields that would increase its value. Often when the data isn't fit for use the data is determined to be of low quality, and in need of improvement although there is no widely accepted metric for data quality. The process of resolving these issues is often called ETL (Extract, Transform and Load) which is implemented as a sequence of tools or programs which modify the data emitted from the previous step. This group of processes are data pipelines.
Many commercial products exist which will simplify the process of building data pipelines, orchestrating the resulting steps into a sequence which feeds the output from one step into the input of the next. However, the choice of tools and approach can vary significantly between data engineers. This might depend on scalability requirements, or the amount of data, latency needs, or organizational preferences.
At its most basic a pipeline might be a collection of scripts or functions, run one after the other. A more advanced pipeline might be a graph of interconnected nodes relying on output from one or more of its direct dependencies before it can continue to the next step. The pipeline itself may be a one-off task that performs the different steps and then terminates, or it may be a continuous process transforming the data as it becomes available.
Regardless of complexity, the purpose of a pipeline is to automate the movement and transformation of data. This automation should ensure data is clean, structured, and ready for analysis or use. Well-designed pipelines help organizations reduce manual effort, minimize errors, and improve data reliability.
Tasks performed by pipelines
Each step of the pipeline performs one or more tasks to get the data into the expected end state. Although there is no expectation on the number of steps or how they are broken down, making the individual steps modular and reusable allows the individual steps to be re-run without the entire pipeline being processed. The different types of task can be categorised as Moving, Measuring, Enriching, or Managing the data
Moving data
As identified earlier, data can often be in the wrong place, and requires a mechanism for moving it from its original location. This step, the "extract" in ETL might be moving a collection of files from one S3 folder to another, or querying a database to extract changes since the last execution of the pipeline. Data movement can present its own challenges, ranging from potential data loss/inaccuracies to security issues to extra financial cost. Whether the process can be performed in the original data location is something that should be considered early in the process.
Measuring and verifying the data
The quality of data varies from system to system, and a useful step in the data pipeline is checking the incoming data for potential issues. This might be missing values, or incorrectly formatted dates, duplicate records or other inconsistencies in the structure of the data. Some of these issues can be resolved during the pipeline process, whilst others will require changes to the source data but in either case it can often be useful to add extra metadata recording a quality metric for the imported data. These quality metrics can be a useful addition to the lineage metadata for the record, making it easier for users to identify during data discovery what the level of data quality they can expect.
Enrichment
While moving the data may be enough, extra value can be obtained by enriching data as it flows through the pipeline. The data might be enriched by calculating new data such as pre-calculating the required pallet size from a product’s dimensions, or adding demographic data based on a customer’s address. Alternatively the data could be enriched by merging in another data source, for example adding the town name when a postcode is present, or adding links to historical transactions based on the record's account number.
Managing data
Part of the data pipeline should be creating or modifying various types of metadata about the dataset, and publishing it to a data catalogue. This stage is not always present, but can aid in data discovery, security and regulatory compliance. Even when the organisation does not possess a data catalogue, there is still a need to maintain some minimal metadata to record schema definitions and access controls. These are important properties in ensuring data is organised and trusted.
Conclusion
Data pipelines are a key part of modern, data-driven services. They enable the movement, transformation, and management of data in a way that is efficient and automated, without the need for manual interventions, and can make sure that the data is accessible and fit for purpose. Understanding these pipelines help bridge the gap between design and engineering to improve data-driven systems.
Building and maintaining data pipelines requires expertise in data engineering, data architecture and data governance, and at Register Dynamics we are more than happy to help whether you are improving operations, building a digital service, or meeting regulatory requirements. Use our contact form to discuss how we can help turn your raw data into actionable insights that are aligned with your business goals.
Author
Tags: