Data pipelines notes

Posted on August 9, 2022August 9, 2022 by Frederick Alvarez

EtLT

Some examples of the type of transformation that fits into the EtLT subpattern include the following:

Deduplicate records in a table
Parse URL parameters into individual components
Mask or otherwise obfuscate sensitive data

These types of transforms are either fully disconnected from business logic or, in the case of something like masking sensitive data, at times required as early in a pipeline as possible for legal or security reasons. In addition, there is value in using the right tool for the right job

ELT for Data Analysis

ELT has become the most common and, in my opinion, most optimal pattern for pipelines built for data analysis.

With ELT, data engineers can focus on the extract and load steps in a pipeline (data ingestion), while analysts can utilize SQL to transform the data that’s been ingested as needed for reporting and analysis.

Data Ingestion: Extracting Data

Extracting Data from a RDBMS

Full or incremental extraction using SQL
Binary Log (binlog) replication

Full or incremental extraction using SQL is far simpler to implement, but also less scalable for large datasets with frequent changes. There are also trade-offs between full and incremental extractions that I discuss in the following section.

Binary Log replication, though more complex to implement, is better suited to cases where the data volume of changes in source tables is high, or there is a need for more frequent data ingestions from the MySQL source.

Extracting Data from a REST API

Get data from endpoint, flatten the data, treat the data as row

Data Ingestion: Loading Data

Transforming Data

Orchestrating Pipelines

workflow orchestration platforms (also referred to as workflow management systems (WMSs)

Airflow is an open source project started by Maxime Beauchemin at Airbnb in 2014. It joined the Apache Software Foundation’s Incubator program in March 2016. Airflow was built to solve a common challenge faced by data engineering teams: how to build, manage, and monitor workflows (data pipelines in particular) that involve multiple tasks with mutual dependencies.
A book

Best Practices for Maintaining Pipelines

orders_contract.json

{
  ingestion_jobid: "orders_postgres",
  source_host: "my_host.com",
  source_db: "ecommerce",
  source_table: "orders",
  ingestion_type: "full",
  ingestion_frequency_minutes: "60",
  source_owner: "dev-team@mycompany.com",
  ingestion_owner: "data-eng@mycompany.com"
};

There should be data owners from source & target and automation to detect potential schema changes to let the stakeholders know when something broke

Standardizing Data Ingestion

Ingestion jobs must be written to handle a variety of source system types (Postgres, Kafka, and so on). The more source system types you need to ingest from, the larger your codebase and the more to maintain.
Ingestion jobs for the same source system type cannot be easily standardized. For example, even if you only ingest from REST APIs, if those APIs do not have standardized ways of paging, incrementally accessing data, and other features, data engineers may build “one-off” ingestion jobs that don’t reuse code and share logic that can be centrally maintained.

Forging a partnership with the engineering organization requires patience and the right touch, but it’s an underrated nontechnical skill for data teams.
A book

Strive for config-driven data ingestions

Are you ingesting from a number of Postgres databases and tables? Don’t write a different job for each ingestion, but rather a single job that iterates through config files (or records in a database table!) that defines the tables and schemas you want to ingest.

Consider your own abstractions

If you can’t get source system owners to build some standardized abstractions between their systems and your ingestion, consider doing so yourself or partnering with them and taking on the bulk of the development work. For example, if you must ingest data from a Postgres or MySQL database, get permission from the source team to implement streaming CDC with Debezium rather than writing yet another ingestion job.

Dbt

When dbt run is executed, the SQL scripts representing each model will run in the proper order based on how each table is referenced from each other.

Key Pipeline Metrics

Data integrity
DAG successful runs
DAG total runtime