Things every data engineer should know

Don’t Let Consumers Solve Engineering Problems

Avoid the temptation of letting data consumers solve data-engineering problems. Many types of data consumers exist, and the core competencies of each individual vary across multiple spectrums—coding skills, statistical knowledge, visualization abilities, and more. In many cases, the more technically capable data consumers will attempt to close infrastructure gaps themselves by applying ad hoc fixes. This can take the form of applying additional data transformations to a pipeline that isn’t serving its purpose or even actual infrastructure design.

Superficially, it may appear to the data engineer as a win-win situation: their own time is saved, and the consumer’s work proceeds unhindered. However, this usually results in convoluted layers of suboptimal solutions that make the organization’s data infrastructure increasingly hard to manage.

Understand Consumers’ Jobs

Put a premium on knowing what data consumers actually do. Data consumers rely on data infrastructure to do their respective jobs. Their level of comfort, productivity, and adoption depends on the fit between that infrastructure and the dynamics of their work. Data engineers are tasked with developing this infrastructure from conception to implementation, and the actual day-to-day needs of the respective consumers are therefore critical context.

This usually implies spending both time and effort to get a clear read, whether in the form of shadowing sessions, iterative proofs of concept (POCs), or both low- and high-level ideation discussions. The increase in professional familiarity between the teams also leads to an increase in mutual respect and amiability, and that in itself is a powerful driver of success.

Data Engineering = Computation + Storage + Messaging + Coding + Architecture + Domain Knowledge + Use Cases

Jesse Anderson
  • Batch and Real-Time Systems
  • Computation Component
  • Storage Component
  • NoSQL Databases
  • Messaging Component

Demystify the Source and Illuminate the Data Pipeline

You’ve been assigned to a new project, new team, or new company. You want to dig in and make an impact to add business value. It can be tempting to start writing code immediately, but if you resist the inclination to make initial assumptions and instead give your attention to setting up a solid foundation, it will pay dividends moving forward.

First, discover where and how the data originates. When your data is initiated from users, it is useful to get their perspective on their entry experience. Each time I walk the floor of a manufacturing plant or talk to a machine operator about how they use a system, I gain valuable knowledge. Often I discover ways users are entering data that are inconsistent with the original system design, or the valid reasons why they are omitting data. If you don’t have access to the people entering data, study their training documentation and talk to the business analysts associated with that function.

Meghan Kwartler

Develop Communities, Not Just Code

Here are some ways you can help foster a data community or practice:

  • Query usage logs and, when privacy permits, publish them to help users connect.
  • Engage with users to understand what business questions bring them to your data and how they interact with it. This can be done in individual one-on-one user interviews or larger design sessions.
  • Empower users by providing training and resources for them to engage with data in more advanced ways. Don’t underestimate that what you perceive as basic skills could be transformative. For example, teaching a marketing team basic SQL or training data scientists in Airflow could help these teams automate parts of their workflows.
  • Share as much of your work as possible (e.g., scripts that are part of your ETL pipeline) to help more-advanced users learn and reuse them.
  • Build centralized tools that help users consume data, and open them to community contribution.