What does a Data Engineer do every day? How is it different from Software Engineering?

I was given the opportunity to work in a data engineering team, although I had no training or experience in the field. It was really difficult for me figuring out what are the daily tasks of a data engineer before I actually started in that position.

After a year on the job, I realize that the day-to-day work of data engineers is still difficult to summarize and explain. This is because the outcome of data engineering is less tangible than that of software engineering. Also, every DE has a very different experience based on where they work(ed).

In this article I’m sharing what my work as a Data Engineer looked like most days, and what kind of occasional projects I worked on. I will list the tasks in the order I faced them when I transitioned from a Software Engineer to a Data Engineer role.

It’s also worth noting that I was part of a small team with T-shaped skills, meaning we handle the full process of data engineering, either for day-to-day maintenance operations or for time-bound projects. I know some bigger companies have several distinct teams of DEs with a specific scope each, performing only a part of the workflow I’ll describe here.

The technologies and general rules I’ll share are specific to where I worked and could vary in other companies. When I know some alternatives, I’ll list them.

A very high-level workflow of data warehousing

One of the main responsibilities of Data Engineers is maintaining the data warehouse. This can be summarized in a few big steps:

  1. Services connector. The orchestrator contacts a service that hosts all the connectors and passes parameters to define what to run and how. This service is responsible for authenticating at the sources, triggering the extraction of raw data and writing it to a target (usually a data lake). This part of the workflow could be handled directly by the orchestrator. In our case, we make use of Azure Data Factory for this step.
  2. Raw data storage. Raw data is written in a data lake, a vast storage with files written in formats like JSON, Parquet…
  3. Staging (preparation of data). A staging phase processes the data to make it clean and ready for further transformations. The data is casted (typed), some filters can already be applied, sometimes intermediary tables are already created.
  4. Final tables creation. Final tables are created by aggregating and joining information coming from different tables, filtering, augmenting data with static information… Multiple tables can be created from the same staged or raw tables, using different calculations, filters, joins… based on the type of facts we want to provide.
  5. Views creation. Views are created based on the processed table. This is the final stage of the data, where it’s made available for reporting or data science.

If I tried to summarize this in a single sentence, I would say that a Data Engineer collects raw data, that is not usable as-is, from various sources, and processes it to make it usable for the end user.

Day-to-day tasks as a Data Engineer

Develop and maintain custom extractors

Extractors are meant to extract data from different data sources across the company (databases, APIs, connectors to ERPs and CRMs like SAP, Salesforce…, Cloud services, or other tools like Jira, to name only a few). In some rare cases, the extractor we need doesn’t exist. This is the case when we extract data from a very peculiar data source, a data source that is not meant to be a data source usually.

Those custom extractors are just a software, and it can be written in any language that supports the necessary operations (making http calls for example). In a DevOps team, the development phase includes the application itself, but also making sure it deploys and runs.

Add columns to an existing table

The most basic type of request is adding some columns in a table that already exists. Depending on the person who wrote the ticket and the kind of source, this can include little to a lot of research to retrieve the fields names in the source.

We wrote our models with SQL and dbt (a SQL helper with functionalities like macros to reuse some functions for example). The models and data transformation can also be written in a programming language. Some popular languages in Data Engineering are Scala, Java/Kotlin and Python.

Other operations on existing tables include renaming columns, adding filters, deleting or replacing some columns, updating the source column name because the source was updated…

Monitor data pipelines

Sometimes the scheduled data extractions failed. Most of the time, it was because the source was unreachable at the time of the extraction, for a reason out of our control, and we reran the pipeline manually.

Other times, it happened the day after the release, and it usually meant that we released breaking changes. In that case we had to find the changes that created problems, take corrective actions, or rollback the changes to allow time for investigation.

Another case is a data source that is still unreachable, even after a retry. In that case, we have to contact the owner of that source to let them know we’re unable to reach it.

Add new tables (models)

Adding a new table means:

  • Sometimes, creating a new connector in Azure Data Factory.
  • Adding or updating the source configuration in our orchestrator.
  • Adding the tables, and sometimes joins to existing tables, in the SQL projects.

It’s also communicating with the requester to make sure they’ll get what they wanted, sometimes creating dummy reports in Power BI yourself to see how the views look and if you can create the necessary links to answer the requester’s questions.

(*) In data warehouse, a dimension is a descriptive table for an object. It contains fields such as names, descriptions, other descriptive labels, keys… A fact is a table that relates to an event. It describes a relation between different objects at a given time. It contains fields like dates, aggregations, flags, and keys to join dimensions that describe this event.

(**) A star schema is the most commonly used form of data modelling in data warehousing. The center of the star schema is a fact, and dimensions are the “branches” of the star. Concretely it means that a fact is linked to several dimensions and dimensions are never linked together. (Of course, we can have more or less than 5 dimensions linked to a fact.)

Maintain and create data source connectors

Connectors for data extraction sometimes need to be updated, either because there was a change on the source side, or because the connector version is deprecated. We managed our extractors in Azure Data Factory and our secrets in Azure Keyvault.

We regularly added new extractors because we needed to integrate new data sources. This required getting an authentication method and some credentials for a service principal (usually provided by another service in the company who’s responsible for the source), choosing the right connector in Data Factory, saving all needed parameters, and designing the workflow for extraction (source, target and intermediate operations).

Review code

As our code (either the Python applications, or the project holding our models in SQL and dbt) was versioned and integrated into a CI pipeline, we do code review like software teams.

Investigate customers’ requests

Our entry point was usually someone contacting us because they needed to do reporting. We had to understand what they were currently doing (generally, a bunch of Excel files), understand if there was a data source available for the data they needed (a database, API, ERP, CRM…), and define with them the actual need (what data to expose). Based on the investigation we gave a Go or No-go.

Create data visualization reports

The goal of a data engineer is to provide clean, usable data either for Data Science projects (Machine Learning, Large Language Models, AI, predictions…) or for business users, in order to make reports and deduct meaningful information from it.

We either trained key users to be the go-to person for reporting inside a department, or made reports ourselves (with Power BI for example).

Other projects as a Data Engineer

Improve testing and acceptation environments (UAT)

As in Software Engineering, we wanted to automate our tests as much as possible, but also have the developments validated by the requester(s). For this, we needed isolated environments.

Migrate queries for logs ingestion

This is a one-time project I worked on. We transitioned from Splunk to Azure Data Explorer to ingest logs continuously from connected machines. I translated queries from Splunk QL to KQL (Kusto), added them to a Delta-Kusto project, updated configurations in our data pipeline, created a new connector, updated the configurations and references in our staging and warehouse projects. This project lasted about 6 months (half-time).

Add linters to the SQL/dbt projects

Adding of-the-shelf linters but also developing custom linters, for example to check if Data Engineers wrote their test files for all tables.

Propose process improvements in the team

With experience in Software Engineering, where processes are more standardized and mature than in Data Engineering, I can propose some new ways of working to improve code review, testing, team organization, workflows, automation…

Automate tasks relying on data

A lot of data, even business-critical, still lives in Excel files that are updated manually. On top of it, manual workflows are executed to let the information flow between people.

We sometimes got larger-scope projects where our mission was to define and implement a new way of storing data, automate data processing to secure the data quality and allow the business to save a lot of time by simplifying processes.

This means understanding the current process, searching for the initial sources of data across the company, creating a structure for data that is potentially not yet stored in a safe and consistent way, train the users and make sure they adopt the new solution, find tools or develop a custom script to automate the transfer of information…

Biggest differences between Data Engineering and Software Engineering

Those are the changes I experienced when moving from a Software Engineer role to Data Engineering, that you might be interested in if you consider changing position (one way or another).

I insist again that it’s particular to the company and team you work in. For example, I’ve heard some data engineers saying they had less contact with the business than software engineers at their company, although it seems odd to me 😅.

Understanding the business

I find it way more important to understand the business in Data Engineering. As a Software developer, I liked understanding what I was working on, but I must recognize it was possible to work with a very shallow understanding most of the time.

In Data Engineering, working on data that doesn’t mean anything to you is really inefficient and error prone, you’d lose a lot of time. Data is really concrete and doesn’t “benefit” from as much abstraction as software does.

This means data will probably require that you have more contacts with the business and take interest in the core business of the company you’re working at.

Technological maturity

The Data Engineering field is way younger than Software Engineering, and it shows. It’s way more difficult to find documentation and especially real use cases on topics such as testing, CI/CD, DataOps… than it is for Software and DevOps.

Due to the nature of the work, it’s rarely possible to just copy/paste what’s done in SE. For example, it usually doesn’t make sense to base Data Engineering tests on mock data!

It means for the moment DE requires more custom solutions development and creativity if you want to apply the same quality grade that is usually observed in Software. This is really important to keep in mind if you consider changing position.

Range of tasks

Data Engineering seems way more diversified in terms of tasks compared to Software Engineering. I was a full stack developer in a DevOps team, but I find myself handling many more different kinds of tasks in my new position.

There might be more technologies and languages to handle in Software Engineering, but there is more tasks diversity in Data Engineering. I would say this is caused by DE covering the daily maintenance of the data warehouse, together with other projects that have a broader scope.

I must also specify that I always worked on long-term projects in Software Engineering, maintaining one or two products at a time. Basically, the maintenance (fixing bugs) felt the same as the new developments (new features), the only tasks that differed a bit were the ones about infrastructure, CI/CD and DevOps.

In Data Engineering, “new features” (developing a new data model) require an investigation phase, the source data and use case is always unique, there’s an analysis phase where you imagine the model, then you implement it (only at that stage we start coding).

When connecting new sources, you could be working with APIs, databases, Cloud services, ERPs… In my experience, the number of APIs and the like with which we integrate is way bigger in data.

For specific projects, you might have to think of new solutions to store the data, you can work on automating tasks… in addition to the usual data warehousing.

Your colleagues will be different

There are very few, if any, Data Engineering specific cursus’s. Data Engineering might require some analytical skills, knowledge of statistics (at least, it can help). It’s also possible to do DE with way less technical knowledge because a lot of tools were developed to allow Data Engineering and Data Science with no code/low code solutions.

Therefore, you will probably work with people that are way less technical savvy than you (if coming from a Software Engineering position). It means you could end up being the most technical person in the team, and having the responsibility to train others, or having to refrain yourself from using some solutions because your colleagues won’t adopt them.

This is also a very important point to take into consideration before switching job: you should investigate what’s the composition of the team and how open they are to technical improvement.

Leave a Reply

Your email address will not be published. Required fields are marked *

Skip to content