It was really difficult for me to figure out what are the daily tasks of a data engineer before I started that position. After more than a half year on the job, I realize that the day-to-day work of data engineers is still difficult to summarize because the outcome of data engineering is less tangible, and because every DE has a very different experience.
I’m here to share what my work as a Data Engineer looks like most days, and what kind of occasional projects I’ve also worked on. I will list the tasks in the order I encountered them when I transitioned from a Software Engineer to a Data Engineer role.
It’s also worth noting that I’m part of a small team with T-shaped skills, meaning we handle the full process of data engineering, either for day-to-day maintenance operations or for time-bound projects. I know some bigger companies have several teams of DE with a specific scope, performing only a part of the workflow I’ll describe here.
The technologies and general rules I’ll share here are specific to where I work and could vary in other companies. When I know some, I’ll list alternatives.
A very high-level workflow of data warehousing
One of the recurrent responsibilities of Data Engineers is maintaining the data warehouse. This can be summarized in a few big steps:
- An orchestrator runs data pipelines on a schedule (either a service or a custom software, popular libraries to develop an orchestrator are Airflow and Dagster, written in Python).
- (Optional, could be part of the orchestration part) It contacts a service hosting all the connectors, pushes some parameters to define what to run and how. That service is responsible for authenticating at the sources, triggering the extraction of raw data and writing it to a target.
- Raw data is written in a data lake, a vast storage where everything is written in a common format (json, parquet…).
- (Optional) A staging phase processes the data to make it clean and ready for further transformations. The data is casted, some filters can already be applied, sometimes intermediary tables are already created.
- Business or Data Science ready tables are created by aggregating and joining information coming from different tables, filtering, augmenting data with static information…
- Views (temporary tables) are created based on the processed table. It’s the final stage of the data, where it’s made available for reporting or data science.
If I tried to summarize this in a single sentence, I would say that a Data Engineer collects raw data that is not usable as-is and processes it to make it usable for the end user.
Day-to-day tasks as a Data Engineer
Develop custom extractors
Extractors are meant to extract data from different data sources across the company on an automatic schedule (databases, APIs, connectors to ERPs like SAP, Salesforce, Cloud services, or other tools like Jira, to name only a few). In some rare cases, the extractor we need doesn’t exist yet. This is the case when we extract data from a very peculiar data source, a data source that is not meant to be a data source.
Those custom extractors are just a software, and it can be written in any language that supports the necessary operations (making http calls for example). In a DevOps team, the development phase includes the application itself, but also making sure it deploys and runs.
Add columns to an existing table
The most basic type of ticket we can receive is a request for adding some columns in a table that already exists. Depending on the person who wrote the ticket and the kind of source, this can include little to a lot of research to retrieve the fields names in the source.
We write our models with SQL and dbt (a SQL helper that allows the use of macros to reuse some kind of functions for example). This can also be written in a programming language as a software. Some popular languages in Data Engineering are Scala, Java/Kotlin and Python.
Monitor data pipelines
Sometimes the scheduled data extractions fail. Most of the time, it’s just a source that was unreachable at the time, for a reason out of our control, and we just have to rerun the pipeline.
Other times, it happens the day after the release, and it usually means that something that was released had breaking changes. In that case we have to find the changes that created problems, take corrective actions or rollback the changes to allow time for investigation.
Another case is a data source that is still unreachable, even after a retry. In that case, we have to contact the owner of that source to let them know we’re unable to reach it.
Add new tables (models)
Adding a new table means:
- Designing the model for the table, especially figuring the links between the dimensions and the facts, deciding on the content of those tables… with an Entity-Relationship diagram that is presented to the whole team for feedback and approval.
- Sometimes, creating a new connector in Azure Data Factory and a new source configuration in our orchestrator.
- Adding the tables, and sometimes joins to existing tables, in the SQL projects.
It’s also communicating with the requester to make sure they’ll get what they wanted at first, sometimes creating dummy reports in Power BI yourself to see how the views look and if you can create the necessary links to answer the requester’s questions.
Maintain and create data source connectors
Connectors for data extraction sometimes need to be updated, either because there was a change at source side, or because the connector is deprecated. We manage our extractors in Azure Data Factory and our secrets in Azure Keyvault.
We regularly add new extractors because we need to integrate new data sources. This requires getting an authentication method and some credentials for a service principal (usually provided by another service in the company who’s responsible for the source), choose the right connector in Data Factory, save all needed parameters, and design the workflow for extraction (source, target and intermediate operations).
Review code
As our code (either the Python applications, or the project holding our models in SQL and dbt) is versioned and integrated into a CI pipeline, we do code review like software teams.
Investigate customers’ requests
Our entry point is usually someone contacting us because they need to do reporting. We have to understand what they currently do (generally, a bunch of Excel files), understand if there is a data source available for the data they need (a database, API or ERP), and define with them the actual need. Based on the investigation we give a Go or No-go.
Create data visualization reports
The goal of a data engineer is to provide clean, usable data either for Data Science projects (Machine Learning, Large Language Models, AI, predictions…) or for business users, in order to make reports and deduct meaningful information from it.
Depending on the kind of user who needs the data, we either train key users to be the go-to person for reporting inside a department or make reports ourselves (with Power BI for example).
Other projects as a Data Engineer
Improve testing and acceptation environments (UAT)
As in Software Engineering, we want to automate our tests as much as possible, but also have the developments validated by the requester(s).
Migrate queries for logs ingestion
This is a one-time project I’ve worked on. We transitioned from Splunk to Azure Data Explorer to ingest logs continuously from some machines. I translated queries from Splunk QL to KQL (Kusto), added them to a Delta-Kusto project, updated configurations in our data pipeline, created a new connector, updated the configurations and references in our staging and warehouse projects. This project lasted about 6 months (half-time).
Add linter to the SQL/dbt projects
Adding of-the-shelf linters but also developing custom linters, for example to check if Data Engineers wrote their test files for all tables.
Propose process improvements in the team
With experience in Software Engineering, where processes are more standardized and mature, I can propose some new ways of working to improve code review, testing, team organization…
Automate tasks relying on company data
A lot of data, even business-critical, still lives in Excel files that are updated manually. On top of it, manual workflows are executed to let the information flow between people. We sometimes get larger-scope projects where our mission is to automate data processing to secure the data quality and allow the business to save a lot of time by simplifying processes.
This means searching for the initial sources of data, creating a structure for data that is potentially not yet stored in a safe, consistent way, train the users and make sure they adopt the new solution, find tools or develop a custom script to automate the transfer of information…
Biggest differences between Data Engineering and Software Engineering
Those are the changes I experienced when moving from a Software Engineer role to Data Engineering, that you might be interested in if you consider changing position (one way or another).
Understanding the business
I find it way more important to understand the business in Data Engineering. As a Software developer, I liked understanding what I was working on, but I must recognize it was possible to work without it most of the time.
In Data Engineering, working on data that doesn’t mean anything to you is really inefficient and error prone, you’d lose a lot of time. Data is really concrete and doesn’t “benefit” from as much abstraction as software does.
This means data will probably require that you have more contacts with the business and take interest in the core business of the company you’re working at.
Technological maturity
The Data Engineering field is way younger than Software Engineering, and it shows. It’s way more difficult to find documentation and especially real use cases on topics such as testing, CI/CD, DataOps… than it is for Software.
It means for the moment DE requires more custom solutions development and creativity if you want to apply the same quality grade that is usually observed in Software. This is really important to keep in mind if you consider changing position.
Range of tasks
Data Engineering seems way more diversified in terms of tasks compared to Software Engineering. I was a full stack developer in a DevOps team, but I find myself handling many more different kinds of tasks in my new position.
There might be more technologies and languages to handle in Software Engineering, but there is more tasks diversity in Data Engineering. I would say this is caused by DE covering the daily maintenance of the data warehouse, together with other projects that have a broader scope.
I must also specify that I’ve always worked on long-term projects in Software Engineering, and basically the maintenance (fixing bugs) felt the same as the new developments (new features), the only tasks that differed a bit were the ones about infrastructure, CI/CD and DevOps.
In Data Engineering, “new features” (developing a new data model) requires an investigation phase, the source data and use case is always unique, there’s an analysis phase where you imagine the model, then you implement it (only at that stage we start coding).
When connecting new sources, you could be working with APIs, databases, Cloud services, ERPs connectors… In my experience, the number of APIs and the like with which we integrate is way bigger in data.
For specific projects, you might have to think of new solutions to store the data, you can work on automation, depending on your audience the tools will be different.
Your colleagues will be different
There are very few, if any, Data Engineering specific cursus. Data Engineering might require some analytical skills, knowledge of statistics (at least, it can help). It’s also possible to do DE with way less technical knowledge because a lot of tools were developed to allow Data Engineering and Data Science with no code/low code solutions.
Therefore, you will probably work with people that are way less technical savvy than you (if coming from a Software Engineering position). It means you could end up being the most technical person in the team, and having the responsibility to train others, or having to refrain yourself from using some solutions because your colleagues won’t adopt them.
This is also a very important point to take into consideration before switching job: you should investigate what’s the composition of the team and how open they are to technical improvement.