What Is Data Engineering?

By Lillian Pierson

If engineering is the practice of using science and technology to design and build systems that solve problems, then you can think of data engineering as the engineering domain that’s dedicated to overcoming data-processing bottlenecks and data-handling problems for applications that utilize big data.

Data engineers use skills in computer science and software engineering to design systems for, and solve problems with, handling and manipulating big data sets. Data engineers have experience working with and designing real-time processing frameworks and Massively Parallel Processing (MPP) platforms, as well as relational database management systems.

They generally code in Java, C++, and Python. They know how to deploy Hadoop or MapReduce to handle, process, and refine big data into more manageably sized datasets. Simply put, with respect to data science, the purpose of data engineering is to engineer big data solutions by building coherent, modular, and scalable data processing platforms from which data scientists can subsequently derive insights.

Most engineered systems are built systems — systems that are constructed or manufactured in the physical world. Data engineering is different, though. It involves designing, building, and implementing software solutions to problems in the data world — a world that can seem pretty abstract when compared to the physical reality of the Golden Gate Bridge or the Aswan Dam.

Using data engineering skills, you can do things like

  • Build large-scale Software as a Service (SaaS) applications.

  • Build and customize Hadoop and MapReduce applications.

  • Design and build relational databases and highly scaled distributed architectures for processing big data.

  • Extract, transform, and load (ETL) data from one database into another.

Data engineers need solid skills in computer science, database design, and software engineering to be able to perform this type of work.

Software as a Service (SaaS) is a term that describes cloud-hosted software services that are made available to users via the Internet.