Summary introduction to data engineering

Data
Engineering
101
for absolute beginner

Data engineering field could be thought of as a superset of business
intelligence and data warehousing that brings more elements from software
engineering. This discipline also integrates specialization around the operation of so
called “big data” distributed systems, along with concepts around the extended
Hadoop ecosystem, stream processing, and in computation at scale.
-Maxime Beauchemin, the original author of Airflow

Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-
actualization (AI) is great, but you first need food, water, and shelter (data literacy,
collection, and infrastructure).
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007

What Data
Engineer
Do?
one of their highly sought-after
skills is the ability to design, build,
and maintain data warehouses.
A data warehouse is a place
where raw data is transformed
and stored in query-able forms.
In many ways, data warehouses
are both the engine and the fuels
that enable higher level analytics,
be it business intelligence, online
experimentation, or machine
learning.

ETL: Extract,
Transform,
and Load
Extract: this is the step where sensors wait
for upstream data sources to land (e.g. a
upstream source could be machine or
user-generated logs, relational database
copy, external dataset … etc). Upon
available, we transport the data from their
source locations to further transformations.
Transform: This is the heart of any ETL job,
where we apply business logic and perform
actions such as filtering, grouping, and
aggregation to translate raw data into
analysis-ready datasets. This step requires
a great deal of business understanding and
domain knowledge.
Load: we load the processed data and
transport them to a final destination. Often,
this dataset can be either consumed
directly by end-users or it can be treated as
yet another upstream dependency to
another ETL job, forming the so called data
lineage.
These three conceptual steps are how most
data pipelines are designed and structured.
They serve as a blueprint for how raw data is
transformed to analysis-ready data.

OLTP vs OLAP
We can divide IT systems into transactional (OLTP) and
analytical (OLAP). In general we can assume that OLTP systems
provide source data to data warehouses, whereas OLAP systems
help to analyze it.

Source
https://medium.com/@rchang/a-beginners-guide-to-data-
engineering-part-i-4227c5c457d7
https://www.datawarehouse4u.info/OLTP-vs-OLAP.html
https://www.datacamp.com/community/blog/data-scientist-vs-
data-engineer
This deck created using canva.com

Summary introduction to data engineering

More Related Content

What's hot

Similar to Summary introduction to data engineering

More from Novita Sari

Recently uploaded

Summary introduction to data engineering