Data
Engineering
101
for absolute beginner
Who is Data Engineer?
Data engineering field could be thought of as a superset of business
intelligence and data warehousing that brings more elements from software
engineering. This discipline also integrates specialization around the operation of so
called “big data” distributed systems, along with concepts around the extended
Hadoop ecosystem, stream processing, and in computation at scale.
-Maxime Beauchemin, the original author of Airflow
Think of Artificial Intelligence as the top of a pyramid of needs. Yes, self-
actualization (AI) is great, but you first need food, water, and shelter (data literacy,
collection, and infrastructure).
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
What Data
Engineer
Do?
one of their highly sought-after
skills is the ability to design, build,
and maintain data warehouses.
A data warehouse is a place
where raw data is transformed
and stored in query-able forms.
In many ways, data warehouses
are both the engine and the fuels
that enable higher level analytics,
be it business intelligence, online
experimentation, or machine
learning.
ETL: Extract,
Transform,
and Load
Extract: this is the step where sensors wait
for upstream data sources to land (e.g. a
upstream source could be machine or
user-generated logs, relational database
copy, external dataset … etc). Upon
available, we transport the data from their
source locations to further transformations.
Transform: This is the heart of any ETL job,
where we apply business logic and perform
actions such as filtering, grouping, and
aggregation to translate raw data into
analysis-ready datasets. This step requires
a great deal of business understanding and
domain knowledge.
Load: we load the processed data and
transport them to a final destination. Often,
this dataset can be either consumed
directly by end-users or it can be treated as
yet another upstream dependency to
another ETL job, forming the so called data
lineage.
These three conceptual steps are how most
data pipelines are designed and structured.
They serve as a blueprint for how raw data is
transformed to analysis-ready data.
OLTP vs OLAP
We can divide IT systems into transactional (OLTP) and
analytical (OLAP). In general we can assume that OLTP systems
provide source data to data warehouses, whereas OLAP systems
help to analyze it.
Data Engineer Data Scientist
Source
https://medium.com/@rchang/a-beginners-guide-to-data-
engineering-part-i-4227c5c457d7
https://www.datawarehouse4u.info/OLTP-vs-OLAP.html
https://www.datacamp.com/community/blog/data-scientist-vs-
data-engineer
This deck created using canva.com

Summary introduction to data engineering

  • 1.
  • 2.
    Who is DataEngineer?
  • 3.
    Data engineering fieldcould be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale. -Maxime Beauchemin, the original author of Airflow
  • 4.
    Think of ArtificialIntelligence as the top of a pyramid of needs. Yes, self- actualization (AI) is great, but you first need food, water, and shelter (data literacy, collection, and infrastructure). https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 5.
    What Data Engineer Do? one oftheir highly sought-after skills is the ability to design, build, and maintain data warehouses. A data warehouse is a place where raw data is transformed and stored in query-able forms. In many ways, data warehouses are both the engine and the fuels that enable higher level analytics, be it business intelligence, online experimentation, or machine learning.
  • 6.
    ETL: Extract, Transform, and Load Extract:this is the step where sensors wait for upstream data sources to land (e.g. a upstream source could be machine or user-generated logs, relational database copy, external dataset … etc). Upon available, we transport the data from their source locations to further transformations. Transform: This is the heart of any ETL job, where we apply business logic and perform actions such as filtering, grouping, and aggregation to translate raw data into analysis-ready datasets. This step requires a great deal of business understanding and domain knowledge. Load: we load the processed data and transport them to a final destination. Often, this dataset can be either consumed directly by end-users or it can be treated as yet another upstream dependency to another ETL job, forming the so called data lineage. These three conceptual steps are how most data pipelines are designed and structured. They serve as a blueprint for how raw data is transformed to analysis-ready data.
  • 7.
    OLTP vs OLAP Wecan divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.
  • 8.
  • 9.