This document provides an introduction to Apache Spark, including its core components, architecture, and programming model. Some key points:
- Spark uses Resilient Distributed Datasets (RDDs) as its fundamental data structure, which are immutable distributed collections that allow in-memory computing across a cluster.
- RDDs support transformations like map, filter, reduce, and actions like collect that return results. Transformations are lazy while actions trigger computation.
- Spark's execution model involves a driver program that coordinates tasks on worker nodes using an optimized scheduler.
- Spark SQL, MLlib, GraphX, and Spark Streaming extend the core Spark API for structured data, machine learning, graph processing, and stream processing