A Beginner's Guide to Spark: Insights from an MLE
Introduction to Spark
Spark is a distributed computing system designed for large-scale data processing. Here are key reasons why Spark is suitable for ML pipelines:
High Volume of Data: Many ML problems require processing large datasets, which might run into TBs of data. Spark can efficiently distribute data across multiple nodes, enabling computation that would be impossible on a single machine.
Iterative Nature of ML: Training models often involves multiple passes over the input data. Spark is able to handle this efficiently since it distributes computation across various machines, enabling parallel processing.
Understanding Spark is a key skill in an MLE’s repertoire, and delving into its intricacies has greatly improved my ability to debug and optimise workflows.
Spark Runtime Architecture
A Spark application consists of three components: Driver, Executor, and Cluster Manager.

Here is a typical flow when a job is submitted to Spark:
Spark launches the driver, which invokes the main method of the submitted program.
The driver breaks down the user program into a Directed Acyclic Graph (DAG) of tasks.
The driver communicates with the cluster manager, to ask for resources to launch tasks.
The cluster manager forwards the tasks submitted by the driver to the executors. Tasks are distributed across the cluster, enabling parallel processing.
The executor runs on its own JVM and processes the tasks. If the execution is successful, it sends the results back to the driver .
The driver concurrently monitors task progress and handles any failures
As an MLE, I find it useful to visualise and understand the DAG created by Spark.
Concepts in Spark
RDD
RDD (Resilient Distributed Dataset) is Spark’s core abstraction, and is simply an immutable distributed collection of objects. It has the following core characteristics:
Distributed: Split across multiple nodes in a cluster
Immutable: Cannot be changed after creation
Lazily Evaluated: Transformations are not computed immediately
Partitions
An RDD is stored in partitions. Each partition represents a logical slice of the entire dataset and can be processed independently. Data is loaded into the executors at the granularity of a partition, thus allowing parallel processing.

Transformations and Actions
RDDs offer two types of operations. A transformation operation, such as filter, constructs a new RDD from an existing one. Actions, such as count or display, compute a result based on an RDD, and either return it to the driver, or save it to an external system.

Lazy Evaluation
Spark employs lazy evaluation for RDDs, which means transformations are not executed immediately. Instead, Spark builds a computational graph of operations, and the actual data processing occurs only when an action is triggered. This approach allows Spark to optimise the entire sequence of transformations before actual computation, potentially reducing unnecessary work and improving overall performance.
Caching
An RDD is recomputed by default each time an action is called. If you would like to reuse an RDD, you can cache it using RDD.persist() or RDD.cache()
Conclusion
Thanks for reading! This article provided a foundational introduction to Spark. Understanding Spark's intricacies has significantly improved my ability to debug and optimise workflows. I hope you found this insightful and valuable. Stay tuned!
References
Learning Spark, Lightning-Fast Data Analysis - https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/