Skip to main content

Command Palette

Search for a command to run...

A Beginner's Guide to Spark: Insights from an MLE

Updated
3 min read
  1. Introduction to Spark

Spark is a distributed computing system designed for large-scale data processing. Here are key reasons why Spark is suitable for ML pipelines:

  • High Volume of Data: Many ML problems require processing large datasets, which might run into TBs of data. Spark can efficiently distribute data across multiple nodes, enabling computation that would be impossible on a single machine.

  • Iterative Nature of ML: Training models often involves multiple passes over the input data. Spark is able to handle this efficiently since it distributes computation across various machines, enabling parallel processing.

Understanding Spark is a key skill in an MLE’s repertoire, and delving into its intricacies has greatly improved my ability to debug and optimise workflows.


  1. Spark Runtime Architecture

A Spark application consists of three components: Driver, Executor, and Cluster Manager.

Spark architecture. Image by the author

Here is a typical flow when a job is submitted to Spark:

  1. Spark launches the driver, which invokes the main method of the submitted program.

  2. The driver breaks down the user program into a Directed Acyclic Graph (DAG) of tasks.

  3. The driver communicates with the cluster manager, to ask for resources to launch tasks.

  4. The cluster manager forwards the tasks submitted by the driver to the executors. Tasks are distributed across the cluster, enabling parallel processing.

  5. The executor runs on its own JVM and processes the tasks. If the execution is successful, it sends the results back to the driver .

  6. The driver concurrently monitors task progress and handles any failures

As an MLE, I find it useful to visualise and understand the DAG created by Spark.


  1. Concepts in Spark

RDD

RDD (Resilient Distributed Dataset) is Spark’s core abstraction, and is simply an immutable distributed collection of objects. It has the following core characteristics:

  1. Distributed: Split across multiple nodes in a cluster

  2. Immutable: Cannot be changed after creation

  3. Lazily Evaluated: Transformations are not computed immediately

Partitions

An RDD is stored in partitions. Each partition represents a logical slice of the entire dataset and can be processed independently. Data is loaded into the executors at the granularity of a partition, thus allowing parallel processing.

Transformations and Actions

RDDs offer two types of operations. A transformation operation, such as filter, constructs a new RDD from an existing one. Actions, such as count or display, compute a result based on an RDD, and either return it to the driver, or save it to an external system.

Lazy Evaluation

Spark employs lazy evaluation for RDDs, which means transformations are not executed immediately. Instead, Spark builds a computational graph of operations, and the actual data processing occurs only when an action is triggered. This approach allows Spark to optimise the entire sequence of transformations before actual computation, potentially reducing unnecessary work and improving overall performance.

Caching

An RDD is recomputed by default each time an action is called. If you would like to reuse an RDD, you can cache it using RDD.persist() or RDD.cache()


  1. Conclusion

Thanks for reading! This article provided a foundational introduction to Spark. Understanding Spark's intricacies has significantly improved my ability to debug and optimise workflows. I hope you found this insightful and valuable. Stay tuned!

  1. References

  1. Learning Spark, Lightning-Fast Data Analysis - https://www.oreilly.com/library/view/learning-spark-2nd/9781492050032/

  2. https://vutr.substack.com/p/the-overview-of-apache-spark