Member-only story

Why Apache Spark RDD is immutable?

Kirill Bobrov
4 min readJul 12, 2024

--

Every now and then, when I find myself on the interviewing side of the table, I like to toss in a question about Apache Spark’s RDD and its immutable nature. It’s a simple question, but the answers can reveal a deep understanding — or lack thereof — of distributed data processing principles.

Apache Spark is a powerful and widely used framework for distributed data processing, beloved for its efficiency and scalability. At the heart of Spark’s magic lies the RDD, an abstraction that’s more than just a mere data collection. In this blog post, we’ll explore why RDDs are immutable and the benefits this immutability provides in the context of Apache Spark.

What is RDD?

Before we dive into the “why” of RDD immutability, let’s first clarify what RDD is.

RDD stands for Resilient Distributed Dataset. Contrary to what the name might suggest, it’s not a traditional collection of data like an array or list. Instead, an RDD is an abstraction that Spark provides to represent a large collection of data distributed across a computer cluster. This abstraction allows users to perform various transformations and actions on the data in a distributed manner without dealing with the underlying complexity of data distribution and fault tolerance.

--

--

Kirill Bobrov
Kirill Bobrov

Written by Kirill Bobrov

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering, ML. Check out my blog—luminousmen.com

No responses yet