Spark core concepts explained

Apache Spark architecture is based on two main abstractions RDD and DAG, let’s dive in what those concepts are

RDD — the Spark basic concept

>>> rdd = sc.parallelize(range(20))  # create RDD
>>> rdd
ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195
>>> rdd.collect() # collect data on driver and show
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> rdd.getNumPartitions()  # get current number of paritions
4
>>> rdd.glom().collect() # collect data on driver based on partitions
[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19]]

Transformations

>>> filteredRDD = rdd.filter(lambda x: x > 10)
>>> print(filteredRDD.toDebugString()) # to see the execution graph; only one stage is created
(4) PythonRDD[1] at RDD at PythonRDD.scala:53 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []
>>> filteredRDD.collect()
[11, 12, 13, 14, 15, 16, 17, 18, 19]
>>> groupedRDD = filteredRDD.groupBy(lambda x: x % 2)  # group data based on mod
>>> print(groupedRDD.toDebugString()) # two separate stages are created, because of the shuffle
(4) PythonRDD[6] at RDD at PythonRDD.scala:53 []
| MapPartitionsRDD[5] at mapPartitions at PythonRDD.scala:133 []
| ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[3] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| PythonRDD[2] at groupBy at <ipython-input-5-a92aa13dcb83>:1 []
| ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:195 []

Actions

>> filteredRDD.reduce(lambda a, b: a + b)
135

DAG

helping robots conquer the earth and trying not to increase entropy using Python, Big Data, ML. Linkedin @luminousmen. Check out my blog — luminousmen.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store