This cheat sheet provides a quick reference to the most commonly used PySpark RDD operations. PySpark RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark, providing fault-tolerant, distributed data processing capabilities.
1. Creating RDDs
From a local collection:
data = [1, 2, 3, 4, 5] rdd = sc.parallelize(data)
From a text file:
rdd = sc.textFile("file_path.txt")
From a directory of text files:
rdd = sc.wholeTextFiles("directory_path")
2. Basic Transformations
map(func)
: Apply a function to each element.rdd.map(lambda x: x * 2)
flatMap(func)
: Apply a function to each element and flatten the result.rdd.flatMap(lambda x: x.split(" "))
filter(func)
: Filter elements based on a condition.rdd.filter(lambda x: x > 2)
distinct()
: Return distinct elements.rdd.distinct()
sample(withReplacement, fraction, seed)
: Sample a fraction of the data.rdd.sample(False, 0.5, 42)
3. Key-Value Pair Transformations
mapValues(func)
: Apply a function to the value of each key-value pair.rdd.mapValues(lambda x: x * 2)
flatMapValues(func)
: Apply a function to the value of each key-value pair and flatten the result.rdd.flatMapValues(lambda x: x.split(" "))
reduceByKey(func)
: Aggregate values for each key.rdd.reduceByKey(lambda x, y: x + y)
groupByKey()
: Group values for each key.rdd.groupByKey()
sortByKey(ascending=True)
: Sort RDD by key.rdd.sortByKey()
keys()
: Extract keys from key-value pairs.rdd.keys()
values()
: Extract values from key-value pairs.rdd.values()
4. Actions
collect()
: Return all elements of the RDD as a list.rdd.collect()
count()
: Return the number of elements in the RDD.rdd.count()
first()
: Return the first element of the RDD.rdd.first()
take(n)
: Return the firstn
elements of the RDD.rdd.take(3)
takeSample(withReplacement, num, seed)
: Return a sample ofnum
elements.rdd.takeSample(False, 5, 42)
reduce(func)
: Aggregate elements using a function.rdd.reduce(lambda x, y: x + y)
foreach(func)
: Apply a function to each element (no return value).rdd.foreach(lambda x: print(x))
saveAsTextFile(path)
: Save RDD as a text file.rdd.saveAsTextFile("output_path")
5. Set Operations
union(other)
: Return the union of two RDDs.rdd1.union(rdd2)
intersection(other)
: Return the intersection of two RDDs.rdd1.intersection(rdd2)
subtract(other)
: Return elements in the first RDD but not in the second.rdd1.subtract(rdd2)
cartesian(other)
: Return the Cartesian product of two RDDs.rdd1.cartesian(rdd2)
6. Advanced Transformations
coalesce(numPartitions)
: Decrease the number of partitions.rdd.coalesce(2)
repartition(numPartitions)
: Increase or decrease the number of partitions.rdd.repartition(4)
zip(other)
: Zip two RDDs together.rdd1.zip(rdd2)
zipWithIndex()
: Zip RDD elements with their index.rdd.zipWithIndex()
zipWithUniqueId()
: Zip RDD elements with a unique ID.rdd.zipWithUniqueId()
7. Persistence (Caching)
persist(storageLevel)
: Persist the RDD in memory or disk.rdd.persist(StorageLevel.MEMORY_ONLY)
unpersist()
: Remove the RDD from persistence.rdd.unpersist()
8. Debugging and Inspection
getNumPartitions()
: Get the number of partitions.rdd.getNumPartitions()
glom()
: Return an RDD of partitions as lists.rdd.glom().collect()
id()
: Get the RDD's unique ID.rdd.id()
9. Joins
join(other)
: Inner join two RDDs.rdd1.join(rdd2)
leftOuterJoin(other)
: Left outer join two RDDs.rdd1.leftOuterJoin(rdd2)
rightOuterJoin(other)
: Right outer join two RDDs.rdd1.rightOuterJoin(rdd2)
fullOuterJoin(other)
: Full outer join two RDDs.rdd1.fullOuterJoin(rdd2)
10. Broadcast and Accumulator Variables
Broadcast Variables:
broadcast_var = sc.broadcast([1, 2, 3]) rdd.map(lambda x: x + broadcast_var.value[0])
Accumulator Variables:
accum = sc.accumulator(0) rdd.foreach(lambda x: accum.add(1))
This cheat sheet covers the most essential PySpark RDD operations. For more advanced use cases, refer to the official PySpark documentation.