PySpark RDD Quick Reference Guide

This cheat sheet provides a quick reference to the most commonly used PySpark RDD operations. PySpark RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark, providing fault-tolerant, distributed data processing capabilities.

1. Creating RDDs

From a local collection:

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

From a text file:
```
rdd = sc.textFile("file_path.txt")
```

From a directory of text files:

rdd = sc.wholeTextFiles("directory_path")

2. Basic Transformations

map(func): Apply a function to each element.
```
rdd.map(lambda x: x * 2)
```
flatMap(func): Apply a function to each element and flatten the result.
```
rdd.flatMap(lambda x: x.split(" "))
```
filter(func): Filter elements based on a condition.
```
rdd.filter(lambda x: x > 2)
```
distinct(): Return distinct elements.
```
rdd.distinct()
```
sample(withReplacement, fraction, seed): Sample a fraction of the data.
```
rdd.sample(False, 0.5, 42)
```

3. Key-Value Pair Transformations

mapValues(func): Apply a function to the value of each key-value pair.
```
rdd.mapValues(lambda x: x * 2)
```
flatMapValues(func): Apply a function to the value of each key-value pair and flatten the result.
```
rdd.flatMapValues(lambda x: x.split(" "))
```
reduceByKey(func): Aggregate values for each key.
```
rdd.reduceByKey(lambda x, y: x + y)
```
groupByKey(): Group values for each key.
```
rdd.groupByKey()
```
sortByKey(ascending=True): Sort RDD by key.
```
rdd.sortByKey()
```
keys(): Extract keys from key-value pairs.
```
rdd.keys()
```
values(): Extract values from key-value pairs.
```
rdd.values()
```

4. Actions

collect(): Return all elements of the RDD as a list.
```
rdd.collect()
```
count(): Return the number of elements in the RDD.
```
rdd.count()
```
first(): Return the first element of the RDD.
```
rdd.first()
```
take(n): Return the first n elements of the RDD.
```
rdd.take(3)
```
takeSample(withReplacement, num, seed): Return a sample of num elements.
```
rdd.takeSample(False, 5, 42)
```
reduce(func): Aggregate elements using a function.
```
rdd.reduce(lambda x, y: x + y)
```
foreach(func): Apply a function to each element (no return value).
```
rdd.foreach(lambda x: print(x))
```
saveAsTextFile(path): Save RDD as a text file.
```
rdd.saveAsTextFile("output_path")
```

5. Set Operations

union(other): Return the union of two RDDs.
```
rdd1.union(rdd2)
```
intersection(other): Return the intersection of two RDDs.
```
rdd1.intersection(rdd2)
```
subtract(other): Return elements in the first RDD but not in the second.
```
rdd1.subtract(rdd2)
```
cartesian(other): Return the Cartesian product of two RDDs.
```
rdd1.cartesian(rdd2)
```

6. Advanced Transformations

coalesce(numPartitions): Decrease the number of partitions.
```
rdd.coalesce(2)
```
repartition(numPartitions): Increase or decrease the number of partitions.
```
rdd.repartition(4)
```
zip(other): Zip two RDDs together.
```
rdd1.zip(rdd2)
```
zipWithIndex(): Zip RDD elements with their index.
```
rdd.zipWithIndex()
```
zipWithUniqueId(): Zip RDD elements with a unique ID.
```
rdd.zipWithUniqueId()
```

7. Persistence (Caching)

persist(storageLevel): Persist the RDD in memory or disk.
```
rdd.persist(StorageLevel.MEMORY_ONLY)
```
unpersist(): Remove the RDD from persistence.
```
rdd.unpersist()
```

8. Debugging and Inspection

getNumPartitions(): Get the number of partitions.
```
rdd.getNumPartitions()
```
glom(): Return an RDD of partitions as lists.
```
rdd.glom().collect()
```
id(): Get the RDD's unique ID.
```
rdd.id()
```

9. Joins

join(other): Inner join two RDDs.
```
rdd1.join(rdd2)
```
leftOuterJoin(other): Left outer join two RDDs.
```
rdd1.leftOuterJoin(rdd2)
```
rightOuterJoin(other): Right outer join two RDDs.
```
rdd1.rightOuterJoin(rdd2)
```
fullOuterJoin(other): Full outer join two RDDs.
```
rdd1.fullOuterJoin(rdd2)
```

10. Broadcast and Accumulator Variables

Broadcast Variables:

broadcast_var = sc.broadcast([1, 2, 3])
rdd.map(lambda x: x + broadcast_var.value[0])

Accumulator Variables:

accum = sc.accumulator(0)
rdd.foreach(lambda x: accum.add(1))

This cheat sheet covers the most essential PySpark RDD operations. For more advanced use cases, refer to the official PySpark documentation.

PySpark RDD Cheat Sheet