PySpark RDD Cheat Sheet

PySpark RDD Cheat Sheet

This cheat sheet provides a quick reference to the most commonly used PySpark RDD operations. PySpark RDDs (Resilient Distributed Datasets) are the fundamental data structure in Apache Spark, providing fault-tolerant, distributed data processing capabilities.


1. Creating RDDs

  • From a local collection:

    data = [1, 2, 3, 4, 5]
    rdd = sc.parallelize(data)
    
  • From a text file:

    rdd = sc.textFile("file_path.txt")
    
  • From a directory of text files:

    rdd = sc.wholeTextFiles("directory_path")
    

2. Basic Transformations

  • map(func): Apply a function to each element.

    rdd.map(lambda x: x * 2)
    
  • flatMap(func): Apply a function to each element and flatten the result.

    rdd.flatMap(lambda x: x.split(" "))
    
  • filter(func): Filter elements based on a condition.

    rdd.filter(lambda x: x > 2)
    
  • distinct(): Return distinct elements.

    rdd.distinct()
    
  • sample(withReplacement, fraction, seed): Sample a fraction of the data.

    rdd.sample(False, 0.5, 42)
    

3. Key-Value Pair Transformations

  • mapValues(func): Apply a function to the value of each key-value pair.

    rdd.mapValues(lambda x: x * 2)
    
  • flatMapValues(func): Apply a function to the value of each key-value pair and flatten the result.

    rdd.flatMapValues(lambda x: x.split(" "))
    
  • reduceByKey(func): Aggregate values for each key.

    rdd.reduceByKey(lambda x, y: x + y)
    
  • groupByKey(): Group values for each key.

    rdd.groupByKey()
    
  • sortByKey(ascending=True): Sort RDD by key.

    rdd.sortByKey()
    
  • keys(): Extract keys from key-value pairs.

    rdd.keys()
    
  • values(): Extract values from key-value pairs.

    rdd.values()
    

4. Actions

  • collect(): Return all elements of the RDD as a list.

    rdd.collect()
    
  • count(): Return the number of elements in the RDD.

    rdd.count()
    
  • first(): Return the first element of the RDD.

    rdd.first()
    
  • take(n): Return the first n elements of the RDD.

    rdd.take(3)
    
  • takeSample(withReplacement, num, seed): Return a sample of num elements.

    rdd.takeSample(False, 5, 42)
    
  • reduce(func): Aggregate elements using a function.

    rdd.reduce(lambda x, y: x + y)
    
  • foreach(func): Apply a function to each element (no return value).

    rdd.foreach(lambda x: print(x))
    
  • saveAsTextFile(path): Save RDD as a text file.

    rdd.saveAsTextFile("output_path")
    

5. Set Operations

  • union(other): Return the union of two RDDs.

    rdd1.union(rdd2)
    
  • intersection(other): Return the intersection of two RDDs.

    rdd1.intersection(rdd2)
    
  • subtract(other): Return elements in the first RDD but not in the second.

    rdd1.subtract(rdd2)
    
  • cartesian(other): Return the Cartesian product of two RDDs.

    rdd1.cartesian(rdd2)
    

6. Advanced Transformations

  • coalesce(numPartitions): Decrease the number of partitions.

    rdd.coalesce(2)
    
  • repartition(numPartitions): Increase or decrease the number of partitions.

    rdd.repartition(4)
    
  • zip(other): Zip two RDDs together.

    rdd1.zip(rdd2)
    
  • zipWithIndex(): Zip RDD elements with their index.

    rdd.zipWithIndex()
    
  • zipWithUniqueId(): Zip RDD elements with a unique ID.

    rdd.zipWithUniqueId()
    

7. Persistence (Caching)

  • persist(storageLevel): Persist the RDD in memory or disk.

    rdd.persist(StorageLevel.MEMORY_ONLY)
    
  • unpersist(): Remove the RDD from persistence.

    rdd.unpersist()
    

8. Debugging and Inspection

  • getNumPartitions(): Get the number of partitions.

    rdd.getNumPartitions()
    
  • glom(): Return an RDD of partitions as lists.

    rdd.glom().collect()
    
  • id(): Get the RDD's unique ID.

    rdd.id()
    

9. Joins

  • join(other): Inner join two RDDs.

    rdd1.join(rdd2)
    
  • leftOuterJoin(other): Left outer join two RDDs.

    rdd1.leftOuterJoin(rdd2)
    
  • rightOuterJoin(other): Right outer join two RDDs.

    rdd1.rightOuterJoin(rdd2)
    
  • fullOuterJoin(other): Full outer join two RDDs.

    rdd1.fullOuterJoin(rdd2)
    

10. Broadcast and Accumulator Variables

  • Broadcast Variables:

    broadcast_var = sc.broadcast([1, 2, 3])
    rdd.map(lambda x: x + broadcast_var.value[0])
    
  • Accumulator Variables:

    accum = sc.accumulator(0)
    rdd.foreach(lambda x: accum.add(1))
    

This cheat sheet covers the most essential PySpark RDD operations. For more advanced use cases, refer to the official PySpark documentation.