• Low Level
  • Slow
  • more complex
  • more flexible

  • Create a spark session

    import findspark
    findspark.init()
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.getOrCreate()
    
  • Create a sparkContext

    sc = spark.sparkContext
    
  • Create RDD0 that reads from a numpy array

    rdd0=sc.parallelize(data)
    
  • Create RDD1,2,3,4

    rdd1= rdd0.map(lambda x:x**2)
    
  • Execute RDD

    <aside> rdd1.collect()

    </aside>

  • Cache RDD

    rdd1.cache()
    
  • Apply transformation on a portion of the data

    rdd2.take(3)