• With RDDs, i tell them how to do
  • with DF I tell them what to do
  • Each DF consists of a group of RDDs
  • transformations are lazy
  • Actions are Eager

  • correct display style

    from IPython.display import display, HTML
    display(HTML("<style>pre { white-space: pre !important; }</style>"))
    

  • Create a DF from an array

    df = spark.createDataFrame([('Brooke',20),('Denny',31),('Jules',30),('TD',35),
                                ('Brooke',25),('Jules',40),('Denny',51),('TD',15)]
                                ,["name","age"])
    
  • create a DF from a json file

    df = spark.read.json()
    
  • csv

    df = spark.read.csv(f"K:\\\\spark_datasets\\\\NullData.csv", header=True, inferSchema=True)
    
  • Create a schema that DF should follow once it tries to read a source file sys

    • Using spark

      from pyspark.sql.types import *
      
      myschema = StructType([StructField("State",StringType(),False),
                            StructField("Color",StringType(),False),
                            StructField("Count",StringType(),False)])
                            
      df= spark.read.csv(f"K:\\mnm_dataset.csv", schema=myschema)
      
    • Using DDL

      image.png

  • displays the schema of the created df

    df.printSchema()
    root
     |-- name: string (nullable = true)
     |-- age: long (nullable = true)
    
    
  • Displays the number of partitions used

    df2.rdd.getNumPartitions() -> 1
    
    df.rdd.getNumPartitions() -> 4