Spark DF | Notion

With RDDs, i tell them how to do
with DF I tell them what to do
Each DF consists of a group of RDDs
transformations are lazy
Actions are Eager

correct display style

from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

Create a DF from an array

df = spark.createDataFrame([('Brooke',20),('Denny',31),('Jules',30),('TD',35),
                            ('Brooke',25),('Jules',40),('Denny',51),('TD',15)]
                            ,["name","age"])

create a DF from a json file
```
df = spark.read.json()
```

csv

df = spark.read.csv(f"K:\\\\spark_datasets\\\\NullData.csv", header=True, inferSchema=True)

Create a schema that DF should follow once it tries to read a source file sys

Using spark

from pyspark.sql.types import *

myschema = StructType([StructField("State",StringType(),False),
                      StructField("Color",StringType(),False),
                      StructField("Count",StringType(),False)])
                      
df= spark.read.csv(f"K:\\mnm_dataset.csv", schema=myschema)

Using DDL

displays the schema of the created df

df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)

Displays the number of partitions used

df2.rdd.getNumPartitions() -> 1

df.rdd.getNumPartitions() -> 4