correct display style
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))
Create a DF from an array
df = spark.createDataFrame([('Brooke',20),('Denny',31),('Jules',30),('TD',35),
('Brooke',25),('Jules',40),('Denny',51),('TD',15)]
,["name","age"])
create a DF from a json file
df = spark.read.json()
csv
df = spark.read.csv(f"K:\\\\spark_datasets\\\\NullData.csv", header=True, inferSchema=True)
Create a schema that DF should follow once it tries to read a source file sys
Using spark
from pyspark.sql.types import *
myschema = StructType([StructField("State",StringType(),False),
StructField("Color",StringType(),False),
StructField("Count",StringType(),False)])
df= spark.read.csv(f"K:\\mnm_dataset.csv", schema=myschema)
Using DDL
displays the schema of the created df
df.printSchema()
root
|-- name: string (nullable = true)
|-- age: long (nullable = true)
Displays the number of partitions used
df2.rdd.getNumPartitions() -> 1
df.rdd.getNumPartitions() -> 4