The PySpark Guide

Context

When talking about PySpark, we are talking about Big Data.

WHY PySpark?

PySpark is an interface for Apache Spark in Python. It allows you to write Spark applications using Python and provides the PySpark shell to analyze data in a distributed environment.

PySpark is a library for working with large datasets in a distributed computing environment, while pandas is a library for working with smaller, tabular datasets on a single machine.

If the data is small enough that you can use pandas to process it, then you likely don’t need pyspark. Spark is useful when you have such large data sizes that it doesn’t fit into memory in one machine since it can perform distributed computation. That being said, if the computation is complex enough that it could benefit from a lot of parallelization, then you could see an efficiency boost using pyspark.

PySpark - Useful Code Snippets

How to create a spark dataframe

Data = [("Jerez","Yosua","Dr.Yosu"),
    ("London","John","CEO"),
    ("Roberta","Storm","Manager")
  ]

schema = ["Region","Nick","Job"]
df = spark.createDataFrame(data=Data, schema = schema)

df.printSchema()
df.show(truncate=False)

How to use SQL in Spark - SparkSQL

If you are already confortable with SQL, a good starting point is to use that knowledge combined with the power of Spark. You are lucky enought to write SQL queries with PySpark and they will get translated as well.

Lets view a few examples:

PySpark with Trino

You can use PySpark to query Trino (ex Presto) by configuring PySpark to connect to Trino as an external data source:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("TrinoQueryExample") \
    .config("spark.jars", "/path/to/trino-jdbc-driver.jar") \
    .getOrCreate()

df.createOrReplaceTempView("trino_table")
result = spark.sql("SELECT * FROM trino_table WHERE column_name = 'some_value'")
result.show()

#spark.stop()
df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:trino://trino-server:port") \
    .option("dbtable", "your_table_name") \
    .option("user", "your_username") \
    .option("password", "your_password") \
    .load()

# Use Python queries on the DataFrame
filtered_df = df.filter(df["column_name"] == "some_value")
result = filtered_df.select("column_name1", "column_name2")

result.show()

Try me with Google Colaboratory

If you have a Google account, you can check these code snippets, as well as few useful UDF’s to work more efficiently with spark directly with your Google Colab account and the code I made available in Github:

Example image


FAQ

Where to Learn more about Data Engineering?

You can browse s3 with a GUI thanks to: https://github.com/mickael-kerjean/filestash

PySpark FAQ

  • Why PySpark is called lazy?

PySpark is considered “lazy” because it does not execute any code until it absolutely has to.

This means that when you call a transformation on a PySpark DataFrame or RDD, it does not actually compute the result until you call an action. This allows Spark to optimize the execution plan by looking at all of the transformations that you have specified, and deciding the most efficient way to execute them.

It also allows Spark to delay execution until the result is actually needed, rather than executing every single transformation as soon as it is specified.

  • What to use, Spark or Pandas? What’s the difference?

The choice between using Spark or Pandas depends on the type and size of data you are dealing with.

For small datasets, Pandas is usually the better option as it provides a more intuitive and user-friendly interface. However, for larger datasets, Spark provides much better performance and scalability.

Spark also offers a range of features, such as distributed processing, in-memory computing, streaming, and machine learning algorithms, that are not available with Pandas.

The main difference between the two is that Pandas is designed to work with tabular data, while Spark is designed to work with both structured and unstructured data.

  • What is data redistributable?

Data redistribution is the process of transferring data from one system or location to another. This can be done between different databases, platforms, or locations.

The purpose of data redistribution is to improve performance, enhance scalability, or reduce cost. Data redistribution is often used to move data between a production system and a test system or between different servers or clusters, to spread the load evenly or to make sure that the data is available in multiple locations in case of system failure or disaster recovery.

  • What is a partition?

Partitions in Apache Spark are logical divisions of data stored on a node in the cluster.

They provide a way to split up large datasets into smaller, more manageable chunks that can be processed in parallel.

By default, Spark uses a Hash Partitioner, which uses a hash function to determine which partition a particular row of data should be assigned to.

Spark also supports Range Partitioning, which allows for data to be divided into partitions based on a range of values.

  • What is doing a GroupBy before a partition?

Doing a group by before partitioning means that the data is grouped together before being divided into partitions.

This can be useful when performing certain calculations, as it allows for more efficient processing of the data. For example, if you have a table of data with multiple columns and you want to sum up values in one of the columns, you can group by that column before partitioning the data so that the sum for each group is only calculated once.

What is Data Skew?

Data skew is an uneven distribution of data in a dataset, which can be caused by a variety of factors including transformations such as joins, groupBy, and orderBy.

Data skew can have a variety of implications, such as increased lock contention, reduced database concurrency, and decreased performance. Data skew can be measured using the skewness coefficient, which is a measure of the asymmetry of the data in a dataset.