The PySpark processor transforms data based on custom PySpark code. You develop the custom code using the Python API for Spark, or PySpark. The PySpark processor supports Python 3.

  1. Pyspark Functions
  2. Pyspark Split

PySpark has been released in order to support the collaboration of Apache Spark and Python, it actually is a Python API for Spark. In addition, PySpark, helps you interface with Resilient Distributed Datasets (RDDs) in Apache Spark and Python programming language. This has been achieved by taking advantage of the Py4j library. PySpark is a Python API for Spark released by the Apache Spark community to support Python with Spark. Using PySpark, one can easily integrate and work with RDDs in Python programming language too. There are numerous features that make PySpark such an amazing framework when it comes to working with huge datasets.

The processor can receive multiple input streams, but can produce only a single output stream. When the processor receives multiple input streams, it receives one Spark DataFrame from each input stream. The custom PySpark code must produce a single DataFrame.
Tip:In streaming pipelines, you can use a Window processor upstream from this processor to generate larger batch sizes for evaluation.

You can use the PySpark processor in pipelines that provision a Databricks cluster, in standalone pipelines, and in pipelines that run on any existing cluster except for Dataproc. Do not use the processor in Dataproc pipelines or in pipelines that provision non-Databricks clusters.



Pyspark Functions

Complete all prerequisite tasks before using the PySpark processor in pipelines that run on existing clusters or in standalone pipelines.

Pyspark Split

When using the processor in pipelines that provision a Databricks cluster, you must perform the required tasks.