type_conversions.rst

Python to Spark Type Conversions

.. currentmodule:: pyspark.sql.types

When working with PySpark, you will often need to consider the conversions between Python-native objects to their Spark equivalents. For instance, when working with user-defined functions, the function return type will be cast by Spark to an appropriate Spark SQL type. Or, when creating a DataFrame, you may supply numpy or pandas objects as the inputted data. This guide will cover the various conversions between Python and Spark SQL types.

Browsing Type Conversions

Though this document provides a comprehensive list of type conversions, you may find it easier to interactively check the conversion behavior of Spark. To do so, you can test small examples of user-defined functions, and use the spark.createDataFrame interface.

All data types of Spark SQL are located in the package of pyspark.sql.types. You can access them by doing:

from pyspark.sql.types import *

Configuration

There are several configurations that affect the behavior of type conversions. These configurations are listed below:

Configuration	Description	Default
spark.sql.execution.pythonUDF.arrow.enabled	Enable PyArrow in PySpark. See more here.	True
spark.sql.pyspark.inferNestedDictAsStruct.enabled	When enabled, nested dictionaries are inferred as StructType. Otherwise, they are inferred as MapType.	False
spark.sql.timestampType	If set to TIMESTAMP_NTZ, the default timestamp type is `TimestampNTZType`. Otherwise, the default timestamp type is TimestampType.	""
spark.sql.execution.pandas.inferPandasDictAsMap	When enabled, Pandas dictionaries are inferred as MapType. Otherwise, they are inferred as StructType.	False
spark.sql.execution.pyspark.binaryAsBytes	Introduced in Spark 4.1.0. When enabled, BinaryType is mapped consistently to Python `bytes`; when disabled, matches the PySpark default behavior before 4.1.0.	True

All Conversions

Data type	Value type in Python	API to access or create a data type
ByteType	int Note Numbers will be converted to 1-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -128 to 127.	ByteType()
ShortType	int Note Numbers will be converted to 2-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -32768 to 32767.	ShortType()
IntegerType	int	IntegerType()
LongType	int Note Numbers will be converted to 8-byte signed integer numbers at runtime. Please make sure that numbers are within the range of -9223372036854775808 to 9223372036854775807. Otherwise, please convert data to decimal.Decimal and use DecimalType.	LongType()
FloatType	float Note Numbers will be converted to 4-byte single-precision floating point numbers at runtime.	FloatType()
DoubleType	float	DoubleType()
DecimalType	decimal.Decimal	DecimalType()\|
StringType	string	StringType()
BinaryType	bytes	BinaryType()
BooleanType	bool	BooleanType()
TimestampType	datetime.datetime	TimestampType()
TimestampNTZType	datetime.datetime	TimestampNTZType()
DateType	datetime.date	DateType()
DayTimeIntervalType	datetime.timedelta	DayTimeIntervalType()
ArrayType	list, tuple, or array	ArrayType(elementType, [containsNull]) Note The default value of containsNull is True.
MapType	dict	MapType(keyType, valueType, [valueContainsNull]) Note The default value of valueContainsNull is True.
StructType	list or tuple	StructType(fields) Note fields is a Seq of StructFields. Also, two fields with the same name are not allowed.
StructField	The value type in Python of the data type of this field. For example, Int for a StructField with the data type IntegerType.	StructField(name, dataType, [nullable]) Note The default value of nullable is True.

Conversions in Practice - UDFs

A common conversion case is returning a Python value from a UDF. In this case, the return type of the UDF must match the provided return type.

Note

If the actual return type of your function does not match the provided return type, Spark will implicitly cast the value to null.

from pyspark.sql.types import (
    StructType,
    StructField,
    IntegerType,
    StringType,
    FloatType,
)
from pyspark.sql.functions import udf, col

df = spark.createDataFrame(
    [[1]], schema=StructType([StructField("int", IntegerType())])
)

@udf(returnType=StringType())
def to_string(value):
    return str(value)

@udf(returnType=FloatType())
def to_float(value):
    return float(value)

df.withColumn("cast_int", to_float(col("int"))).withColumn(
    "cast_str", to_string(col("int"))
).printSchema()
# root
# |-- int: integer (nullable = true)
# |-- cast_int: float (nullable = true)
# |-- cast_str: string (nullable = true)

Conversions in Practice - Creating DataFrames

Another common conversion case is when creating a DataFrame from values in Python. In this case, you can supply a schema, or allow Spark to infer the schema from the provided data.

data = [
    ["Wei", "Math", 93.0, 1],
    ["Jerry", "Physics", 85.0, 4],
    ["Katrina", "Geology", 90.0, 2],
]
cols = ["Name", "Subject", "Score", "Period"]

spark.createDataFrame(data, cols).printSchema()
# root
# |-- Name: string (nullable = true)
# |-- Subject: string (nullable = true)
# |-- Score: double (nullable = true)
# |-- Period: long (nullable = true)

import pandas as pd

df = pd.DataFrame(data, columns=cols)
spark.createDataFrame(df).printSchema()
# root
# |-- Name: string (nullable = true)
# |-- Subject: string (nullable = true)
# |-- Score: double (nullable = true)
# |-- Period: long (nullable = true)

import numpy as np

spark.createDataFrame(np.zeros([3, 2], "int8")).printSchema()
# root
# |-- _1: byte (nullable = true)
# |-- _2: byte (nullable = true)

Conversions in Practice - Nested Data Types

Nested data types will convert to StructType, MapType, and ArrayType, depending on the passed data.

data = [
    ["Wei", [[1, 2]], {"RecordType": "Scores", "Math": { "H1": 93.0, "H2": 85.0}}],
]
cols = ["Name", "ActiveHalfs", "Record"]

spark.createDataFrame(data, cols).printSchema()
# root
#  |-- Name: string (nullable = true)
#  |-- ActiveHalfs: array (nullable = true)
#  |    |-- element: array (containsNull = true)
#  |    |    |-- element: long (containsNull = true)
#  |-- Record: map (nullable = true)
#  |    |-- key: string
#  |    |-- value: string (valueContainsNull = true)

spark.conf.set('spark.sql.pyspark.inferNestedDictAsStruct.enabled', True)

spark.createDataFrame(data, cols).printSchema()
# root
#  |-- Name: string (nullable = true)
#  |-- ActiveHalfs: array (nullable = true)
#  |    |-- element: array (containsNull = true)
#  |    |    |-- element: long (containsNull = true)
#  |-- Record: struct (nullable = true)
#  |    |-- RecordType: string (nullable = true)
#  |    |-- Math: struct (nullable = true)
#  |    |    |-- H1: double (nullable = true)
#  |    |    |-- H2: double (nullable = true)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python to Spark Type Conversions

Browsing Type Conversions

Configuration

All Conversions

Conversions in Practice - UDFs

Conversions in Practice - Creating DataFrames

Conversions in Practice - Nested Data Types

FilesExpand file tree

type_conversions.rst

Latest commit

History

type_conversions.rst

File metadata and controls

Python to Spark Type Conversions

Browsing Type Conversions

Configuration

All Conversions

Conversions in Practice - UDFs

Conversions in Practice - Creating DataFrames

Conversions in Practice - Nested Data Types