DataFusion also offers a SQL API, read the full reference here
.. ipython:: python
import datafusion
from datafusion import DataFrame, SessionContext
# create a context
ctx = datafusion.SessionContext()
# register a CSV
ctx.register_csv("pokemon", "pokemon.csv")
# create a new statement via SQL
df = ctx.sql('SELECT "Attack"+"Defense", "Attack"-"Defense" FROM pokemon')
# collect and convert to pandas DataFrame
df.to_pandas()
In DataFusion-Python 51.0.0 we introduced the ability to pass parameters in a SQL query. These are similar in concept to prepared statements, but allow passing named parameters into a SQL query. Consider this simple example.
.. ipython:: python
def show_attacks(ctx: SessionContext, threshold: int) -> None:
ctx.sql(
'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val', val=threshold
).show(num=5)
show_attacks(ctx, 75)
When passing parameters like the example above we convert the Python objects into their string representation. We also have special case handling for :py:class:`~datafusion.dataframe.DataFrame` objects, since they cannot simply be turned into string representations for an SQL query. In these cases we will register a temporary view in the :py:class:`~datafusion.context.SessionContext` using a generated table name.
The formatting for passing string replacement objects is to precede the
variable name with a single $. This works for all dialects in
the SQL parser except hive and mysql. Since these dialects do not
support named placeholders, we are unable to do this type of replacement.
We recommend either switching to another dialect or using Python
f-string style replacement.
Warning
To support DataFrame parameterized queries, your session must support registration of temporary views. The default :py:class:`~datafusion.catalog.CatalogProvider` and :py:class:`~datafusion.catalog.SchemaProvider` do have this capability. If you have implemented custom providers, it is important that temporary views do not persist across :py:class:`~datafusion.context.SessionContext` or you may get unintended consequences.
The following example shows passing in both a :py:class:`~datafusion.dataframe.DataFrame` object as well as a Python object to be used in parameterized replacement.
.. ipython:: python
def show_column(
ctx: SessionContext, column: str, df: DataFrame, threshold: int
) -> None:
ctx.sql(
'SELECT "Name", $col FROM $df WHERE $col > $val',
col=column,
df=df,
val=threshold,
).show(num=5)
df = ctx.table("pokemon")
show_column(ctx, '"Defense"', df, 75)
The approach implemented for conversion of variables into a SQL query
relies on string conversion. This has the potential for data loss,
specifically for cases like floating point numbers. If you need to pass
variables into a parameterized query and it is important to maintain the
original value without conversion to a string, then you can use the
optional parameter param_values to specify these. This parameter
expects a dictionary mapping from the parameter name to a Python
object. Those objects will be cast into a
PyArrow Scalar Value.
Using param_values will rely on the SQL dialect you have configured
for your session. This can be set using the :ref:`configuration options <configuration>`
of your :py:class:`~datafusion.context.SessionContext`. Similar to how
prepared statements
work, these parameters are limited to places where you would pass in a
scalar value, such as a comparison.
.. ipython:: python
def param_attacks(ctx: SessionContext, threshold: int) -> None:
ctx.sql(
'SELECT "Name", "Attack" FROM pokemon WHERE "Attack" > $val',
param_values={"val": threshold},
).show(num=5)
param_attacks(ctx, 75)