I am implementing data quality checks using Great expectation library. does this library compatible with Pyspark does this run on multiple cores?
1 Answer
Yes it is compatible with Pyspark. Here is the example.
datasource creation.
datasources:
spark_ds:
class_name: Datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: SparkDFExecutionEngine
force_reuse_spark_context: true
module_name: great_expectations.datasource
data_connectors:
spark_ds_connector:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
batch_identifiers:
- batch_id
Create runtime batch request
df=#Create your dataframe
request=RuntimeBatchRequest(
datasource_name="spark_ds",
data_connector_name="spark_ds_connector",
data_asset_name="any_asset_name",
runtime_parameters={"batch_data": df},
batch_identifiers={"batch_id": "batch_id"},
)
ge_context.run_checkpoint(checkpoint_name="checkpoint", validations=[{"batch_request": request, "expectation_suite_name": "suite_name"}])
2 Comments
code_bug
I have simply created in this format
gedf = ge.dataset.SparkDFDataset(df1) DQI=gedf.expect_column_values_to_be_unique("ID", result_format = "COMPLETE") to remove unnecessary overhead. But failing with Python kernel unresponsive. not sure this error because of OOM.halfwind22
Should be OOM, because as per the documentation ( and also code) the results are collected to the driver memory. So only one node does the job.