1

I am implementing data quality checks using Great expectation library. does this library compatible with Pyspark does this run on multiple cores?

1 Answer 1

0

Yes it is compatible with Pyspark. Here is the example.

datasource creation.

datasources:
  spark_ds:
    class_name: Datasource
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: SparkDFExecutionEngine
      force_reuse_spark_context: true
    module_name: great_expectations.datasource
    data_connectors:
      spark_ds_connector:
        class_name: RuntimeDataConnector
        module_name: great_expectations.datasource.data_connector
        batch_identifiers:
          - batch_id

Create runtime batch request

df=#Create your dataframe
request=RuntimeBatchRequest(
            datasource_name="spark_ds",
            data_connector_name="spark_ds_connector",
            data_asset_name="any_asset_name",  

            runtime_parameters={"batch_data": df},  

            batch_identifiers={"batch_id": "batch_id"},
        )
    
    
ge_context.run_checkpoint(checkpoint_name="checkpoint", validations=[{"batch_request": request, "expectation_suite_name": "suite_name"}])
Sign up to request clarification or add additional context in comments.

2 Comments

I have simply created in this format gedf = ge.dataset.SparkDFDataset(df1) DQI=gedf.expect_column_values_to_be_unique("ID", result_format = "COMPLETE") to remove unnecessary overhead. But failing with Python kernel unresponsive. not sure this error because of OOM.
Should be OOM, because as per the documentation ( and also code) the results are collected to the driver memory. So only one node does the job.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.