Database and ETL  Testing Process Methods, Issues, Recommendations Jan. 19, 2010 W. Yaddow [email_address] For internal use only – Not for external distribution
Agenda QA objectives for ETL’s & data loading projects Samples of QA data defect discoveries Data quality tools / techniques used by QA team ETL & data loading verification checks Lessons learned in data verification Recommendations for continued early involvement by QA
Data defects… a definition Data Defects:  Deviations from the correctness of data, generally  errors occurring prior to processing of data for analytics or reporting. Errors can be the result of data model, low level design, data mapping, or data loading prior to processing in an application .  Note:   Data issues on displays or reports are not considered data defects when they result from service calls or computation errors within an application.
QA objectives for ETL & data integration Assure that all the records in source systems that should be migrated to a database are extracted -- no more, no less.  Verify that all of the components of the ETL / load process complete with no defects. Verify that all of the source data is correctly transformed into dimension, fact and other tables. Analyze ETL / load exception logs
QA role in data verification projects QA develops verification methods to support data integration specific to projects.  QA executes tasks that demonstrate data verification is a critical link between the DS, DSO, application development and analytics teams.  QA continues to demonstrate that early data testing is the most efficient means of identifying and correcting defects.
Sample of data defect discoveries % Data Defects % Data Defects High or Critical Severity App1 39% 48% App2 26% 70% App3  26% 33% App4 6% 59% App5 29% 68% Note:  Data as of 10/23/2009
Data integration & ETL error injection points 1)  DB design & planning 2)  ETL, Data Load DATA TRACK PHASES ARTIFACTS QA TASKS Data and analysis requirements Data design & requirements Reviews, comments   Source data planning & profiling Reviews, comments Data flow and load design Data model Reviews, comments   Logical & physical data flow diagrams Reviews, comments   Data movement low level design (LLD) Reviews, comments   Data mappings & transformations, source to target. Reviews, test planning, test case development.   ETL design & logic Reviews, comments   SQL and PL/SQL for data loads Reviews, comments   Data cleansing plan  Reviews, comments   Data load and ETL developer test plan Reviews, comments Data load /ETLs Execution Extract, transpose, load Reviews, verification, defect reports Data load / ETL load inspection Workflow logs, session logs, error log tables, reject tables, Reviews, verification, defect reports
Small sample; data verification
ETL & data loading verification checks Basic ETL and PL/SQL Verifications Conducted by QA Verify  mappings, source to target Verify that  all tables and specified fields were loaded from source to staging Verify that  keys were properly  generated using sequence generator  Verify that  not-null fields are populated   Verify  no data truncation  in each field Verify  data types and formats  are as specified in design phase Verify  no duplicate records  in target tables.  Verify  transformations based on data low level design  (LLD's)  Verify that  numeric fields are populated with correct precision Verify that every  ETL session completed  with only planned exceptions  Verify all  cleansing, transformation, error and exception handling   Verify  PL/SQL calculations and data mappings
Data verification training overview Data Quality Overview Testing: DQ Categories / Checks Testing: DQ Case Study DQ Test Management (planning, design, execution, tools) DQ Benefits & Challenges
QA steps: data integration verification (1) Data integration planning ( Data model, LLD’s ) Gain understanding of data to be reported by the application … and the tables upon which each report is based (orgs, ratings, countries, analysts, etc.).  Review, understand data model  – gain understanding of keys, flows from source to target Review, understand data LLD’s and mappings : add, update sequences for all sources of each target table ETL Planning and testing ( source inputs & ETL design ) Participate in  ETL design reviews Gain  in-depth knowledge of ETL sessions , the order of execution, restraints, transformations Participate in development  ETL test case reviews After ETL’s are run, use  checklists for QA assessments  of rejects, session failures, errors
QA steps: data integration verification (2) Assess ETL logs: session, workflow, errors Review  ETL workflow outputs , source to target counts Verify source to target mapping  docs with loaded tables using TOAD and other tools After ETL runs or manual data loads,  assess data in every table  with focus on key fields (dirty data, incorrect formats, duplicates, etc.). Use TOAD, Excel tools. (SQL queries, filtering, etc.) GUI and report validations Compare reports with target data.   Verify that reporting meets user expectations Analytics test team data validation Test data as it is integrated into application. Provide tools and tests for data validation.
From Source to Data Warehouse…  Unit Testing •  Know data transformation rules!  •  Run test cases for each transformation rule; include positive & negative situations  •  Row counts: DWH (Destination) = Source + Rejected  •  Verify process correctly uses all required data including metadata  •  Cross reference DWH Dimensions and fact tables to source tables  •  Verify all busines rule computations are correct  •  Verify database queries, expected vs actual results  •  Rejects are correctly handled and conform to business rules  •  Slow-changing Dimensions eg. address, marital status processed correctly •  Correctness of surrogate keys eg. time zones, currencies in fact tables
Transforming Data, Source to Target
DQ tools / techniques used by QA team TOAD / SQL Navigator Data profiling for value range & boundary analysis  Null field analysis Row counting Data type analysis  Referential integrity analysis (key analysis) Distinct value analysis by field Duplicate data analysis (fields and rows) Cardinality analysis PL/SQL stored procedures & package verification Excel Data filtering for profile analysis Data value sampling Data type analysis MS Access Table and data analysis across schemas  QTP Automated testing of templates and application screens Analytics Tools J – statistics, visualization, data manipulation Perl – data manipulation, scripting R – statistics
Data defect findings by QA team Data Defects Types on six projects: Inadequate ETL and stored procedure design documents Field values are null when specified as “Not Null”. Field constraints and SQL not coded correctly for Informatica ETL Excessive ETL errors discovered after entry to QA Source data does not meet table mapping specifications (ex., dirty data) Source to target mappings: 1) often not reviewed, 2) in error and 2) not consistently maintained through dev lifecycle Data models are not adequately maintained during development lifecycle Target data does not meet mapping specifications Duplicate field values when defined to be DISTINCT ETL SQL / transformation errors leading to missing rows and invalid field values Constraint violations in source Target data is incorrectly stored in nonstandard formats Table keys are incorrect for important relationship linkages
Lessons learned Formal QA data track verifications should continue early in the ETL design and data load process (independent of application development. With access to ETL dev environment, QA can prepare for formal testing and offer feedback to dev team Offshore teams need adequate and more representative samples of data for data planning and design Data models, LLD’s, ETL design and data mapping documents need to be kept in sync until transition QA resourcing for projects must include needs to accommodate data track verifications
Recommendations for data verifications Detailed Recommendations for Development, QA, Data Services Need analysis of a.)  source data quality  and b.)  data field profiles  before input to Informatica and other data-build services. QA should participate in all  data model and data mapping reviews . Need complete review of ETL  error logs  and resolution of errors by ETL teams before DB turn-over to QA. Early use of  QC during ETL and stored procedure  testing to target vulnerable process areas. Substantially  improved documentation of PL/SQL  stored procedures. QA needs dev or separate environment  for early data testing. QA should be able to modify data in order to perform negative tests. (QA currently does only positive tests because the application and data base tests work in parallel in the same environment.) Need substantially enhanced verification of  target tables  after each ETL load before data turn-over to QA. Need mandatory  maintenance of data models  and  source to target mapping  /  transformation rules  documents from elaboration until transition. Investments in more Informatica and off-the-shelf  data quality analysis tools  for pre and post ETL. Investments in  automated DB regression test tools  and training to support frequent data loads.
Important resource for DB testers

Data Verification In QA Department Final

  • 1.
    Database and ETL Testing Process Methods, Issues, Recommendations Jan. 19, 2010 W. Yaddow [email_address] For internal use only – Not for external distribution
  • 2.
    Agenda QA objectivesfor ETL’s & data loading projects Samples of QA data defect discoveries Data quality tools / techniques used by QA team ETL & data loading verification checks Lessons learned in data verification Recommendations for continued early involvement by QA
  • 3.
    Data defects… adefinition Data Defects: Deviations from the correctness of data, generally errors occurring prior to processing of data for analytics or reporting. Errors can be the result of data model, low level design, data mapping, or data loading prior to processing in an application . Note: Data issues on displays or reports are not considered data defects when they result from service calls or computation errors within an application.
  • 4.
    QA objectives forETL & data integration Assure that all the records in source systems that should be migrated to a database are extracted -- no more, no less. Verify that all of the components of the ETL / load process complete with no defects. Verify that all of the source data is correctly transformed into dimension, fact and other tables. Analyze ETL / load exception logs
  • 5.
    QA role indata verification projects QA develops verification methods to support data integration specific to projects. QA executes tasks that demonstrate data verification is a critical link between the DS, DSO, application development and analytics teams. QA continues to demonstrate that early data testing is the most efficient means of identifying and correcting defects.
  • 6.
    Sample of datadefect discoveries % Data Defects % Data Defects High or Critical Severity App1 39% 48% App2 26% 70% App3 26% 33% App4 6% 59% App5 29% 68% Note: Data as of 10/23/2009
  • 7.
    Data integration &ETL error injection points 1) DB design & planning 2) ETL, Data Load DATA TRACK PHASES ARTIFACTS QA TASKS Data and analysis requirements Data design & requirements Reviews, comments   Source data planning & profiling Reviews, comments Data flow and load design Data model Reviews, comments   Logical & physical data flow diagrams Reviews, comments   Data movement low level design (LLD) Reviews, comments   Data mappings & transformations, source to target. Reviews, test planning, test case development.   ETL design & logic Reviews, comments   SQL and PL/SQL for data loads Reviews, comments   Data cleansing plan Reviews, comments   Data load and ETL developer test plan Reviews, comments Data load /ETLs Execution Extract, transpose, load Reviews, verification, defect reports Data load / ETL load inspection Workflow logs, session logs, error log tables, reject tables, Reviews, verification, defect reports
  • 8.
    Small sample; dataverification
  • 9.
    ETL & dataloading verification checks Basic ETL and PL/SQL Verifications Conducted by QA Verify mappings, source to target Verify that all tables and specified fields were loaded from source to staging Verify that keys were properly generated using sequence generator Verify that not-null fields are populated Verify no data truncation in each field Verify data types and formats are as specified in design phase Verify no duplicate records in target tables. Verify transformations based on data low level design (LLD's) Verify that numeric fields are populated with correct precision Verify that every ETL session completed with only planned exceptions Verify all cleansing, transformation, error and exception handling Verify PL/SQL calculations and data mappings
  • 10.
    Data verification trainingoverview Data Quality Overview Testing: DQ Categories / Checks Testing: DQ Case Study DQ Test Management (planning, design, execution, tools) DQ Benefits & Challenges
  • 11.
    QA steps: dataintegration verification (1) Data integration planning ( Data model, LLD’s ) Gain understanding of data to be reported by the application … and the tables upon which each report is based (orgs, ratings, countries, analysts, etc.). Review, understand data model – gain understanding of keys, flows from source to target Review, understand data LLD’s and mappings : add, update sequences for all sources of each target table ETL Planning and testing ( source inputs & ETL design ) Participate in ETL design reviews Gain in-depth knowledge of ETL sessions , the order of execution, restraints, transformations Participate in development ETL test case reviews After ETL’s are run, use checklists for QA assessments of rejects, session failures, errors
  • 12.
    QA steps: dataintegration verification (2) Assess ETL logs: session, workflow, errors Review ETL workflow outputs , source to target counts Verify source to target mapping docs with loaded tables using TOAD and other tools After ETL runs or manual data loads, assess data in every table with focus on key fields (dirty data, incorrect formats, duplicates, etc.). Use TOAD, Excel tools. (SQL queries, filtering, etc.) GUI and report validations Compare reports with target data. Verify that reporting meets user expectations Analytics test team data validation Test data as it is integrated into application. Provide tools and tests for data validation.
  • 13.
    From Source toData Warehouse… Unit Testing • Know data transformation rules! • Run test cases for each transformation rule; include positive & negative situations • Row counts: DWH (Destination) = Source + Rejected • Verify process correctly uses all required data including metadata • Cross reference DWH Dimensions and fact tables to source tables • Verify all busines rule computations are correct • Verify database queries, expected vs actual results • Rejects are correctly handled and conform to business rules • Slow-changing Dimensions eg. address, marital status processed correctly • Correctness of surrogate keys eg. time zones, currencies in fact tables
  • 14.
  • 15.
    DQ tools /techniques used by QA team TOAD / SQL Navigator Data profiling for value range & boundary analysis Null field analysis Row counting Data type analysis Referential integrity analysis (key analysis) Distinct value analysis by field Duplicate data analysis (fields and rows) Cardinality analysis PL/SQL stored procedures & package verification Excel Data filtering for profile analysis Data value sampling Data type analysis MS Access Table and data analysis across schemas QTP Automated testing of templates and application screens Analytics Tools J – statistics, visualization, data manipulation Perl – data manipulation, scripting R – statistics
  • 16.
    Data defect findingsby QA team Data Defects Types on six projects: Inadequate ETL and stored procedure design documents Field values are null when specified as “Not Null”. Field constraints and SQL not coded correctly for Informatica ETL Excessive ETL errors discovered after entry to QA Source data does not meet table mapping specifications (ex., dirty data) Source to target mappings: 1) often not reviewed, 2) in error and 2) not consistently maintained through dev lifecycle Data models are not adequately maintained during development lifecycle Target data does not meet mapping specifications Duplicate field values when defined to be DISTINCT ETL SQL / transformation errors leading to missing rows and invalid field values Constraint violations in source Target data is incorrectly stored in nonstandard formats Table keys are incorrect for important relationship linkages
  • 17.
    Lessons learned FormalQA data track verifications should continue early in the ETL design and data load process (independent of application development. With access to ETL dev environment, QA can prepare for formal testing and offer feedback to dev team Offshore teams need adequate and more representative samples of data for data planning and design Data models, LLD’s, ETL design and data mapping documents need to be kept in sync until transition QA resourcing for projects must include needs to accommodate data track verifications
  • 18.
    Recommendations for dataverifications Detailed Recommendations for Development, QA, Data Services Need analysis of a.) source data quality and b.) data field profiles before input to Informatica and other data-build services. QA should participate in all data model and data mapping reviews . Need complete review of ETL error logs and resolution of errors by ETL teams before DB turn-over to QA. Early use of QC during ETL and stored procedure testing to target vulnerable process areas. Substantially improved documentation of PL/SQL stored procedures. QA needs dev or separate environment for early data testing. QA should be able to modify data in order to perform negative tests. (QA currently does only positive tests because the application and data base tests work in parallel in the same environment.) Need substantially enhanced verification of target tables after each ETL load before data turn-over to QA. Need mandatory maintenance of data models and source to target mapping / transformation rules documents from elaboration until transition. Investments in more Informatica and off-the-shelf data quality analysis tools for pre and post ETL. Investments in automated DB regression test tools and training to support frequent data loads.
  • 19.