I have following HQL excuted by MR engine where the source table has almost 800 million records
select concat(upp_sys_id,'#',min(bhv_tm) over ssn,'#',ssn_seq_all) as ssn_id
,evt_drt
,row_number() over ssn as ssn_seq
,`dw_dat_dt` ,`msg_id` ,`evt_nm` ,`upp_sys_id`
,`bhv_tm`
from test.VT_seq_all
window ssn as (partition by upp_sys_id,ssn_seq_all order by bhv_tm);
most of the reducers take about 10min to finish, but there is a reducer has running for 4 hours. The input records of the abnormal task are similar to others. In general, the number of records with same upp_sys_id and ssn_seq_all is less than 1 thousands. But for a few specific upp_sys_id and ssn_seq_all, the number of records is more than 100 thousands.
Is there any way to optimize this hql?
Here is the execution plan
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: VT_seq_all
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: upp_sys_id (type: string), ssn_seq_all (type: bigint), bhv_tm (type: string)
sort order: +++
Map-reduce partition columns: upp_sys_id (type: string), ssn_seq_all (type: bigint)
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
value expressions: evt_drt (type: bigint), dw_dat_dt (type: date), msg_id (type: string), evt_nm (type: string)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey1 (type: bigint), VALUE._col1 (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey2 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col9
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
PTF Operator
Function definitions:
Input definition
input alias: ptf_0
output shape: _col0: bigint, _col1: bigint, _col2: date, _col3: string, _col4: string, _col5: string, _col9: string
type: WINDOWING
Windowing table definition
input alias: ptf_1
name: windowingtablefunction
order by: _col9 ASC NULLS FIRST
partition by: _col5, _col1
raw input shape:
window functions:
window function definition
alias: min_window_0
arguments: _col9
name: min
window function: GenericUDAFMinEvaluator
window frame: RANGE PRECEDING(MAX)~CURRENT
window function definition
alias: row_number_window_1
name: row_number
window function: GenericUDAFRowNumberEvaluator
window frame: ROWS PRECEDING(MAX)~FOLLOWING(MAX)
isPivotResult: true
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: concat(_col5, '#', min_window_0, '#', _col1) (type: string), _col0 (type: bigint), row_number_window_1 (type: int), _col2 (type: date), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col9 (type: string)
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink