Hive HQL - optimizing WINDOW function

Ask Question

Asked 2 years, 2 months ago

Modified 2 years, 2 months ago

Viewed 42 times

I have following HQL excuted by MR engine where the source table has almost 800 million records

select concat(upp_sys_id,'#',min(bhv_tm) over ssn,'#',ssn_seq_all) as ssn_id
,evt_drt
,row_number() over ssn as ssn_seq
,`dw_dat_dt` ,`msg_id` ,`evt_nm` ,`upp_sys_id`
,`bhv_tm`
from test.VT_seq_all
window ssn as (partition by upp_sys_id,ssn_seq_all order by bhv_tm);

most of the reducers take about 10min to finish, but there is a reducer has running for 4 hours. The input records of the abnormal task are similar to others. In general, the number of records with same upp_sys_id and ssn_seq_all is less than 1 thousands. But for a few specific upp_sys_id and ssn_seq_all, the number of records is more than 100 thousands.

Is there any way to optimize this hql?

Here is the execution plan


STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: VT_seq_all
            Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: upp_sys_id (type: string), ssn_seq_all (type: bigint), bhv_tm (type: string)
              sort order: +++
              Map-reduce partition columns: upp_sys_id (type: string), ssn_seq_all (type: bigint)
              Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
              value expressions: evt_drt (type: bigint), dw_dat_dt (type: date), msg_id (type: string), evt_nm (type: string)
      Reduce Operator Tree:
        Select Operator
          expressions: VALUE._col0 (type: bigint), KEY.reducesinkkey1 (type: bigint), VALUE._col1 (type: date), VALUE._col2 (type: string), VALUE._col3 (type: string), KEY.reducesinkkey0 (type: string), KEY.reducesinkkey2 (type: string)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col9
          Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
          PTF Operator
            Function definitions:
                Input definition
                  input alias: ptf_0
                  output shape: _col0: bigint, _col1: bigint, _col2: date, _col3: string, _col4: string, _col5: string, _col9: string
                  type: WINDOWING
                Windowing table definition
                  input alias: ptf_1
                  name: windowingtablefunction
                  order by: _col9 ASC NULLS FIRST
                  partition by: _col5, _col1
                  raw input shape:
                  window functions:
                      window function definition
                        alias: min_window_0
                        arguments: _col9
                        name: min
                        window function: GenericUDAFMinEvaluator
                        window frame: RANGE PRECEDING(MAX)~CURRENT
                      window function definition
                        alias: row_number_window_1
                        name: row_number
                        window function: GenericUDAFRowNumberEvaluator
                        window frame: ROWS PRECEDING(MAX)~FOLLOWING(MAX)
                        isPivotResult: true
            Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: concat(_col5, '#', min_window_0, '#', _col1) (type: string), _col0 (type: bigint), row_number_window_1 (type: int), _col2 (type: date), _col3 (type: string), _col4 (type: string), _col5 (type: string), _col9 (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 782125303 Data size: 2610606354398 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Stage: Stage-0
              Fetch Operator
                limit: -1
                Processor Tree:
                  ListSink

asked Oct 20, 2023 at 15:01

yy zhao

111 bronze badge

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Hive HQL - optimizing WINDOW function

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest