Skip to content

Refactor: Name the implicit pipeline stages — detect, parse, accumulate, finalize, render #180

@gregeva

Description

@gregeva

Summary

Today's main flow in ltl already forms an implicit 5-stage pipeline, but the stages are unnamed and the boundaries between them are not enforced. Name them — detect, parse, accumulate, finalize, render — with explicit entry-point subroutines and clear inter-stage data contracts. The light-touch refactor adds structure without changing behavior.

This is a prerequisite for Issue #23. With named stages in place, Phase 1 of #23 inserts the format registry into the detect stage, and Phase 2 adds per-bucket lifecycle hooks inside finalize — both become refactors between named stages rather than green-field redesigns.

Motivation

The current implicit pipeline is in ## MAIN ## (ltl:7677):

Order Sub Lines Implicit stage
1 read_and_process_logs() ltl:3590-4407 detect + parse + accumulate (interleaved)
2 initialize_empty_time_windows() ltl:7692 accumulate
3 group_similar_messages() ltl:7702 finalize (consolidation)
4 calculate_all_statistics() ltl:4822-5070, called ltl:7868 finalize
5 calculate_heatmap_buckets() ltl:4435, called ltl:7876 finalize
6 calculate_histogram_buckets() ltl:4552, called ltl:7885 finalize
7 normalize_data_for_output() ltl:5590, called ltl:7893 render
8 print_bar_graph() ltl:6251, called ltl:7913 render
9 print_histograms() called ltl:7914 render
10 write_index_file() ltl:524, called ltl:7942 render (post-output)

The interleaving in #1 (read_and_process_logs) is the part that hides the boundaries — detection, parsing, and accumulation all happen per line in a single tight loop.

Stages

detect

Format detection. Today: inlined per-line in read_and_process_logs() as 13 cascading regex tests (ltl:3689-3840). After this refactor: a named subroutine that runs at the start of each file (or first-N-lines), determines the format, and caches the choice. Falls through to per-line detection in low-confidence cases (this fallback requires the buffered-read architecture from the companion issue).

parse

Regex match + field extraction for a single line. Takes a line + cached format; emits a structured record (timestamp, message, duration, bytes, count, fields). Today: also inlined in read_and_process_logs().

accumulate

Push parsed records into bucket structures: %log_analysis, %log_messages, %heatmap_raw, %histogram_values, %log_threadpools, %log_sessions, %udm_values. Today: also inlined.

finalize

Close-of-pipeline computations: calculate_all_statistics(), calculate_heatmap_buckets(), calculate_histogram_buckets(), group_similar_messages(). Today: separate subs called sequentially after read_and_process_logs() returns.

render

normalize_data_for_output(), print_bar_graph(), print_histograms(), then write_index_file().

Scope

  • Add named entry-point subroutines: pipeline_detect(), pipeline_parse(), pipeline_accumulate(), pipeline_finalize(), pipeline_render().
  • Add a top-level dispatcher in ## MAIN ## that calls them in order.
  • Define and document inter-stage data contracts: what each stage receives and emits.
  • Move existing logic into the named stages with zero behavioral change. Do not restructure intra-stage logic.

Out of scope

Acceptance criteria

  • Five named entry-point subroutines exist with clear contracts.
  • ## MAIN ## is a thin dispatcher that calls the stages in order.
  • All golden-file tests pass byte-identically (regression suite from Testing: Memory baseline profiling for current processing model #56).
  • Benchmark suite shows no regression beyond noise (compare against current baseline).
  • Stage contracts are documented (inline comments or a short addendum to docs/staged-processing-pipeline.md).
  • Pre-existing functions (calculate_all_statistics, calculate_heatmap_buckets, etc.) become callees of the named stages.

Why coarse not fine-grained

Finer stage boundaries (e.g., per-bucket open/close hooks for sliding window) belong inside #23 Phase 2. Landing them now would expand scope and risk regression. The 5-stage shape is enough to make Phase 1 (#58) and Phase 2 (#59) into refactors-between-named-stages.

Dependency of

Depends on

  • None.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions