As illustrated in section 2.4, The AlignEvaluator is trained on (reprompt, image) pairs. But The GRPO Training Loop in section 2.3, AlignEvaluator calculates a scalar reward $r_i$ for each pair ($p_i, I_i$). Here, the $p_i$ denotes user prompt, not reprompt.
Would you please explain how this part was considered?