Perhaps there are many problems in the design of the reward function.

Hi, thanks for the great project!
我在阅览您的论文以及Code中，在奖励函数设计部分发现了一些令我困惑的问题，我想请教一下：
1.准确率奖励R_acc采用GPT-4o模型作为评判器，传入空图片列表，仅对reponse文本评估吗？但是提示词中却提到image：I will give you a question related to an image and the following text as inputs。
2.论文规定：R_tool = R_crop - α * R_area。bbox_score对应 R_crop，但为什么额外乘了bbox_reward_weight（0.1）？
3.compute_score()函数中相对面积惩罚R_area计算方式错误，惩罚形式为 area_penalty_score = area_penalty_weight * max(0, area_ratio - min_area_ratio)，而非论文中的 clip(ratio/μ_a -1, 0, 1)。并且area_penalty_weight设为0.2对应文中的α=2是不是也有问题？代码使用固定的 min_area_ratio 阈值，未计算动态的组内均值μ_a。未检查 R_acc=1 且 R_crop=1 的条件（虽有 acc_condition 参数但在run_adptvision.sh文件中为False），并且为什么R_acc=0要清空R_crop?
4.compute_score()函数中，if isinstance(acc, dict):return acc 这一步是什么意思？
5.compute_score()函数的平衡奖励R_bal只有工具调用​惩罚。无工具调用时，结果奖励score = acc_score + format_score，缺少对低分辨率图像下正确率低场景的直接回答施加的惩罚。
Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perhaps there are many problems in the design of the reward function. #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Perhaps there are many problems in the design of the reward function. #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions