-
Notifications
You must be signed in to change notification settings - Fork 136
Description
Describe the bug
search memory likes to return more assistant facts, less user facts
- thus it returns second hand info instead of first hand info, leading to missing facts and wrong answers
- thus timestamps are wrong for temporal questions, leading to wrong answers
- consider doing separate search for user facts (firsthand info) and increase its score, and decrease score for assistant facts (secondhand info)
Steps to reproduce
run longmemeval with s dataset
e.g. multi-session
Question gpt4_59c863d7
Issue:
search memory did not return facts on:
Revell F-15 Eagle
69 Camaro
search memory did not return user content for these, only assistant content:
B-29 bomber
German Tiger
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q01-10/test4-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. multi-session
Question c4a1ceb8
Issue:
search memory only returned assistant facts on fresh lime juice, no user facts which contained the answer
answer_llm found the word grapefruit in an assistant fact and incorrectly thinks user said it
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q11-20/test1-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. multi-session
Question 28dc39ac
Issue:
search memory did not return Hyper Light Drifter, which took me 5 hours
search memory did not return Celeste, which took me 10 hours
search memory returned assistant fact on The Last of Us Part II, but not any user facts
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q11-20/test1-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. multi-session
Question 88432d0a
Issue:
search memory did not return any facts for rustic Italian bread
search memory did not return any facts for batch of cookies
search memory returned assistant facts but no user facts on sourdough starter
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q11-20/test1-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. multi-session
Question d23cf73b
Issue:
search memory did not return any facts on Indian cuisine
search memory returned assistant facts about many suggestions
answer_llm tooks assistant suggestions as user facts
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q21-30/test2-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. temporal-reasoning
Question gpt4_d6585ce8
Issue:
search memory returned assistant facts for outdoor concert but no user facts which does not help answer the question
search memory returned assistant facts for Queen but no user facts which does not help answer the question
search memory returned assistant facts for Brooklyn but no user facts so the timestamp is wrong
search memory did not return the required fact for jazz night so the timestamp is wrong
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q31-40/test3-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. temporal-reasoning
Question gpt4_f420262c
Issue:
search memory return only assistant facts but no user facts for American so timestamp is wrong
search memory return only assistant facts but no user facts for JetBlue so timestamp is wrong
search memory return only assistant facts but no user facts for Delta so timestamp is wrong
search memory did not return facts for United
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q31-40/test3-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
e.g. knowledge-update
Question 945e3d21
Issue:
search memory did not return any facts on yoga from the user, only returned facts on yoga from assistant
Files on aep4:
cd "/automation/atf/tomw/tmp/lme_compare/s full_ctx vs mmai vs gold_facts q01-10/test4-mmai"
vi "memmachine_search_eval_results_gpt-4.1-mini.json"
========================================
more examples are in the results spreadsheet
Expected behavior
see results spreadsheet, link is directly above.
see tabs "mmai pass" and "mmai fail"
-
when answer is correct, there is 50:50 ratio of user and assistant facts
-
when answer is wrong, there is 44:56 ratio of user and assistant facts
-
more assistant facts results in more wrong answers
-
when answer is correct, there is 43:57 ratio where assistant facts appear towards top of the search memories list
-
when answer is wrong, there is 0:100 ratio where assistant facts appear towards top of the search memories list
-
user facts towards the top of the search memories list results in more correct answers
Environment
build is main 01/14 commit dacf9a6
Additional context
Edwin has a suggestion that may help add more user facts:
one way I found to increase the LongMemEval score is to prepend the question with "User: "
So if a question is "What did I eat for breakfast?", then the query is "User: What did I eat for breakfast?"