ATM we have a homegrown tool to evaluate a change to search - in https://github.com/cormacparle/media-search-signal-test#use-the-labeled-images-to-compare-search-algorithms a script runs all the searches for which we have labeled data via the search API, stores the results, then calculates average F1, precision and recall over all the searches
In contrast, with relforge (AFAIK) you can run sets of elasticsearch queries and use labeled data to calculate information retrieval scores (precision, recall, and others)
Let's figure out how to do this, as it's more sustainable to do things the same way the search team is doing them. See runSearch.php in CirrusSearch and searchEntities.php in Wikibase
NOTE
T280131 is in progress to decide whether to continue using relforge, so if it's a while before this gets done best check in there