Page MenuHomePhabricator

High level plan of how to scale MoreLike
Closed, ResolvedPublic

Description

As part of WE3.1, we might need to scale the MoreLike endpoint. In this experiment, we want to expose more users to RelatedArticles, which relies on MoreLike. This is one the most expensive Search endpoints, so depending on the additional load, appropriate care must be take to scale it appropriately so that additional usage does not overload the whole Search clusters.

We don't yet have a good estimation of the additional load that might be generated, so this will be a high level plan, with potentially multiple scenarios, and very high level estimates.

AC

  • various scaling strategies are defined in high level terms
  • high level estimation of the amount of work necessary are provided for each strategy

Event Timeline

Next step: brainstorming with 1 engineer from the web team

EBernhardson subscribed.

Current scaling method:

  • results are edge-cached for 1 day. Edge sees request rate of 90-110M per day
  • cirrus has a second 3 day cache. This sees 30-35M requests per day, giving the edge cache a 65-70% hit rate.
  • elasticsearch servers see 8.5-11M morelike requests per day, giving the second layer cache a 68-70% hit rate
  • End-to-end hit rate is ~90%, only 1 in 10 queries land on the elasticsearch servers
  • The 8.5M-11M requests that do land on the backend servers take approximately 23% [1] of all the time spent on full_text queries. Full text queries (including morelike) are the majority of the load on the servers.

Scaling possibilities:

  • Elasticsearch 7 added an algorithm called WAND which is specialized for returning the top results, but isn't able to tell you how many documents match a given query. We haven't turned this on because various editor workflows require the exact number of results. We could evaluate turning this on in specific contexts, like morelike.
    • I did quick experiments with an article with ~10k morelike results, which had no change, or maybe even worse with WAND. Another test with an article with >1M morelike results showed a slight improvement of perhaps 15%. This is only testing 2 queries, but suggests there isn't a huge win to be had here. It might allow some additional level of queries.
  • Most of the cost of the morelike query is based on the number of results it has to visit and score to determine which are the best results, various techniques exist for limiting the result set thus making the query cheaper. Making the query cheaper will certainly change the results, evaluations would have to be done to estimate the change in quality of recommendations.
    • Adding specific category/template/weighted tags filters could reduce the number of documents visited. But this has to be targeted, adding many disparate filters may not limit the number of documents visited by significant amounts.
    • Targeting the query to the opening_text field instead of the text field we use today greatly reduces the number of documents matched, in part because of how much shorter the opening text is. In prior evaluations we found this degraded the quality of results beyond acceptable levels.
    • We can reduce the number of words that morelike selects to search for. This currently is set to 25, we could experiment with lower values. I don't believe we've ever experimented with this value. In a very quick test a value of 5 against a page that typically takes 1500ms to calculate queries only took 150ms. I suspect this is more applicable to particularly long pages, as we get better stats about which words are important from longer pages, but more analysis would have to be done.
  • Caching could be expanded to longer time periods. The current 3-day cache in Cirrus is about the most we feel is reasonable while using an in-memory cache, but we could push that cache into a disk-backed cache, like an SQL database, and consider much longer time periods. Might require review/approval from DBA's. Prior analysis of the change in recommendations over time suggests this could be 7+ days with minimal noticable change in results.
    • Both of the current caching layers might be dropping recommendations from the cache prior to the TTL expiring, but we don't have any insight or metrics into that. A dedicated disk-backed cache could potentially have better hit rates if that is happening.

Notably these scaling possibilties all assume that these recommendations are reasonably cachable and only specialized on a content basis, not on the basis of which user is performing the query. It seems unlikely we would be able to support personalizing morelike queries to individual users, the cache fragmentation would be too much and the base queries seem too expensive.

[1] Estimated by comparing query_time_in_millis between the more_like and full_text groups. This is roughly the amount of time spent executing shard queries. From elastic:9200/_stats/search?groups=more_like,full_text

@EBernhardson, your write-up looks good to me!

The only thing I would add is that there are probably additional changes to the query that we could explore that might make it more efficient—like increasing the minimum number of terms to match—but we need to evaluate any possible change for both efficiency and accuracy.

@EBernhardson makes sense to me, thanks!
Quick ref regarding opening_text and prior testing at T127900#2077958

@EBernhardson, your write-up looks good to me!

The only thing I would add is that there are probably additional changes to the query that we could explore that might make it more efficient—like increasing the minimum number of terms to match—but we need to evaluate any possible change for both efficiency and accuracy.

I did run a quick test but it didn't seem to make much difference. We currently set minimum_should_match to 30%. Running the same query with 60% takes approximately the same time, although it does manage to have significantly fewer results. Went from 2.5M to 321k results. Would probably have to poke around with profiling to better understand why.

Anyone opposed to my marking of this as Resolved for now?

We may end up needing to apply techniques to accommodate increased load later, depending on UX approach, but for now, at least if recommendations are principally shown from an empty search state, even allowing for some initial CTAs to drive feature visibility, we believe most likely the serving infrastructure should be able to handle the extra load.

This said, if we end up having a more visible UX affordance driving more interaction (imagine for example an icon or the search bar being expanded as a larger "Explore" click target on mobile web or some such thing), we may need to revisit, but we can probably defer addressing until we hit such a point. And I'd think we could re-open the ticket in that case?

Naturally if we have additional recommender approaches (e.g., @MGerlach has a similar-sessions and more type research) that we bring to the UX, it seems likely we'll need to revisit there as well and re-open, too.

If anyone feels we ought to keep this task open and hold some follow up conversations soon to try to work through different aspects of this, please do advise! I'm mainly looking to mark Resolved for now to get ahead of Phabricator clutter :)