3

I have a table PaymentItems with 8 million rows. 100'000 rows have the foreign key PaymentItemGroupId = '662162c6-209c-4594-b081-55b89ce81fda'.

I have created a nonclustered index on the column PaymentItems.Date (ASC) in order to be able to sort / find entries faster for dates.

When running the following query, it will take about 3 minutes:

SELECT TOP 10 [p].[Id], [p].[Receivers]
FROM [PaymentItems] AS [p]
WHERE [p].[PaymentItemGroupId] = '662162c6-209c-4594-b081-55b89ce81fda'
ORDER BY [p].[Date]

Interesting is, without the TOP 10 it will take 18 seconds and return all 100'000 rows. When I order descending instead of ascending (ORDER BY [p].[Date] DESC) it will take about 1 second. When I remove the index, it's also faster when sorting ascending.

I analyzed the query plan for the slow query, and it looks like MS SQL Server does not filter the rows by the foreign key first, but will instead sort all 8 million rows first (Index scan non clustered on Date index).

In the fast query, it will filter the where conditions first (key lookup clustered).

Is there anything I can do except removing the index for Date to prevent leading SQL Server into building a bad query plan like this?

Here is the actual query plan: https://www.brentozar.com/pastetheplan/?id=xBBArQl9kh

Here is the create table script:

CREATE TABLE [dbo].[PaymentItems](
    [Id] [uniqueidentifier] NOT NULL,
    [PaymentItemGroupId] [uniqueidentifier] NOT NULL,
    [Date] [datetime2](7) NOT NULL,
 CONSTRAINT [PK_PaymentItems] PRIMARY KEY CLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_PaymentItems_Date] ON [dbo].[PaymentItems]
(
    [Date] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO

CREATE NONCLUSTERED INDEX [IX_PaymentItems_PaymentItemGroupId] ON [dbo].[PaymentItems]
(
    [PaymentItemGroupId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
GO
5
  • A foreign key is not an index, so you probably need one for PaymentItemGroupId Commented Mar 7 at 9:36
  • There already is. Not sure if SQL Server or Entity Framework creates this automatically. Commented Mar 7 at 9:52
  • Can you include the query plan? Use pastetheplan for simplicity. I'm guessing since lookup on date is needed, it decides to use the clustered index which becomes slow. So guessing you need better indexes Commented Mar 7 at 9:53
  • @siggemannen Added it to the question. Commented Mar 7 at 10:03
  • Please also add the full CREATE TABLE definition with indexes. Commented Mar 7 at 10:26

3 Answers 3

5

For this query to run fast, you would need the following index, which it seems you don't have. This would prevent the key lookup entirely by completely covering the query.

CREATE INDEX IX_PaymentItems_PaymentItemGroupId_Date ON PaymentItems
    (PaymentItemGroupId, Date, Id)
    INCLUDE (Receivers)
    WITH (DROP_EXISTING = ON);

Foreign keys are not indexed by default (even though they should be), and even if they were you'd be missing the Date key column.

Sign up to request clarification or add additional context in comments.

6 Comments

I added the Index but only (PaymentItemGroupId, Date) and it solves the issue. Is there a general rule of thumb, when we should use indexes like this? An when to not use indexes like I did? Or was this rather an "execution plan bug"?
You should have the INCLUDE as well. No this is not a bug, it's entirely expected. The server thinks it's faster to scan the whole table because it thinks it will find 10 rows fast, this is the Row Goal Problem. Giving it a proper index means the index will be faster, and using INCLUDE means it won't need Key Lookups. See also learn.microsoft.com/en-nz/archive/blogs/bartd/… and use-the-index-luke.com/blog/2019-04/…
@Ben5 No, it is not a "bug." The optimizer can't invent an index you didn't create; it has to work within the confines of what you did create.
@Charlieface Great explanation. In my real scenario there are a lot more columns involved (also dynamic combinations) so I guess this is good enough? Or is it the same as without any index then?
Hard to say without a real example, but generally you want some kind of index that at least covers the more selective WHERE predicates. The problem in this case started when it picked a bad index ie IX_PaymentItems_Date because it had to scan many rows and do key lookups on all of them until it got 10 rows that actually matched all the predicates.
|
1

In addition to the accepted answer, I think it's important to dwell a bit at (and upvote) the comment

"it picked a bad index ie IX_PaymentItems_Date because it had to scan many rows and do key lookups on all of them until it got 10 rows that actually matched all the predicates"

If you have 8.6M rows in your table, of which 100.000 have a particular paymentGroupItemId. It would be reasonable to assume that these would be distributed across the entire 8.6M rows. In other words, 1 in every 86 row or so. This is what the optimizer will say: "10 rows you say? That's 860 rows on average .. I can do that really fast by scanning in date order and doing lookups."

But if the sought after paymentItemGroupId wasn't added to the system til sometime late in the game, say after about 8.4M paymentItems had already been created, the distribution isn't reasonable any longer, and the server will get caught off guard, because the date-order index scan is guaranteed to not find anything for the first 8.4M rows. The (paymentItemGroupId,Date)-distribution is skewed

This is what's at play here!

If you select TOP 10 in descending date order, the server will probably only need to look at something like 15 or 20 rows, because so many of the newer paymentItems have the sought after paymentItemGroupId.

Why not use the predicate first? There is an index after all on paymentItemGroupId.
But then SQL Server would have to make 100.000 lookups into the clustered index to find the date, then sort these 100.000 intermediate results and then take TOP 10. Even if the guesstimate of 1/86 is off by x1000 (and the plan actually says ~1/79) it's still better to scan by date and do lookups for paymentItemGroupIds

By combining paymentItemGroupId and Date in an index, you save all the lookups, because you correlate the two columns. For TOP 10 you might not need to Include receivers, but what if it were TOP 100 or TOP 1000 in stead? It's just good practice to cover.

Comments

0

I had a slightly different, but with possibly the same root cause. To resolve the INDEX issue the only thing that worked in my case was adding a new Date column which was something like [DateSmall] = CAST(DATETIME2 AS Date) thus removing TIME. Of course this creates new issues, such as UTC vs local dates, but it did solve the performance issue.

PS I did try tackling it as an ASCENDING KEY problem, which helped somewhat but only if I updated Stats with Full scan, which on a table with over 3 million inserts per day, wasn't really viable.

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.