In this blog I will explain another limitation of the missing index feature – it does not specify an order for columns to be used in an index.
SQL Server can only benefit from a row store index if the key columns are in the correct order. The order of the columns of an index determines whether SQL Server chooses an index scan or seek. Scanning an index simply means that the storage engine of SQL Server will read all index pages and filter the desired records afterwards (residual predicate). The costs of the query execution is proportional to the number of pages SQL Server must read. An index seek on the other hand will navigate through the B-Tree of the index. SQL Server begins reading the appropriate index root page. From there it is directed directly to the page of the intermediate level and finally stops reading on the leaf page. If we use a range predicate in our SQL query or do not have a unique index, SQL Server will read forward or backward the leaf pages with the support of the doubly linked index chain. In this way, SQL Server reads only the pages that match the searchable argument, and therefore drastically reduces IO reads.
That is why SQL queries (SELECT, UPDATE and DELETE) normally will benefit from a supportive index.
With this understanding, consider the index consists of two columns, col1 and col2. If we are looking for a col2 in the index, SQL Server cannot use the index to seek for col2 because col1 is the leading column and therefore the index is sorted by col1 and not col2. SQL Server will scan the entire index or the table instead.
Another example where the order of the index columns determines if SQL Server can use an index is when there is a need to sort the data either by an explicit sort operator or because SQL Server decided to use an operator like Stream Aggregate or a Merge Join in the query plan. These operators requires sorted data. If the index columns are not in the correct order, it is not possible to use the index to sort, aggregate or join. Therefore, we need to consider the order of index columns when creating an index.
Let us go into detail now by running a query on a heap without any non clustered index to get the missing index hint from SQL Server.
First, we create a table based on AdventureWorks2017.Sales.SalesOrderDetail. They way we a creating the table ensures we do not create any index at the same time.
USE AdventureWorks2017; GO -- Create a demo table DROP TABLE IF EXISTS dbo.SalesOrderDetail; GO -- Create a heap table based on Sales.SalesOrderDetail SELECT * INTO dbo.SalesOrderDetail FROM Sales.SalesOrderDetail; GO
Now, let us run a simple query. The final result should be sorted by SalesOrderDetailID and SalesOrderID.
-- Sort operator -- Memory grant : 23016 -- Costs : 7,39451 -- Table 'SalesOrderDetail'. Scan count 5, logical reads 1497 -- The Query Processor estimates that implementing the following index could -- improve the query cost by 9.94838%. SELECT * FROM dbo.SalesOrderDetail WHERE SalesOrderID > 43670 AND SalesOrderDetailID > 112 ORDER BY SalesOrderDetailID, SalesOrderID GO
In the execution plan we can see that SQL server recommends to create an index because with the use of the index the overall costs can be reduced by 9.94838%. Currently we scan the entire table which costs 7.39451 and therefore SQL Server even decides to run the query with all possible cores (I did not changed the value 5 for the Cost Threshold for Parallelism). Additionally, an explicit sort operator is used which needs memory grant of approx. 23MB. This query will not scale and it will definitely benefit from a supportive index to avoid the sorting operator, memory grant, parallelism finally and reduce IO reads.
So let us create the index SQL Server recommends.
-- Index recommendation from missing index dmv -- Create a non clustered index including all columns CREATE NONCLUSTERED INDEX [wrong_order] ON dbo.SalesOrderDetail ( SalesOrderID , SalesOrderDetailID ) INCLUDE ( CarrierTrackingNumber , OrderQty , ProductID , SpecialOfferID , UnitPrice , UnitPriceDiscount , LineTotal , rowguid , ModifiedDate ); GO
We can discuss if it is good idea to include all columns in the index leaf pages or simply create a clustered index but let us concentrate on the order of the index columns. SQL Server wants us to build the index pages in the order SalesOrderID and SalesOrderDetailID.
After creating the index and running the query again the execution plan looks like this:
-- There is no improvement by using the suggested index from missing index dmv -- Index colums are in the wrong order -- Sort operator -- Memory grant : 24960 -- Costs : 7,39309 -- Table 'SalesOrderDetail'. Scan count 5, logical reads 1497 SELECT * FROM dbo.SalesOrderDetail WHERE SalesOrderID > 43670 AND SalesOrderDetailID > 112 ORDER BY SalesOrderDetailID, SalesOrderID GO
SQL Server seems to be fine with the index because we cannot see any index hint anymore. But the plan used is actually the same plan as before because the storage engine does not use the new index. Let us force SQL Server to use the index which was suggested:
-- Force SQL Server to use the suggested index -- Sort operator -- Memory grant : 24336 (24960) -- Costs : 7,42752 (7,39309) -- Table 'SalesOrderDetail'. Scan count 4, logical reads 1661 (1497) SELECT * FROM dbo.SalesOrderDetail WITH (INDEX = [wrong_order]) WHERE SalesOrderID > 43670 AND SalesOrderDetailID > 112 ORDER BY SalesOrderDetailID, SalesOrderID OPTION (QUERYTRACEON 9130); GO
It is obvious that the index seek is not as effective as the table scan because the logical IOs are higher which is the reason that SQL Server prefers to read the pages which belongs to the Index Allocation Map page of the heap. Also note that the pages from the index are filtered after reading – so there was no way the push down the predicate to the storage engine.
I spend some time in order to understand why SQL Server suggests to use the order of SalesOrderID, SalesOrderDetailID in the index. The decision is not based on the statistics of the column. The density of SalesOrderID is 0,2771633 while the density of SalesOrderDetailID is 0,998008. A density of 1 is the maximum and indicates a column with unique values. The higher the density the more selective the index will be.
-- Ensure you run the script before creating the index SELECT sta.object_id, sta.name, c.name FROM sys.stats sta INNER JOIN sys.stats_columns col ON col.object_id = sta.object_id AND col.stats_id = sta.stats_id INNER JOIN sys.columns c ON c.object_id = col.object_id AND c.column_id = col.column_id WHERE sta.object_id = OBJECT_ID('SalesOrderDetail', 'U'); GO -- SalesOrderID: 0,2771633 DBCC SHOW_STATISTICS('SalesOrderDetail', '_WA_Sys_00000001_4DB4832C') GO -- SalesOrderDetailID: 0,998008 DBCC SHOW_STATISTICS('SalesOrderDetail', '_WA_Sys_00000002_4DB4832C') GO
From that perspective it makes sense to recommend an index based on SalesOrderDetailID, SalesOrderID but SQL Server does not agree. Also changing the order of the columns used in the predicate will no change the column order of the suggested index. Maybe we will see a hint in the Memo of the Query Optimizer, but I think it is not worth spending more time on this – the recommendation is clearly wrong.
The correct index to create is the following:
-- Drop the recommended index DROP INDEX IF EXISTS [wrong_order] ON SalesOrderDetail; GO CREATE CLUSTERED INDEX idx ON dbo.SalesOrderDetail ( SalesOrderDetailID ASC , SalesOrderID ASC ); GO
Now the query plan looks much better:
-- No sort operator -- No memory grant -- Costs : 1,2529 -- Table 'SalesOrderDetail'. Scan count 5, logical reads 1512 SELECT * FROM dbo.SalesOrderDetail WHERE SalesOrderID > 43670 AND SalesOrderDetailID > 112 ORDER BY SalesOrderDetailID, SalesOrderID; GO
The costs are down to 1.2529, the sort operator and therefore the memory grant is gone and the plan runs on only one core.
Again, do not trust the missing index feature blindly.
- The limitations of the missing index feature – It suggests creating indexes on imprecise and not persisted columns
- The limitations of the missing index feature – It reports only include columns for some queries
- The limitations of the missing index feature – It does not consider trivial plans
- The limitations of the missing index feature – It cannot gather statistics for more than 500 missing index groups