In the last post you saw that the missing index feature does not consider trivial plans.
In this blog I will explain another limitation of the missing index feature – it does not specify an order for columns to be used in an index.
SQL Server can only benefit from a row store index if the key columns are in the correct order. The order of the columns of an index determines whether SQL Server chooses an index scan or seek. Scanning an index simply means that the storage engine of SQL Server will read all index pages and filter the desired records afterwards (residual predicate). The costs of the query execution is proportional to the number of pages SQL Server must read. An index seek on the other hand will navigate through the B-Tree of the index. SQL Server begins reading the appropriate index root page. From there it is directed directly to the page of the intermediate level and finally stops reading on the leaf page. If we use a range predicate in our SQL query or do not have a unique index, SQL Server will read forward or backward the leaf pages with the support of the doubly linked index chain. In this way, SQL Server reads only the pages that match the searchable argument, and therefore drastically reduces IO reads.
That is why SQL queries (SELECT, UPDATE and DELETE) normally will benefit from a supportive index.
With this understanding, consider the index consists of two columns, col1 and col2. If we are looking for a col2 in the index, SQL Server cannot use the index to seek for col2 because col1 is the leading column and therefore the index is sorted by col1 and not col2. SQL Server will scan the entire index or the table instead.
Another example where the order of the index columns determines if SQL Server can use an index is when there is a need to sort the data either by an explicit sort operator or because SQL Server decided to use an operator like Stream Aggregate or a Merge Join in the query plan. These operators requires sorted data. If the index columns are not in the correct order, it is not possible to use the index to sort, aggregate or join. Therefore, we need to consider the order of index columns when creating an index.Let us go into detail now by running a query on a heap without any non clustered index to get the missing index hint from SQL Server.

First, we create a table based on AdventureWorks2017.Sales.SalesOrderDetail. They way we a creating the table ensures we do not create any index at the same time.

USE AdventureWorks2017;
GO

-- Create a demo table
DROP TABLE IF EXISTS dbo.SalesOrderDetail;
GO

-- Create a heap table based on Sales.SalesOrderDetail
SELECT * INTO dbo.SalesOrderDetail FROM Sales.SalesOrderDetail;
GO

Now, let us run a simple query. The final result should be sorted by SalesOrderDetailID and SalesOrderID.

-- Sort operator
-- Memory grant : 23016
-- Costs : 7,39451
-- Table 'SalesOrderDetail'. Scan count 5, logical reads 1497
-- The Query Processor estimates that implementing the following index could
-- improve the query cost by 9.94838%.
SELECT *
FROM dbo.SalesOrderDetail
WHERE
	SalesOrderID > 43670
AND SalesOrderDetailID > 112
ORDER BY
	SalesOrderDetailID, SalesOrderID
GO

In the execution plan we can see that SQL server recommends to create an index because with the use of the index the overall costs can be reduced by 9.94838%. Currently we scan the entire table which costs 7.39451 and therefore SQL Server even decides to run the query with all possible cores (I did not changed the value 5 for the Cost Threshold for Parallelism). Additionally, an explicit sort operator is used which needs memory grant of approx. 23MB. This query will not scale and it will definitely benefit from a supportive index to avoid the sorting operator, memory grant, parallelism finally and reduce IO reads.

So let us create the index SQL Server recommends.

-- Index recommendation from missing index dmv
-- Create a non clustered index including all columns
CREATE NONCLUSTERED INDEX [wrong_order]
	ON dbo.SalesOrderDetail (
								SalesOrderID
							  , SalesOrderDetailID
							)
	INCLUDE (
				CarrierTrackingNumber
			  , OrderQty
			  , ProductID
			  , SpecialOfferID
			  , UnitPrice
			  , UnitPriceDiscount
			  , LineTotal
			  , rowguid
			  , ModifiedDate
			);
GO

We can discuss if it is good idea to include all columns in the index leaf pages or simply create a clustered index but let us concentrate on the order of the index columns. SQL Server wants us to build the index pages in the order SalesOrderID and SalesOrderDetailID.

After creating the index and running the query again the execution plan looks like this:

-- There is no improvement by using the suggested index from missing index dmv
-- Index colums are in the wrong order
-- Sort operator
-- Memory grant : 24960
-- Costs : 7,39309 
-- Table 'SalesOrderDetail'. Scan count 5, logical reads 1497
SELECT *
FROM
	dbo.SalesOrderDetail
WHERE
	SalesOrderID > 43670
	AND SalesOrderDetailID > 112
ORDER BY
	SalesOrderDetailID, SalesOrderID
GO

SQL Server seems to be fine with the index because we cannot see any index hint anymore. But the plan used is actually the same plan as before because the storage engine does not use the new index. Let us force SQL Server to use the index which was suggested:

-- Force SQL Server to use the suggested index
-- Sort operator
-- Memory grant : 24336 (24960)
-- Costs : 7,42752 (7,39309)
-- Table 'SalesOrderDetail'. Scan count 4, logical reads 1661 (1497)
SELECT *
FROM
	dbo.SalesOrderDetail WITH (INDEX = [wrong_order])
WHERE
	SalesOrderID > 43670
	AND SalesOrderDetailID > 112
ORDER BY
	SalesOrderDetailID, SalesOrderID
OPTION (QUERYTRACEON 9130);
GO

It is obvious that the index seek is not as effective as the table scan because the logical IOs are higher which is the reason that SQL Server prefers to read the pages which belongs to the Index Allocation Map page of the heap. Also note that the pages from the index are filtered after reading – so there was no way the push down the predicate to the storage engine.

I spend some time in order to understand why SQL Server suggests to use the order of SalesOrderID, SalesOrderDetailID in the index. The decision is not based on the statistics of the column. The density of SalesOrderID is 0,2771633 while the density of SalesOrderDetailID is 0,998008. A density of 1 is the maximum and indicates a column with unique values. The higher the density the more selective the index will be.

-- Ensure you run the script before creating the index
SELECT
	sta.object_id, sta.name, c.name
FROM
	sys.stats sta
INNER JOIN
	sys.stats_columns col
ON
	col.object_id = sta.object_id
	AND col.stats_id = sta.stats_id
INNER JOIN
	sys.columns c
ON
	c.object_id = col.object_id
	AND c.column_id = col.column_id
WHERE
	sta.object_id = OBJECT_ID('SalesOrderDetail', 'U');
GO

-- SalesOrderID: 0,2771633
DBCC SHOW_STATISTICS('SalesOrderDetail', '_WA_Sys_00000001_4DB4832C')
GO

-- SalesOrderDetailID: 0,998008
DBCC SHOW_STATISTICS('SalesOrderDetail', '_WA_Sys_00000002_4DB4832C') 
GO

From that perspective it makes sense to recommend an index based on SalesOrderDetailID, SalesOrderID but SQL Server does not agree. Also changing the order of the columns used in the predicate will no change the column order of the suggested index. Maybe we will see a hint in the Memo of the Query Optimizer, but I think it is not worth spending more time on this – the recommendation is clearly wrong.

The correct index to create is the following:

-- Drop the recommended index
DROP INDEX IF EXISTS [wrong_order]
	ON SalesOrderDetail;
GO

CREATE CLUSTERED INDEX idx
ON dbo.SalesOrderDetail (
							SalesOrderDetailID ASC
						  , SalesOrderID ASC
						);
GO

Now the query plan looks much better:

-- No sort operator
-- No memory grant
-- Costs : 1,2529
-- Table 'SalesOrderDetail'. Scan count 5, logical reads 1512
SELECT *
FROM
	dbo.SalesOrderDetail
WHERE
	SalesOrderID > 43670
	AND SalesOrderDetailID > 112
ORDER BY
	SalesOrderDetailID, SalesOrderID;
GO

The costs are down to 1.2529, the sort operator and therefore the memory grant is gone and the plan runs on only one core.
Again, do not trust the missing index feature blindly.