Unreasonable Defaults: Primary Key as Clustering Key
As you might have noticed—at least if you have read SQL Performance Explained—I don’t think clustered indexes are as useful as most people believe. That is mainly because it is just too darn difficult to choose a good clustering key. As a matter of fact, choosing a good—the “right”—clustering key is almost impossible if there are more than one or two indexes on the table. The result is that most people just stick to the default—which is the primary key. Unfortunately, this is almost always the worst possible choice.
In this article I explain the beast named clustered index and all it’s downsides. Although this article uses SQL Server as demo database, the article is equally relevant for MySQL/MariaDB with InnoDB and the Oracle database when using index-organized tables.
Recap: What is a clustered index
The idea of clustered indexes is to store a complete table in a B-tree structure. If a table has a clustered index, it basically means the index is the table. A clustered index has a strict row order like any other B-tree index: it sorts the rows according to the index definition. In case of clustered indexes we call the columns that define the index order the clustering key. The alternative way to store a table is as a heap with no particular row order. Clustered indexes have the advantage that they support very fast range scans. That is, they can fetch rows with the same (or similar) clustering key value quickly, because those rows are physically stored next to each other (clustered)—often in the same data page. When it comes to clustered indexes it is very important to understand that there is no separate place where the table is stored. The clustered index is the primary table store—thus you can have only one per table. That’s the definition of clustered indexes—it’s a contrast to heap tables.
There is, however, another contrast to clustered indexes: non-clustered indexes. Just because of the naming, this is the more natural counterpart to clustered indexes to many people. From this perspective, the main difference is that querying a clustered index is always done as index-only scan. A non-clustered index, on the other hand, has just a sub-set of the table columns so it causes extra IOs for getting the missing columns from the primary table store if needed. Every row in a non-clustered index has a reference to the very same row in the primary table store for this purpose. In other words, using a non-clustered index generally involves resolving an extra level of indirection. Generally, I said. In fact it is pretty easy to avoid this overhead by including all needed columns in the non-clustered index. In that case the database can find all the required data in the index and just doesn’t resolve the extra level of indirection. Even non-clustered indexes can be used for index-only scans—making them as fast as clustered indexes. Isn’t that what matters most?
The index-only scan is the important concept—not the clustered index.
Later in the article, we’ll see that there is a duality among these aspects of clustered indexes: being a table store (in contrast to heap tables) or being an index that happens to have all table columns. Unfortunately, this “table-index duality” can be as confusing as the wave-particle duality of light. Hence, I’ll explicitly state which aspect appears at the time whenever necessary.
The costs of an extra level of indirection
When it comes to performance, an extra level of indirection is not exactly desirable because dereferencing takes time. The crucial point here is that the costs of dereferencing is greatly affected by the way the table is physically stored—either as heap table or as clustered index.
The following figures explain his phenomenon. They visualize the process to execute a query to fetch all
SALES rows from 2012-05-23. The first figure uses a non-clustered index on
SALE_DATE together with a heap table (= a table that doesn’t have a clustered index):
Note that there is a single
Index Seek (Non-Clustered) operation on the non-clustered index that causes two
RID Lookups into the heap table (one for each matched row). This operation reflects dereferencing the extra indirection to load the remaining columns from the primary table store. For heap tables, a non-clustered index uses the physical address (the so-called
RID) to refer to the very same row in the heap table. In the worst case the extra level of indirection causes one additional read access per row (neglecting forwarding).
Now let’s look at the same scenario with a clustered index. More precisely, when using a non-clustered index in presence of a clustered-index on
SALE_ID—that is, the primary key as clustering key.
Note that the definition of the index on the left hand side has not changed: it’s still a non-clustered index on
SALE_DATE. Nevertheless, the pure presence of the clustered index affects they way the non-clustered index refers to the primary table storage—which is the clustered index! Unlike heap tables, clustered indexes are “living creatures” that move rows around as needed to maintain their properties (i.e.: the row order and tree balance). Consequently the non-clustered index can’t use the physical address as reference anymore because it could change at any time. Instead, it uses the clustering key
SALE_ID as reference. Loading the missing columns from the primary table store (=clustered index) now involves a full B-tree traversal for each row. That are several extra IOs per row as opposed to a single extra IO in case of heap tables.
I also refer to this effect as the “clustered index penalty”. This penalty affects all non-clustered indexes on tables that have a clustered index.
For index-organized tables, the Oracle database also stores the physical address of that row (=guess
ROWID) along with the clustering key in the secondary indexes (=non-clustered indexes). If the row is found at this address, the database doesn’t need to perform the B-tree traversal. If not, however, it has performed one more IO for nothing.
How bad is it?
Now that we have seen that clustered indexes cause a considerable overhead for non-clustered indexes, you’ll probably like to know how bad it is? The theoretic answer has been given above—one IO for the
RID Lookup compared to several IOs for the
Key Lookup (Clustered). However, as I learned the hard way when giving performance training, people tend to ignore, deny, or just don’t believe that fact. Hence, I’ll show you a demo.
Obviously, I’ll use two similar tables that just vary by the table storage they use (heap vs. clustered). The following listing shows the pattern to create these tables. The part in square brackets makes the difference to either use a heap table or clustered index as table store.
CREATE TABLE sales[nc] ( sale_id NUMERIC NOT NULL, employee_id NUMERIC NOT NULL, eur_value NUMERIC(16,2) NOT NULL, SALE_DATE DATE NOT NULL CONSTRAINT salesnc_pk PRIMARY KEY [nonclustered] (sale_id), ); CREATE INDEX sales[nc]2 ON sales[nc] (sale_date);
I’ve filled these tables with 10 million rows for this demo.
The next step is to craft a query that allows us to measure the performance impact.
SELECT TOP 1 * FROM salesnc WHERE sale_date > '2012-05-23' ORDER BY sale_date
The whole idea behind this query is to use the non-clustered index (hence filtering and ordering on
SALE_DATE) to fetch a variable number of rows (hence
TOP N) from the primary table store (hence
select * to make sure it’s not executed as index-only scan).
Now lets look what happens if we
SET STATISTICS IO ON and run the query against the heap table:
Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0,…
The interesting figure is the logical reads count of four. There were no physical reads because I have executed the statement twice so it is executed from the cache. Knowing that we have fetched a single row from a heap table, we can already conclude that the tree of the non-clustered index must have three levels. Together with one more logical read to access the heap table we get the total of four logical reads we see above.
To verify this hypothesis, we can change the
TOP clause to fetch two rows:
SELECT TOP 2 * FROM salesnc WHERE sale_date > '2012-05-23' ORDER BY sale_date
Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0,…
Keep in mind that the lookup in the non-clustered index is essentially unaffected by this change—it still needs three logical reads if the second row happens to reside in the same index page—which is very likely. Hence we see just one more logical read because of the second access to the heap table. This corresponds to the first figure above.
Now let’s do the same test with the table that has (=is) a clustered index:
SELECT TOP 1 * FROM sales WHERE sale_date > '2012-05-23' ORDER BY sale_date
Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0,…
Fetching a single row involves eight logical reads—twice as many as before. If we assume that the non-clustered index has the same tree depth, it means that the
KEY Lookup (Clustered) operation triggers five logical reads per row. Let’s check that by fetching one more row:
SELECT TOP 2 * FROM sales WHERE sale_date > '2012-05-23' ORDER BY sale_date
Scan count 1, logical reads 13, physical reads 0, read-ahead reads 0,…
As feared, fetching a second row triggers five more logical reads.
I’ve continued these test in 1/2/5-steps up to 100 rows to get more meaningful data:
|Rows Fetched||Logical Reads (Heap)||Logical Reads (Clustered)|
I’ve also fitted linear equations in the chart to see how the slope differs. The heap table matches the theoretic figures pretty closely (3 + 1 per row) but we see “only” four logical reads per row with the clustered index—not the five we would have expected from just fetching one and two rows.
Who cares about logical reads anyway
Logical reads are a fine benchmark because it yields reproducible results—independent of the current state of caches or other work done on the system. However, the above chart is still a very theoretic figure. Four times as many logical reads does not automatically mean four times as slow. In all reality you’ll have a large part of the clustered index in the cache—at least the first few B-tree levels. That reduces the impact definitively. To see how it affects the impact, I conducted another test: measuring the execution time when running the queries from the cache. Avoiding disk access is another way to get more or less reproducible figures that can be compared to each other.
Again, I’ve used 1/2/5-steps but started at 10.000 rows—selecting fewer rows was too fast for the timer’s resolution. Selecting more than 200.000 rows took extraordinarily long so that I believe the data didn’t fit into the cache anymore. Hence I stopped collecting data there.
|Rows Fetched||Time (Heap)||Time (Clustered)|
From this test it seems that the “clustered index penalty” on the non-clustered index is more like three times as high as compared to using a heap table.
Notwithstanding these results, is it perfectly possible that the real world caching leads to a “clustered index penalty” outside the here observed range in a production system.
What was the upside of clustered indexes again?
The upside of clustered indexes is that they can deliver subsequent rows quickly when accessed directly (not via a non-clustered index). In other words, they are fast if you use the clustering key to fetch several rows. Remember that the primary key is the clustering key per default. In that case, it means fetching several rows via primary key—with a single query. Hmm. Well, you can’t do that with an equals filter. But how often do you use non-equals filters like
< on the primary key? Sometimes, maybe, but not that often that it makes sense to optimize the physical table layout for these queries and punish all other indexes with the “clustered index penalty.” That is really insane (IMHO).
Luckily, SQL Server is quite flexible when it comes to clustered indexes. As opposed to MySQL/InnoDB, SQL Server can use any column(s) as clustering key—even non-unique ones. You can choose the clustering key so it catches the most important range scan. Remember that equals predicates (
=) can also cause range scans (on non-unique columns). But beware: if you are using long columns and/or many columns as clustering key, they bloat all non-clustered indexes because each row in every non-clustered indexes contains the clustering key as reference to the primary table store. Further, SQL Server makes non-unique clustering keys automatically unique by adding an extra column, which also bloats all non-clustered indexes. Although this bloat will probably not affect the tree depth—thanks to the logarithmic growth of the B-tree—it might still hurt the cache-hit rate. That’s also why the “clustered index penalty” could be outside the range indicated by the tests above—in either way!
Even if you are able to identify the clustering key that brings the most benefit for range scans, the overhead it introduces on the non-clustered indexes can void the gains again. As I said in the intro above, it is just too darn difficult to estimate the overall impact of these effects because the clustering key affects all indexes on the table in a pretty complex way. Therefore, I’m very much in favour of using heap tables if possible and index-only scans when necessary. From this perspective, clustered indexes are just an additional space optimization in case the “clustered index penalty“ isn’t an issue—most importantly if you need only one index that is used for range scans.
Some databases don’t support heap tables at all—most importantly MySQL/MariaDB with InnoDB and the Azure database.
Further there are configurations that can only use the primary key as clustering key—most importantly MySQL/MariaDB with InnoDB and Oracle index-organized tables.
Note that MySQL/MariaDB with InnoDB is on both lists. They don’t offer any alternative for what I referred to as “insane.” MyISAM being no alternative.
Concluding: How many clustered indexes can a SQL Server table have?
To conclude this article, I’d like to demonstrate why it is a bad idea to consider the clustered index as the “silver bullet” that deserves all your thoughts about performance. This demo is simple. Just take a second and think about the following question:
How many clustered indexes can a SQL Server table have?
I love to ask this question in my SQL Server performance trainings. It truly reveals the bad understanding of clustered indexes.
The first answer I get is usually is “one!” Then I ask why I can’t have a second one. Typical response: —silence—. After this article, you already know that a clustered index is not only an index, but also the primary table store. You can’t have two primary table stores. Logical, isn’t it? Next question: Do I need to have a clustered index on every table? Typical response: —silence— or “no” in a doubtful tone. I guess this is when the participants realize that I’m asking trick questions. However, you now know that SQL Server doesn’t need to have a clustered index on every table. In absence of a clustered index, SQL Server uses a heap as table store.
As per definition, the answer to the question is: “at most one.”
And now stop thinking about the clustered index as “silver bullet” and start putting the index-only scan into the focus. For that, I’ll rephrase the question slightly:
How many indexes on a SQL Server table can be queried as fast as a clustered index?
The only change to the original question is that it doesn’t focus on the clustered index as such anymore. Instead it puts your focus on the positive effect you expect from a clustering index. The point in using a clustered index is not just to have a clustered index, it is about improving performance. So, let’s focus on that.
What makes the clustered index fast is that every (direct) access is an index-only scan. So the question essentially boils down to: “how many indexes can support an index-only scan?” And the simple answer is: as many as you like. All you need to do is to add all the required columns to a non-clustered index and *BAM* using it is as fast as though it was a clustered index. That’s what SQL Server’s
INCLUDE keyword of the
CREATE INDEX statement is for!
Focusing on index-only scans instead of clustered indexes has several advantages:
You are not limited to one index. Any index can be as fast as a clustered index.
INCLUDEcolumns to a non-clustered index doesn’t affect anything else than this particular index. There is no penalty that hurts all other indexes!
You don’t need to add all table columns to a non-clustered index to enable an index-only scan. Just add the columns that are relevant for the query you’d like to tune. That keeps the index small and can thus become even faster than a clustered index.
And the best part is: there is no mutual exclusion of index-only scans and clustered indexes. Index-only scans work irrespective of the table storage. You can extend non-clustered indexes for index-only scans even if there is a clustered index. That’s also an easy way to avoid paying the “clustered index penalty” on non-clustered indexes.
Because of the “clustered index penalty” the concept of the index-only scan is even more important when having a clustered index. Really, if there is something like a “silver bullet”, it is the index-only scan—not the clustered index.