Blog


Thank You MySQL, We’ll Miss You!

Dear MySQL,

Thank you for introducing me to SQL. It must have been 1998 when we first met I and fell in love with the simplicity of SQL immediately. Before that I’ve been using C structs all the time; I had to do my joins programmatically and also create and maintain my indexes manually. It was even hard to combine several search conditions via and and or. But then there was the shiny new world of SQL you were showing me…

Everything was easily. Just write a where clause, no matter how complex, you found the right rows. Joins were equally easy to write and you took all the effort to combine the data from several tables as I needed them. I also remember how easy it became to manage the schema. Instead of writing a program to copy my data from one C struct to another, I just say alter table now—in the meanwhile it even works online in many cases! I didn’t take long until I used SQL for stuff I wouldn’t have thought a database could do for me. So I was quickly embracing group by and co.

But I haven’t spent a lot of time with you lately. It’s not because I was too busy. I’m still practicing what you have shown me! And I’ve moved on. Now I’m using common table expressions to organize complex queries and I use window functions to calculate running totals or just do a ranking. I’m also using joins more efficiently because I know about hash and sort/merge joins. A while ago I was wondering why you didn’t tell me about these things. But then I realized that you don’t know them.

I know it was not always nice what I said about you recently. But now that Oracle announced your retirement to focus on Oracle NoSQL, I realized how right this comment on reddit was. Sure you are neither the most advanced nor the most widely deployed open source SQL database, but you introduced millions of people to SQL. There should be no tears when some of them move away because they want to see what’s next. Isn’t that the greatest compliment a teacher can get? You can be proud of what you have accomplished.

Thank you!

Markus Winand
Markus Winand
April 1, 2014

ps.: Even when fooling around, I can’t resist to inform my fellow readers. Besides the claim that Oracle retires MySQL, everything is true. Initially I thought Oracle NoSQL is an Aprils fool’s joke but it isn’t.

Results of the SQL Performance Quiz: 60% fail

In 2011, I’ve launched the “The 3-Minute Test: What do you know about SQL performance.” It consists of five questions that follow a simple pattern: each question shows a query/index pair and asks if it demonstrates proper indexing or not. Till today, this test has become one of the most popular features on Use The Index, Luke and has been completed more than 28,000 times.

Note

Just in case you got curious, please be aware that this article is a spoiler. You might want to do the quiz yourself before continuing.

Although the quiz was created for educational purposes, I was wondering if I could get some interesting figures out of these 28,000 results. And I think I could. However, there are several things to keep in mind when looking at these figures. First, the quiz uses the surprise factor to catch attention. That means, three questions show cases that look fine, but aren’t. One question does it the other way around and shows an example that might look dangerous, but isn’t. There is only one question where the correct answer is in line with the first impression. Another effect that might affect the significance of the results is that there was no representative selection of participants. Everybody can take the quiz. You can even do it multiple times and will probably get a better result the second time. Just keep in mind that the quiz was never intended to be used for scientific research upon the indexing knowledge in the field. Nevertheless, I think that the size of the dataset is still good enough to get an impression.

Below I’ll show two different statistics for each question. First, the average rate at which this question was correctly answered. Second, how this figure varies for users of MySQL, Oracle, PostgreSQL and SQL Server databases. In other word, it says if e.g. MySQL users are more knowledgeable about indexing as PostgreSQL users. Spoiler: It’s the other way around. The only reason I’m in the lucky position to have this data is that the test sometimes uses vendor specific syntax. For example, what is LIMIT in MySQL and PostgreSQL is TOP in SQL Server. Therefore, the participants have to select a database at the beginning so that the questions are shown in the native syntax of that product.

Question 1: Functions in the WHERE Clause

Is the following SQL good or bad practice from a performance perspective?

Searching for all rows of the year 2012:

CREATE INDEX tbl_idx ON tbl (date_column);

SELECT text, date_column
  FROM tbl
 WHERE TO_CHAR(date_column, 'YYYY') = '2012';

This is an example where the code uses functions specific to the Oracle and PostgreSQL databases. For MySQL, the question uses YEAR(date_column) and for SQL Server datepart(yyyy, date_column). Of course, I could have used EXTRACT(YEAR date_column), but I thought it is better to use the most common syntax.

The participants have these options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The correct answer is “bad practice” because the index on date_column cannot be used when searching on something derived from date_column. If you don’t believe that, please have a look at the proof scripts or the explanations shown at the end of the test. They also contain links to the Use The Index, Luke pages explaining it in more detail.

However, if you didn’t know how functions effectively “disable” indexes, you are not alone. Only about two-thirds gave the correct answer. And as for every multiple choice test, there is a certain probably to pick the correct answer by chance. In this cases it’s a 50/50 chance — by no means negligible. I’ve marked this “guessing score” in the figure to emphasize that.

This is one of the most common problems I see in my everyday work. The same problem can also hit you with VARCHAR fields when using UPPER, TRIM or the like. Please keep in mind: whenever you are applying functions on columns in the where clause, an index on the column itself is no longer useful for this query.

Although this result is quite disappointing — I mean it’s just 17% better than guessing — it is no surprise to me. What is a surprise for me is how the result differs amongst the users of different databases.

As a matter of fact, MySQL users just score 55% — almost as low as the “guessing score”. PostgreSQL users, on the other hand, get a score of 83%.

An effect that might explain this result is that MySQL doesn’t support function-based indexes while Oracle and PostgreSQL do. Function-based indexes allow you to index expressions like TO_CHAR(date_column, 'YYYY'). Although it is not the recommended solution for this case, the pure existence of this feature might make users of the Oracle and PostgreSQL database more aware of this problem. SQL Server offers a similar feature: although it cannot index expressions directly, you can create a so-called computed column on expressions, which in turn can be indexed.

Although support for function-based indexes might explain why MySQL users underperformed, it is still no excuse. The shown query/index pair is bad — no matter whether the database supports function-based indexes or not. And the major improvement is also possible without function-based indexes:

SELECT text, date_column
  FROM tbl
 WHERE date_column >= TO_DATE('2012-01-01', 'YYYY-MM-DD')
   AND date_column <  TO_DATE('2013-01-01', 'YYYY-MM-DD');

The index doesn’t need to be changed. This solution is very flexible because it supports queries for different ranges too — e.g. by week or month. This is the recommended solution.

As a curious guy, I’d love to know how many of the people who correctly answered this question were thinking of the sub-optimal solution to use a function-based index. I’d rate this solution half-correct at best.

Question 2: Indexed Top-N Queries

Is the following SQL good or bad practice from a performance perspective?

To find the most recent row:

CREATE INDEX tbl_idx ON tbl (a, date_column);

SELECT id, a, date_column
  FROM tbl
 WHERE a = ? 
 ORDER BY date_column DESC
 LIMIT 1;

Note that the question mark is a placeholder, because I always encourage developers to use bind parameters.

As participant, you have these two options again:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

This is the question that is supposed to look dangerous, but isn’t. Generally, it seems like people believe order by must always sort the data. This index, however, eliminates the need to sort the data entirely so that the query is basically as fast as a unique index lookup. Please find a detailed explanation of this trick here.

The result is very close to the “guessing score” which I interpret as “people don’t have a clue about it.”

This result is particularly sad because I’ve seen people building caching tables, regularly refilled by a cron jobs, to avoid queries like this. Interestingly, the cron job tends to cause performance problems because it is running in rather short intervals to make sure the cache table has fresh data. However, the right index is often the better option in the first place.

Here I have to mention that the Oracle database needs the most special syntax for this trick. Up till version 12c released in 2013, the Oracle database did not offer a handy shortcut such as LIMIT or TOP. Instead, you have to use the ROWNUM pseudo-column like this:

SELECT *
  FROM (
        SELECT id, date_column
          FROM tbl
         WHERE a = :a
         ORDER BY date_column DESC
       )
 WHERE rownum <= 1;

The extra complexity of this query might have pushed Oracle users more heavily towards the wrong answer — actually below guessing score!

Another argument I’m getting in response to this question is that including the ID column in the index would allow an index-only scan. Although this is correct, I don’t consider not doing so a “bad practice” because the query touches only one row anyway. An index-only scan could just avoid a single table access. Obviously there are cases were you need that improvement, but in the general case I’d consider it a premature optimization. But that’s just my opinion. However, following this argument might give us an idea why PostgreSQL users got the best score (again): PostgreSQL did not have index-only scans until version 9.2, which was released in September 2012. As a result, PostgreSQL users could not fall into this trap of thinking an index-only scan can bring major improvements in this case. Undoubtedly, the term “major” is troublesome in this context.

Question 3: Index Column Order

Is the following SQL good or bad practice from a performance perspective?

Two queries, searching by a common column:

CREATE INDEX tbl_idx ON tbl (a, b);

SELECT id, a, b
  FROM tbl
 WHERE a = ?
   AND b = ?;


SELECT id, a, b
  FROM tbl
 WHERE b = ?;

The same options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The answer is “bad practice,” because the second query cannot use the index properly. Changing the index column order to (b, a) would, however, allow both queries to use this index in the most efficient way. You can find a full explanation here. Adding a second index on (b) would be a poor solution due to the overhead it adds for no reason. Unfortunately, I don’t know how many would have done that.

The result is disappointing, but in line with my expectations — just 12.5% above guessing score.

This is also a problem I see almost every day. People just don’t understand how multi-column indexes work.

All the per-database results are pretty close together. Maybe because there is no syntactic difference or well-known database features that could have a major influence the answer. Less known features like Oracle’s SKIP SCAN could have a minor impact, of course. Generally, the index-only scan could have a influence too, but it pushes the participants to the “right” answer this time.

After all, this result might just say that users of some databases know more about indexing than others. Interestingly, PostgreSQL users get the best score for the third time.

Question 4: LIKE Searches

Is the following SQL troublesome or bulletproof from a performance perspective?

Searching within a string:

CREATE INDEX tbl_idx ON tbl (text);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

I’ve phrased the options differently this time:

  • Bulletproof: It will always run fast.

  • Troublesome: There is high risk for performance problems.

And the correct answer is “troublesome” because the LIKE pattern uses a leading wild card. Otherwise, if it would use the pattern 'TERM%', it could use the index very efficiently. Have a look at this visual explanation for details.

The results of this question are promising. Here I feel safe to say most people know that LIKE is not for full-text search.

The segregated results are also within a very narrow corridor:

It is, however, strange that PostgreSQL users under performed at this question. A closer look at how the question is presented to PostgreSQL users might give an explanation:

CREATE INDEX tbl_idx ON tbl (text varchar_pattern_ops);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

Note the addition to the index (varchar_pattern_ops). In PostgreSQL, this special operator class is required to make the index usable for postfix wild card searches (e.g. 'TERM%'). I added this because I aimed to find out if people know about the problem of leading wild cards in LIKE expressions. Without the operator class, there are two reasons why it doesn’t work: (1) the leading wild card; (2) the missing operator class. I though that would be too obvious. Retrospectivley, I believe some participants interpreted this operator class as “magic that makes it work” and thus took the wrong answer.

Question 5a: Index-Only Scan

Question five is a little bit tricky, because PostgreSQL did not support index-only scans when the quiz was created. For that reason, there are two variants of question five: one about index-only scans for users of MySQL, Oracle and SQL Server databases. And another question about index column order for PostgresSQL users. Both results are presented here, but the segregated data is limited for obvious reasons. We start with the question about index-only scans:

How will the change affect query performance?

Current situation, selecting about hundred rows out of a million:

CREATE INDEX tab_idx ON tbl (a, date_column);

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
 GROUP BY date_column;

Changed query, selecting about ten rows out of a million:

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
   AND b = 42
 GROUP BY date_column;

Note the added where clause in the second query.

This question is also special because it offers four options:

  • Query performance will roughly stay the same (+/- 10%)

  • Depends on the data.

  • The query will be much slower (impact >10%)

  • The query will be much faster (impact >10%)

When I created the quiz, I was well aware that 50/50 questions have a tendency to render the score meaningless. This question is a trade-off between keeping the questions easy to gasp and answer (questions 1–4) and giving a more accurate result.

To make it short, the correct answer is “the query will be much slower.” The reason is that the original query could use an index-only scan — that is, the query could be answered only using data from the index without fetching any data from the actual table. The second query, however, needs to check column B too, which is not in the index. Consequently, the database must take some extra effort to fetch the candidate rows from the table to evaluate the expression on B That is, it must fetch at least 100 rows from the table — the number of rows returned by the first query. Due to the group by there are probably more rows to fetch. A quite considerable extra effort that will make the query much slower. A more exhaustive explanation is here.

With that number of options, the overall score drops significantly to about 39% or 14% above guessing score.

Still I think saying that about 39% of participants knew the right answers is wrong. They gave the right answer, but there was still a probability of 25% that they gave the right answer without knowing it.

The segregation by database is quite boring. Conspicuously boring.

However, with four options, it is also interesting to see how people actually answered.

That caught me by surprise. Both options “roughly the same” and “depends on the data” got about 25% — the guessing probably. Does this mean half of the participants were guessing? As it is the last question some participants might have picked a random option just to get through. Quite possible. However, the correct option “much slower” got 38.8% at the cost of the “much faster” option, which got just 10.9%.

My intention with this question was to trap people into the “much faster” option because fetching less data should be faster — except when breaking an index-only scan. The only hypothesis I have for this result is that people might have got the idea that the obvious answer isn’t the correct one. That, however, would mean the 39% score doesn’t prove anything about the knowledge of this phenomenon in the field.

Another effect that I expected to have an impact is that “it is always depending on the data.” Of course there are edge cases where the performance impact might roughly stay the same — e.g., when all inspected rows are in the same table block. However, this is rather unlikely — just because there would be no point in adding the date_column for an index-only scan in the first place.

Question 5b: Index Column Order and Range Operators

This question is only shown to PostgreSQL users.

Is the following SQL good or bad practice from a performance perspective?

Searching for entries in state X, not older than 5 years.

CREATE INDEX tbl_idx ON tbl (date_column, state);

SELECT id, date_column, state
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR
   AND state = 'X';

(365 rows)

The data distribution is as follows:

SELECT count(*)
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR;

 count 
-------
  1826

SELECT count(*)
  FROM tbl
 WHERE state = 'X';

 count 
-------
 10000

There is an index with two columns and a query that filters on both of them. One filter uses an equals operator, the other a greater than or equal operator. When using each filter individually the query returns many more rows as when combining both filters.

The options are:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

And the correct answer is “bad practice” because the column order in the index is the wrong way around. The general rule is that index columns can be used efficiently from the left hand side as long as they are used with equals operators. Further, one column can be used efficiently with a range operator. However, the first range operator effectively cuts off the index so that further columns cannot be used efficiently anymore. With efficiently I mean as an index access predicate. Please find a visualization here.

With the original index shown above, the query has to fetch 1826 entries from the index (those matching the date_column filter) and check each of them for the value of the state column. If we turn the index column order around, the database can use both filters efficiently (= as access predicate) and directly limit the index access to those 365 rows of interest.

And this is how people answered:

Wait a moment, that’s below the guessing score! It’s not just people don’t know, they believe the wrong thing. However, I must admit that the term “major” is very problematic again. When I run this example, the speed-up I get is just 70%. Not even twice as fast.

Overall Score: How Many Passed the Test?

Looking at each question individually is interesting, but doesn’t tell us how many participants managed to answer all five questions correctly, for example. The following chart has that information.

Finally, I’d like to boil this chart down to a single figure: how many participants “passed” the test?

Considering that the test has only five questions, out of which four are 50/50 questions, I think it is fair to say three correct answers isn’t enough to pass the test. Requiring five correct answers would quite obviously be asking for too much. Requiring four correct answers to “pass” the test is therefore the only sensible choice I see. Using this definition, only 38.2% passed the test. The chance to pass the test by guessing is still 12.5%.

If you like this article and want to learn about proper indexing, SQL Performance Explained is for you.

Unreasonable Defaults: Primary Key as Clustering Key

As you might have noticed—at least if you have read SQL Performance Explained—I don’t think clustered indexes are as useful as most people believe. That is mainly because it is just too darn difficult to choose a good clustering key. As a matter of fact, choosing a good—the “right”—clustering key is almost impossible if there are more than one or two indexes on the table. The result is that most people just stick to the default—which is the primary key. Unfortunately, this is almost always the worst possible choice.

In this article I explain the beast named clustered index and all it’s downsides. Although this article uses SQL Server as demo database, the article is equally relevant for MySQL/MariaDB with InnoDB and the Oracle database when using index-organized tables.

Recap: What is a clustered index

The idea of clustered indexes is to store a complete table in a B-tree structure. If a table has a clustered index, it basically means the index is the table. A clustered index has a strict row order like any other B-tree index: it sorts the rows according to the index definition. In case of clustered indexes we call the columns that define the index order the clustering key. The alternative way to store a table is as a heap with no particular row order. Clustered indexes have the advantage that they support very fast range scans. That is, they can fetch rows with the same (or similar) clustering key value quickly, because those rows are physically stored next to each other (clustered)—often in the same data page. When it comes to clustered indexes it is very important to understand that there is no separate place where the table is stored. The clustered index is the primary table store—thus you can have only one per table. That’s the definition of clustered indexes—it’s a contrast to heap tables.

There is, however, another contrast to clustered indexes: non-clustered indexes. Just because of the naming, this is the more natural counterpart to clustered indexes to many people. From this perspective, the main difference is that querying a clustered index is always done as index-only scan. A non-clustered index, on the other hand, has just a sub-set of the table columns so it causes extra IOs for getting the missing columns from the primary table store if needed. Every row in a non-clustered index has a reference to the very same row in the primary table store for this purpose. In other words, using a non-clustered index generally involves resolving an extra level of indirection. Generally, I said. In fact it is pretty easy to avoid this overhead by including all needed columns in the non-clustered index. In that case the database can find all the required data in the index and just doesn’t resolve the extra level of indirection. Even non-clustered indexes can be used for index-only scans—making them as fast as clustered indexes. Isn’t that what matters most?

Note

The index-only scan is the important concept—not the clustered index.

Later in the article, we’ll see that there is a duality among these aspects of clustered indexes: being a table store (in contrast to heap tables) or being an index that happens to have all table columns. Unfortunately, this “table-index duality” can be as confusing as the wave-particle duality of light. Hence, I’ll explicitly state which aspect appears at the time whenever necessary.

The costs of an extra level of indirection

When it comes to performance, an extra level of indirection is not exactly desirable because dereferencing takes time. The crucial point here is that the costs of dereferencing is greatly affected by the way the table is physically stored—either as heap table or as clustered index.

The following figures explain his phenomenon. They visualize the process to execute a query to fetch all SALES rows from 2012-05-23. The first figure uses a non-clustered index on SALE_DATE together with a heap table (= a table that doesn’t have a clustered index):

Note that there is a single Index Seek (Non-Clustered) operation on the non-clustered index that causes two RID Lookups into the heap table (one for each matched row). This operation reflects dereferencing the extra indirection to load the remaining columns from the primary table store. For heap tables, a non-clustered index uses the physical address (the so-called RID) to refer to the very same row in the heap table. In the worst case the extra level of indirection causes one additional read access per row (neglecting forwarding).

Now let’s look at the same scenario with a clustered index. More precisely, when using a non-clustered index in presence of a clustered-index on SALE_ID—that is, the primary key as clustering key.

Note that the definition of the index on the left hand side has not changed: it’s still a non-clustered index on SALE_DATE. Nevertheless, the pure presence of the clustered index affects they way the non-clustered index refers to the primary table storage—which is the clustered index! Unlike heap tables, clustered indexes are “living creatures” that move rows around as needed to maintain their properties (i.e.: the row order and tree balance). Consequently the non-clustered index can’t use the physical address as reference anymore because it could change at any time. Instead, it uses the clustering key SALE_ID as reference. Loading the missing columns from the primary table store (=clustered index) now involves a full B-tree traversal for each row. That are several extra IOs per row as opposed to a single extra IO in case of heap tables.

I also refer to this effect as the “clustered index penalty”. This penalty affects all non-clustered indexes on tables that have a clustered index.

Note

For index-organized tables, the Oracle database also stores the physical address of that row (=guess ROWID) along with the clustering key in the secondary indexes (=non-clustered indexes). If the row is found at this address, the database doesn’t need to perform the B-tree traversal. If not, however, it has performed one more IO for nothing.

How bad is it?

Now that we have seen that clustered indexes cause a considerable overhead for non-clustered indexes, you’ll probably like to know how bad it is? The theoretic answer has been given above—one IO for the RID Lookup compared to several IOs for the Key Lookup (Clustered). However, as I learned the hard way when giving performance training, people tend to ignore, deny, or just don’t believe that fact. Hence, I’ll show you a demo.

Obviously, I’ll use two similar tables that just vary by the table storage they use (heap vs. clustered). The following listing shows the pattern to create these tables. The part in square brackets makes the difference to either use a heap table or clustered index as table store.

CREATE TABLE sales[nc] (
    sale_id      NUMERIC       NOT NULL,
    employee_id  NUMERIC       NOT NULL,
    eur_value    NUMERIC(16,2) NOT NULL,
    SALE_DATE    DATE          NOT NULL
    CONSTRAINT salesnc_pk
       PRIMARY KEY [nonclustered]  (sale_id),
);

CREATE INDEX sales[nc]2 ON sales[nc] (sale_date);

I’ve filled these tables with 10 million rows for this demo.

The next step is to craft a query that allows us to measure the performance impact.

SELECT TOP 1 * 
  FROM salesnc
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date

The whole idea behind this query is to use the non-clustered index (hence filtering and ordering on SALE_DATE) to fetch a variable number of rows (hence TOP N) from the primary table store (hence select * to make sure it’s not executed as index-only scan).

Now lets look what happens if we SET STATISTICS IO ON and run the query against the heap table:

Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0,…

The interesting figure is the logical reads count of four. There were no physical reads because I have executed the statement twice so it is executed from the cache. Knowing that we have fetched a single row from a heap table, we can already conclude that the tree of the non-clustered index must have three levels. Together with one more logical read to access the heap table we get the total of four logical reads we see above.

To verify this hypothesis, we can change the TOP clause to fetch two rows:

SELECT TOP 2 * 
  FROM salesnc
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0,…

Keep in mind that the lookup in the non-clustered index is essentially unaffected by this change—it still needs three logical reads if the second row happens to reside in the same index page—which is very likely. Hence we see just one more logical read because of the second access to the heap table. This corresponds to the first figure above.

Now let’s do the same test with the table that has (=is) a clustered index:

SELECT TOP 1 * 
  FROM sales
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0,…

Fetching a single row involves eight logical reads—twice as many as before. If we assume that the non-clustered index has the same tree depth, it means that the KEY Lookup (Clustered) operation triggers five logical reads per row. Let’s check that by fetching one more row:

SELECT TOP 2 * 
  FROM sales
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 13, physical reads 0, read-ahead reads 0,…

As feared, fetching a second row triggers five more logical reads.

I’ve continued these test in 1/2/5-steps up to 100 rows to get more meaningful data:

Rows FetchedLogical Reads (Heap)Logical Reads (Clustered)
148
2513
5827
101348
202391
5053215
100104418

I’ve also fitted linear equations in the chart to see how the slope differs. The heap table matches the theoretic figures pretty closely (3 + 1 per row) but we see “only” four logical reads per row with the clustered index—not the five we would have expected from just fetching one and two rows.

Who cares about logical reads anyway

Logical reads are a fine benchmark because it yields reproducible results—independent of the current state of caches or other work done on the system. However, the above chart is still a very theoretic figure. Four times as many logical reads does not automatically mean four times as slow. In all reality you’ll have a large part of the clustered index in the cache—at least the first few B-tree levels. That reduces the impact definitively. To see how it affects the impact, I conducted another test: measuring the execution time when running the queries from the cache. Avoiding disk access is another way to get more or less reproducible figures that can be compared to each other.

Again, I’ve used 1/2/5-steps but started at 10.000 rows—selecting fewer rows was too fast for the timer’s resolution. Selecting more than 200.000 rows took extraordinarily long so that I believe the data didn’t fit into the cache anymore. Hence I stopped collecting data there.

Rows FetchedTime (Heap)Time (Clustered)
10.0003178
20.00047130
50.000109297
100.000203624
200.0003901232

From this test it seems that the “clustered index penalty” on the non-clustered index is more like three times as high as compared to using a heap table.

Notwithstanding these results, is it perfectly possible that the real world caching leads to a “clustered index penalty” outside the here observed range in a production system.

What was the upside of clustered indexes again?

The upside of clustered indexes is that they can deliver subsequent rows quickly when accessed directly (not via a non-clustered index). In other words, they are fast if you use the clustering key to fetch several rows. Remember that the primary key is the clustering key per default. In that case, it means fetching several rows via primary key—with a single query. Hmm. Well, you can’t do that with an equals filter. But how often do you use non-equals filters like > or < on the primary key? Sometimes, maybe, but not that often that it makes sense to optimize the physical table layout for these queries and punish all other indexes with the “clustered index penalty.” That is really insane (IMHO).

Luckily, SQL Server is quite flexible when it comes to clustered indexes. As opposed to MySQL/InnoDB, SQL Server can use any column(s) as clustering key—even non-unique ones. You can choose the clustering key so it catches the most important range scan. Remember that equals predicates (=) can also cause range scans (on non-unique columns). But beware: if you are using long columns and/or many columns as clustering key, they bloat all non-clustered indexes because each row in every non-clustered indexes contains the clustering key as reference to the primary table store. Further, SQL Server makes non-unique clustering keys automatically unique by adding an extra column, which also bloats all non-clustered indexes. Although this bloat will probably not affect the tree depth—thanks to the logarithmic growth of the B-tree—it might still hurt the cache-hit rate. That’s also why the “clustered index penalty” could be outside the range indicated by the tests above—in either way!

Even if you are able to identify the clustering key that brings the most benefit for range scans, the overhead it introduces on the non-clustered indexes can void the gains again. As I said in the intro above, it is just too darn difficult to estimate the overall impact of these effects because the clustering key affects all indexes on the table in a pretty complex way. Therefore, I’m very much in favour of using heap tables if possible and index-only scans when necessary. From this perspective, clustered indexes are just an additional space optimization in case the “clustered index penalty“ isn’t an issue—most importantly if you need only one index that is used for range scans.

Note

Some databases don’t support heap tables at all—most importantly MySQL/MariaDB with InnoDB and the Azure database.

Further there are configurations that can only use the primary key as clustering key—most importantly MySQL/MariaDB with InnoDB and Oracle index-organized tables.

Note that MySQL/MariaDB with InnoDB is on both lists. They don’t offer any alternative for what I referred to as “insane.” MyISAM being no alternative.

Concluding: How many clustered indexes can a SQL Server table have?

To conclude this article, I’d like to demonstrate why it is a bad idea to consider the clustered index as the “silver bullet” that deserves all your thoughts about performance. This demo is simple. Just take a second and think about the following question:

How many clustered indexes can a SQL Server table have?

I love to ask this question in my SQL Server performance trainings. It truly reveals the bad understanding of clustered indexes.

The first answer I get is usually is “one!” Then I ask why I can’t have a second one. Typical response: —silence—. After this article, you already know that a clustered index is not only an index, but also the primary table store. You can’t have two primary table stores. Logical, isn’t it? Next question: Do I need to have a clustered index on every table? Typical response: —silence— or “no” in a doubtful tone. I guess this is when the participants realize that I’m asking trick questions. However, you now know that SQL Server doesn’t need to have a clustered index on every table. In absence of a clustered index, SQL Server uses a heap as table store.

As per definition, the answer to the question is: “at most one.”

And now stop thinking about the clustered index as “silver bullet” and start putting the index-only scan into the focus. For that, I’ll rephrase the question slightly:

How many indexes on a SQL Server table can be queried as fast as a clustered index?

The only change to the original question is that it doesn’t focus on the clustered index as such anymore. Instead it puts your focus on the positive effect you expect from a clustering index. The point in using a clustered index is not just to have a clustered index, it is about improving performance. So, let’s focus on that.

What makes the clustered index fast is that every (direct) access is an index-only scan. So the question essentially boils down to: “how many indexes can support an index-only scan?” And the simple answer is: as many as you like. All you need to do is to add all the required columns to a non-clustered index and *BAM* using it is as fast as though it was a clustered index. That’s what SQL Server’s INCLUDE keyword of the CREATE INDEX statement is for!

Focusing on index-only scans instead of clustered indexes has several advantages:

  • You are not limited to one index. Any index can be as fast as a clustered index.

  • Adding INCLUDE columns to a non-clustered index doesn’t affect anything else than this particular index. There is no penalty that hurts all other indexes!

  • You don’t need to add all table columns to a non-clustered index to enable an index-only scan. Just add the columns that are relevant for the query you’d like to tune. That keeps the index small and can thus become even faster than a clustered index.

  • And the best part is: there is no mutual exclusion of index-only scans and clustered indexes. Index-only scans work irrespective of the table storage. You can extend non-clustered indexes for index-only scans even if there is a clustered index. That’s also an easy way to avoid paying the “clustered index penalty” on non-clustered indexes.

Because of the “clustered index penalty” the concept of the index-only scan is even more important when having a clustered index. Really, if there is something like a “silver bullet”, it is the index-only scan—not the clustered index.

If you like this article, you’ll love SQL Performance Explained.

MongoDB is to NoSQL like MySQL to SQL — in the most harmful way

Yesterday evening I tweeted: “MongoDB seems to be as bad for NoSQL as MySQL is for SQL.” Unfortunately, I tweeted without context. But I guess I couldn’t have given all the required context in a single tweet anyway, so I’m dedicating this post to it. I hope this answers some of the questions I’ve got in response to the tweet.

First of all, I think everybody should know that I’m not a NoSQL fanboy, yet I’m open to the polyglot persistence idea. This distinction doesn’t seem to make sense if you read NoSQL as "not only SQL" (as you are supposed to do). However, I believe there are NoSQL systems out there that greatly benefit from the idea that SQL is bad and not using SQL is good. On other words, they offer “not using SQL” as their main advantage. MongoDB seems to be one of them. Just my perception.

But if I don’t like NoSQL, then I should like MySQL? Not exactly. In my eyes, MySQL has done great harm to SQL because many of the problems people associate with SQL are in fact just MySQL problems. One of the more important examples is that MySQL is rather poor at joining because is only supports nested loops joins. Most other SQL database implement the hash join and sort/merge join algorithms too—both deliver better performance for non-tiny data sets. Considering the wide adoption of MySQL (“The most popular open source database”) and the observation that many people move away from SQL because “joins are slow,” it isn’t far-fetched to say that an implementation limitation of MySQL pushes people towards NoSQL.

Now let’s look at MongoDB. I think the direct competition between MongoDB and MySQL became most obvious in the epic video “MongoDB is Web Scale.” In the meanwhile, MongoDB even claims to be “the leading NoSQL database” — does that sound like “the most popular open source database”? Nevertheless, MongoDB has disappointed many people because it couldn’t live up to it’s promise of “web scale” (example: global write lock up to release 2.2).

The next piece in the puzzle that eventually caused me to tweet was a funny tweet by Gwen (Chen) Shapira (she’s an Oracle DB consultant):

#mongoDB : the big data platform that is challenging to scale over 100GB. http://blog.mongodb.org/post/62899600960/scaling-advice-from-mongohq

Note that the link was broken for a while (the post originally appeared on Sep 30, then disappeared, but is online since Oct 2 again at a different URL). The article is about handling MongoDB if it grows above 100GB. It gives me the impression that scaling MongoDB to that size is a serious issue. Even though there is no exact definition of “web scale” I guess most people would assume that it should be easy to scale MongoDB to 100GB. 100GB is not big data nowadays. 100GBs can be easily managed with most SQL DBs (joining in MySQL could be a problem). It was really funny to see this post on the MongoDB blog. Chen’s tweet nailed it.

At this point, I was once more thinking about the “misspent half-decade” mentioned by Jack Clark in his article “Google goes back to the future with SQL F1 database.” But as mentioned before, I like the idea of polyglot persistence. I’m not saying NoSQL is bullshit—not just because a single implementation fails to deliver. That would be like saying SQL is bullshit because MySQL is bad at joining. On the contrary, it reminded how Alex Popescu lost his temper in his post “The premature return to SQL” last Friday. His response to the “misspent half-decade” was:

Just take a second a think what we got during this misspent half-decade: Redis, Cassandra, Riak, a multi-parallel fully programmatic way to process data, Cascading, Pig, Cypher, ReQL and many more tools, languages, and APIs for processing data.

Well, I don’t know all of these but I do realize that some of them are interesting tools to have in the tool box. Further, I’m following Alex Popescu long enough to know that he is rather reflective on NoSQL—the title of his post being an exception. That’s why I came back to his post to see if he mentioned “the leading NoSQL database“ in his list. He didn’t. I don’t think it’s a coincidence.

At this point it was inevitable to see MongoDB as a popular, yet poor representative of it’s species—just like MySQL is.

Myth: Select * is bad

This is one of the most persistent myths I’ve seen in the field. It’s there for decades. If a myth is alive that long there must be some truth behind it. So, what could be bad about select *? Let’s have a closer look.

We all know that selecting “*” is just a short-hand for selecting all columns. Believe it or not, this makes a big difference to many people. So, lets first rephrase the question using this “finding”:

Why is it bad to select all columns?

In fact, there are a few very good reasons it is bad to select all columns if you don’t need them. And they all boil down to performance. What is surprising, however, is that the performance impact can be huge.

Up to 100x slower when preventing an Index-Only Scan

Broadly speaking, the less columns you ask for, the less data must be loaded from disk when processing your query. However, this relationship is non-linear.

Quite often, selecting from a table involves two steps: (1) use an index to find the address where the selected rows are stored; (2) load the selected rows from the table. Now imagine that you are just selecting columns that are present in the index. Why should the database still perform the second step? In fact, most databases don’t. They can process your query just with the information stored in the index—hence index-only scan.

But why should an index-only scan be 100 times faster? Simple: an ideal index stores the selected rows next to each other. It’s not uncommon that each index page holds about 100 rows—a ballpark figure; it depends on the size of the indexed columns. Nonetheless, it means that one IO operation might fetch 100 rows. The table data, on the other hand, is not organized like that (exceptions). Here it is quite common that a page just contains one of the selected rows—along with many other rows that are of no interest for the particular query. So, the reason an Index-Only Scan can be 100 times faster is that an index access can easily deliver 100 rows per IO while the table access typically just fetches a few rows per IO.

If you select a single column that’s not in the index, the database cannot do an index-only scan. If you select all columns, … , well I guess you know the answer.

Further, some databases store large objects in a separate place (e.g., LOBs in Oracle). Accessing those causes an extra IO too.

Up to 5x slower when bloating server memory footprint

Although databases avoid to store the result in the server’s main memory—instead the deliver each row after loading and forget about it again—it is sometimes inevitable. Sorting, for example, needs to keep all rows—and all selected columns—in memory to do the job. Once again, the more columns you select, the more memory the database needs. In the worst case, the database might even need to do an external sort on disk.

However, most database are extremely well tuned for this kind of workload. Although I’ve seen a sorting speed-up of factor two quite often—just by removing a few unused columns—I cannot remember having got more than factor five. However, it’s not just sorting, hash joins are rather sensitive to memory bloat too. Don’t know what that is? Please read this article.

These are just the two top issues from database perspective. Remember that the client needs to process the data too—which might put a considerable load on garbage collection.

Now that we have established a common understanding of why selecting everything is bad for performance, you may ask why it is listed as a myth? It’s because many people think the star is the bad thing. Further they believe they are not committing this crime because their ORM lists all columns by name anyway. In fact, the crime is to select all columns without thinking about it—and most ORMs readily commit this crime on behalf of their users.

The reason select * actually is bad—hence the reason the myth is very resistant—is because the star is just used as an allegory for “selecting everything without thinking about it”. This is the bad thing. But if you need a more catch phrase to remember the truth behind this myth, take this:

It’s not about the star, stupid!

Update 2013-11-03 - Is the star itself also bad?

Besides the performance issues mentioned above that are not caused by the star (asterisk) itself, the star itself might still cause other trouble. E.g. with software that expects the columns in a specific order when you add or drop a column. However, from my observation I’d say these issues are rather well understood in the field and usually easily identify (software stops working) fixed.

The focus of the article is on very subtle issues which are hardly understood, hard to find, and often even hard to fix (e.g. when using ORM tools). The main goal of this article is to stop people thinking about the star itself. Once people start to name the wanted columns explicitly to gain the performance benefit explained above, the issues caused by the star itself are also gone. Hence, I’ve felt no reason to add a discussion about these issues here—that’s just a distraction from the arguments that I wanted to explain with the article.

Try it online!

Today marks the third anniversary of Use The Index, Luke! And I have to fulfill a promise I gave one year ago: You can now test many of the example from this site online at SQLFiddle.com.

Here is a trivial example how it looks like. Just click on the SQL Fiddle logo on the right top corner of the execution plan.

SELECT first_name, last_name
  FROM employees
 WHERE employee_id   = 123
   AND subsidiary_id = 30
MySQL
Try online at SQL Fiddle+----+-----------+-------+---------+---------+------+-------+
| id | table     | type  | key     | key_len | rows | Extra |
+----+-----------+-------+---------+---------+------+-------+
|  1 | employees | const | PRIMARY | 10      |    1 |       |
+----+-----------+-------+---------+---------+------+-------+

As before, MySQL is able to use access type const because the query cannot match more than one row. Note that the key lengths (key_len) has become bigger because it now uses two columns of the index. See ??? for more details.

Oracle
Try online at SQL Fiddle---------------------------------------------------------------
|Id |Operation                   | Name         | Rows | Cost |
---------------------------------------------------------------
| 0 |SELECT STATEMENT            |              |    1 |    2 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES    |    1 |    2 |
|*2 |  INDEX UNIQUE SCAN         | EMPLOYEES_PK |    1 |    1 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"=123 AND "SUBSIDIARY_ID"=30)
PostgreSQL
Try online at SQL Fiddle                QUERY PLAN
-------------------------------------------
 Index Scan using employees_pk on employees 
   (cost=0.00..8.27 rows=1 width=14)
   Index Cond: ((employee_id   = 123::numeric)
            AND (subsidiary_id = 30::numeric))
SQL Server
Try online at SQL Fiddle
|--Nested Loops(Inner Join)
   |--Index Seek(OBJECT:employees_pk,
   |               SEEK:employee_id=@ AND subsidiary_id=@2
   |            ORDERED FORWARD)
   |--RID Lookup(OBJECT:employees,
                   SEEK:Bmk1000=Bmk1000
                 LOOKUP ORDERED FORWARD)

As I said—a trivial example borrowed from chapter 2.

I have to admit that not all examples are available at SQL Fiddle yet. At the moment I’m finishing the examples of chapter 2. However, if you have read SQL Performance Explained you know that chapter 2 makes up half of the book. In other words, half of the book is already available at SQL Fiddle.

I hope this online experience makes Use The Index, Luke an even more awesome learning resource. A large part of this additional awesomeness is owed to Jake Feasel who built SQL Fiddle. Please note that you can flattr SQL Fiddle and donate via PayPal (on the right top of the page).

If you have not yet read the book, please have a look at the table of contents now and remember that you are just one click away from actually running the examples shown in the book. Learning about SQL performance has never been that easy ;)

About Optimizer Hints

Quite often I’m asked what I think about query hints. The answer is more lengthy and probably also more two-fold than most people expect it to be. However, to answer this question once and forever, I though I should write it down.

The most important fact about query hints is that not all query hints are born equally. I distinguish two major types:

Restricting Hints

Most query hints are restricting hints: they limit the optimizers’ freedom to choose an execution plan. “Hint” is an incredibly bad name for these things as they force the optimizer to do what it has been told—probably the reason MySQL uses the FORCE keyword for those.

I do not like restricting hints, yet I use them sometimes to test different execution plans. It usually goes like this: when I believe a different execution plan could (should?) give better performance, I just hint it to see if it really gives better performance. Quite often it becomes slower and sometimes I even realize that the execution plan I though of does not work at all—at least not with the database I’m working at that moment.

Typical examples for restricting query hints are hints that force the database to use or not use a particular index (e.g., INDEX and NO_INDEX in the Oracle database, USE INDEX and IGNORE INDEX in MySQL, or INDEX, FORCESEEK and the like in SQL Server).

So, what’s wrong with them? Well, the two main problems are that they (1) restrict the optimizer and that they (2) often need volatile object names as parameters (e.g., index names). Example: if you use a hint to use index ABC for a query, the hint becomes ineffective when somebody changes the name of the index to ABCD. Further, if you restrict the optimizer you can no longer expect it to adjust the execution plan if you add another index that servers the query better. Of course there are ways around these problems. The Oracle database, for example, offers "index description" hints to avoid both issues: instead of specifying the index name, it accepts a description of the ideal index (column list) and it selects the index that matches this definition best.

Nevertheless, I strongly recommend against using restricting query hints in production. Instead you should find out why the optimizer does “the wrong thing”™ and fix the root cause. Restricting hints fix the symptom, not the cause. That being said, I know that there is sometimes no other reasonable choice.

Supporting Hints

The second major type of query hints are supporting hints: they support the optimizer by providing information it doesn’t have otherwise. Supporting hints are rare—I’m only aware of a few good examples and the most useful one has already become obsolete: it’s FAST number_rows (SQL Server) and FIRST_ROWS(n) (Oracle). They tell the optimizer that the application plans to fetch only that many rows of the result. Consequently, the optimizer can prefer using indexes and nested loop joins that would be inefficient when fetching the full result (see Chapter 7, Partial Results for more details). Although being kind-of obsolete, I’m still using these hints as the defining example for supporting hints because they provide information the optimizer cannot have otherwise. This particular example is important enough that it was worth defining new keywords in the ISO SQL:2008: FETCH FIRST ... ROWS ONLY and OFFSET. That’s why this hint is a very good, yet obsolete example for supporting query hints.

Another example for supporting hints is the (undocumented) CARDINALITY hint of the Oracle database. It basically overwrites the row count estimate of sub-queries. This hint was often used if the combined selectivity of two predicates was way off the product of the selectivity of each individual predicate (see Combined Selectivity Example). But this hint is also outdated since Oracle database 11g introduced extended statistics to cope with issues like that. SQL Server’s filtered statistics serve the same purpose. If your database cannot reflect data correlation in it’s statistics, you’ll need to fall back to restricting hints.

Combined Selectivity Example

Let’s say we have two Y/N columns and each has a 50:50 distribution. When you select using both columns most optimizers estimate that the query matches 25% of the table (by multiplying two times 50%). That means that the optimizer assumes there is no correlation between those two columns.

Column 1Column 2count(*)
YY25
YN25
NY25
NN25

If there is a correlation, however, so that most rows that have Y in one column also have Y in the other column, then the estimate is way off.

Column 1Column 2count(*)
YY49
YN1
NY1
NN49

If you query one of the rare Y/N combinations, the optimizer might refrain from using an index due to the high cardinality estimate. Nevertheless, it would be better to use the index because this particular combination is very selective.

It think supporting hints are not that bad: they are just a way to cope with known limitations of the optimizer. That’s probably why they tend to become obsolete when the optimizers evolve.

And Then There Was PostgreSQL

You might have noticed that I did not mention PostgreSQL. It’s probably because PostgreSQL doesn’t have query hints although it has (which are actually session parameters). Confused? No problem, there is a short Wiki for that.

However, to see some discussion about introducing a similar hint as CARDINALITY described above or implementing "cross column statistics" read the first few messages in this thread from February 2011 (after the first page, the discussion moves to another direction). And the result? PostgreSQL still doesn’t have a good way to cope with the original problem of column correlation.

Pagination done the PostgreSQL way

Here is another slide deck for my "Pagination done the PostgreSQL way" talk at P2D2.cz. The topic is also relevant for other DBs—PostgreSQL makes it just very easy to demonstrate it. Look here how to do it with other DBs.

Please also have a look at this blog post by Gary Millsap about “The Ramp”. Do you see how using OFFSET implements this anti-pattern?

Indexes: The neglected performance all-rounder

I think I've actually never shared the slides of my talk given in Paris at Dalibo's PostgreSQL Session about Performance. So, here they are.

Afraid of SSD?

When clients tell me about their plans to invest in SSD storage for their database, they often look at me like a doctor telling a patient about his deadly disease. I didn’t discuss this with my clients until recently, when one client just asked me straight away: “As SQL tuning guy, are you afraid of SSD because it kills your job?” Here is what I told that client.

Generally, people seem to believe that SSD are just much faster than HDD. As a matter of fact, this is only partially true because—as I also mentioned in Chapter 3—performance has two dimensions: response time and throughput. Although SSDs tend to deliver more performance on both axes, it has to be considered that the throughput delivered by SSD is “only” about five times as high as that of HDDs. That’s because HDDs are not bad at sequential read/write operations anyway.

Sure enough a five times faster storage makes many performance problems go away…for a while…until you have five times more data. For a decently growing startup it might just take a few months until you have the same problem again. However, this is not the crucial point here. The crucial point is that SSDs essentially fix the one performance issue where HDDs are really bad at: the response time. Due to the lack of moving parts, the response time of SSDs is about fifty times faster as that of HDDs. Well, that really helps solving problems for a while.

However, there is a catch—maybe even a Catch-22: If you want to get the factor 50 speed-up of SSDs, you’d better avoid reading large chunks of sequential data, because that’s where you can only gain a factor five improvement. To put that into database context: if you are doing many full table scans, you won’t get the full potential of SSD. On the other hand, index lookups have a tendency to cause many random IO operations and can thus benefit from the fast response time of SSDs. The fun part is that properly indexed databases get better benefits from SSD than poorly indexed ones. But guess who is most desperately betting on SSD to solve their performance problems? People having proper indexes or those who don’t have them?

The story goes on: which database operation do you think causes most random IO operations? Of course it’s our old friend the join—it is the sole purpose of joins to gather many little data fragments from different places and combine them into the result we want. Joins can also greatly benefit from SSDs. SSDs actually voids one of arguments often brought up by NoSQL folks against relational databases: with SSD it doesn’t matter that much if you fetch data from one place or from many places.

To conclude what I said to my client: No, as an indexing-focused SQL performance guy, I’m absolutely not afraid of SSD.

About the Author

As an author, trainer, and coach Markus Winand specializes in helping developers cope with SQL performance issues. He also published the book SQL Performance Explained and tweets his best performance tips via @SQLPerfTips.http://winand.at/

Recent Questions at Ask.Use-The-Index-Luke.com

0
votes
1
answer
111
views
0
votes
0
answers
358
views

Fanout in R-Tree

Mar 27 at 08:07 jamie 1
tree indexing
0
votes
1
answer
139
views

Think About It

Mar 26 at 12:54 Markus Winand ♦♦ 511
reflection