Use The Index, Luke Blog - Latest News About SQL Performance


Seven Surprising Findings About DB2

I’ve just completed IBM DB2 for Linux, Unix and Windows (LUW) coverage here on Use The Index, Luke as preparation for an upcoming training I’m giving. This blog post describes the major differences I’ve found compared to the other databases I’m covering (Oracle, SQL Server, PostgreSQL and MySQL).

Free & Easy

Well, let’s face it: it’s IBM software. It has a pretty long history. You would probably not expect that it is easy to install and configure, but in fact: it is. At least DB2 LUW Express-C 10.5 (LUW is for Linux, Unix and Windows, Express-C is the free community edition). That might be another surprise: there is a free community edition. It’s not open source, but it’s free as in free beer.

No Easy Explain

The first problem I stumbled upon is that DB2 has no easy way to display an execution plan. No kidding. Here is what IBM says about it:

  • Explain a statement by prefixing it with explain plan for

    This stores the execution plan in a set of tables in the database (you’ll need to create these tables first). This is pretty much like in Oracle.

  • Display a stored explain plan using db2exfmt

    This is a command line tool, not something you can fall from an SQL prompt. To run this tool you’ll need shell access to a DB2 installation (e.g. on the server). That means, that you cannot use this tool over an regular database connection.

There is another command line tool (db2expln) that combines the two steps from above. Apart from the fact that this procedure is not exactly convenient, the output you get an ASCII art:

Access Plan:
-----------
    Total Cost:         60528.3
    Query Degree:       1


              Rows
             RETURN
             (   1)
              Cost
               I/O
               |
             49534.9
             ^HSJOIN
             (   2)
             60528.3
              68095
         /-----+------\
     49534.9           10000
     TBSCAN           TBSCAN
     (   3)           (   4)
     59833.6          687.72
      67325             770
       |                |
   1.00933e+06         10000
 TABLE: DB2INST1  TABLE: DB2INST1
      SALES          EMPLOYEES
       Q2               Q1

Please note that this is just an excerpt—the full output of db2exfmt has 400 lines. Quite a lot information that you’ll hardly ever need. Even the information that you need all the time (the operations) is presented in a pretty unreadable way (IMHO). I’m particularly thankful that all the numbers you see above are not labeled—that’s really the icing that renders this “tool” totally useless for the occasional user.

However, according to the IBM documentation there is another way to display an execution plan: “Write your own queries against the explain tables.” And that’s exactly what I did: I wrote a view called last_explained that does exactly what it’s name suggest: it shows the execution plan of the last statement that was explained (in a non-useless formatting):

Explain Plan
------------------------------------------------------------
ID | Operation          |                       Rows |  Cost
 1 | RETURN             |                            | 60528
 2 |  HSJOIN            |             49535 of 10000 | 60528
 3 |   TBSCAN SALES     | 49535 of 1009326 (  4.91%) | 59833
 4 |   TBSCAN EMPLOYEES |   10000 of 10000 (100.00%) |   687

Predicate Information
 2 - JOIN (Q2.SUBSIDIARY_ID = DECIMAL(Q1.SUBSIDIARY_ID, 10, 0))
     JOIN (Q2.EMPLOYEE_ID = DECIMAL(Q1.EMPLOYEE_ID, 10, 0))
 3 - SARG ((CURRENT DATE - 6 MONTHS) < Q2.SALE_DATE)

Explain plan by Markus Winand - NO WARRANTY
http://use-the-index-luke.com/s/last_explained

I’m pretty sure many DB2 users will say that this presentation of the execution plan is confusing. And that’s OK. If you are used to the way IBM presents execution plans, just stick to what you are used to. However, I’m working with all kinds of databases and they all have a way to display the execution plan similar to the one shown above—for me this format is much more useful. Further, I’ve made a useful selection of data to display: the row count estimates and the predicate information.

You can get the source of the last_explained view from here or from the URL shown in the output above. I’m serious about the no warranty part. Yet I’d like to know about problems you have with the view.

Emulating Partial Indexes is Possible

Partial indexes are indexes not containing all table rows. They are useful in three cases:

  1. To preserve space when the index is only useful for a very small fraction of the rows. Example: queue tables.

  2. To establish a specific row order in presence of constant non-equality predicates. Example: WHERE x IN (1, 5, 9) ORDER BY y. An index like the following can be used to avoid a sort operation:

    CREATE INDEX … ON … (y)
     WHERE x IN (1, 5, 9)
  3. To implement unique constraints on a subset of rows (e.g. only those WHERE active = 'Y').

However, DB2 doesn’t support a where clause for indexes like shown above. But DB2 has many Oracle-compatibility features, one of them is EXCLUDE NULL KEYS: “Specifies that an index entry is not created when all parts of the index key contain the null value.” This is actually the hard-wired behaviour in the Oracle database and it is commonly exploited to emulate partial indexes in the Oracle database.

Generally speaking, emulating partial indexes works by mapping all parts of the key (all indexed columns) to NULL for rows that should not end up in the index. As an example, let’s emulate this partial index in the Oracle database (DB2 is next):

CREATE INDEX messages_todo
          ON messages (receiver)
       WHERE processed = 'N'

The solution presented in SQL Performance Explained uses a function to map the processed rows to NULL, otherwise the receiver value is passed through:

CREATE OR REPLACE
FUNCTION pi_processed(processed CHAR, receiver NUMBER)
RETURN NUMBER
DETERMINISTIC
AS BEGIN
   IF processed IN ('N') THEN
      RETURN receiver;
   ELSE
      RETURN NULL;
   END IF;
END;
/

It’s a deterministic function and can thus be used in an Oracle function-based index. This won’t work with DB2, because DB2 doesn’t allow user defined-functions in index definitions. However, let’s first complete the Oracle example.

CREATE INDEX messages_todo
          ON messages (pi_processed(processed, receiver));

This index has only rows WHERE processed IN ('N')—otherwise the function returns NULL which is not put in the index (there is no other column that could be non-NULL). Voilà: a partial index in the Oracle database.

To use this index, just use the pi_processed function in the where clause:

SELECT message
  FROM messages
 WHERE pi_processed(processed, receiver) = ?

This is functionally equivalent to:

SELECT message
  FROM messages
 WHERE processed = 'N'
   AND receiver  = ?

So far, so ugly. If you go for this approach, you’d better need the partial index desperately.

To make this approach work in DB2 we need two components: (1) the EXCLUDE NULL KEYS clause (no-brainer); (2) a way to map processed rows to NULL without using a user-defined function so it can be used in a DB2 index.

Although the second one might seem to be hard, it is actually very simple: DB2 can do expression based indexing, just not on user-defined functions. The mapping we need can be accomplished with regular SQL expressions:

CASE WHEN processed = 'N' THEN receiver
                          ELSE NULL
END

This implements the very same mapping as the pi_processed function above. Remember that CASE expressions are first class citizens in SQL—they can be used in DB2 index definitions (on LUW just since 10.5):

CREATE INDEX messages_not_processed_pi
    ON messages (CASE WHEN processed = 'N' THEN receiver
                                           ELSE NULL
                 END)
EXCLUDE NULL KEYS;

This index uses the CASE expression to map not to be indexed rows to NULL and the EXCLUDE NULL KEYS feature to prevent those row from being stored in the index. Voilà: a partial index in DB2 LUW 10.5.

To use the index, just use the CASE expression in the where clause and check the execution plan:

SELECT *
  FROM messages
 WHERE (CASE WHEN processed = 'N' THEN receiver
                                  ELSE NULL
         END) = ?;
Explain Plan
-------------------------------------------------------
ID | Operation        |                    Rows |  Cost
 1 | RETURN           |                         | 49686
 2 |  TBSCAN MESSAGES | 900 of 999999 (   .09%) | 49686

Predicate Information
 2 - SARG (Q1.PROCESSED = 'N')
     SARG (Q1.RECEIVER = ?)

Oh, that’s a big disappointment: the optimizer didn’t take the index. It does a full table scan instead. What’s wrong?

If you have a very close look at the execution plan above, which I created with my last_explained view, you might see something suspicious.

Look at the predicate information. What happened to the CASE expression that we used in the query? The DB2 optimizer was smart enough rewrite the expression as WHERE processed = 'N' AND receiver = ?. Isn’t that great? Absolutely!…except that this smartness has just ruined my attempt to use the partial index. That’s what I meant when I said that CASE expressions are first class citizens in SQL: the database has a pretty good understanding what they do and can transform them.

We need a way to apply our magic NULL-mapping but we can’t use functions (can’t be indexed) nor can we use CASE expressions, because they are optimized away. Dead-end? Au contraire: it’s pretty easy to confuse an optimizer. All you need to do is to obfuscate the CASE expression so that the optimizer doesn’t transform it anymore. Adding zero to a numeric column is always my first attempt in such cases:

CASE WHEN processed = 'N' THEN receiver + 0
                          ELSE NULL
END

The CASE expression is essentially the same, I’ve just added zero to the RECEIVER column, which is numeric. If I use this expression in the index and the query, I get this execution plan:

ID | Operation                            |            Rows |  Cost
 1 | RETURN                               |                 | 13071
 2 |  FETCH MESSAGES                      |  40000 of 40000 | 13071
 3 |   RIDSCN                             |  40000 of 40000 |  1665
 4 |    SORT (UNQIUE)                     |  40000 of 40000 |  1665
 5 |     IXSCAN MESSAGES_NOT_PROCESSED_PI | 40000 of 999999 |  1646

Predicate Information
 2 - SARG ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                           ELSE NULL END = ?)
 5 - START ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                            ELSE NULL END = ?)
      STOP ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                            ELSE NULL END = ?)

The partial index is used as intended. The CASE expression appears unchanged in the predicate information section.

I haven’t checked any other ways to emulate partial indexes in DB2 (e.g., using partitions like in more recent Oracle versions).

As always: just because you can do something doesn’t mean you should. This approach is so ugly—even more ugly than the Oracle workaround—that you must desperately need a partial index to justify this maintenance nightmare. Further it will stop working whenever the optimizer becomes smart enough to optimize +0 away. However, then you just need put an even more ugly obfuscation in there.

INCLUDE Clause Only for Unique Indexes

With the INCLUDE clause you can add extra columns to an index for the sole purpose to allow in index-only scan when these columns are selected. I knew the INCLUDE clause before because SQL Server offers it too, but there are some differences:

  • In SQL Server INCLUDE columns are only added to the leaf nodes of the index—not in the root and branch nodes. This limits the impact on the B-tree’s depth when adding many or long columns to an index. This also allows to bypass some limitations (number of columns, total index row length, allowed data types). That doesn’t seem to be the case in DB2.

  • In DB2 the INCLUDE clause is only valid for unique indexes. It allows you to enforce the uniqueness of the key columns only—the INCLUDE columns are just not considered when checking for uniqueness. This is the same in SQL Server except that SQL Server supports INCLUDE columns on non-unique indexes too (to leverage the above-mentioned benefits).

Almost No NULLS FIRST/LAST Support

The NULLS FIRST and NULLS LAST modifiers to the order by clause allow you to specify whether NULL values are considered as larger or smaller than non-NULL values during sorting. Strictly speaking, you must always specify the desired order when sorting nullable columns because the SQL standard doesn’t specify a default. As you can see in the following chart, the default order of NULL is indeed different across various databases:

Figure A.1. Database/Feature Matrix


In this chart, you can also see that DB2 doesn’t support NULLS FIRST or NULLS LAST—neither in the order by clause no in the index definition. However, note that this is a simplified statement. In fact, DB2 accepts NULLS FIRST and NULLS LAST when it is in line with the default NULLS order. In other words, ORDER BY col ASC NULLS FIRST is valid, but it doesn’t change the result—NULLS FIRST is anyways the default. Same is true for ORDER BY col DESC NULLS LAST—accepted, but doesn’t change anything. The other two combinations are not valid at all and yield a syntax error.

SQL:2008 FETCH FIRST but not OFFSET

DB2 supports the fetch first … rows only clause for a while now—kind-of impressive considering it was “just” added with the SQL:2008 standard. However, DB2 doesn’t support the offset clause, which was introduced with the very same release of the SQL standard. Although it might look like an arbitrary omission, it is in fact a very wise move that I deeply respect. offset is the root of so much evil. In the next section, I’ll explain how to live without offset.

Side node: If you have code using offset that you cannot change, you can still activate the MySQL compatibility vector that makes limit and offset available in DB2. Funny enough, combining fetch first with offset is then still not possible (that would be standard compliant).

Decent Row-Value Predicates Support

SQL row-values are multiple scalar values grouped together by braces to form a single logical value. IN-lists are a common use-case:

WHERE (col_a, col_b) IN (SELECT col_a, col_b FROM…)

This is supported by pretty much every database. However, there is a second, hardly known use-case that has pretty poor support in today’s SQL databases: key-set pagination or offset-less pagination. Keyset pagination uses a where clause that basically says “I’ve seen everything up till here, just give me the next rows”. In the simplest case it looks like this:

SELECT …
  FROM …
 WHERE time_stamp < ?
 ORDER BY time_stamp DESC
 FETCH FIRST 10 ROWS ONLY

Imagine you’ve already fetched a bunch of rows and need to get the next few ones. For that you’d use the time_stamp value of the last entry you’ve got for the bind value (?). The query then just return the rows from there on. But what if there are two rows with the very same time_stamp value? Then you need a tiebreaker: a second column—preferably a unique column—in the order by and where clauses that unambiguously marks the place till where you have the result. This is where row-value predicates come in:

SELECT …
  FROM …
 WHERE (time_stamp, id) < (?, ?)
 ORDER BY time_stamp DESC, id DESC
 FETCH FIRST 10 ROWS ONLY

The order by clause is extended to make sure there is a well-defined order if there are equal time_stamp values. The where clause just selects what’s after the row specified by the time_stamp and id pair. It couldn’t be any simpler to express this selection criteria. Unfortunately, neither the Oracle database nor SQLite or SQL Server understand this syntax—even though it’s in the SQL standard since 1992! However, it is possible to apply the same logic without row-value predicates—but that’s rather inconvenient and easy to get wrong.

Even if a database understands the row-value predicate, it’s not necessarily understanding these predicates good enough to make proper use of indexes that support the order by clause. This is where MySQL fails—although it applies the logic correctly and delivers the right result, it does not use an index for that and is thus rather slow. In the end, DB2 LUW (since 10.1) and PostgreSQL (since 8.4) are the only two databases that support row-value predicates in the way it should be.

The fact that DB2 LUW has everything you need for convenient keyset pagination is also the reason why there is absolutely no reason to complain about the missing offset functionality. In fact I think that offset should not have been added to the SQL standard and I’m happy to see a vendor that resisted the urge to add it because its became part of the standard. Sometimes the standard is wrong—just sometimes, not very often ;) I can’t change the standard—all I can do is teaching how to do it right and start campaigns like #NoOffset.

Figure A.2. Database/Feature Matrix


If you like my way of explaining things, you’ll love my book “SQL Performance Explained”.

I’ve just completed IBM DB2 for Linux, Unix and Windows (LUW) coverage here on Use The Index, Luke as preparation for an upcoming training I’m giving. This blog post describes the major differences I’ve found compared to the other databases I’m covering (Oracle, SQL Server, PostgreSQL and MySQL).

Free & Easy

Well, let’s face it: it’s IBM software. It has a pretty long history. You would probably not expect that it is easy to install and configure, but in fact: it is. At least DB2 LUW Express-C 10.5 (LUW is for Linux, Unix and Windows, Express-C is the free community edition). That might be another surprise: there is a free community edition. It’s not open source, but it’s free as in free beer.

No Easy Explain

The first problem I stumbled upon is that DB2 has no easy way to display an execution plan. No kidding. Here is what IBM says about it:

  • Explain a statement by prefixing it with explain plan for

    This stores the execution plan in a set of tables in the database (you’ll need to create these tables first). This is pretty much like in Oracle.

  • Display a stored explain plan using db2exfmt

    This is a command line tool, not something you can fall from an SQL prompt. To run this tool you’ll need shell access to a DB2 installation (e.g. on the server). That means, that you cannot use this tool over an regular database connection.

There is another command line tool (db2expln) that combines the two steps from above. Apart from the fact that this procedure is not exactly convenient, the output you get an ASCII art:

Access Plan:
-----------
    Total Cost:         60528.3
    Query Degree:       1


              Rows
             RETURN
             (   1)
              Cost
               I/O
               |
             49534.9
             ^HSJOIN
             (   2)
             60528.3
              68095
         /-----+------\
     49534.9           10000
     TBSCAN           TBSCAN
     (   3)           (   4)
     59833.6          687.72
      67325             770
       |                |
   1.00933e+06         10000
 TABLE: DB2INST1  TABLE: DB2INST1
      SALES          EMPLOYEES
       Q2               Q1

Please note that this is just an excerpt—the full output of db2exfmt has 400 lines. Quite a lot information that you’ll hardly ever need. Even the information that you need all the time (the operations) is presented in a pretty unreadable way (IMHO). I’m particularly thankful that all the numbers you see above are not labeled—that’s really the icing that renders this “tool” totally useless for the occasional user.

However, according to the IBM documentation there is another way to display an execution plan: “Write your own queries against the explain tables.” And that’s exactly what I did: I wrote a view called last_explained that does exactly what it’s name suggest: it shows the execution plan of the last statement that was explained (in a non-useless formatting):

Explain Plan
------------------------------------------------------------
ID | Operation          |                       Rows |  Cost
 1 | RETURN             |                            | 60528
 2 |  HSJOIN            |             49535 of 10000 | 60528
 3 |   TBSCAN SALES     | 49535 of 1009326 (  4.91%) | 59833
 4 |   TBSCAN EMPLOYEES |   10000 of 10000 (100.00%) |   687

Predicate Information
 2 - JOIN (Q2.SUBSIDIARY_ID = DECIMAL(Q1.SUBSIDIARY_ID, 10, 0))
     JOIN (Q2.EMPLOYEE_ID = DECIMAL(Q1.EMPLOYEE_ID, 10, 0))
 3 - SARG ((CURRENT DATE - 6 MONTHS) < Q2.SALE_DATE)

Explain plan by Markus Winand - NO WARRANTY
http://use-the-index-luke.com/s/last_explained

I’m pretty sure many DB2 users will say that this presentation of the execution plan is confusing. And that’s OK. If you are used to the way IBM presents execution plans, just stick to what you are used to. However, I’m working with all kinds of databases and they all have a way to display the execution plan similar to the one shown above—for me this format is much more useful. Further, I’ve made a useful selection of data to display: the row count estimates and the predicate information.

You can get the source of the last_explained view from here or from the URL shown in the output above. I’m serious about the no warranty part. Yet I’d like to know about problems you have with the view.

Emulating Partial Indexes is Possible

Partial indexes are indexes not containing all table rows. They are useful in three cases:

  1. To preserve space when the index is only useful for a very small fraction of the rows. Example: queue tables.

  2. To establish a specific row order in presence of constant non-equality predicates. Example: WHERE x IN (1, 5, 9) ORDER BY y. An index like the following can be used to avoid a sort operation:

    CREATE INDEX … ON … (y)
     WHERE x IN (1, 5, 9)
  3. To implement unique constraints on a subset of rows (e.g. only those WHERE active = 'Y').

However, DB2 doesn’t support a where clause for indexes like shown above. But DB2 has many Oracle-compatibility features, one of them is EXCLUDE NULL KEYS: “Specifies that an index entry is not created when all parts of the index key contain the null value.” This is actually the hard-wired behaviour in the Oracle database and it is commonly exploited to emulate partial indexes in the Oracle database.

Generally speaking, emulating partial indexes works by mapping all parts of the key (all indexed columns) to NULL for rows that should not end up in the index. As an example, let’s emulate this partial index in the Oracle database (DB2 is next):

CREATE INDEX messages_todo
          ON messages (receiver)
       WHERE processed = 'N'

The solution presented in SQL Performance Explained uses a function to map the processed rows to NULL, otherwise the receiver value is passed through:

CREATE OR REPLACE
FUNCTION pi_processed(processed CHAR, receiver NUMBER)
RETURN NUMBER
DETERMINISTIC
AS BEGIN
   IF processed IN ('N') THEN
      RETURN receiver;
   ELSE
      RETURN NULL;
   END IF;
END;
/

It’s a deterministic function and can thus be used in an Oracle function-based index. This won’t work with DB2, because DB2 doesn’t allow user defined-functions in index definitions. However, let’s first complete the Oracle example.

CREATE INDEX messages_todo
          ON messages (pi_processed(processed, receiver));

This index has only rows WHERE processed IN ('N')—otherwise the function returns NULL which is not put in the index (there is no other column that could be non-NULL). Voilà: a partial index in the Oracle database.

To use this index, just use the pi_processed function in the where clause:

SELECT message
  FROM messages
 WHERE pi_processed(processed, receiver) = ?

This is functionally equivalent to:

SELECT message
  FROM messages
 WHERE processed = 'N'
   AND receiver  = ?

So far, so ugly. If you go for this approach, you’d better need the partial index desperately.

To make this approach work in DB2 we need two components: (1) the EXCLUDE NULL KEYS clause (no-brainer); (2) a way to map processed rows to NULL without using a user-defined function so it can be used in a DB2 index.

Although the second one might seem to be hard, it is actually very simple: DB2 can do expression based indexing, just not on user-defined functions. The mapping we need can be accomplished with regular SQL expressions:

CASE WHEN processed = 'N' THEN receiver
                          ELSE NULL
END

This implements the very same mapping as the pi_processed function above. Remember that CASE expressions are first class citizens in SQL—they can be used in DB2 index definitions (on LUW just since 10.5):

CREATE INDEX messages_not_processed_pi
    ON messages (CASE WHEN processed = 'N' THEN receiver
                                           ELSE NULL
                 END)
EXCLUDE NULL KEYS;

This index uses the CASE expression to map not to be indexed rows to NULL and the EXCLUDE NULL KEYS feature to prevent those row from being stored in the index. Voilà: a partial index in DB2 LUW 10.5.

To use the index, just use the CASE expression in the where clause and check the execution plan:

SELECT *
  FROM messages
 WHERE (CASE WHEN processed = 'N' THEN receiver
                                  ELSE NULL
         END) = ?;
Explain Plan
-------------------------------------------------------
ID | Operation        |                    Rows |  Cost
 1 | RETURN           |                         | 49686
 2 |  TBSCAN MESSAGES | 900 of 999999 (   .09%) | 49686

Predicate Information
 2 - SARG (Q1.PROCESSED = 'N')
     SARG (Q1.RECEIVER = ?)

Oh, that’s a big disappointment: the optimizer didn’t take the index. It does a full table scan instead. What’s wrong?

If you have a very close look at the execution plan above, which I created with my last_explained view, you might see something suspicious.

Look at the predicate information. What happened to the CASE expression that we used in the query? The DB2 optimizer was smart enough rewrite the expression as WHERE processed = 'N' AND receiver = ?. Isn’t that great? Absolutely!…except that this smartness has just ruined my attempt to use the partial index. That’s what I meant when I said that CASE expressions are first class citizens in SQL: the database has a pretty good understanding what they do and can transform them.

We need a way to apply our magic NULL-mapping but we can’t use functions (can’t be indexed) nor can we use CASE expressions, because they are optimized away. Dead-end? Au contraire: it’s pretty easy to confuse an optimizer. All you need to do is to obfuscate the CASE expression so that the optimizer doesn’t transform it anymore. Adding zero to a numeric column is always my first attempt in such cases:

CASE WHEN processed = 'N' THEN receiver + 0
                          ELSE NULL
END

The CASE expression is essentially the same, I’ve just added zero to the RECEIVER column, which is numeric. If I use this expression in the index and the query, I get this execution plan:

ID | Operation                            |            Rows |  Cost
 1 | RETURN                               |                 | 13071
 2 |  FETCH MESSAGES                      |  40000 of 40000 | 13071
 3 |   RIDSCN                             |  40000 of 40000 |  1665
 4 |    SORT (UNQIUE)                     |  40000 of 40000 |  1665
 5 |     IXSCAN MESSAGES_NOT_PROCESSED_PI | 40000 of 999999 |  1646

Predicate Information
 2 - SARG ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                           ELSE NULL END = ?)
 5 - START ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                            ELSE NULL END = ?)
      STOP ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0)
                                            ELSE NULL END = ?)

The partial index is used as intended. The CASE expression appears unchanged in the predicate information section.

I haven’t checked any other ways to emulate partial indexes in DB2 (e.g., using partitions like in more recent Oracle versions).

As always: just because you can do something doesn’t mean you should. This approach is so ugly—even more ugly than the Oracle workaround—that you must desperately need a partial index to justify this maintenance nightmare. Further it will stop working whenever the optimizer becomes smart enough to optimize +0 away. However, then you just need put an even more ugly obfuscation in there.

INCLUDE Clause Only for Unique Indexes

With the INCLUDE clause you can add extra columns to an index for the sole purpose to allow in index-only scan when these columns are selected. I knew the INCLUDE clause before because SQL Server offers it too, but there are some differences:

  • In SQL Server INCLUDE columns are only added to the leaf nodes of the index—not in the root and branch nodes. This limits the impact on the B-tree’s depth when adding many or long columns to an index. This also allows to bypass some limitations (number of columns, total index row length, allowed data types). That doesn’t seem to be the case in DB2.

  • In DB2 the INCLUDE clause is only valid for unique indexes. It allows you to enforce the uniqueness of the key columns only—the INCLUDE columns are just not considered when checking for uniqueness. This is the same in SQL Server except that SQL Server supports INCLUDE columns on non-unique indexes too (to leverage the above-mentioned benefits).

Almost No NULLS FIRST/LAST Support

The NULLS FIRST and NULLS LAST modifiers to the order by clause allow you to specify whether NULL values are considered as larger or smaller than non-NULL values during sorting. Strictly speaking, you must always specify the desired order when sorting nullable columns because the SQL standard doesn’t specify a default. As you can see in the following chart, the default order of NULL is indeed different across various databases:

Figure A.1. Database/Feature Matrix


In this chart, you can also see that DB2 doesn’t support NULLS FIRST or NULLS LAST—neither in the order by clause no in the index definition. However, note that this is a simplified statement. In fact, DB2 accepts NULLS FIRST and NULLS LAST when it is in line with the default NULLS order. In other words, ORDER BY col ASC NULLS FIRST is valid, but it doesn’t change the result—NULLS FIRST is anyways the default. Same is true for ORDER BY col DESC NULLS LAST—accepted, but doesn’t change anything. The other two combinations are not valid at all and yield a syntax error.

SQL:2008 FETCH FIRST but not OFFSET

DB2 supports the fetch first … rows only clause for a while now—kind-of impressive considering it was “just” added with the SQL:2008 standard. However, DB2 doesn’t support the offset clause, which was introduced with the very same release of the SQL standard. Although it might look like an arbitrary omission, it is in fact a very wise move that I deeply respect. offset is the root of so much evil. In the next section, I’ll explain how to live without offset.

Side node: If you have code using offset that you cannot change, you can still activate the MySQL compatibility vector that makes limit and offset available in DB2. Funny enough, combining fetch first with offset is then still not possible (that would be standard compliant).

Decent Row-Value Predicates Support

SQL row-values are multiple scalar values grouped together by braces to form a single logical value. IN-lists are a common use-case:

WHERE (col_a, col_b) IN (SELECT col_a, col_b FROM…)

This is supported by pretty much every database. However, there is a second, hardly known use-case that has pretty poor support in today’s SQL databases: key-set pagination or offset-less pagination. Keyset pagination uses a where clause that basically says “I’ve seen everything up till here, just give me the next rows”. In the simplest case it looks like this:

SELECT …
  FROM …
 WHERE time_stamp < ?
 ORDER BY time_stamp DESC
 FETCH FIRST 10 ROWS ONLY

Imagine you’ve already fetched a bunch of rows and need to get the next few ones. For that you’d use the time_stamp value of the last entry you’ve got for the bind value (?). The query then just return the rows from there on. But what if there are two rows with the very same time_stamp value? Then you need a tiebreaker: a second column—preferably a unique column—in the order by and where clauses that unambiguously marks the place till where you have the result. This is where row-value predicates come in:

SELECT …
  FROM …
 WHERE (time_stamp, id) < (?, ?)
 ORDER BY time_stamp DESC, id DESC
 FETCH FIRST 10 ROWS ONLY

The order by clause is extended to make sure there is a well-defined order if there are equal time_stamp values. The where clause just selects what’s after the row specified by the time_stamp and id pair. It couldn’t be any simpler to express this selection criteria. Unfortunately, neither the Oracle database nor SQLite or SQL Server understand this syntax—even though it’s in the SQL standard since 1992! However, it is possible to apply the same logic without row-value predicates—but that’s rather inconvenient and easy to get wrong.

Even if a database understands the row-value predicate, it’s not necessarily understanding these predicates good enough to make proper use of indexes that support the order by clause. This is where MySQL fails—although it applies the logic correctly and delivers the right result, it does not use an index for that and is thus rather slow. In the end, DB2 LUW (since 10.1) and PostgreSQL (since 8.4) are the only two databases that support row-value predicates in the way it should be.

The fact that DB2 LUW has everything you need for convenient keyset pagination is also the reason why there is absolutely no reason to complain about the missing offset functionality. In fact I think that offset should not have been added to the SQL standard and I’m happy to see a vendor that resisted the urge to add it because its became part of the standard. Sometimes the standard is wrong—just sometimes, not very often ;) I can’t change the standard—all I can do is teaching how to do it right and start campaigns like #NoOffset.

Figure A.2. Database/Feature Matrix


If you like my way of explaining things, you’ll love my book “SQL Performance Explained”.

Meta-Post: New Mascot, New Language, New Database

It has been quiet here at the Use The Index, Luke blog lately. But that’s not because I’ve run out of topics to write about — in fact, my blog backlog seems to be ever growing — the recent silence is just because there are some more demanding projects happening at the moment.

First of all, Use the Index, Luke got a new mascot—not exactly breaking news. However, I’m currently preparing give-aways and merchandise products featuring the new mascot. Stay tuned.

Next, Use The Index, Luke gets translated to Japanese! The first two chapters have just been published. Insiders will remember that chapter 1 and 2 make up half of the book. The translation is done by Hayato Matsuura, Takuto Matsuu has a second look over it. As far as I can tell both are making a great job and I’d like to be the first to thank them! Please help spreading the word about this in the Japanese community.

Finally, I’m just adding DB2 as a first-class citizen to Use The Index, Luke because a client wanted to get my SQL performance training based on DB2 LUW Express-C (which is free, by the way). Like the Japanese translation, this work is not yet finished. However, the appendix on execution plans is already there. Again, please help spreading the word about this in the DB2 community.

That’s it. I just wanted to give you a short update.

We need tool support for keyset pagination

Did you know pagination with offset is very troublesome but easy to avoid?

offset instructs the databases skip the first N results of a query. However, the database must still fetch these rows from the disk and bring them in order before it can send the following ones.

This is not an implementation problem, it’s the way offset is designed:

…the rows are first sorted according to the <order by clause> and then limited by dropping the number of rows specified in the <result offset clause> from the beginning…

In other words, big offsets impose a lot of work on the database—no matter whether SQL or NoSQL.

But the trouble with offset doesn’t stop here: think about what happens if a new row is inserted between fetching two pages?

When using offset➌ to skip the previously fetched entries❶, you’ll get duplicates in case there were new rows inserted between fetching two pages➋. There are other anomalies possible too, this is just the most common one.

This is not even a database problem, it is the way frameworks implement pagination: they just say which page number to fetch or how many rows to skip. With this information alone, no database can do any better.

Life Without OFFSET

Now imagine a world without these problems. As it turns out, living without offset is quite simple: just use a where clause that selects only data you haven’t seen yet.

For that, we exploit the fact that we work on an ordered set—you do have an order by clause, ain’t you? Once there is a definite sort order, we can use a simple filter to only select what follows the entry we have see last:

SELECT ...
  FROM ...
 WHERE ...
   AND id < ?last_seen_id
 ORDER BY id DESC
 FETCH FIRST 10 ROWS ONLY

This is the basic recipe. It gets more interesting when sorting on multiple columns, but the idea is the same. This recipe is also applicable to many NoSQL systems.

This approach—called seek method or keyset pagination—solves the problem of drifting results as illustrated above and is even faster than offset. If you’d like to know what happens inside the database when using offset or keyset pagination, have a look at these slides (benchmarks, benchmarks!):

On slide 43 you can also see that keyset pagination has some limitations: most notably that you cannot directly navigate to arbitrary pages. However, this is not a problem when using infinite scrolling. Showing page number to click on is a poor navigation interface anyway—IMHO.

If you want to read more about how to properly implement keyset pagination in SQL, please read this article. Even if you are not involved with SQL, it’s worth reading that article before starting to implement anything.

But the Frameworks…

The main reason to prefer offset over keyset pagination is the lack of tool support. Most tools offer pagination based on offset, but don’t offer any convenient way to use keyset pagination.

Please note that keyset pagination affects the whole technology stack up to the JavaScript running in the browser doing AJAX for infinite scrolling: instead of passing a simple page number to the server, you must pass a full keyset (often multiple columns) down to the server.

The hall of fame of frameworks that do support keyset pagination is rather short:

This is where I need your help. If you are maintaining a framework that is somehow involved with pagination, I ask you, I urge you, I beg you, to build in native support for keyset pagination too. If you have any questions about the details, I’m happy to help (forum, contact form, Twitter)!

Even if you are just using software that should support keyset pagination such as a content management system or webshop, let the maintainers know about it. You might just file a feature request (link to this page) or, if possible, supply a patch. Again, I’m happy to help out getting the details right.

Take WordPress as an example.

Spread the Word

The problem with keyset pagination is not a technical one. The problem is just that it is hardly known in the field and has no tool support. If you like the idea of offset-less pagination, please help spreading the word. Tweet it, share it, mail it, you can even re-blog this post (CC-BY-NC-ND). Translations are also welcome, just contact me beforehand—I’ll include a link to the translation on this page too!

Oh, and if you are blogging, you could also add a banner on your blog to make your readers aware of it. I’ve prepared a NoOffset banner gallery with some common banner formats. Just pick what suits you best.

Finding All the Red M&Ms: A Story of Indexes and Full‑Table Scans

In this guest post, Chris Saxon explains a very important topic using an analogy with chocolates: When does a database use an index and when is it better not using it. Although Chris explanation has the Oracle database in mind, the principles apply to other databases too.

A common question that comes up when people start tuning queries is “why doesn’t this query use the index I expect?”. There are a few myths surrounding when database optimizers will use an index. A common one I’ve heard is that an index will be used when accessing 5% or less of the rows in a table. This isn’t the case however - the basic decision on whether or not to use an index comes down to its cost.

How do databases determine the cost of an index?

Before getting into the details, let’s talk about chocolate! Imagine you have 100 packets of M&M’s. You also have a document listing the colour of each M&M and the bag it’s stored in. This is ordered by colour, so we have all the blue sweets first, then the brown, green and so on like so:

You’ve been asked to find all the red M&M’s. There’s a couple of basic ways you could approach this task:

Method 1

Get your document listing the colour and location of each M&M. Go to the top of the “red” section. Lookup the location of the first red M&M, pick up the bag it states, and get the sweet. Go back to your document and repeat the process for the next red chocolate. Keep going back-and-forth between your document and the bags until you’ve reached the end of the red section.

Method 2

Pick up a number of bags at a time (e.g. 10), empty their contents out, pick out the red chocolates and return the others (back to their original bag).

Which approach is quicker?

Intuitively the second approach appears to be faster. You only select each bag once and then do some filtering of the items inside. Whereas with the first approach you have done a lot of back-and-forth between the document and the bags. This means you have to look into each bag multiple times.

We can be a bit more rigorous than this though. Let’s calculate how many operations we need to do to get all the red chocolates in each case.

When going between the document and the bags (method 1), each time you lookup the location of a new sweet and fetch that bag that’s a new operation. You have 100 bags with around 55 sweets in each. This means you’re doing roughly 920 (100 bags x 55 sweets / 6 colours) operations (plus some work to find the red section in your document). So the “cost” of using the document is around 920.

With the second approach you collect 10 bags in one step. This means you do ( 100 bags / 10 bags per operation = ) 10 operations (plus some filtering of the chocolates in them), giving a “cost” of 10.

Comparing these costs (920 vs. 10), method 2 is the clear winner.

Let’s imagine another scenario. Mars have started doing a promotion where around 1 in 100 bags contain a silver M&M. If you get the silver sweet, you win a prize. You want to find the silver chocolate!

In this case, using method 1, you go to the document to find the location of the single sweet. Then you go to that bag and retrieve the sweet. One operation (well two, including going to the document to find location of the silver chocolate), so we have a cost of two.

With method 2, you still need to pick up every single bag (and do some filtering) just to find one sweet - the cost is fixed at 10. Clearly method 1 is far superior in this case.

What have M&M’s got to do with databases?

When Oracle stores a record to the database, it is placed in a block. Just like there are many M&Ms in a bag, (normally) there are many rows in a block. When accessing a particular row, Oracle fetches the whole block and retrieves the requested row from within it. This is analogous to us picking up a bag of M&Ms and then picking a single chocolate out.

When doing an index-range scan, Oracle will search the index (the document) to find the first value matching your where clause. It then goes back-and-forth between the index and the table blocks, fetching the records from the location pointed to by the index. This is similar to method 1, where you continually switch between the document and the M&M bags.

As the number of rows accessed by an index increases, the database has to do more work. Therefore the cost of using an index increases in line with the number of records it is expected to fetch.

When doing a full table scan (FTS), Oracle will fetch several blocks at once in a multi-block read. The data fetched is then filtered so that only rows matching your where clause are returned (in this case the red M&Ms) – the rest are discarded. Just like in method 2.

The expected number of rows returned has little impact on the work a FTS does. Its basic cost is fixed by the size of the table and how many blocks you fetch at once.

When fetching a “high” percentage of the rows from a table, it becomes far more efficient to get several blocks at once and do some filtering than it is to visit a single block multiple times.

When does an index scan become more efficient than a FTS?

In our M&M example above, the “full-table scan” method fetches all 100 bags in 10 operations. Whereas with “index” approach requires a separate operation for each sweet. So an index is more efficient when it points to 10 M&Ms or less.

Mars puts around 55 M&M’s in each bag, so as a percentage of the “table” that’s just under ( 10 M&M’s / (100 bags * 55 sweets) * 100 = ) 0.2%!

What if Mars releases some “giant” M&Ms with only 10 sweets in a bag? In this case there’s fewer sweets in total, so the denominator in the equation above decreases. Our FTS approach is still fixed at a “cost” of 10 for the 100 bags. This means the point at which an index is better is when accessing approximately ( 10/1000*100 = ) 1% of the “rows”. A higher percentage, but still small in real terms.

If they released “mini” M&Ms with 200 in a bag, the denominator would increase. This means that the index is more efficient when accessing a very small percentage of the table!

So as you increase the space required to store a row, an index becomes more effective than a FTS. The number of rows accessed by the index remains fixed. The number of blocks required to store the data increases however, making the FTS more expensive and leading to it having a higher cost.

There’s a big assumption made in the above reasoning however. It’s that there’s no correlation between the order M&M’s are listed in the document and which bag they are in. So, for example, the first red M&M (in the document) may be in bag 1, the second in bag 56, the third in bag 20, etc.

Let’s make a different assumption – that the order of red chocolates in the document corresponds to the order they appear in the bags. So the first 9 red sweets are in bag 1, the next 9 in bag 2 etc. While you still have to visit all 100 bags, you can keep the last bag accessed in your hand, only switching bags every 9 or so sweets. This reduces the number of operations you do, making the index approach more efficient.

We can take this further still. What if Mars changes the bagging process so that only one colour appears in each bag?

Now, instead of having to visit every single bag to get all the red sweets, you only have to visit around ( 100 bags / 6 colours) 16 bags. If the sweets are also placed in the bags in the same order they are listed in the document (so M&M’s 1-55 are all blue and in bag 1, bag 2 has the blue M&M’s 56-100, and so on until bag 100, which has yellow M&M’s 496-550) you get the benefits of not switching bags compounded with the effect of having fewer bags to fetch.

This principle – how closely the order of records in a table matches the order they’re listed in a corresponding index – is referred to as the clustering factor. This figure is lower when the rows appear in the same physical order in the table as they do in the index (all sweets in a bag are the same colour) and higher when there’s little or no correlation.

The closer the clustering factor is to the number of blocks in a table the more likely it is that the index will be used (it is assigned a lower cost). The closer it is to the number of rows in a table, the more likely it is a FTS will be chosen (the index access is given a higher cost).

Bringing it all together

To sum up, we can see the cost-based optimizer decides whether to use an index or FTS by:

  • Taking the number of blocks used to store the table and dividing this by the number of blocks read in a multi-block read to give the FTS cost.

  • For each index on the table available to the query:

    • Finding the percentage of the rows in the table it expects a query to return (the selectivity)

    • This is then used to determine the percentage of the index expected to be accessed

    • The selectivity is also multiplied by the clustering factor to estimate the number of table blocks it expects to access to fetch these rows via an index

    • Adding these numbers together to give the expected cost (of the index)

  • The cost of the FTS is then compared to each index inspected and the access method with the lowest cost used.

This is just an overview of how the (Oracle) cost-based optimizer works. If you want to see the formulas the optimizer uses have a read of Wolfgang Breitling’s “Fallacies of the Cost Based Optimizer” paper or Jonathan Lewis’ Cost-Based Oracle Fundamentals book. The blogs of Jonathan Lewis, Richard Foote and of course Markus’ articles on this site also contain many posts going into this subject in more detail.

What I learned about SQLite…at a PostgreSQL conference

So, I’ve been to PgCon 2014 in Ottawa to give a short version of my SQL performance training (hint: special offer expires soon). However, I think I ended up learning more about SQLite than about PostgreSQL there. Here is how that happened and what I actually learned.

Richard Hipp, creator of SQLite was the keynote speaker at this years PgCon. In his keynote (slides, video) he has put the focus on three topics: how PostgreSQL influenced SQLite development (“SQLite was originally written from PostgreSQL 6.5 documentation” and the “What Would PostgreSQL Do?” (WWPD) way of finding out what the SQL standard tries to tell us). The second main topic was that SQLite should be seen as an application file format—an alternative to inventing own file formats or using ZIPped XMLs. The statement “SQLite is not a replacement for PostgreSQL. SQLite is a replacement for fopen()” nails that (slide 21). Finally, Richard put a lot of emphasis on that fact that SQLite takes care of your data (crash safe, ACID)—unlike many of the so-called NoSQL systems, which Richard refers to as “Postmodern Databases: absence of objective truth; Queries return opinions rather than facts”. In his keynote, Richard has also shown that SQLite is pretty relaxed when it comes to data types. As a matter of fact, SQLite accepts strings like “Hello” for INT fields. Note that it still stores “Hello”—no data is lost. I think he mentioned that it is possible to enforce the types via CHECK constraints.

Here I have to mention that I’ve had an e-mail conversation with Richard about covering SQLite on Use The Index, Luke last year. When our ways crossed at the PgCon hallway I just wanted to let him know that this has not been forgotten (despite the minimal progress on my side). Guess what—he was still remembering our mail exchange and immediately suggested to go through those topics once more in PgCon’s hacker lounge later. Said. Done. Here comes what I learned about SQLite.

First of all, SQLite uses the clustered index concept as known from SQL Server or MySQL/InnoDB. However, before version 3.8.2 (released 2013-12-06) you always had to use an INTEGER column as clustering key. If the primary key happened to be a non-integer, SQLite was creating an integer primary key implicitly and used this as the main clustering key. Querying by the real primary key (e.g. TEXT) required a secondary index lookup. This limitation was recently lifted by introducing WITHOUT ROWID tables.

When it comes to execution plans, SQLite has two different variants: the regular explain prefix returns the byte code that is actually executed. To get an execution plan similar to what we are used to, use explain query plan. SQLite has some hints (unlike PostgreSQL) to affect the query plan: indexed by and unlikely, likelihood.

We also discussed whether or not SQLite can cope with search conditions like (col1, col2) < (val1, val2)—it can’t. Here Richard was questioning the motivation to have that, so I gave him an elevator-pitch version of my “Pagination Done The PostgreSQL Way” talk. I think he got “excited” about this concept to avoid offset at all and I’m curious to see if he can make it work in SQLite faster than I’m adding SQLite content to Use The Index, Luke :)

SQLite supports limit and offset (*sigh*) but the optimizer does currently not consider limit for optimization. Offering the SQL:2008 first fetch ... rows only syntax is not planned and might actually turn out to be hard because the parser is close to run out of 8-bit token codes (when I understood that right).

Joins are basically always executed as nested loops joins but SQLite might create an “automatic transient index” for the duration of the query. We also figured out that there seems to be a oddity with they way CROSS JOIN works in SQLite: it turn the optimizers table reordering off. Non-cross joins, on the other hand, don’t enforce the presence of ON or USING so that you can still build a Cartesian product using the optimizers smartness to reorder the tables.

Last but not least I’d like to mention that SQLite supports partial indexes in the way it should be: just a WHERE clause at index creation. Partial indexes are available since version 3.8.0 (released 2013-08-26).

On the personal side, I’d describe Richard as very pragmatic and approachable. He joined twitter a few month ago and all he did there is helping people who asked questions about SQLite. Really, look at his time line. That says a lot.

I’m really happy that I had the opportunity to meet Richard at a PostgreSQL conference—as happy as I was to meet Joe “Smartie” Celko years ago…at a another PostgreSQL conference. Their excellent keynote speaker selection is just one more reason to recommend PostgreSQL conferences in general. Here are the upcoming ones just in case you got curious.

ps.: There was also a unconference where I discussed the topic of Bitmap Index Only Scan (slides & minutes).

What’s left of NoSQL?

This is my own and very loose translation of an article I wrote for the Austrian newspaper derStandard.at in October 2013. As this article was very well received and the SQL vs. NoSQL discussion is currently hot again, I though it might be a good time for a translation.

Back in 2013 The Register reported that Google sets its bets on SQL again. On the first sight this might look like a surprising move because it was of all things Google’s publications about MapReduce and BigTable that gave the NoSQL movement a big boost in the first place. On a second sight it turns out that there is a trend to use SQL or similar languages to access data from NoSQL systems—and that’s not even a new trend. However, it raises a question: What remains of NoSQL if we add SQL again? To answer this question, I need to start with a brief summary about the history of databases.

Top Dog SQL

It was undisputed for decades: store enterprise data in a relational database and use the programming language SQL to process the stored data.

The theoretical foundation for this approach, the relational model, was laid down as early as 1970. The first commercial representative of this new database type became available in 1979: the Oracle Database in release 2. Todays appearance of relational databases was mainly coined in the 1980s. The most important milestones were the formalization of the ACID criteria (Atomicity, Consistency, Isolation, Durability), which avoid accidental data corruption, and the standardization of the query language SQL—first by ANSI, one year later by ISO. From this time on SQL and ACID compliance were desirable goals for database vendors.

At first glance there were no major improvements during the next decades. The result: SQL and the relational model are sometimes called old or even outdated technology. At a closer look SQL databases evolved into mature products during that time. This process took that long because SQL is a declarative language and was way ahead of the technical possibilities of the 1980s. “Declarative language” means that SQL developers don’t need to specify the steps that lead to the desired result as with imperative programming, they just describe the desired result itself. The database finds the most efficient way to gather this result automatically—which is by no means trivial so that early implementations only mastered it for simple queries. Over these decades, however, the hardware got more powerful and the software more advanced so that modern databases deliver good results for complex queries too (see here if it doesn’t work for you).

It is especially important to note that the capabilities of SQL are not limited to storing and fetching data. In fact, SQL was designed to make refining and transforming data easy. Without any doubt, that was an important factor that made SQL and the relational model the first choice when it comes to databases.

And Then: NoSQL

Despite the omnipresence of SQL, a new trend emerged during the past few years: NoSQL. This term alone struck the nerve of many developers and caused a rapid spread that ultimately turned into a religious war. The opponents were SQL as symbol for outdated, slow, and expensive technology on the one side against an inhomogeneous collection of new, highly-scalable, free software that is united by nothing more than the umbrella brand “NoSQL.”

One possible explanation for the lost appreciation of SQL among developers is the increasing popularity of object-relational mapping tools (ORM) that generally tend to reduce SQL databases to pure storage media (“persistence layer”). The possibility to refine data using SQL is not encouraged but considerably hindered by these tools. The result of the excessive use is a step-by-step processing by the application. Under this circumstances SQL does indeed not deliver any additional value and it becomes understandable why so many developers sympathise with the term NoSQL.

But the problem is that the term NoSQL was not aimed against SQL in the first place. To make that clear the term was defined to mean “not only SQL” later on. Thus, NoSQL is about complementary alternatives. To be precise it is not even about alternatives to SQL but about alternatives to the relational model and ACID. In the meantime the CAP theorem revealed that the ACID criteria will inevitably reduce the availability of distributed databases. That means that traditional, ACID compliant, databases cannot benefit from the virtually unlimited resources available in cloud environments. This is what many NoSQL systems provide a solution for: instead of sticking to the very rigid ACID criteria to keep data 100% consistent all the time they accept temporary inconsistencies to increase the availability in a distributed environment. Simply put: in doubt they prefer to deliver wrong (old) data than no data. A more correct but less catchy term would therefore be NoACID.

Deploying such systems only makes sense for applications that don’t need strict data consistency. These are quite often applications in the social media field that can still fulfil their purpose with old data and even accept the loss of some updates in case of service interruption. These are also applications that could possibly need the unlimited scalability offered by cloud infrastructure. If a central database is sufficient, however, the CAP theorem does not imply a compelling reason to abandon the safety of ACID. However, using NoSQL systems can still make sense in domains where SQL doesn’t provide sufficient power to process the data. Interestingly, this problem is also very dominant in the social media field: although it is easy to express a social graph in a relational model it is rather cumbersome to analyse the edges using SQL.

Nevertheless: Back to SQL

Notwithstanding the above, SQL is still a very good tool to answer countless questions. This is also true for questions that could not be foreseen at the time of application design—a problem many NoSQL deployments face after the first few years. That’s probably also the cause for the huge demand for powerful and generic query languages like SQL.

Another aspect that strengthens the trend back to SQL goes to the heart of NoSQL: without ACID it is very difficult to write reliable software. Google said that very clearly in their paper about the F1 database: without ACID “we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date.” Apparently you only learn to appreciate ACID once you lost it.

What’s left of NoSQL?

If SQL and ACID become the flavor of the time again you may wonder what’s left of the NoSQL movement? Did it really waste half a decade like Jack Clark speculated at The Register? One thing we can say for sure is that SQL and ACID conformity are still desirable goals for database vendors. Further we know that there is a trade-off between ACID and scalability in cloud environments.

Of course, the race for the Holy Grail of the ideal balance between ACID and scalability has already begun. The NoSQL systems entered the field from the corner of unlimited scalability and started adding tools to control data consistency such as causal consistency. Obviously there a newcomers that use the term NewSQL to jump on the bandwagon by developing completely new SQL databases that bring ACID and SQL from the beginning but use NoSQL ideas internally to improve scalability.

And what about the established database vendors? They want to make us believe they are doing NoSQL too and release products like the “Oracle NoSQL Database” or “Windows Azure Table Storage Service.” The intention to create these products for the sole purpose to ride the hype is so striking that one must wonder why they treat neither NoSQL nor NewSQL as a serious threat to their business? When looking a the Oracle Database in it’s latest release 12c, we can even see the opposite trend. Although the version suffix “c” is ought to express its cloud capability it doesn’t change the fact that the killer feature of this release serves a completely different need: the easy and safe operation of multiple databases on a single server. That’s the exact opposite of what many NoSQL systems aim for: running a giant database on a myriad of cheap commodity servers. Virtualization is a way bigger trend than scale-out.

Is it even remotely possible that the established database vendors underwent such a fundamental misjudgment? Or is it more like that NoSQL only serves a small market niche? How many companies really need to cope with data at the scale of Google, Facebook or Twitter? Incidentally three companies that grew up on the open source database MySQL. One might believe the success of NoSQL is also based on the fact that it solves a problem that everybody would love to have. In all reality this problem is only relevant to a very small but prominent community, which managed to get a lot of attention. After all, also judging on the basis that the big database vendors don’t show a serious engagement, NoSQL is nothing more than a storm in a teacup.

If you like my way to explain things, you’ll love SQL Performance Explained.

Thank You MySQL, We’ll Miss You!

Dear MySQL,

Thank you for introducing me to SQL. It must have been 1998 when we first met I and fell in love with the simplicity of SQL immediately. Before that I’ve been using C structs all the time; I had to do my joins programmatically and also create and maintain my indexes manually. It was even hard to combine several search conditions via and and or. But then there was the shiny new world of SQL you were showing me…

Everything was easily. Just write a where clause, no matter how complex, you found the right rows. Joins were equally easy to write and you took all the effort to combine the data from several tables as I needed them. I also remember how easy it became to manage the schema. Instead of writing a program to copy my data from one C struct to another, I just say alter table now—in the meanwhile it even works online in many cases! I didn’t take long until I used SQL for stuff I wouldn’t have thought a database could do for me. So I was quickly embracing group by and co.

But I haven’t spent a lot of time with you lately. It’s not because I was too busy. I’m still practicing what you have shown me! And I’ve moved on. Now I’m using common table expressions to organize complex queries and I use window functions to calculate running totals or just do a ranking. I’m also using joins more efficiently because I know about hash and sort/merge joins. A while ago I was wondering why you didn’t tell me about these things. But then I realized that you don’t know them.

I know it was not always nice what I said about you recently. But now that Oracle announced your retirement to focus on Oracle NoSQL, I realized how right this comment on reddit was. Sure you are neither the most advanced nor the most widely deployed open source SQL database, but you introduced millions of people to SQL. There should be no tears when some of them move away because they want to see what’s next. Isn’t that the greatest compliment a teacher can get? You can be proud of what you have accomplished.

Thank you!

Markus Winand
Markus Winand
April 1, 2014

ps.: Even when fooling around, I can’t resist to inform my fellow readers. Besides the claim that Oracle retires MySQL, everything is true. Initially I thought Oracle NoSQL is an Aprils fool’s joke but it isn’t.

Results of the SQL Performance Quiz: 60% fail

In 2011, I’ve launched the “The 3-Minute Test: What do you know about SQL performance.” It consists of five questions that follow a simple pattern: each question shows a query/index pair and asks if it demonstrates proper indexing or not. Till today, this test has become one of the most popular features on Use The Index, Luke and has been completed more than 28,000 times.

Note

Just in case you got curious, please be aware that this article is a spoiler. You might want to do the quiz yourself before continuing.

Although the quiz was created for educational purposes, I was wondering if I could get some interesting figures out of these 28,000 results. And I think I could. However, there are several things to keep in mind when looking at these figures. First, the quiz uses the surprise factor to catch attention. That means, three questions show cases that look fine, but aren’t. One question does it the other way around and shows an example that might look dangerous, but isn’t. There is only one question where the correct answer is in line with the first impression. Another effect that might affect the significance of the results is that there was no representative selection of participants. Everybody can take the quiz. You can even do it multiple times and will probably get a better result the second time. Just keep in mind that the quiz was never intended to be used for scientific research upon the indexing knowledge in the field. Nevertheless, I think that the size of the dataset is still good enough to get an impression.

Below I’ll show two different statistics for each question. First, the average rate at which this question was correctly answered. Second, how this figure varies for users of MySQL, Oracle, PostgreSQL and SQL Server databases. In other word, it says if e.g. MySQL users are more knowledgeable about indexing as PostgreSQL users. Spoiler: It’s the other way around. The only reason I’m in the lucky position to have this data is that the test sometimes uses vendor specific syntax. For example, what is LIMIT in MySQL and PostgreSQL is TOP in SQL Server. Therefore, the participants have to select a database at the beginning so that the questions are shown in the native syntax of that product.

Question 1: Functions in the WHERE Clause

Is the following SQL good or bad practice from a performance perspective?

Searching for all rows of the year 2012:

CREATE INDEX tbl_idx ON tbl (date_column);

SELECT text, date_column
  FROM tbl
 WHERE TO_CHAR(date_column, 'YYYY') = '2012';

This is an example where the code uses functions specific to the Oracle and PostgreSQL databases. For MySQL, the question uses YEAR(date_column) and for SQL Server datepart(yyyy, date_column). Of course, I could have used EXTRACT(YEAR date_column), but I thought it is better to use the most common syntax.

The participants have these options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The correct answer is “bad practice” because the index on date_column cannot be used when searching on something derived from date_column. If you don’t believe that, please have a look at the proof scripts or the explanations shown at the end of the test. They also contain links to the Use The Index, Luke pages explaining it in more detail.

However, if you didn’t know how functions effectively “disable” indexes, you are not alone. Only about two-thirds gave the correct answer. And as for every multiple choice test, there is a certain probably to pick the correct answer by chance. In this cases it’s a 50/50 chance — by no means negligible. I’ve marked this “guessing score” in the figure to emphasize that.

This is one of the most common problems I see in my everyday work. The same problem can also hit you with VARCHAR fields when using UPPER, TRIM or the like. Please keep in mind: whenever you are applying functions on columns in the where clause, an index on the column itself is no longer useful for this query.

Although this result is quite disappointing — I mean it’s just 17% better than guessing — it is no surprise to me. What is a surprise for me is how the result differs amongst the users of different databases.

As a matter of fact, MySQL users just score 55% — almost as low as the “guessing score”. PostgreSQL users, on the other hand, get a score of 83%.

An effect that might explain this result is that MySQL doesn’t support function-based indexes while Oracle and PostgreSQL do. Function-based indexes allow you to index expressions like TO_CHAR(date_column, 'YYYY'). Although it is not the recommended solution for this case, the pure existence of this feature might make users of the Oracle and PostgreSQL database more aware of this problem. SQL Server offers a similar feature: although it cannot index expressions directly, you can create a so-called computed column on expressions, which in turn can be indexed.

Although support for function-based indexes might explain why MySQL users underperformed, it is still no excuse. The shown query/index pair is bad — no matter whether the database supports function-based indexes or not. And the major improvement is also possible without function-based indexes:

SELECT text, date_column
  FROM tbl
 WHERE date_column >= TO_DATE('2012-01-01', 'YYYY-MM-DD')
   AND date_column <  TO_DATE('2013-01-01', 'YYYY-MM-DD');

The index doesn’t need to be changed. This solution is very flexible because it supports queries for different ranges too — e.g. by week or month. This is the recommended solution.

As a curious guy, I’d love to know how many of the people who correctly answered this question were thinking of the sub-optimal solution to use a function-based index. I’d rate this solution half-correct at best.

Question 2: Indexed Top-N Queries

Is the following SQL good or bad practice from a performance perspective?

To find the most recent row:

CREATE INDEX tbl_idx ON tbl (a, date_column);

SELECT id, a, date_column
  FROM tbl
 WHERE a = ? 
 ORDER BY date_column DESC
 LIMIT 1;

Note that the question mark is a placeholder, because I always encourage developers to use bind parameters.

As participant, you have these two options again:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

This is the question that is supposed to look dangerous, but isn’t. Generally, it seems like people believe order by must always sort the data. This index, however, eliminates the need to sort the data entirely so that the query is basically as fast as a unique index lookup. Please find a detailed explanation of this trick here.

The result is very close to the “guessing score” which I interpret as “people don’t have a clue about it.”

This result is particularly sad because I’ve seen people building caching tables, regularly refilled by a cron jobs, to avoid queries like this. Interestingly, the cron job tends to cause performance problems because it is running in rather short intervals to make sure the cache table has fresh data. However, the right index is often the better option in the first place.

Here I have to mention that the Oracle database needs the most special syntax for this trick. Up till version 12c released in 2013, the Oracle database did not offer a handy shortcut such as LIMIT or TOP. Instead, you have to use the ROWNUM pseudo-column like this:

SELECT *
  FROM (
        SELECT id, date_column
          FROM tbl
         WHERE a = :a
         ORDER BY date_column DESC
       )
 WHERE rownum <= 1;

The extra complexity of this query might have pushed Oracle users more heavily towards the wrong answer — actually below guessing score!

Another argument I’m getting in response to this question is that including the ID column in the index would allow an index-only scan. Although this is correct, I don’t consider not doing so a “bad practice” because the query touches only one row anyway. An index-only scan could just avoid a single table access. Obviously there are cases were you need that improvement, but in the general case I’d consider it a premature optimization. But that’s just my opinion. However, following this argument might give us an idea why PostgreSQL users got the best score (again): PostgreSQL did not have index-only scans until version 9.2, which was released in September 2012. As a result, PostgreSQL users could not fall into this trap of thinking an index-only scan can bring major improvements in this case. Undoubtedly, the term “major” is troublesome in this context.

Question 3: Index Column Order

Is the following SQL good or bad practice from a performance perspective?

Two queries, searching by a common column:

CREATE INDEX tbl_idx ON tbl (a, b);

SELECT id, a, b
  FROM tbl
 WHERE a = ?
   AND b = ?;


SELECT id, a, b
  FROM tbl
 WHERE b = ?;

The same options:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

The answer is “bad practice,” because the second query cannot use the index properly. Changing the index column order to (b, a) would, however, allow both queries to use this index in the most efficient way. You can find a full explanation here. Adding a second index on (b) would be a poor solution due to the overhead it adds for no reason. Unfortunately, I don’t know how many would have done that.

The result is disappointing, but in line with my expectations — just 12.5% above guessing score.

This is also a problem I see almost every day. People just don’t understand how multi-column indexes work.

All the per-database results are pretty close together. Maybe because there is no syntactic difference or well-known database features that could have a major influence the answer. Less known features like Oracle’s SKIP SCAN could have a minor impact, of course. Generally, the index-only scan could have a influence too, but it pushes the participants to the “right” answer this time.

After all, this result might just say that users of some databases know more about indexing than others. Interestingly, PostgreSQL users get the best score for the third time.

Question 4: LIKE Searches

Is the following SQL troublesome or bulletproof from a performance perspective?

Searching within a string:

CREATE INDEX tbl_idx ON tbl (text);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

I’ve phrased the options differently this time:

  • Bulletproof: It will always run fast.

  • Troublesome: There is high risk for performance problems.

And the correct answer is “troublesome” because the LIKE pattern uses a leading wild card. Otherwise, if it would use the pattern 'TERM%', it could use the index very efficiently. Have a look at this visual explanation for details.

The results of this question are promising. Here I feel safe to say most people know that LIKE is not for full-text search.

The segregated results are also within a very narrow corridor:

It is, however, strange that PostgreSQL users under performed at this question. A closer look at how the question is presented to PostgreSQL users might give an explanation:

CREATE INDEX tbl_idx ON tbl (text varchar_pattern_ops);

SELECT id, text
  FROM tbl
 WHERE text LIKE '%TERM%';

Note the addition to the index (varchar_pattern_ops). In PostgreSQL, this special operator class is required to make the index usable for postfix wild card searches (e.g. 'TERM%'). I added this because I aimed to find out if people know about the problem of leading wild cards in LIKE expressions. Without the operator class, there are two reasons why it doesn’t work: (1) the leading wild card; (2) the missing operator class. I though that would be too obvious. Retrospectivley, I believe some participants interpreted this operator class as “magic that makes it work” and thus took the wrong answer.

Question 5a: Index-Only Scan

Question five is a little bit tricky, because PostgreSQL did not support index-only scans when the quiz was created. For that reason, there are two variants of question five: one about index-only scans for users of MySQL, Oracle and SQL Server databases. And another question about index column order for PostgresSQL users. Both results are presented here, but the segregated data is limited for obvious reasons. We start with the question about index-only scans:

How will the change affect query performance?

Current situation, selecting about hundred rows out of a million:

CREATE INDEX tab_idx ON tbl (a, date_column);

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
 GROUP BY date_column;

Changed query, selecting about ten rows out of a million:

SELECT date_column, count(*)
  FROM tbl
 WHERE a = 123
   AND b = 42
 GROUP BY date_column;

Note the added where clause in the second query.

This question is also special because it offers four options:

  • Query performance will roughly stay the same (+/- 10%)

  • Depends on the data.

  • The query will be much slower (impact >10%)

  • The query will be much faster (impact >10%)

When I created the quiz, I was well aware that 50/50 questions have a tendency to render the score meaningless. This question is a trade-off between keeping the questions easy to gasp and answer (questions 1–4) and giving a more accurate result.

To make it short, the correct answer is “the query will be much slower.” The reason is that the original query could use an index-only scan — that is, the query could be answered only using data from the index without fetching any data from the actual table. The second query, however, needs to check column B too, which is not in the index. Consequently, the database must take some extra effort to fetch the candidate rows from the table to evaluate the expression on B That is, it must fetch at least 100 rows from the table — the number of rows returned by the first query. Due to the group by there are probably more rows to fetch. A quite considerable extra effort that will make the query much slower. A more exhaustive explanation is here.

With that number of options, the overall score drops significantly to about 39% or 14% above guessing score.

Still I think saying that about 39% of participants knew the right answers is wrong. They gave the right answer, but there was still a probability of 25% that they gave the right answer without knowing it.

The segregation by database is quite boring. Conspicuously boring.

However, with four options, it is also interesting to see how people actually answered.

That caught me by surprise. Both options “roughly the same” and “depends on the data” got about 25% — the guessing probably. Does this mean half of the participants were guessing? As it is the last question some participants might have picked a random option just to get through. Quite possible. However, the correct option “much slower” got 38.8% at the cost of the “much faster” option, which got just 10.9%.

My intention with this question was to trap people into the “much faster” option because fetching less data should be faster — except when breaking an index-only scan. The only hypothesis I have for this result is that people might have got the idea that the obvious answer isn’t the correct one. That, however, would mean the 39% score doesn’t prove anything about the knowledge of this phenomenon in the field.

Another effect that I expected to have an impact is that “it is always depending on the data.” Of course there are edge cases where the performance impact might roughly stay the same — e.g., when all inspected rows are in the same table block. However, this is rather unlikely — just because there would be no point in adding the date_column for an index-only scan in the first place.

Question 5b: Index Column Order and Range Operators

This question is only shown to PostgreSQL users.

Is the following SQL good or bad practice from a performance perspective?

Searching for entries in state X, not older than 5 years.

CREATE INDEX tbl_idx ON tbl (date_column, state);

SELECT id, date_column, state
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR
   AND state = 'X';

(365 rows)

The data distribution is as follows:

SELECT count(*)
  FROM tbl
 WHERE date_column >= CURRENT_DATE - INTERVAL '5' YEAR;

 count 
-------
  1826

SELECT count(*)
  FROM tbl
 WHERE state = 'X';

 count 
-------
 10000

There is an index with two columns and a query that filters on both of them. One filter uses an equals operator, the other a greater than or equal operator. When using each filter individually the query returns many more rows as when combining both filters.

The options are:

  • Good practice — There is no major improvement possible.

  • Bad practice — There is a major improvement possible.

And the correct answer is “bad practice” because the column order in the index is the wrong way around. The general rule is that index columns can be used efficiently from the left hand side as long as they are used with equals operators. Further, one column can be used efficiently with a range operator. However, the first range operator effectively cuts off the index so that further columns cannot be used efficiently anymore. With efficiently I mean as an index access predicate. Please find a visualization here.

With the original index shown above, the query has to fetch 1826 entries from the index (those matching the date_column filter) and check each of them for the value of the state column. If we turn the index column order around, the database can use both filters efficiently (= as access predicate) and directly limit the index access to those 365 rows of interest.

And this is how people answered:

Wait a moment, that’s below the guessing score! It’s not just people don’t know, they believe the wrong thing. However, I must admit that the term “major” is very problematic again. When I run this example, the speed-up I get is just 70%. Not even twice as fast.

Overall Score: How Many Passed the Test?

Looking at each question individually is interesting, but doesn’t tell us how many participants managed to answer all five questions correctly, for example. The following chart has that information.

Finally, I’d like to boil this chart down to a single figure: how many participants “passed” the test?

Considering that the test has only five questions, out of which four are 50/50 questions, I think it is fair to say three correct answers isn’t enough to pass the test. Requiring five correct answers would quite obviously be asking for too much. Requiring four correct answers to “pass” the test is therefore the only sensible choice I see. Using this definition, only 38.2% passed the test. The chance to pass the test by guessing is still 12.5%.

If you like this article and want to learn about proper indexing, SQL Performance Explained is for you.

Unreasonable Defaults: Primary Key as Clustering Key

As you might have noticed—at least if you have read SQL Performance Explained—I don’t think clustered indexes are as useful as most people believe. That is mainly because it is just too darn difficult to choose a good clustering key. As a matter of fact, choosing a good—the “right”—clustering key is almost impossible if there are more than one or two indexes on the table. The result is that most people just stick to the default—which is the primary key. Unfortunately, this is almost always the worst possible choice.

In this article I explain the beast named clustered index and all it’s downsides. Although this article uses SQL Server as demo database, the article is equally relevant for MySQL/MariaDB with InnoDB and the Oracle database when using index-organized tables.

Recap: What is a clustered index

The idea of clustered indexes is to store a complete table in a B-tree structure. If a table has a clustered index, it basically means the index is the table. A clustered index has a strict row order like any other B-tree index: it sorts the rows according to the index definition. In case of clustered indexes we call the columns that define the index order the clustering key. The alternative way to store a table is as a heap with no particular row order. Clustered indexes have the advantage that they support very fast range scans. That is, they can fetch rows with the same (or similar) clustering key value quickly, because those rows are physically stored next to each other (clustered)—often in the same data page. When it comes to clustered indexes it is very important to understand that there is no separate place where the table is stored. The clustered index is the primary table store—thus you can have only one per table. That’s the definition of clustered indexes—it’s a contrast to heap tables.

There is, however, another contrast to clustered indexes: non-clustered indexes. Just because of the naming, this is the more natural counterpart to clustered indexes to many people. From this perspective, the main difference is that querying a clustered index is always done as index-only scan. A non-clustered index, on the other hand, has just a sub-set of the table columns so it causes extra IOs for getting the missing columns from the primary table store if needed. Every row in a non-clustered index has a reference to the very same row in the primary table store for this purpose. In other words, using a non-clustered index generally involves resolving an extra level of indirection. Generally, I said. In fact it is pretty easy to avoid this overhead by including all needed columns in the non-clustered index. In that case the database can find all the required data in the index and just doesn’t resolve the extra level of indirection. Even non-clustered indexes can be used for index-only scans—making them as fast as clustered indexes. Isn’t that what matters most?

Note

The index-only scan is the important concept—not the clustered index.

Later in the article, we’ll see that there is a duality among these aspects of clustered indexes: being a table store (in contrast to heap tables) or being an index that happens to have all table columns. Unfortunately, this “table-index duality” can be as confusing as the wave-particle duality of light. Hence, I’ll explicitly state which aspect appears at the time whenever necessary.

The costs of an extra level of indirection

When it comes to performance, an extra level of indirection is not exactly desirable because dereferencing takes time. The crucial point here is that the costs of dereferencing is greatly affected by the way the table is physically stored—either as heap table or as clustered index.

The following figures explain his phenomenon. They visualize the process to execute a query to fetch all SALES rows from 2012-05-23. The first figure uses a non-clustered index on SALE_DATE together with a heap table (= a table that doesn’t have a clustered index):

Note that there is a single Index Seek (Non-Clustered) operation on the non-clustered index that causes two RID Lookups into the heap table (one for each matched row). This operation reflects dereferencing the extra indirection to load the remaining columns from the primary table store. For heap tables, a non-clustered index uses the physical address (the so-called RID) to refer to the very same row in the heap table. In the worst case the extra level of indirection causes one additional read access per row (neglecting forwarding).

Now let’s look at the same scenario with a clustered index. More precisely, when using a non-clustered index in presence of a clustered-index on SALE_ID—that is, the primary key as clustering key.

Note that the definition of the index on the left hand side has not changed: it’s still a non-clustered index on SALE_DATE. Nevertheless, the pure presence of the clustered index affects they way the non-clustered index refers to the primary table storage—which is the clustered index! Unlike heap tables, clustered indexes are “living creatures” that move rows around as needed to maintain their properties (i.e.: the row order and tree balance). Consequently the non-clustered index can’t use the physical address as reference anymore because it could change at any time. Instead, it uses the clustering key SALE_ID as reference. Loading the missing columns from the primary table store (=clustered index) now involves a full B-tree traversal for each row. That are several extra IOs per row as opposed to a single extra IO in case of heap tables.

I also refer to this effect as the “clustered index penalty”. This penalty affects all non-clustered indexes on tables that have a clustered index.

Note

For index-organized tables, the Oracle database also stores the physical address of that row (=guess ROWID) along with the clustering key in the secondary indexes (=non-clustered indexes). If the row is found at this address, the database doesn’t need to perform the B-tree traversal. If not, however, it has performed one more IO for nothing.

How bad is it?

Now that we have seen that clustered indexes cause a considerable overhead for non-clustered indexes, you’ll probably like to know how bad it is? The theoretic answer has been given above—one IO for the RID Lookup compared to several IOs for the Key Lookup (Clustered). However, as I learned the hard way when giving performance training, people tend to ignore, deny, or just don’t believe that fact. Hence, I’ll show you a demo.

Obviously, I’ll use two similar tables that just vary by the table storage they use (heap vs. clustered). The following listing shows the pattern to create these tables. The part in square brackets makes the difference to either use a heap table or clustered index as table store.

CREATE TABLE sales[nc] (
    sale_id      NUMERIC       NOT NULL,
    employee_id  NUMERIC       NOT NULL,
    eur_value    NUMERIC(16,2) NOT NULL,
    SALE_DATE    DATE          NOT NULL
    CONSTRAINT salesnc_pk
       PRIMARY KEY [nonclustered]  (sale_id),
);

CREATE INDEX sales[nc]2 ON sales[nc] (sale_date);

I’ve filled these tables with 10 million rows for this demo.

The next step is to craft a query that allows us to measure the performance impact.

SELECT TOP 1 * 
  FROM salesnc
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date

The whole idea behind this query is to use the non-clustered index (hence filtering and ordering on SALE_DATE) to fetch a variable number of rows (hence TOP N) from the primary table store (hence select * to make sure it’s not executed as index-only scan).

Now lets look what happens if we SET STATISTICS IO ON and run the query against the heap table:

Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0,…

The interesting figure is the logical reads count of four. There were no physical reads because I have executed the statement twice so it is executed from the cache. Knowing that we have fetched a single row from a heap table, we can already conclude that the tree of the non-clustered index must have three levels. Together with one more logical read to access the heap table we get the total of four logical reads we see above.

To verify this hypothesis, we can change the TOP clause to fetch two rows:

SELECT TOP 2 * 
  FROM salesnc
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0,…

Keep in mind that the lookup in the non-clustered index is essentially unaffected by this change—it still needs three logical reads if the second row happens to reside in the same index page—which is very likely. Hence we see just one more logical read because of the second access to the heap table. This corresponds to the first figure above.

Now let’s do the same test with the table that has (=is) a clustered index:

SELECT TOP 1 * 
  FROM sales
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 8, physical reads 0, read-ahead reads 0,…

Fetching a single row involves eight logical reads—twice as many as before. If we assume that the non-clustered index has the same tree depth, it means that the KEY Lookup (Clustered) operation triggers five logical reads per row. Let’s check that by fetching one more row:

SELECT TOP 2 * 
  FROM sales
 WHERE sale_date > '2012-05-23'
 ORDER BY sale_date
Scan count 1, logical reads 13, physical reads 0, read-ahead reads 0,…

As feared, fetching a second row triggers five more logical reads.

I’ve continued these test in 1/2/5-steps up to 100 rows to get more meaningful data:

Rows FetchedLogical Reads (Heap)Logical Reads (Clustered)
148
2513
5827
101348
202391
5053215
100104418

I’ve also fitted linear equations in the chart to see how the slope differs. The heap table matches the theoretic figures pretty closely (3 + 1 per row) but we see “only” four logical reads per row with the clustered index—not the five we would have expected from just fetching one and two rows.

Who cares about logical reads anyway

Logical reads are a fine benchmark because it yields reproducible results—independent of the current state of caches or other work done on the system. However, the above chart is still a very theoretic figure. Four times as many logical reads does not automatically mean four times as slow. In all reality you’ll have a large part of the clustered index in the cache—at least the first few B-tree levels. That reduces the impact definitively. To see how it affects the impact, I conducted another test: measuring the execution time when running the queries from the cache. Avoiding disk access is another way to get more or less reproducible figures that can be compared to each other.

Again, I’ve used 1/2/5-steps but started at 10.000 rows—selecting fewer rows was too fast for the timer’s resolution. Selecting more than 200.000 rows took extraordinarily long so that I believe the data didn’t fit into the cache anymore. Hence I stopped collecting data there.

Rows FetchedTime (Heap)Time (Clustered)
10.0003178
20.00047130
50.000109297
100.000203624
200.0003901232

From this test it seems that the “clustered index penalty” on the non-clustered index is more like three times as high as compared to using a heap table.

Notwithstanding these results, is it perfectly possible that the real world caching leads to a “clustered index penalty” outside the here observed range in a production system.

What was the upside of clustered indexes again?

The upside of clustered indexes is that they can deliver subsequent rows quickly when accessed directly (not via a non-clustered index). In other words, they are fast if you use the clustering key to fetch several rows. Remember that the primary key is the clustering key per default. In that case, it means fetching several rows via primary key—with a single query. Hmm. Well, you can’t do that with an equals filter. But how often do you use non-equals filters like > or < on the primary key? Sometimes, maybe, but not that often that it makes sense to optimize the physical table layout for these queries and punish all other indexes with the “clustered index penalty.” That is really insane (IMHO).

Luckily, SQL Server is quite flexible when it comes to clustered indexes. As opposed to MySQL/InnoDB, SQL Server can use any column(s) as clustering key—even non-unique ones. You can choose the clustering key so it catches the most important range scan. Remember that equals predicates (=) can also cause range scans (on non-unique columns). But beware: if you are using long columns and/or many columns as clustering key, they bloat all non-clustered indexes because each row in every non-clustered indexes contains the clustering key as reference to the primary table store. Further, SQL Server makes non-unique clustering keys automatically unique by adding an extra column, which also bloats all non-clustered indexes. Although this bloat will probably not affect the tree depth—thanks to the logarithmic growth of the B-tree—it might still hurt the cache-hit rate. That’s also why the “clustered index penalty” could be outside the range indicated by the tests above—in either way!

Even if you are able to identify the clustering key that brings the most benefit for range scans, the overhead it introduces on the non-clustered indexes can void the gains again. As I said in the intro above, it is just too darn difficult to estimate the overall impact of these effects because the clustering key affects all indexes on the table in a pretty complex way. Therefore, I’m very much in favour of using heap tables if possible and index-only scans when necessary. From this perspective, clustered indexes are just an additional space optimization in case the “clustered index penalty“ isn’t an issue—most importantly if you need only one index that is used for range scans.

Note

Some databases don’t support heap tables at all—most importantly MySQL/MariaDB with InnoDB and the Azure database.

Further there are configurations that can only use the primary key as clustering key—most importantly MySQL/MariaDB with InnoDB and Oracle index-organized tables.

Note that MySQL/MariaDB with InnoDB is on both lists. They don’t offer any alternative for what I referred to as “insane.” MyISAM being no alternative.

Concluding: How many clustered indexes can a SQL Server table have?

To conclude this article, I’d like to demonstrate why it is a bad idea to consider the clustered index as the “silver bullet” that deserves all your thoughts about performance. This demo is simple. Just take a second and think about the following question:

How many clustered indexes can a SQL Server table have?

I love to ask this question in my SQL Server performance trainings. It truly reveals the bad understanding of clustered indexes.

The first answer I get is usually is “one!” Then I ask why I can’t have a second one. Typical response: —silence—. After this article, you already know that a clustered index is not only an index, but also the primary table store. You can’t have two primary table stores. Logical, isn’t it? Next question: Do I need to have a clustered index on every table? Typical response: —silence— or “no” in a doubtful tone. I guess this is when the participants realize that I’m asking trick questions. However, you now know that SQL Server doesn’t need to have a clustered index on every table. In absence of a clustered index, SQL Server uses a heap as table store.

As per definition, the answer to the question is: “at most one.”

And now stop thinking about the clustered index as “silver bullet” and start putting the index-only scan into the focus. For that, I’ll rephrase the question slightly:

How many indexes on a SQL Server table can be queried as fast as a clustered index?

The only change to the original question is that it doesn’t focus on the clustered index as such anymore. Instead it puts your focus on the positive effect you expect from a clustering index. The point in using a clustered index is not just to have a clustered index, it is about improving performance. So, let’s focus on that.

What makes the clustered index fast is that every (direct) access is an index-only scan. So the question essentially boils down to: “how many indexes can support an index-only scan?” And the simple answer is: as many as you like. All you need to do is to add all the required columns to a non-clustered index and *BAM* using it is as fast as though it was a clustered index. That’s what SQL Server’s INCLUDE keyword of the CREATE INDEX statement is for!

Focusing on index-only scans instead of clustered indexes has several advantages:

  • You are not limited to one index. Any index can be as fast as a clustered index.

  • Adding INCLUDE columns to a non-clustered index doesn’t affect anything else than this particular index. There is no penalty that hurts all other indexes!

  • You don’t need to add all table columns to a non-clustered index to enable an index-only scan. Just add the columns that are relevant for the query you’d like to tune. That keeps the index small and can thus become even faster than a clustered index.

  • And the best part is: there is no mutual exclusion of index-only scans and clustered indexes. Index-only scans work irrespective of the table storage. You can extend non-clustered indexes for index-only scans even if there is a clustered index. That’s also an easy way to avoid paying the “clustered index penalty” on non-clustered indexes.

Because of the “clustered index penalty” the concept of the index-only scan is even more important when having a clustered index. Really, if there is something like a “silver bullet”, it is the index-only scan—not the clustered index.

If you like my way to explain things, you’ll love SQL Performance Explained.

MongoDB is to NoSQL like MySQL to SQL — in the most harmful way

Yesterday evening I tweeted: “MongoDB seems to be as bad for NoSQL as MySQL is for SQL.” Unfortunately, I tweeted without context. But I guess I couldn’t have given all the required context in a single tweet anyway, so I’m dedicating this post to it. I hope this answers some of the questions I’ve got in response to the tweet.

First of all, I think everybody should know that I’m not a NoSQL fanboy, yet I’m open to the polyglot persistence idea. This distinction doesn’t seem to make sense if you read NoSQL as "not only SQL" (as you are supposed to do). However, I believe there are NoSQL systems out there that greatly benefit from the idea that SQL is bad and not using SQL is good. On other words, they offer “not using SQL” as their main advantage. MongoDB seems to be one of them. Just my perception.

But if I don’t like NoSQL, then I should like MySQL? Not exactly. In my eyes, MySQL has done great harm to SQL because many of the problems people associate with SQL are in fact just MySQL problems. One of the more important examples is that MySQL is rather poor at joining because is only supports nested loops joins. Most other SQL database implement the hash join and sort/merge join algorithms too—both deliver better performance for non-tiny data sets. Considering the wide adoption of MySQL (“The most popular open source database”) and the observation that many people move away from SQL because “joins are slow,” it isn’t far-fetched to say that an implementation limitation of MySQL pushes people towards NoSQL.

Now let’s look at MongoDB. I think the direct competition between MongoDB and MySQL became most obvious in the epic video “MongoDB is Web Scale.” In the meanwhile, MongoDB even claims to be “the leading NoSQL database” — does that sound like “the most popular open source database”? Nevertheless, MongoDB has disappointed many people because it couldn’t live up to it’s promise of “web scale” (example: global write lock up to release 2.2).

The next piece in the puzzle that eventually caused me to tweet was a funny tweet by Gwen (Chen) Shapira (she’s an Oracle DB consultant):

#mongoDB : the big data platform that is challenging to scale over 100GB. http://blog.mongodb.org/post/62899600960/scaling-advice-from-mongohq

Note that the link was broken for a while (the post originally appeared on Sep 30, then disappeared, but is online since Oct 2 again at a different URL). The article is about handling MongoDB if it grows above 100GB. It gives me the impression that scaling MongoDB to that size is a serious issue. Even though there is no exact definition of “web scale” I guess most people would assume that it should be easy to scale MongoDB to 100GB. 100GB is not big data nowadays. 100GBs can be easily managed with most SQL DBs (joining in MySQL could be a problem). It was really funny to see this post on the MongoDB blog. Chen’s tweet nailed it.

At this point, I was once more thinking about the “misspent half-decade” mentioned by Jack Clark in his article “Google goes back to the future with SQL F1 database.” But as mentioned before, I like the idea of polyglot persistence. I’m not saying NoSQL is bullshit—not just because a single implementation fails to deliver. That would be like saying SQL is bullshit because MySQL is bad at joining. On the contrary, it reminded how Alex Popescu lost his temper in his post “The premature return to SQL” last Friday. His response to the “misspent half-decade” was:

Just take a second a think what we got during this misspent half-decade: Redis, Cassandra, Riak, a multi-parallel fully programmatic way to process data, Cascading, Pig, Cypher, ReQL and many more tools, languages, and APIs for processing data.

Well, I don’t know all of these but I do realize that some of them are interesting tools to have in the tool box. Further, I’m following Alex Popescu long enough to know that he is rather reflective on NoSQL—the title of his post being an exception. That’s why I came back to his post to see if he mentioned “the leading NoSQL database“ in his list. He didn’t. I don’t think it’s a coincidence.

At this point it was inevitable to see MongoDB as a popular, yet poor representative of it’s species—just like MySQL is.

If you like my way to explain things, you’ll love SQL Performance Explained.

About the Author

Photo of Markus Winand
Markus Winand tunes developers for high SQL performance. He also published the book SQL Performance Explained and offers in-house training as well as remote coaching at http://winand.at/

?Recent questions at
Ask.Use-The-Index-Luke.com

0
votes
1
answer
131
views

PostgreSQL Scripts: Performance Testing and Scalability problem and question

Nov 12 at 14:53 Markus Winand ♦♦ 936
testing postgresql scalability
0
votes
1
answer
504
views

PostgreSQL Bitmap Heap Scan on index is very slow but Index Only Scan is fast

Oct 31 at 11:31 Markus Winand ♦♦ 936
index postgresql postgres sql
3
votes
2
answers
562
views

pagination with nulls

Oct 29 at 22:39 Rocky 46
pagination