Concatenated Indexes


Even though the database creates the index for the primary key automatically, there is still room for manual refinements if the key consists of multiple columns. In that case the database creates an index on all primary key columns—a so-called concatenated index (also known as multi-column, composite or combined index). Note that the column order of a concatenated index has great impact on its usability so it must be chosen carefully.

For the sake of demonstration, let’s assume there is a company merger. The employees of the other company are added to our EMPLOYEES table so it becomes ten times as large. There is only one problem: the EMPLOYEE_ID is not unique across both companies. We need to extend the primary key by an extra identifier—e.g., a subsidiary ID. Thus the new primary key has two columns: the EMPLOYEE_ID as before and the SUBSIDIARY_ID to reestablish uniqueness.

The index for the new primary key is therefore defined in the following way:

CREATE UNIQUE INDEX employee_pk 
    ON employees (employee_id, subsidiary_id);

A query for a particular employee has to take the full primary key into account—that is, the SUBSIDIARY_ID column also has to be used:

SELECT first_name, last_name
  FROM employees
 WHERE employee_id   = 123
   AND subsidiary_id = 30
DB2
Explain Plan
--------------------------------------------------------
ID | Operation             |                 Rows | Cost
 1 | RETURN                |                      |   13
 2 |  FETCH EMPLOYEES      |     1 of 1 (100.00%) |   13
 3 |   IXSCAN EMPLOYEES_PK | 1 of 10000 (   .01%) |    6

Predicate Information
 3 - START (Q1.EMPLOYEE_ID = +00123.)
     START (Q1.SUBSIDIARY_ID = +00030.)
      STOP (Q1.EMPLOYEE_ID = +00123.)
      STOP (Q1.SUBSIDIARY_ID = +00030.)
MySQL
Try online at SQL Fiddle+----+-----------+-------+---------+---------+------+-------+
| id | table     | type  | key     | key_len | rows | Extra |
+----+-----------+-------+---------+---------+------+-------+
|  1 | employees | const | PRIMARY | 10      |    1 |       |
+----+-----------+-------+---------+---------+------+-------+

As before, MySQL is able to use access type const because the query cannot match more than one row. Note that the key lengths (key_len) has become bigger because it now uses two columns of the index. See Section 4.3 for more details.

Oracle
Try online at SQL Fiddle---------------------------------------------------------------
|Id |Operation                   | Name         | Rows | Cost |
---------------------------------------------------------------
| 0 |SELECT STATEMENT            |              |    1 |    2 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES    |    1 |    2 |
|*2 |  INDEX UNIQUE SCAN         | EMPLOYEES_PK |    1 |    1 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"=123 AND "SUBSIDIARY_ID"=30)
PostgreSQL
Try online at SQL Fiddle                QUERY PLAN
-------------------------------------------
 Index Scan using employees_pk on employees 
   (cost=0.00..8.27 rows=1 width=14)
   Index Cond: ((employee_id   = 123::numeric)
            AND (subsidiary_id = 30::numeric))
SQL Server
Try online at SQL Fiddle
|--Nested Loops(Inner Join)
   |--Index Seek(OBJECT:employees_pk,
   |               SEEK:employee_id=@ AND subsidiary_id=@2
   |            ORDERED FORWARD)
   |--RID Lookup(OBJECT:employees,
                   SEEK:Bmk1000=Bmk1000
                 LOOKUP ORDERED FORWARD)

Whenever a query uses the complete primary key, the database can use an INDEX UNIQUE SCAN—no matter how many columns the index has. But what happens when using only one of the key columns, for example, when searching all employees of a subsidiary?

SELECT first_name, last_name
  FROM employees
 WHERE subsidiary_id = 20
DB2
Explain Plan
-------------------------------------------------------
ID | Operation         |                    Rows | Cost
 1 | RETURN            |                         |  689
 2 |  TBSCAN EMPLOYEES | 1195 of 10000 ( 11.95%) |  689

Predicate Information
 2 - SARG (Q1.SUBSIDIARY_ID = +00002.)

The query was executed with SUBSIDIARY_ID = 2 to get the expected result. That has no impact on the subject that matters here.

The operation TBSCAN indicates a full table scan.

MySQL
Try online at SQL Fiddle+----+-----------+------+------+-------+-------------+
| id | table     | type | key  | rows  | Extra       |
+----+-----------+------+------+-------+-------------+
|  1 | employees | ALL  | NULL | 11000 | Using where |
+----+-----------+------+------+-------+-------------+

The MySQL access type ALL indicates an full table scan: all rows of the table are fetched and compared against the where clause.

Oracle
Try online at SQL Fiddle----------------------------------------------------
| Id | Operation         | Name      | Rows | Cost |
----------------------------------------------------
|  0 | SELECT STATEMENT  |           |  106 |  478 |
|* 1 |  TABLE ACCESS FULL| EMPLOYEES |  106 |  478 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("SUBSIDIARY_ID"=20)
PostgreSQL
Try online at SQL Fiddle                QUERY PLAN
-------------------------------------------
 Seq Scan on employees 
   (cost=0.00..1398.50 rows=1070 width=14)
   Filter: (subsidiary_id = 2::numeric)

The PostgreSQL operation Seq Scan performs a full table scan.

SQL Server
Try online at SQL Fiddle
|--Table Scan(OBJECT:employees,
               WHERE:subsidiary_id=30)

The execution plan reveals that the database does not use the index. Instead it performs a TABLE ACCESS FULL. As a result the database reads the entire table and evaluates every row against the where clause. The execution time grows with the table size: if the table grows tenfold, the TABLE ACCESS FULL takes ten times as long. The danger of this operation is that it is often fast enough in a small development environment, but it causes serious performance problems in production.

Full Table Scan

The operation TABLE ACCESS FULL, also known as full table scan, can be the most efficient operation in some cases anyway, in particular when retrieving a large part of the table.

This is partly due to the overhead for the index lookup itself, which does not happen for a TABLE ACCESS FULL operation. This is mostly because an index lookup reads one block after the other as the database does not know which block to read next until the current block has been processed. A FULL TABLE SCAN must get the entire table anyway so that the database can read larger chunks at a time (multi block read). Although the database reads more data, it might need to execute fewer read operations.

The database does not use the index because it cannot use single columns from a concatenated index arbitrarily. A closer look at the index structure makes this clear.

About our book “SQL Performance Explained”
Probably the best book on SQL performance I've read
Guillaume Lelarge on Amazon.co.uk (5 stars)

A concatenated index is just a B-tree index like any other that keeps the indexed data in a sorted list. The database considers each column according to its position in the index definition to sort the index entries. The first column is the primary sort criterion and the second column determines the order only if two entries have the same value in the first column and so on.

Important

A concatenated index is one index across multiple columns.

The ordering of a two-column index is therefore like the ordering of a telephone directory: it is first sorted by surname, then by first name. That means that a two-column index does not support searching on the second column alone; that would be like searching a telephone directory by first name.

Figure 2.1. Concatenated Index


The index excerpt in Figure 2.1 shows that the entries for subsidiary 20 are not stored next to each other. It is also apparent that there are no entries with SUBSIDIARY_ID = 20 in the tree, although they exist in the leaf nodes. The tree is therefore useless for this query.

Tweet this tip

Tip

Visualizing an index helps in understanding what queries the index supports. You can query the database to retrieve the entries in index order (SQL:2008 syntax, see syntax of top-n queries for proprietary solutions using LIMIT, TOP or ROWNUM):

SELECT <INDEX COLUMN LIST> 
  FROM <TABLE>  
 ORDER BY <INDEX COLUMN LIST>
 FETCH FIRST 100 ROWS ONLY;

If you put the index definition and table name into the query, you will get a sample from the index. Ask yourself if the requested rows are clustered in a central place. If not, the index tree cannot help find that place.

We could, of course, add another index on SUBSIDIARY_ID to improve query speed. There is however a better solution—at least if we assume that searching on EMPLOYEE_ID alone does not make sense.

We can take advantage of the fact that the first index column is always usable for searching. Again, it is like a telephone directory: you don’t need to know the first name to search by last name. The trick is to reverse the index column order so that the SUBSIDIARY_ID is in the first position:

CREATE UNIQUE INDEX EMPLOYEES_PK 
    ON EMPLOYEES (SUBSIDIARY_ID, EMPLOYEE_ID);

Both columns together are still unique so queries with the full primary key can still use an INDEX UNIQUE SCAN but the sequence of index entries is entirely different. The SUBSIDIARY_ID has become the primary sort criterion. That means that all entries for a subsidiary are in the index consecutively so the database can use the B-tree to find their location.

Important

The most important consideration when defining a concatenated index is how to choose the column order so it can be used as often as possible.

The execution plan confirms that the database uses the “reversed” index. The SUBSIDIARY_ID alone is not unique anymore so the database must follow the leaf nodes in order to find all matching entries: it is therefore using the INDEX RANGE SCAN operation.

DB2
Explain Plan
-------------------------------------------------------------
ID | Operation               |                    Rows | Cost
 1 | RETURN                  |                         |  128
 2 |  FETCH EMPLOYEES        |  1195 of 1195 (100.00%) |  128
 3 |   RIDSCN                |  1195 of 1195 (100.00%) |   43
 4 |    SORT (UNQIUE)        |  1195 of 1195 (100.00%) |   43
 5 |     IXSCAN EMPLOYEES_PK | 1195 of 10000 ( 11.95%) |   43

Predicate Information
 2 - SARG (Q1.SUBSIDIARY_ID = +00002.)
 5 - START (Q1.SUBSIDIARY_ID = +00002.)
      STOP (Q1.SUBSIDIARY_ID = +00002.)

This execution plan looks more complex than the execution plan that was using the index before. They key operations are still there, however: the IXSCAN representing the index range scan and the FETCH for the table access. In between these operations there is an unexpected SORT and RIDSCN operation: the SORT operation sorts the entries fetched from the index according to the rows physical storage location in the heap table. The RIDSCAN then prefetches all the affected database pages (collapsing multiple adjacent blocks into a single IO operation).

MySQL
Try online at SQL Fiddle+----+-----------+------+---------+---------+------+-------+
| id | table     | type | key     | key_len | rows | Extra |
+----+-----------+------+---------+---------+------+-------+
|  1 | employees | ref  | PRIMARY | 5       |  123 |       |
+----+-----------+------+---------+---------+------+-------+

The MySQL access type ref is the equivalent of INDEX RANGE SCAN in the Oracle database.

Oracle
Try online at SQL Fiddle--------------------------------------------------------------
|Id |Operation                   | Name        | Rows | Cost |
--------------------------------------------------------------
| 0 |SELECT STATEMENT            |             |  106 |   75 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES   |  106 |   75 |
|*2 |  INDEX RANGE SCAN          | EMPLOYEE_PK |  106 |    2 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("SUBSIDIARY_ID"=20)
PostgreSQL
Try online at SQL Fiddle                 QUERY PLAN
----------------------------------------------
 Bitmap Heap Scan on employees
 (cost=24.63..1529.17 rows=1080 width=13)
   Recheck Cond: (subsidiary_id = 2::numeric)
   -> Bitmap Index Scan on employees_pk
      (cost=0.00..24.36 rows=1080 width=0)
      Index Cond: (subsidiary_id = 2::numeric)

The PostgreSQL database uses two operations in this case: a Bitmap Index Scan followed by a Bitmap Heap Scan. They roughly correspond to Oracle’s INDEX RANGE SCAN and TABLE ACCESS BY INDEX ROWID with one important difference: it first fetches all results from the index (Bitmap Index Scan), then sorts the rows according to the physical storage location of the rows in the heap table and than fetches all rows from the table (Bitmap Heap Scan). This method reduces the number of random access IOs on the table.

SQL Server
Try online at SQL Fiddle
|--Nested Loops(Inner Join)
   |--Index Seek(OBJECT:employees_pk,
   |               SEEK:subsidiary_id=20
   |            ORDERED FORWARD)
   |--RID Lookup(OBJECT:employees,
                   SEEK:Bmk1000=Bmk1000
                 LOOKUP ORDERED FORWARD)

In general, a database can use a concatenated index when searching with the leading (leftmost) columns. An index with three columns can be used when searching for the first column, when searching with the first two columns together, and when searching using all columns.

Even though the two-index solution delivers very good select performance as well, the single-index solution is preferable. It not only saves storage space, but also the maintenance overhead for the second index. The fewer indexes a table has, the better the insert, delete and update performance.

There is something for everyone:
training, tuning and literature on SQL performance

To define an optimal index you must understand more than just how indexes work—you must also know how the application queries the data. This means you have to know the column combinations that appear in the where clause.

Defining an optimal index is therefore very difficult for external consultants because they don’t have an overview of the application’s access paths. Consultants can usually consider one query only. They do not exploit the extra benefit the index could bring for other queries. Database administrators are in a similar position as they might know the database schema but do not have deep insight into the access paths.

The only place where the technical database knowledge meets the functional knowledge of the business domain is the development department. Developers have a feeling for the data and know the access path. They can properly index to get the best benefit for the overall application without much effort.

If you like my way of explaining things, you’ll love my book.

About the Author

Photo of Markus Winand
Markus Winand tunes developers for high SQL performance. He also published the book SQL Performance Explained and offers in-house training as well as remote coaching at http://winand.at/

?Recent questions at
Ask.Use-The-Index-Luke.com

0
votes
1
answer
164
views

PostgreSQL Scripts: Performance Testing and Scalability problem and question

Nov 12 at 14:53 Markus Winand ♦♦ 936
testing postgresql scalability
0
votes
1
answer
561
views

PostgreSQL Bitmap Heap Scan on index is very slow but Index Only Scan is fast

Oct 31 at 11:31 Markus Winand ♦♦ 936
index postgresql postgres sql
3
votes
2
answers
588
views

pagination with nulls

Oct 29 at 22:39 Rocky 46
pagination