Concatenated Keys


Even though the database indexes the primary key automatically, there is still room for manual refinements if the key consists of multiple columns. That is because the database creates an index on all columns of the primary key—a so-called concatenated index (also multi-column, composite or combined index). The column order of a concatenated index has, however, great impact on its usability. So, choose it wisely.

For the sake of demonstration, let’s assume a company merger so that the EMPLOYEES table grows by a factor of ten. It is, however, not possible to combine the two data sets straight away, because the EMPLOYEE_ID is not unique across both companies. One way to resolve such conflicts is to extend the primary key with an extra identifier—e.g., a subsidiary ID. It is quite easy from database perspective, so we take it for this example. That means, the new primary key has two columns: the EMPLOYEE_ID as before and the SUBSIDIARY_ID to reestablish uniqueness.

The index for the new primary key is therefore defined as following:

CREATE UNIQUE INDEX employee_pk 
    ON employees (employee_id, subsidiary_id);

A query for a particular employee has to take the full primary key into account. That is, also the SUBSIDIARY_ID column:

SELECT first_name, last_name
  FROM employees
 WHERE employee_id   = 123
   AND subsidiary_id = 30;
---------------------------------------------------------------
|Id |Operation                   | Name         | Rows | Cost |
---------------------------------------------------------------
| 0 |SELECT STATEMENT            |              |    1 |    2 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES    |    1 |    2 |
|*2 |  INDEX UNIQUE SCAN         | EMPLOYEES_PK |    1 |    1 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("EMPLOYEE_ID"=123 AND "SUBSIDIARY_ID"=30)

Accessing the table with the complete primary key allows the database to use an INDEX RANGE SCAN—no matter how many columns the index has. But what happens when using one column only? When, for example, listing all employees of a subsidiary:

SELECT first_name, last_name
  FROM employees
 WHERE subsidiary_id = 20;
----------------------------------------------------
| Id | Operation         | Name      | Rows | Cost |
----------------------------------------------------
|  0 | SELECT STATEMENT  |           |  106 |  478 |
|* 1 |  TABLE ACCESS FULL| EMPLOYEES |  106 |  478 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("SUBSIDIARY_ID"=20)

The execution reveals that the database does not use the index. It performs a FULL TABLE SCAN instead. That means, it reads the entire table and evaluates every row against the where clause. The speed declines linear on a growing table size as a result. If the table grows tenfold, the FULL TABLE SCAN takes ten times as long. That makes the FULL TABLE SCAN particularly dangerous. Even though it might be fast enough during development, it often becomes a performance problem right after go-live.

Full Table Scan

There are still cases when the FULL TABLE SCAN is the most efficient operation. That is, in particular, when retrieving a large part of the table. There is no exact limit, but it can already apply when retrieving a small percentage of the table.

That is partly because of the overhead for the index lookup itself, which does not occur for the FULL TABLE ACCESS. But mostly because an index lookup reads one block after the other, because the database does not know which block to read next, until the current block was processed. A FULL TABLE SCAN must get the entire table anyway, so that the database reads larger chunks at once (multi block read). Although it reads more data, it does not necessarily need more reads.

All of that should not hide the fact that a FULL TABLE SCAN is often caused by a missing or incorrect index.

The database does not use the index because it cannot use single columns from a concatenated index arbitrarily. A closer look at the index structure makes it clear.

Coaching by the Author
Faster, easier and more memorable than reading.

A concatenated index is a B-Tree index like any other: it maintains the indexed data as sorted list. It considers all columns according to their position in the index to sort the entries. That means, the first column is the primary sort criterion. The second column determines the order only if two entries have the same value in the first column. And so forth.

Important

A concatenated index is one index across multiple columns.

The order of a two-column index is therefore like the order of a telephone directory: it is first sorted by surname, then by first name. That means, that a two-column index does not support searching on the second column only. That is like searching in a telephone directory to find all entries with a first name.

Figure 2.1. Concatenated Index


The index excerpt in Figure 2.1 shows that the entries for SUBSIDIARY_ID = 20 are not centralized. That renders the B-Tree useless. It is even visible that there are no entries with SUBSIDIARY_ID = 20 in the tree, although they exist in the leaf nodes.

Tweet this tip

Tip

Visualizing an index helps to understand what queries the index supports. Query the database to see the index order (Oracle syntax):

SELECT * FROM (
  SELECT <INDEX COLUMN LIST> 
    FROM <TABLE>  
   ORDER BY <INDEX COLUMN LIST>
) 
 WHERE ROWNUM < 100;

If you insert the index definition and table name, you will get a sample from the index. Ask yourself if the requested rows are clustered at a central place. Otherwise, the index tree cannot help to find that place.

A second index on SUBSIDIARY_ID improves the query speed, of course. There is, however, a better solution—at least when you know that searching on EMPLOYEE_ID alone does not make sense.

We can take advantage of the fact, that the first index column is always usable for searching. It is, again, like a telephone directory: you don’t need to know the first name to search by last name. So, the trick is to reverse the index column order, so that the SUBSIDIARY_ID is at the first place:

CREATE UNIQUE INDEX EMPLOYEES_PK 
    ON EMPLOYEES (SUBSIDIARY_ID, EMPLOYEE_ID);

Both columns together are still unique, so that queries for the full primary key continue to use an INDEX UNIQUE SCAN. The order of index entries is, however, completely different. The SUBSIDIARY_ID has become the primary sort criterion. That means, that all entries of a subsidiary occur consecutively in the index. The B-Tree helps to find that place.

Important

The most important factor when defining a concatenated index is which statements it can support.

The execution plan confirms: the index is used. But the SUBSIDIARY_ID alone is not unique anymore, so that the database must follow the leaf nodes to find all entries. So there is an INDEX RANGE SCAN:

--------------------------------------------------------------
|Id |Operation                   | Name        | Rows | Cost |
--------------------------------------------------------------
| 0 |SELECT STATEMENT            |             |  106 |   75 |
| 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES   |  106 |   75 |
|*2 |  INDEX RANGE SCAN          | EMPLOYEE_PK |  106 |    2 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("SUBSIDIARY_ID"=20)

A concatenated index is generally usable when searching with the leading columns. That is, an index with three columns can be used when searching for the first column, when searching with the first two columns together, and when searching with all columns, of course.

Even though the two-index solution delivers very good select performance as well, you should prefer the one-index solution. It does not only save the space, but also the maintenance overhead for the second index. The fewer indexes a table has, the better the insert, delete and update performance is.

Quick answers instead of long searches!
Instant Coaching is the online consulting service by the author of this page.

To define the best index you must not only know how an index works, but also know how the application queries the data. Specifically, this means you have to know the column combinations that appear in the where clause.

Defining the best index is therefore very hard for external performance consultants because they don’t have an overview about the application’s access paths. They can usually consider one query only and will not gain extra benefits for other queries. Database administrators are in a similar position. They know the data better, but still don’t have deep insight into the access paths.

The only place where the technical database knowledge meets the functional knowledge of the business domain is the development department. Developers have a feeling for the data and know the access path. They can index properly without much effort to get the best benefit for the overall application.

Recent Questions at Ask.Use-The-Index-Luke.com

0
votes
1
answer
229
views

query regd the CBO decision

Apr 17 at 10:27 Hulda(suspended)
index-choice optimizer
0
votes
3
answers
2.0k
views

Examples for Function Based Indexes?

Mar 25 at 15:52 Castorp 1
function-based
0
votes
1
answer
610
views

Updating multiple rows using a subquery in SQL

Jan 08 at 09:52 Jan 26
subquery update sql