2011-11-15Concatenated Keys
Even though the database indexes the primary key automatically, there is still room for manual refinements if the key consists of multiple columns. That is because the database creates an index on all columns of the primary key—a so-called concatenated index (also multi-column, composite or combined index). The column order of a concatenated index has, however, great impact on its usability. So, choose it wisely.
For the sake of demonstration, let’s assume a company merger so that the EMPLOYEES table grows by a factor of ten. It is, however, not possible to combine the two data sets straight away, because the EMPLOYEE_ID is not unique across both companies. One way to resolve such conflicts is to extend the primary key with an extra identifier—e.g., a subsidiary ID. It is quite easy from database perspective, so we take it for this example. That means, the new primary key has two columns: the EMPLOYEE_ID as before and the SUBSIDIARY_ID to reestablish uniqueness.
The index for the new primary key is therefore defined as following:
CREATE UNIQUE INDEX employee_pk
ON employees (employee_id, subsidiary_id);
A query for a particular employee has to take the full primary key into account. That is, also the SUBSIDIARY_ID column:
SELECT first_name, last_name
FROM employees
WHERE employee_id = 123
AND subsidiary_id = 30;
--------------------------------------------------------------- |Id |Operation | Name | Rows | Cost | --------------------------------------------------------------- | 0 |SELECT STATEMENT | | 1 | 2 | | 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES | 1 | 2 | |*2 | INDEX UNIQUE SCAN | EMPLOYEES_PK | 1 | 1 | --------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("EMPLOYEE_ID"=123 AND "SUBSIDIARY_ID"=30)
Accessing the table with the complete primary key allows the database to use an INDEX RANGE SCAN—no matter how many columns the index has. But what happens when using one column only? When, for example, listing all employees of a subsidiary:
SELECT first_name, last_name FROM employees WHERE subsidiary_id = 20;
---------------------------------------------------- | Id | Operation | Name | Rows | Cost | ---------------------------------------------------- | 0 | SELECT STATEMENT | | 106 | 478 | |* 1 | TABLE ACCESS FULL| EMPLOYEES | 106 | 478 | ---------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("SUBSIDIARY_ID"=20)
The execution reveals that the database does not use the index. It performs a FULL TABLE SCAN instead. That means, it reads the entire table and evaluates every row against the where clause. The speed declines linear on a growing table size as a result. If the table grows tenfold, the FULL TABLE SCAN takes ten times as long. That makes the FULL TABLE SCAN particularly dangerous. Even though it might be fast enough during development, it often becomes a performance problem right after go-live.
The database does not use the index because it cannot use single columns from a concatenated index arbitrarily. A closer look at the index structure makes it clear.
A concatenated index is a B-Tree index like any other: it maintains the indexed data as sorted list. It considers all columns according to their position in the index to sort the entries. That means, the first column is the primary sort criterion. The second column determines the order only if two entries have the same value in the first column. And so forth.
Important
A concatenated index is one index across multiple columns.
The order of a two-column index is therefore like the order of a telephone directory: it is first sorted by surname, then by first name. That means, that a two-column index does not support searching on the second column only. That is like searching in a telephone directory to find all entries with a first name.
Figure 2.1. Concatenated Index
The index excerpt in Figure 2.1 shows that the entries for SUBSIDIARY_ID = 20 are not centralized. That renders the B-Tree useless. It is even visible that there are no entries with SUBSIDIARY_ID = 20 in the tree, although they exist in the leaf nodes.
Tip
Visualizing an index helps to understand what queries the index supports. Query the database to see the index order (Oracle syntax):
SELECT * FROM ( SELECT <INDEX COLUMN LIST> FROM <TABLE> ORDER BY <INDEX COLUMN LIST> ) WHERE ROWNUM < 100;
If you insert the index definition and table name, you will get a sample from the index. Ask yourself if the requested rows are clustered at a central place. Otherwise, the index tree cannot help to find that place.
A second index on SUBSIDIARY_ID improves the query speed, of course. There is, however, a better solution—at least when you know that searching on EMPLOYEE_ID alone does not make sense.
We can take advantage of the fact, that the first index column is always usable for searching. It is, again, like a telephone directory: you don’t need to know the first name to search by last name. So, the trick is to reverse the index column order, so that the SUBSIDIARY_ID is at the first place:
CREATE UNIQUE INDEX EMPLOYEES_PK
ON EMPLOYEES (SUBSIDIARY_ID, EMPLOYEE_ID);
Both columns together are still unique, so that queries for the full primary key continue to use an INDEX UNIQUE SCAN. The order of index entries is, however, completely different. The SUBSIDIARY_ID has become the primary sort criterion. That means, that all entries of a subsidiary occur consecutively in the index. The B-Tree helps to find that place.
Important
The most important factor when defining a concatenated index is which statements it can support.
The execution plan confirms: the index is used. But the SUBSIDIARY_ID alone is not unique anymore, so that the database must follow the leaf nodes to find all entries. So there is an INDEX RANGE SCAN:
-------------------------------------------------------------- |Id |Operation | Name | Rows | Cost | -------------------------------------------------------------- | 0 |SELECT STATEMENT | | 106 | 75 | | 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES | 106 | 75 | |*2 | INDEX RANGE SCAN | EMPLOYEE_PK | 106 | 2 | -------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("SUBSIDIARY_ID"=20)
A concatenated index is generally usable when searching with the leading columns. That is, an index with three columns can be used when searching for the first column, when searching with the first two columns together, and when searching with all columns, of course.
Even though the two-index solution delivers very good select performance as well, you should prefer the one-index solution. It does not only save the space, but also the maintenance overhead for the second index. The fewer indexes a table has, the better the insert, delete and update performance is.
To define the best index you must not only know how an index works, but also know how the application queries the data. Specifically, this means you have to know the column combinations that appear in the where clause.
Defining the best index is therefore very hard for external performance consultants because they don’t have an overview about the application’s access paths. They can usually consider one query only and will not gain extra benefits for other queries. Database administrators are in a similar position. They know the data better, but still don’t have deep insight into the access paths.
The only place where the technical database knowledge meets the functional knowledge of the business domain is the development department. Developers have a feeling for the data and know the access path. They can index properly without much effort to get the best benefit for the overall application.

share and subscribe
RSS FeedFlattr this! Follow me on TwitterShare at Google+Like on Facebook