2012-02-15Slow Indexes, Part II
The previous section explained how to gain additional benefits from an existing index by changing its column order. But the example considered two statements only. Changing an index, however, may affect all queries on the indexed table. This section explains the way databases pick an index and demonstrates the possible side effects when changing existing indexes.
The adopted EMPLOYEE_PK index improves performance of all queries that search by subsidiary only. It is, however, usable for all queries that search by SUBSIDIARY_ID—no matter if there are any other search criteria. That means, the index becomes usable for queries that used to use another index with another part of the where clause. In that case, if there are multiple access path available, it is the optimizers job to find the best.
Changing an index might have unpleasant side effects as well. In our example, it is the internal telephone directory application that has become very slow. The first analysis reveals that the following query takes up to a minute:
SELECT first_name, last_name, subsidiary_id, phone_number FROM employees WHERE last_name = 'WINAND' AND subsidiary_id = 30;
The execution plan is:
Example 2.1. Execution plan with revised primary key index
--------------------------------------------------------------- |Id |Operation | Name | Rows | Cost | --------------------------------------------------------------- | 0 |SELECT STATEMENT | | 1 | 30 | |*1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES | 1 | 30 | |*2 | INDEX RANGE SCAN | EMPLOYEES_PK | 40 | 2 | --------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("LAST_NAME"='WINAND') 2 - access("SUBSIDIARY_ID"=30)
The query uses an index and has a rather low cost value. On the first sight, everything looks fine. But it suspicious that it uses the index we just changed—so, we must assume that our change causes the performance problem. Especially when bearing the old index definition in mind, which started with the EMPLOYEE_ID column that is not part of the where clause at all. The query could not use that index before.
For further analysis, it would be nice to have the previous execution plan for comparison. To get that, we could just deploy the old index definition again. However, most databases offer a simpler way to prevent index usage without dropping the index. The following example uses an Oracle optimizer hint, for that purpose.
SELECT /*+ NO_INDEX(EMPLOYEES EMPLOYEE_PK) */
first_name, last_name, subsidiary_id, phone_number
FROM employees
WHERE last_name = 'WINAND'
AND subsidiary_id = 30;
The original execution plan uses a FULL TABLE SCAN and has a higher cost value than the INDEX RANGE SCAN:
---------------------------------------------------- | Id | Operation | Name | Rows | Cost | ---------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 477 | |* 1 | TABLE ACCESS FULL| EMPLOYEES | 1 | 477 | ---------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("LAST_NAME"='WINAND' AND "SUBSIDIARY_ID"=30)
Even though the FULL TABLE SCAN must read and process the entire table, it is—in this particular case—faster than the INDEX RANGE SCAN. That is particularly unusual, because the query matches one row only. But using an index to find a single row should be much faster than a FULL TABLE SCAN. But in that case, it is not. The index seems to be slow.
In such cases, it is best to go through each stop of the troublesome execution. The execution starts with an INDEX RANGE SCAN on the EMPLOYEES_PK index. The index does not cover the LAST_NAME column, so the INDEX RANGE SCAN can consider the SUBSIDIARY_ID filter only. The Oracle execution plan shows that in the “Predicate Information” area—entry “2” for operation Id 2.
Tip
Appendix A, “Execution Plans” explains where to find the “Predicate Information” for other databases.
The INDEX RANGE SCAN with operation Id 2 applies the SUBSIDIARY_ID=30 filter only. That means, it traverses the Index-Tree to find the first entry for SUBSIDIARY_ID 30. Afterwards, it follows the leaf node chain to find all entries for that subsidiary. The result of the INDEX ONLY SCAN is a list of ROWIDs that fulfill the SUBSIDIARY_ID condition. That are, depending on the subsidiary size, just a few or many hundreds.
The next step is the TABLE ACCESS BY INDEX ROWID. It uses the ROWIDs from the previous step to fetch the rows—all columns—from the table. Once the LAST_NAME column is available, the database can evaluate the outstanding part of the where clause. That means the database has to fetch all rows for SUBSIDIARY_ID=30 before it can apply the LAST_NAME filter.
The statements response time depends on the number of employees in the particular subsidiary. For a small subsidiary—e.g., only a few dozen members—the INDEX RANGE SCAN gives better performance. But a FULL TABLE SCAN can be faster, if there are many employees in the subsidiary, because it can read large parts from the table in one shot (see Full Table Scan).
The query is slow because the INDEX RANGE SCAN returns many rows—all employees from the original company—and the TABLE ACCESS BY INDEX ROWID fetches them individually, just to apply the LAST_NAME filter which discards most rows anyway. It is the perfect composition of the two ingredients that make an index slow: scanning a wide range and fetching too many rows individually.
Choosing the best execution plan depends on the table data as well. Hence, the optimizer needs to access some statistics about the table data. In that case, that are statistics about the distribution of SUBSIDIARY_ID values. They allow the optimizer to estimate the number of rows returned from the index access, which is used for the cost calculation.
If there are no statistics available—e.g., because they were deleted for this demonstration—the optimizer defaults. The Oracle default statistics suggest a small index with medium selectivity. They lead to the estimate that the INDEX RANGE SCAN will return 40 rows. The figure shown in the Rows column of the execution plan. Obviously a gross misjudgment, as there are 1000 employees assigned to this subsidiary.
Providing accurate statistics allows the optimizer to do a better estimation. For the following execution plan, the optimizer estimated 1000 rows for the INDEX RANGE SCAN. Consequently, it calculated a higher cost value for the subsequent table access.
--------------------------------------------------------------- |Id |Operation | Name | Rows | Cost | --------------------------------------------------------------- | 0 |SELECT STATEMENT | | 1 | 680 | |*1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES | 1 | 680 | |*2 | INDEX RANGE SCAN | EMPLOYEES_PK | 1000 | 4 | --------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("LAST_NAME"='WINAND') 2 - access("SUBSIDIARY_ID"=30)
The cost value of 680 is even higher than the cost value for the execution plan using the FULL TABLE SCAN (477). The optimizer will therefore automatically prefer the FULL TABLE SCAN.
That example about bad index performance and a “fast” FULL TABLE SCANs should not hide the fact that a proper index is the best solution to most performance problems. Searching on last name is best support by an index on LAST_NAME, of course:
CREATE INDEX emp_name ON employees (last_name);
Using the new index, the optimizer calculates a cost value of 3:
Example 2.2. Execution Plan with Dedicated Index
-------------------------------------------------------------- | Id | Operation | Name | Rows | Cost | -------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 3 | |* 1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES | 1 | 3 | |* 2 | INDEX RANGE SCAN | EMP_NAME | 1 | 1 | -------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("SUBSIDIARY_ID"=30) 2 - access("LAST_NAME"='WINAND')
The index access delivers one row only—according to the optimizers estimate. Hence, the following table access fetches a single row only, which is definitely faster than a FULL TABLE SCAN. A proper index is therefore still faster than a FULL TABLE SCAN.
The two execution plan from Example 2.1 and Example 2.2 are almost identical. They perform the same operations and have rather low cost values. Nevertheless, the second plan performs much better. The efficiency of an INDEX RANGE SCAN may vary in a wide range—especially when followed by a TABLE ACCESS BY INDEX ROWID. Using an index does not automatically mean best execution.

share and subscribe
RSS FeedFlattr this! Follow me on TwitterShare at Google+Like on Facebook