Slow Indexes, Part II


The previous section explained how to gain additional benefits from an existing index by changing its column order. But the example considered two statements only. Changing an index, however, may affect all queries on the indexed table. This section explains the way databases pick an index and demonstrates the possible side effects when changing existing indexes.

The adopted EMPLOYEE_PK index improves performance of all queries that search by subsidiary only. It is, however, usable for all queries that search by SUBSIDIARY_ID—no matter if there are any other search criteria. That means, the index becomes usable for queries that used to use another index with another part of the where clause. In that case, if there are multiple access path available, it is the optimizers job to find the best.

The Query Optimizer

The query optimizer, or query planner, is the database component that transforms an SQL statement into an execution plan. This process is also called compiling or parsing. There are two different optimizer types.

Cost Based Optimizers (CBO) generate many execution plan variations and calculates a cost value for each one. The operations in use and the estimated row numbers are the basis for the cost value calculation. The cost value finally serves as benchmark pick the “best” execution plan.

Rule Based Optimizers (RBO) generate the execution plan by a hard-coded rule set and are therefore less flexible. Rule based Optimizers are hardly used today.

Changing an index might have unpleasant side effects as well. In our example, it is the internal telephone directory application that has become very slow. The first analysis reveals that the following query takes up to a minute:

SELECT first_name, last_name, subsidiary_id, phone_number
  FROM employees
 WHERE last_name  = 'WINAND'
   AND subsidiary_id = 30;

The execution plan is:

Example 2.1. Execution plan with revised primary key index

---------------------------------------------------------------
|Id |Operation                   | Name         | Rows | Cost |
---------------------------------------------------------------
| 0 |SELECT STATEMENT            |              |    1 |   30 |
|*1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES    |    1 |   30 |
|*2 |  INDEX RANGE SCAN          | EMPLOYEES_PK |   40 |    2 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  1 - filter("LAST_NAME"='WINAND')
  2 - access("SUBSIDIARY_ID"=30)

The query uses an index and has a rather low cost value. On the first sight, everything looks fine. But it suspicious that it uses the index we just changed—so, we must assume that our change causes the performance problem. Especially when bearing the old index definition in mind, which started with the EMPLOYEE_ID column that is not part of the where clause at all. The query could not use that index before.

For further analysis, it would be nice to have the previous execution plan for comparison. To get that, we could just deploy the old index definition again. However, most databases offer a simpler way to prevent index usage without dropping the index. The following example uses an Oracle optimizer hint, for that purpose.

SELECT /*+ NO_INDEX(EMPLOYEES EMPLOYEE_PK) */ 
       first_name, last_name, subsidiary_id, phone_number
  FROM employees
 WHERE last_name  = 'WINAND'
   AND subsidiary_id = 30;

The original execution plan uses a FULL TABLE SCAN and has a higher cost value than the INDEX RANGE SCAN:

----------------------------------------------------
| Id | Operation         | Name      | Rows | Cost |
----------------------------------------------------
|  0 | SELECT STATEMENT  |           |    1 |  477 |
|* 1 |  TABLE ACCESS FULL| EMPLOYEES |    1 |  477 |
----------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("LAST_NAME"='WINAND' AND "SUBSIDIARY_ID"=30)

Even though the FULL TABLE SCAN must read and process the entire table, it is—in this particular case—faster than the INDEX RANGE SCAN. That is particularly unusual, because the query matches one row only. But using an index to find a single row should be much faster than a FULL TABLE SCAN. But in that case, it is not. The index seems to be slow.

In such cases, it is best to go through each stop of the troublesome execution. The execution starts with an INDEX RANGE SCAN on the EMPLOYEES_PK index. The index does not cover the LAST_NAME column, so the INDEX RANGE SCAN can consider the SUBSIDIARY_ID filter only. The Oracle execution plan shows that in the “Predicate Information” area—entry “2” for operation Id 2.

Tip

Appendix A, “Execution Plans” explains where to find the “Predicate Information” for other databases.

The INDEX RANGE SCAN with operation Id 2 applies the SUBSIDIARY_ID=30 filter only. That means, it traverses the Index-Tree to find the first entry for SUBSIDIARY_ID 30. Afterwards, it follows the leaf node chain to find all entries for that subsidiary. The result of the INDEX ONLY SCAN is a list of ROWIDs that fulfill the SUBSIDIARY_ID condition. That are, depending on the subsidiary size, just a few or many hundreds.

The next step is the TABLE ACCESS BY INDEX ROWID. It uses the ROWIDs from the previous step to fetch the rows—all columns—from the table. Once the LAST_NAME column is available, the database can evaluate the outstanding part of the where clause. That means the database has to fetch all rows for SUBSIDIARY_ID=30 before it can apply the LAST_NAME filter.

The statements response time depends on the number of employees in the particular subsidiary. For a small subsidiary—e.g., only a few dozen members—the INDEX RANGE SCAN gives better performance. But a FULL TABLE SCAN can be faster, if there are many employees in the subsidiary, because it can read large parts from the table in one shot (see Full Table Scan).

Quick answers instead of long searches!
Instant Coaching is the online consulting service by the author of this page.

The query is slow because the INDEX RANGE SCAN returns many rows—all employees from the original company—and the TABLE ACCESS BY INDEX ROWID fetches them individually, just to apply the LAST_NAME filter which discards most rows anyway. It is the perfect composition of the two ingredients that make an index slow: scanning a wide range and fetching too many rows individually.

Choosing the best execution plan depends on the table data as well. Hence, the optimizer needs to access some statistics about the table data. In that case, that are statistics about the distribution of SUBSIDIARY_ID values. They allow the optimizer to estimate the number of rows returned from the index access, which is used for the cost calculation.

Statistics

The optimizer uses a variety of statistics on table, index, and column level. Most statistics are collected on column level: the number of distinct values, the smallest and biggest value (data range), the number of NULL occurrences and the column histogram (data distribution). As of Oracle 11g it is also possible to collect extended statistics for column concatenations and expressions. There are only a few statistics on table level: the size (in rows and blocks) and the average row length.

The most important index statistics are the tree depth, the number of leaf nodes, the number of distinct keys and the clustering factor (see Section 11).

The optimizer uses these values to estimate the selectivity of the predicates in the where clause.

If there are no statistics available—e.g., because they were deleted for this demonstration—the optimizer defaults. The Oracle default statistics suggest a small index with medium selectivity. They lead to the estimate that the INDEX RANGE SCAN will return 40 rows. The figure shown in the Rows column of the execution plan. Obviously a gross misjudgment, as there are 1000 employees assigned to this subsidiary.

Providing accurate statistics allows the optimizer to do a better estimation. For the following execution plan, the optimizer estimated 1000 rows for the INDEX RANGE SCAN. Consequently, it calculated a higher cost value for the subsequent table access.

---------------------------------------------------------------
|Id |Operation                   | Name         | Rows | Cost |
---------------------------------------------------------------
| 0 |SELECT STATEMENT            |              |    1 |  680 |
|*1 | TABLE ACCESS BY INDEX ROWID| EMPLOYEES    |    1 |  680 |
|*2 |  INDEX RANGE SCAN          | EMPLOYEES_PK | 1000 |    4 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
  1 - filter("LAST_NAME"='WINAND')
  2 - access("SUBSIDIARY_ID"=30)

The cost value of 680 is even higher than the cost value for the execution plan using the FULL TABLE SCAN (477). The optimizer will therefore automatically prefer the FULL TABLE SCAN.

Coaching by the Author
Faster, easier and more memorable than reading.

That example about bad index performance and a “fast” FULL TABLE SCANs should not hide the fact that a proper index is the best solution to most performance problems. Searching on last name is best support by an index on LAST_NAME, of course:

CREATE INDEX emp_name ON employees (last_name);

Using the new index, the optimizer calculates a cost value of 3:

Example 2.2. Execution Plan with Dedicated Index

--------------------------------------------------------------
| Id | Operation                   | Name      | Rows | Cost |
--------------------------------------------------------------
|  0 | SELECT STATEMENT            |           |    1 |    3 |
|* 1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES |    1 |    3 |
|* 2 |   INDEX RANGE SCAN          | EMP_NAME  |    1 |    1 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter("SUBSIDIARY_ID"=30)
   2 - access("LAST_NAME"='WINAND')

The index access delivers one row only—according to the optimizers estimate. Hence, the following table access fetches a single row only, which is definitely faster than a FULL TABLE SCAN. A proper index is therefore still faster than a FULL TABLE SCAN.

The two execution plan from Example 2.1 and Example 2.2 are almost identical. They perform the same operations and have rather low cost values. Nevertheless, the second plan performs much better. The efficiency of an INDEX RANGE SCAN may vary in a wide range—especially when followed by a TABLE ACCESS BY INDEX ROWID. Using an index does not automatically mean best execution.

Recent Questions at Ask.Use-The-Index-Luke.com

0
votes
1
answer
229
views

query regd the CBO decision

Apr 17 at 10:27 Hulda(suspended)
index-choice optimizer
0
votes
3
answers
2.0k
views

Examples for Function Based Indexes?

Mar 25 at 15:52 Castorp 1
function-based
0
votes
1
answer
610
views

Updating multiple rows using a subquery in SQL

Jan 08 at 09:52 Jan 26
subquery update sql