Index Filter Predicates Used Intentionally


Very often index filter predicates indicate improper index usage caused by an incorrect column order in a concatenated index. Nevertheless index filter predicates can be used for a good reason as well—not to improve range scan performance but to group consecutively accessed data together.

Where clause predicates that cannot serve as access predicate are good candidates for this technique:

SELECT first_name, last_name, subsidiary_id, phone_number
  FROM employees
 WHERE subsidiary_id = ?
   AND UPPER(last_name) LIKE '%INA%';

Remember that LIKE expressions with leading wildcards cannot use the index tree. That means that indexing LAST_NAME doesn’t narrow the scanned index range—no matter if you index LAST_NAME or UPPER(last_name). This condition is therefore no good candidate for indexing.

About our book “SQL Performance Explained”
This book is definitively worth having in the company library.
” — Joe Celko

However the condition on SUBSIDIARY_ID is well suited for indexing. We don’t even need to add a new index because the SUBSIDIARY_ID is already the leading column in the index for the primary key.

--------------------------------------------------------------
|Id | Operation                   | Name       | Rows | Cost |
--------------------------------------------------------------
| 0 | SELECT STATEMENT            |            |   17 |  230 |
|*1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES  |   17 |  230 |
|*2 |   INDEX RANGE SCAN          | EMPLOYEE_PK|  333 |    2 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter(UPPER("LAST_NAME") LIKE '%INA%')
   2 - access("SUBSIDIARY_ID"=TO_NUMBER(:A))

In the above execution plan, the cost value raises a hundred times from the INDEX RANGE SCAN to the subsequent TABLE ACCESS BY INDEX ROWID operation. In other words: the table access causes the most work. It is actually a common pattern and is not a problem by itself. Nevertheless, it is the most significant contributor to the overall execution time of this query.

The table access is not necessarily a bottleneck if the accessed rows are stored in a single table block because the database can fetch all rows with a single read operation. If the same rows are spread across many different blocks, in contrast, the table access can become a serious performance problem because the database has to fetch many blocks in order to retrieve all the rows. That means the performance depends on the physical distribution of the accessed rows—in other words: it depends on the clustering of rows.

Note

The correlation between index order and table order is a performance benchmark—the so-called index clustering factor.

It is in fact possible to improve query performance by re-ordering the rows in the table so they correspond to the index order. This method is, however, rarely applicable because you can only store the table rows in one sequence. That means you can optimize the table for one index only. Even if you can choose a single index for which you would like to optimizer the table, it is still a difficult task because most databases only offer rudimentary tools for this task. So-called row sequencing is, after all, a rather impractical approach.

This is exactly where the second power of indexing—clustering data—comes in. You can add many columns to an index so that they are automatically stored in a well defined order. That makes an index a powerful yet simple tool for clustering data.

To apply this concept to the above query, we must extend the index to cover all columns from the where clause—even if they do not narrow the scanned index range:

CREATE INDEX empsubupnam ON employees
       (subsidiary_id, UPPER(last_name));

The column SUBSIDIARY_ID is the first index column so it can be used as an access predicate. The expression UPPER(last_name) covers the LIKE filter as index filter predicate. Indexing the uppercase representation saves a few CPU cycles during execution, but a straight index on LAST_NAME would work as well. You’ll find more about this in the next section.

--------------------------------------------------------------
|Id | Operation                   | Name       | Rows | Cost |
--------------------------------------------------------------
| 0 | SELECT STATEMENT            |            |   17 |   20 |
| 1 |  TABLE ACCESS BY INDEX ROWID| EMPLOYEES  |   17 |   20 |
|*2 |   INDEX RANGE SCAN          | EMPSUBUPNAM|   17 |    3 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("SUBSIDIARY_ID"=TO_NUMBER(:A))
       filter(UPPER("LAST_NAME") LIKE '%INA%')

The new execution plan shows the very same operations as before. The cost value dropped considerably nonetheless. In the predicate information we can see that the LIKE filter is already applied during the INDEX RANGE SCAN. Rows that do not fulfill the LIKE filter are immediately discarded. The table access does not have any filter predicates anymore. That means it does not load rows that do not fulfill the where clause.

The difference between the two execution plans is clearly visible in the “Rows” column. According to the optimizer’s estimate, the query ultimately matches 17 records. The index scan in the first execution plan delivers 333 rows nevertheless. The database must then load these 333 rows from the table to apply the LIKE filter which reduces the result to 17 rows. In the second execution plan, the index access does not deliver those rows in the first place so the database needs to execute the TABLE ACCESS BY INDEX ROWID operation only 17 times.

You should also note that the cost value of the INDEX RANGE SCAN operation grew from two to three because the additional column makes the index bigger. In view of the performance gain, it is an acceptable compromise.

Warning

Don’t introduce a new index for the sole purpose of filter predicates. Extend an existing index instead and keep the maintenance effort low. With some databases you can even add columns to the index for the primary key that are not part of the primary key.

The following animation demonstrates the difference between the two execution plans:

Figure 5.1. Intentional Index Filter-Predicates


This trivial example seems to confirm the common wisdom to index every column from the where clause. This “wisdom”, however, ignores the relevance of the column order which determines what conditions can be used as access predicates and thus has a huge impact on performance. The decision about column order should therefore never be left to chance.

The index size grows with the number of columns as well—especially when adding text columns. Of course the performance does not get better for a bigger index even though the logarithmic scalability limits the impact considerably. You should by no means add all columns that are mentioned in the where clause to an index but instead only use index filter predicates intentionally to reduce the data volume during an earlier execution step.

About the Author

As an author, trainer, and coach Markus Winand specializes in helping developers cope with SQL performance issues. He also published the book SQL Performance Explained and tweets his best performance tips via @SQLPerfTips.http://winand.at/

Recent Questions at Ask.Use-The-Index-Luke.com

0
votes
1
answer
69
views
0
votes
0
answers
321
views

Fanout in R-Tree

Mar 27 at 08:07 jamie 1
tree indexing
0
votes
1
answer
104
views

Think About It

Mar 26 at 12:54 Markus Winand ♦♦ 511
reflection