Greater, Less and BETWEEN


The biggest performance risk of an INDEX RANGE SCAN is the leaf node traversal. Keeping the scanned index range small is the golden rule of indexing. to check that, you can ask yourself where the range scan starts and where it ends.

The question is easy to answer if the SQL statement shows the start and stop conditions explicitly:

SELECT first_name, last_name, date_of_birth
  FROM employees
 WHERE date_of_birth >= TO_DATE(?, 'YYYY-MM-DD')
   AND date_of_birth <= TO_DATE(?, 'YYYY-MM-DD')

An index on DATE_OF_BIRTH will be scanned for the specified date range only. The scan starts at the first date and ends at the second. The index range scanned for this query cannot be narrowed any further.

The answer is less obvious, if a second columns gets involved:

SELECT first_name, last_name, date_of_birth
  FROM employees
 WHERE date_of_birth >= TO_DATE(?, 'YYYY-MM-DD')
   AND date_of_birth <= TO_DATE(?, 'YYYY-MM-DD')
   AND subsidiary_id  = ?

The optimal index has to cover both columns, of course. But the question is: in which order?

The following figures show the effect of the column order on the scanned index range when searching all employees of subsidiary 27, who were born between 1st and 9th January 1971.

Figure 2.2 visualizes a part of the index on DATE_OF_BIRTH and SUBSIDIARY_ID—in that order. Which range has to be scanned? Or, to put it another way: Where will the tree traversal end?

Figure 2.2. Range Scan in DATE_OF_BIRTH, SUBSIDIARY_ID index


The index is ordered by birth dates first. Only if two employees were born on the same day, the SUBSIDIARY_ID is used to sort these records. The query, however, filters for a date range. SUBSIDIARY_ID is not ordered in that range, hence, it is useless during tree traversal. The scanned index range is delimited by the conditions on DATE_OF_BIRTH only. It starts at the first entry matching the date range and ends at the last one. That are five leaf nodes in the figure.

It's a book!
You are just reading a book. Here is the table of content

The situation changes radically when reversing the column order so that the index starts with the SUBSIDIARY_ID. Figure 2.3 shows the search using this index.

Figure 2.3. Range Scan in SUBSIDIARY_ID, DATE_OF_BIRTH Index


The difference is that the equals operator limits the first index column to a single value (SUBSIDIARY_ID 27). The next index column is therefore sorted—at least within that SUBSIDIARY_ID value. So, there no need to visit the first leaf node because the branch node already indicates that there is no employee for subsidiary 27 born after 25th June 1969 stored in the first leaf node.

The three traversal ends at the second leaf node. Thereafter, the database follows the leaf node chain until finding a record that does not fulfill the search criterion anymore. That is, just the next leaf node entry. The range scan ends after reading a single leaf node.

Tweet this tip

Tip

Rule of thumb: Index for equality first—then for ranges.

Using this example, we can also falsify the myth that says the most selective column should be the first column in an index. The selectivity of each column alone is the same in both cases; there are 13 records for the date range and 13 records for the subsidiary. The selectivity does not give any advice in this case. Still, there is a definite answer how to order the columns.

The real performance difference depends on the data and the search terms. When searching for a tiny date range, the difference might be negligible. The wider the date range gets, the bigger the performance difference becomes.

To optimize performance, it is very important to know which index range is scanned. Most databases will even show it in the execution plan. You just have to know what to look for. The following execution plan, for example, indicates that the DATE_OF_BIRTH is the first in the index.

--------------------------------------------------------------
|Id | Operation                    | Name      | Rows | Cost |
--------------------------------------------------------------
| 0 | SELECT STATEMENT             |           |    1 |    4 |
|*1 |  FILTER                      |           |      |      |
| 2 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES |    1 |    4 |
|*3 |    INDEX RANGE SCAN          | EMP_TEST  |    2 |    2 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(:END_DT >= :START_DT)
3 - access(DATE_OF_BIRTH >= :START_DT 
       AND DATE_OF_BIRTH <= :END_DT)
    filter(SUBSIDIARY_ID  = :SUBS_ID)

The „Predicate Information“ for the INDEX RANGE SCAN gives the required hint. It describes the conditions of the where clause either as access or as filter predicate. This tells us how the database uses each condition.

Important

The access predicates express the start and stop conditions for an index search. They define the scanned index range.

Index filter predicates are applied during the leaf node traversal only. They do not narrow the scanned index range.

The appendix explains how to recognize access predicates in MySQL, SQL Server and PostgreSQL.

Note

The execution plan was simplified for clarity. The appendix explains the details about the „Predicate Information“ in full length.

The execution plan lists the conditions on the DATE_OF_BIRTH column as access predicates. That means, they limit the scanned index range. The DATE_OF_BIRTH is therefore the first column in the EMP_TEST index. The SUBSIDIARY_ID column is used as filter only.

The database can use all conditions from the where clause as access predicate if we turn the index definition around:

---------------------------------------------------------------
| Id | Operation                    | Name      | Rows | Cost |
---------------------------------------------------------------
|  0 | SELECT STATEMENT             |           |    1 |    3 |
|* 1 |  FILTER                      |           |      |      |
|  2 |   TABLE ACCESS BY INDEX ROWID| EMPLOYEES |    1 |    3 |
|* 3 |    INDEX RANGE SCAN          | EMP_TEST2 |    1 |    2 |
---------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(:END_DT >= :START_DT)
3 - access(SUBSIDIARY_ID  = :SUBS_ID
       AND DATE_OF_BIRTH >= :START_DT
       AND DATE_OF_BIRTH <= :END_T)

Finally, there is the between keyword. It allows you to specify the upper and lower bounds in one condition:

DATE_OF_BIRTH BETWEEN '01-JAN-71'
                  AND '10-JAN-71'

Note that between is always including the specified values. Just like using less-or-equal (<=) and greater-or-equal (>=) operators:

    DATE_OF_BIRTH >= '01-JAN-71' 
AND DATE_OF_BIRTH <= '10-JAN-71'

Recent Questions at Ask.Use-The-Index-Luke.com

0
votes
1
answer
229
views

query regd the CBO decision

Apr 17 at 10:27 Hulda(suspended)
index-choice optimizer
0
votes
3
answers
2.0k
views

Examples for Function Based Indexes?

Mar 25 at 15:52 Castorp 1
function-based
0
votes
1
answer
610
views

Updating multiple rows using a subquery in SQL

Jan 08 at 09:52 Jan 26
subquery update sql