When an index contains all the columns involved in a query, we usually say that an index covers a query. Whenever, this happens, the DB2 optimizer usually chooses to access only that index to satisfy a query, which is called index-only access or index covering. "Usually" does not mean "always," however. For example, let us consider this table structure:
CREATE TABLE CONTACT( ZIPCODE INT NOT NULL, PHONE_NUMBER CHAR(10) NOT NULL, SOME_OTHER_STUFF VARCHAR(100)); CREATE INDEX CONTACT_ZIP_PHONE ON CONTACT(ZIPCODE, PHONE_NUMBER); CREATE INDEX CONTACT_PHONE ON CONTACT(PHONE_NUMBER);
Let us consider this query:
SELECT ZIPCODE, PHONE_NUMBER FROM CONTACT WHERE PHONE_NUMBER LIKE '312987654%' AND ZIPCODE = '60606'
Obviously the index CONTACT_ZIP_PHONE does cover the query. However, the DB2 optimizer has chosen not to use it. Instead, the DB2 optimizer has chosen to access the table itself via the other index, CONTACT_PHONE. This is a little bit surprising, isn't it? In fact, this decision makes perfect sense. The DB2 optimizer has simply been very aggressive in finding the very best access plan. Let us figure out why the DB2 optimizer's decision is really so good. On one hand, more that 15,000 rows match the condition ZIPCODE = '60606'. On the other hand, no more that 10 rows would match the other one, PHONE_NUMBER LIKE '312987654%'. That means that the condition on PHONE_NUMBER is definitely more selective. There is a well-known rule of thumb that says "put the most selective column first in the index definition." Having said that, let us take a detailed look at the real execution costs and see if the theory (and the optimizer) were right:
- 2 index pages were read, the root index page and one leaf level page
- 10 data pages were read, since 10 matching rows were scattered across the table
Now let us perform some "what if" analyses. Let us drop the CONTACT_PHONE index and re-run the same query. Now the DB2 optimizer has chosen to scan the part of the index CONTACT_ZIP_PHONE, starting at or after the value '60' and scanning to the value '61'. The real execution cost now is significantly higher:
- more than 100 index leaf pages were read
As we have seen, the optimizer was right not to choose the covering index.
Now let us return to the rule of thumb that we just mentioned: "put the most selective column first in the index definition." As we have just discussed, in most cases it makes perfect sense. However, there are some exceptions. Let us consider an example. To begin, we shall create an index with the most selective column last in the definition:
CREATE INDEX CONTACT_PHONE_ZIP ON CONTACT(PHONE_NUMBER, ZIPCODE);
Note that now there are two indexes with the same column list: both (PHONE_NUMBER, ZIPCODE) and (ZIPCODE, PHONE_NUMBER). Here is the query that benefits from the index that we have just created:
SELECT ZIPCODE, PHONE_NUMBER FROM CONTACT WHERE ZIPCODE = '60606'
Given a choice of these two indexes, the DB2 optimizer will most probably choose the index on (ZIPCODE, PHONE_NUMBER). During execution, only a portion of that index will be scanned to satisfy the query. Again, having dropped this index, the DB2 database engine will satisfy the query by scanning the whole index on (PHONE_NUMBER, ZIPCODE). That will definitely be slower. So if the query above is run frequently enough, it might make make sense to have an index on (ZIPCODE, PHONE_NUMBER).
As we have seen, the rule of thumb "put the most selective column first in the index definition" is just a recommendation. Yes, it is a very good recommendation, it makes perfect sense in most situations. However, there are situations when putting the most selective column last speeds up a query. Make your own choices based on a careful analysis of your own situation.
About the Author
Alexander Kuznetsov has 15 years of experience in software design, development and database administration. Currently he improves performance of applications running in a multi-terabyte database environment. Alexander is an IBM Certified Advanced Technical Expert (DB2 Cluster) and an IBM Certified Solutions Expert (Database Administration and Application Development).
Iudith M. writes: As the problem is described, the ZIPCODE column is THE LESS SELECTIVE, and not the most selective, because it returns a higher percentage of the total number of rows.
The author is right when he states that "Putting the most selective column first" in the index is just a rule of thumb, but not due to selectivity, but also due to the fact that the overall performance could be influenced by the physical location of the different rows in the table's blocks.
Using an index on more selective column may have a lower performance than an index on a less selective column; for example, if the "few" rows returned by the first index are "scattered" over many data blocks, while the "many" rows return by the second index could all reside in few database blocks.
It's true that each case should be tested separately, in the specific environment.
For More Information
- Feedback: E-mail the editor with your thoughts about this tip.
- More tips: Hundreds of free DB2 tips and scripts.
- Tip contest: Have a DB2 tip to offer your fellow DBAs and developers? The best tips submitted will receive a cool prize -- submit your tip today!
- Ask the Experts: Our SQL, database design, Oracle, SQL Server, DB2, metadata, and data warehousing gurus are waiting to answer your toughest questions.
- Forums: Ask your technical DB2 questions--or help out your peers by answering them--in our active forums.
- Best Web Links: DB2 tips, tutorials, and scripts from around the Web.