Use The Index, Luke Blog - Latest News About SQL Performance


About Optimizer Hints

Quite often I’m asked what I think about query hints. The answer is more lengthy and probably also more two-fold than most people expect it to be. However, to answer this question once and forever, I though I should write it down.

The most important fact about query hints is that not all query hints are born equally. I distinguish two major types:

Restricting Hints

Most query hints are restricting hints: they limit the optimizers’ freedom to choose an execution plan. “Hint” is an incredibly bad name for these things as they force the optimizer to do what it has been told—probably the reason MySQL uses the FORCE keyword for those.

I do not like restricting hints, yet I use them sometimes to test different execution plans. It usually goes like this: when I believe a different execution plan could (should?) give better performance, I just hint it to see if it really gives better performance. Quite often it becomes slower and sometimes I even realize that the execution plan I though of does not work at all—at least not with the database I’m working at that moment.

Typical examples for restricting query hints are hints that force the database to use or not use a particular index (e.g., INDEX and NO_INDEX in the Oracle database, USE INDEX and IGNORE INDEX in MySQL, or INDEX, FORCESEEK and the like in SQL Server).

So, what’s wrong with them? Well, the two main problems are that they (1) restrict the optimizer and that they (2) often need volatile object names as parameters (e.g., index names). Example: if you use a hint to use index ABC for a query, the hint becomes ineffective when somebody changes the name of the index to ABCD. Further, if you restrict the optimizer you can no longer expect it to adjust the execution plan if you add another index that servers the query better. Of course there are ways around these problems. The Oracle database, for example, offers "index description" hints to avoid both issues: instead of specifying the index name, it accepts a description of the ideal index (column list) and it selects the index that matches this definition best.

Nevertheless, I strongly recommend against using restricting query hints in production. Instead you should find out why the optimizer does “the wrong thing”™ and fix the root cause. Restricting hints fix the symptom, not the cause. That being said, I know that there is sometimes no other reasonable choice.

Supporting Hints

The second major type of query hints are supporting hints: they support the optimizer by providing information it doesn’t have otherwise. Supporting hints are rare—I’m only aware of a few good examples and the most useful one has already become obsolete: it’s FAST number_rows (SQL Server) and FIRST_ROWS(n) (Oracle). They tell the optimizer that the application plans to fetch only that many rows of the result. Consequently, the optimizer can prefer using indexes and nested loop joins that would be inefficient when fetching the full result (see Chapter 7, Partial Results for more details). Although being kind-of obsolete, I’m still using these hints as the defining example for supporting hints because they provide information the optimizer cannot have otherwise. This particular example is important enough that it was worth defining new keywords in the ISO SQL:2008: FETCH FIRST ... ROWS ONLY and OFFSET. That’s why this hint is a very good, yet obsolete example for supporting query hints.

Another example for supporting hints is the (undocumented) CARDINALITY hint of the Oracle database. It basically overwrites the row count estimate of sub-queries. This hint was often used if the combined selectivity of two predicates was way off the product of the selectivity of each individual predicate (see Combined Selectivity Example). But this hint is also outdated since Oracle database 11g introduced extended statistics to cope with issues like that. SQL Server’s filtered statistics serve the same purpose. If your database cannot reflect data correlation in it’s statistics, you’ll need to fall back to restricting hints.

The Oracle hint OPT_ESTIMATE is somehow the successor of the CARDINALITY hint for cases when the estimations are still off. Pythian wrote a nice article about OPT_ESTIMATE.

Combined Selectivity Example

Let’s say we have two Y/N columns and each has a 50:50 distribution. When you select using both columns most optimizers estimate that the query matches 25% of the table (by multiplying two times 50%). That means that the optimizer assumes there is no correlation between those two columns.

Column 1Column 2count(*)
YY25
YN25
NY25
NN25

If there is a correlation, however, so that most rows that have Y in one column also have Y in the other column, then the estimate is way off.

Column 1Column 2count(*)
YY49
YN1
NY1
NN49

If you query one of the rare Y/N combinations, the optimizer might refrain from using an index due to the high cardinality estimate. Nevertheless, it would be better to use the index because this particular combination is very selective.

It think supporting hints are not that bad: they are just a way to cope with known limitations of the optimizer. That’s probably why they tend to become obsolete when the optimizers evolve.

And Then There Was PostgreSQL

You might have noticed that I did not mention PostgreSQL. It’s probably because PostgreSQL doesn’t have query hints although it has (which are actually session parameters). Confused? No problem, there is a short Wiki for that.

However, to see some discussion about introducing a similar hint as CARDINALITY described above or implementing "cross column statistics" read the first few messages in this thread from February 2011 (after the first page, the discussion moves to another direction). And the result? PostgreSQL still doesn’t have a good way to cope with the original problem of column correlation.

If you like my way to explain things, you’ll love SQL Performance Explained.

Pagination Done the Right Way

Here is another slide deck for my "Pagination Done the Right Way" talk that I've given at many occassions.

Please also have a look at this blog post by Gary Millsap about “The Ramp”. Do you see how using OFFSET implements this anti-pattern?

Indexes: The neglected performance all-rounder

I think I've actually never shared the slides of my talk given in Paris at Dalibo's PostgreSQL Session about Performance. So, here they are.

Afraid of SSD?

When clients tell me about their plans to invest in SSD storage for their database, they often look at me like a doctor telling a patient about his deadly disease. I didn’t discuss this with my clients until recently, when one client just asked me straight away: “As SQL tuning guy, are you afraid of SSD because it kills your job?” Here is what I told that client.

Generally, people seem to believe that SSD are just much faster than HDD. As a matter of fact, this is only partially true because—as I also mentioned in Chapter 3—performance has two dimensions: response time and throughput. Although SSDs tend to deliver more performance on both axes, it has to be considered that the throughput delivered by SSD is “only” about five times as high as that of HDDs. That’s because HDDs are not bad at sequential read/write operations anyway.

Sure enough a five times faster storage makes many performance problems go away…for a while…until you have five times more data. For a decently growing startup it might just take a few months until you have the same problem again. However, this is not the crucial point here. The crucial point is that SSDs essentially fix the one performance issue where HDDs are really bad at: the response time. Due to the lack of moving parts, the response time of SSDs is about fifty times faster as that of HDDs. Well, that really helps solving problems for a while.

However, there is a catch—maybe even a Catch-22: If you want to get the factor 50 speed-up of SSDs, you’d better avoid reading large chunks of sequential data, because that’s where you can only gain a factor five improvement. To put that into database context: if you are doing many full table scans, you won’t get the full potential of SSD. On the other hand, index lookups have a tendency to cause many random IO operations and can thus benefit from the fast response time of SSDs. The fun part is that properly indexed databases get better benefits from SSD than poorly indexed ones. But guess who is most desperately betting on SSD to solve their performance problems? People having proper indexes or those who don’t have them?

The story goes on: which database operation do you think causes most random IO operations? Of course it’s our old friend the join—it is the sole purpose of joins to gather many little data fragments from different places and combine them into the result we want. Joins can also greatly benefit from SSDs. SSDs actually voids one of arguments often brought up by NoSQL folks against relational databases: with SSD it doesn’t matter that much if you fetch data from one place or from many places.

To conclude what I said to my client: No, as an indexing-focused SQL performance guy, I’m absolutely not afraid of SSD.

If you like my way to explain things, you’ll love SQL Performance Explained.

Training and Conference Dates

A few weeks ago I invited you to take part in a survey about your interest in SQL performance training for developers. In the meanwhile there is a schedule for German and English trainings available. In particular, I’d like to point out four online courses I’m giving during summer time. Hope to see you there.

Another opportunity for a short get-together are conferences I’ll attend and/or speak at. The next one is the Prague PostgreSQL Developers’ Day (p2d2) next Thursday. You can buy SQL Performance Explained there (CZK 700; in the breaks and after the conference). The next conference that I can already confirm is this years PostgreSQL Conference Europe in Dublin end of October. You might have noticed that I attended a lot of PostgreSQL conferences recently (Brussels in February, Paris in March). I do plan to attend other conferences too and I’ve just filed some proposals for talks at other conferences. I’ll let you know if they are accepted.

One more thing: in case you are involved in the organization of developer and/or database centric event—no matter how small—you might want to have a look at the small sponsoring I can offer.

The two top performance problems caused by ORM tools

ORMs are not entirely useless…” I just tweeted in response to a not exactly constructive message that we should fire developers who want to use ORMs. But than I continued in a not exactly constructive tone myself when I wrote "…the problem is that the ORM authors don’t know anything about database performance”.

Well, I don’t think they “don’t know anything” but I wonder why they don’t provide decent solutions (including docs) for the two most striking performance problems caused by ORMs? Here they are:

The infamous N+1 selects problem

This problem is actually well-known and I believe there are working solutions in all recent ORMs. The problem that remains is about documentation and culture: although there are solutions, many developers are not aware of them or still live in the “joins are slow—let’s avoid them” universe. The N+1 selects problem seems to be on the decline but I still think ORMs’ documentation should put more emphasis on joins.

For me, it looks very similar to the SQL injection topic: each and every database access layer provides bind parameters but the documentation and books just show examples using literal values. The result: SQL injection was the most dangerous weakness in the CWE/SANS Top 25 list. Not because the tools don’t provide proper ways to close that hole, but because the examples don’t use them consistently.

The hardly-known Index-Only Scan

Although the Index-Only Scan is one of the most powerful SQL tuning techniques, it seems to be hardly known by developers (related SO question I was involved in recently). However, I’ll try to make the story short.

Whenever a database uses an index to find the requested data quickly, it can also use the information in the index to deliver the queried data itself—if the queried data is available in the index. Example:

CREATE TABLE demo
     ( last_name  VARCHAR(255)
     , first_name VARCHAR(255)
     -- more columns and constraints
     );

CREATE INDEX ios_demo
    ON demo (last_name, first_name);

SELECT last_name, first_name
  FROM demo
 WHERE last_name = ?;

If the database uses the IOS_DEMO index to find the rows in question, it can directly use the first name that is stored along with the last name in the index and deliver the queries’ result right away without accessing the actual table. That saves a lot of IO—especially when you are selecting more than a few rows. This technique is even more useful (important) for databases that use clustered indexes like SQL Server or MySQL+InnoDB because they have an extra level of indirection between the index and the table data.

Did you see the crucial prerequisite to make an Index-Only Scan happen? Or asking the other way around: what’s a good way to make sure you’ll never get an Index-Only Scan? Yes, selecting all columns is the most easy yet effective Index-Only-Scan-Preventer. And now guess who is selecting all the columns all the time on behalf of you? Your good old friend the ORM-tool.

This is where the tool support is really getting sparse. Selecting partial objects is hardly supported by ORMs. If it is supported then often in a very inconvenient way that doesn’t give runtime control over the columns you’d like to select. And for gods sake, let’s forget about the documentation for a moment. To say it straight: using Index-Only Scans is a pain in the a** with most ORMs.

Besides Index-Only Scans, not selecting everything can also improve sorting, grouping and join performance because the database can save memory that way.

What we would actually need to get decent database performance is a way to declare which information we are going to need in the upcoming unit of work. Malicious tongues might now say “that’s SQL!” And it’s true (in my humble opinion). However, I do acknowledge that ORMs reduce boilerplate code. Luckily they often offer an easier language than SQL: their whatever-QL (like HQL). Although the syntactic difference between SQL and whatever-QL is often small, the semantic difference is huge because they don’t work on tables and columns but on objects and classes. That avoids a lot of typing and feels more natural to developers from the object world. Of course, the whatever-QL needs to support everything we need—also partial objects like in this Doctrine example.

After all, I think ORM documentation should be more like this: first introduce whatever-QL as simplified SQL dialect that is the default way to query the database. Those methods that are currently mentioned first (e.g., .find or .byId) should be explained as a shortcut if you really need only one row from a single table.

If you like my way to explain things, you’ll love SQL Performance Explained.

I need your help!

Hi!

Would you please spare me a few minutes and help me a little bit?

As you might know—or maybe not—I’m making my living as an independent trainer and consultant. Up till now I’ve only delivered on-site training at the clients’ site, but I though it makes sense to offer open trainings as well so that singe participants can also join. For that I’d need to know how many people would like to join such a training, where they are physically located, and which database they are using. So, I’ve set up a short survey:

http://winand.at/services/sql-performance-training/survey

It’s just one form and won’t take much time. Naturally, you are under no obligation if you fill out the form, and as a way of saying thank you for your time, I’ll be drawing three participants to receive a DON’T PANIC towel—the perfect preparation for the upcoming Towel Day on 25th May. So take a minute to complete the survey now because the deadline is 26th of April!

Thanks,

-markus

Party time

Good news everybody. I’m in a good mood today ;) I’ll explain. First of all, the French edition of SQL Performance Explained is almost done and will ship at the end of the month (pre-order now ;). It has a different name (SQL : Au cœur des performances) but is otherwise the same. So, that triggered some kind of "project done" feeling.

Further, I’ve given a few good trainings recently and looking forward for some more in the next weeks. It’s just awesome when the audience comes up with great questions, but it’s even more awesome if they come up with great conclusions that proof their correct understanding. That triggers a "missing accomplished" feeling.

So, as the pre-sale of the French edition is running on a 10% discount, I though why not giving a general 10% discount to celebrate the French edition? And here it is: ’aucoeur’. It’s only valid for order placed directly at http://sql-performance-explained.com/ (not valid on Amazon) and expires on March 28 (the release date of the French edition).

Further, I’ll repeat the ’retweet to win’ campaign every week during March. It goes like this: I’ll tweet something like ’Retweet this for a chance to win’. After a while, I’ll select one of the retweeters as winner. You can’t win twice (sorry @d_gustafsson). So, watch out, the first giveaway tweet comes soon.

Better Performance with PostgreSQL

Dalibo, the company where Guillaume Lelarge works (he does the French translation of SQL Performance Explained) invited me to give a talk at their (mostly French) for-free conference “Better Performance with PostgreSQL” on 28th March in Paris. Seats are limited: if you’d like to go there, register now!

And if you don’t speak French, there are other good news for you. SQL Performance Explained is currently on sale at Amazon.co.uk for £25.00 (you save £1.99, -7%). Please note that this is the free-delivery threshold for these countries: Belgium, Denmark, Luxembourg, Netherlands, Andorra, Finland, Gibraltar, Greece, Iceland, Ireland, Italy, Liechtenstein, Norway, Portugal, San Marino, Spain, Sweden, Vatican city and Poland. They will also ship for free to UK of course.

FOSDEM Impressions

Last weekend I’ve visited the Free and Open source Software Developers’ European Meeting (FOSDEM) in Brussels, Belgium. It is a huge and free conference covering countless open source topics such as Open/Libre Office, Microkernels, MySQL, NoSQL, PostgreSQL, OWASP, and much more.

On the day before FOSDEM, Friday, I’ve been at the PgDay giving my talk “Pagination Done the PostgreSQL Way” [slides]. I think it was well received but I’m still curious to get the feedback from the feedback forms. So, if you have been there, please fill in the feedback form!

Here you find a short list of notable talks. I’ve actually attended more, I’m just mentioning the most impressive ones.

The GNU/Hurd architecture, nifty features, and latest news

Hurd is still being released in two years. This has been funny 15 years ago, but is quite sad today. However, they aim to get a fully supported Debian distro based on Hurd by that time. Good luck. It seems that current virtualization developments (e.g. Linux containers) are a very good fit for the Hurd architecture, effectively not needing any changes to support containers natively.

Lightning talk: Spoiling and Counter-spoiling

An attempt to make a coding contest that honours maintainable code. The problem: how to measure it? The solution: make three contests: (1) one for writing maintainable code; (2) one for incorporating difficult bugs into that code; (3) and a last one to fix those bugs again. Add some statistics to that and you know who’s code had the best maintainability by measuring how long it took to find and fix the bugs. Funny idea, but not yet executed.

Practical Security for developers, using OWASP ZAP

Promising tool for security checks (brute force). Video available at the homepage.

Lightning talk: The C2 programming language

Another successor of C. It’s a minimalistic approach to just change what really hurts in C. That are a few syntax issues and the preprocessor (#include). At a very early stage. See c2lang.org.

How to Share a Trademark

About legal issues (e.g., Trademark vs. Registered Trademark). Some example cases from the open source scene.

Lightning talk: BIND 10: DNS by Cooperating Processes

Have I ever mentioned that I used to be the DNS admin of an Austrian ISP? It’s more than 10 years ago, but I still hate BIND ;) Privately I’m only using djbdns since then. One of the fundamental design principles of djbdns is that there is a separation of concerns (e.g., serving a zone authoritatively is different from acting as caching resolver). Guess what’s gonna be introduced in BIND 10 ;)

PostGIS 2.0 and beyond

Really interesting talk where PostGIS is coming from (1.5 releases) and how 2.0 differs from it. One of the nice things about PostGIS presentations is that they always have nice visualizations (by definition). There was also a PostGIS 3D talk at PgDay.

When and How to Take Advantage of New Optimizer Features in MySQL 5.6

I’ve been to another MySQL talks as well and I must say they don’t feel like FOSS talks. It’s just Oracle presenting. However, the topics presented are no surprise (e.g. Index Condition Pushdown, a.k.a. Index-Filterpredicates). The scary thing about this talk was that the examples almost always needed hits to work. Sure you could say that these are just made up examples, but I think it just a proof of the optimizers weakness if it needs all the time hints to do the “right thing”.

A real-time architecture using Hadoop & Storm

Presenting the “classical” batch/online mixture to handle BigData: Running Hadoop/MapReduce in Batches and processing the new data online (in memory). To get the full picture, you must query both of them.

About the Author

Photo of Markus Winand
Markus Winand tunes developers for high SQL performance. He also published the book SQL Performance Explained and offers in-house training as well as remote coaching at http://winand.at/

?Recent questions at
Ask.Use-The-Index-Luke.com

2
votes
1
answer
1.4k
views
0
votes
2
answers
848
views

different execution plans after failing over from primary to standby server

Sep 17 at 11:46 Markus Winand ♦♦ 741
oracle index update
1
vote
1
answer
287
views

Generate test data for a given case

Sep 14 at 18:11 Markus Winand ♦♦ 741
testcase postgres