select *
has a bad reputation. Most people think it is
because changes in the table definition cause hard-to-detect bugs. Listing
all columns in the desired order solves that problem, right? Yes, but that
doesn’t solve the second problem of select *
.
The other problem is that asking for more columns causes more work for the database. The surprise is that the performance impact can be huge.
- Up to 100x slower when preventing an Index-Only Scan
Broadly speaking, the less columns you ask for, the less data must be loaded from disk when processing your query. However, this relationship is non-linear.
Quite often, selecting from a table involves two steps: (1) use an index to find the address where the selected rows are stored; (2) load the selected rows from the table. Now imagine that you are just selecting columns that are present in the index. Why should the database still perform the second step? In fact, most databases don’t. They can process your query just with the information stored in the index—hence index-only scan.
But why should an index-only scan be 100 times faster? Simple: an ideal index stores the selected rows next to each other. It’s not uncommon that each index page holds about 100 rows—a ballpark figure; it depends on the size of the indexed columns. Nonetheless, it means that one IO operation might fetch 100 rows. The table data, on the other hand, is not organized like that (exceptions). Here it is quite common that a page just contains one of the selected rows—along with many other rows that are of no interest for the particular query. So, the reason an Index-Only Scan can be 100 times faster is that an index access can easily deliver 100 rows per IO while the table access typically just fetches a few rows per IO.
If you select a single column that’s not in the index, the database cannot do an index-only scan. If you select all columns, … , well I guess you know the answer.
Further, some databases store large objects in a separate place (e.g., LOBs in Oracle). Accessing those causes an extra IO too.
- Up to 5x slower when bloating server memory footprint
Although databases avoid storing the result in the server’s main memory—instead they deliver each row after loading and forget about it again—it is sometimes inevitable. Sorting, for example, needs to keep all rows—and all selected columns—in memory to do the job. Once again, the more columns you select, the more memory the database needs. In the worst case, the database might even need to do an external sort on disk.
However, most database are extremely well tuned for this kind of workload. Although I’ve seen a sorting speed-up of factor two quite often—just by removing a few unused columns—I cannot remember having got more than factor five. However, it’s not just sorting, hash joins are rather sensitive to memory bloat too. Don’t know what that is? Please read this article.
These are just the two top issues from database perspective. Remember that the network suffers too and that the client needs to process the data as well—which might put a considerable pressure on garbage collection.
Many people think the star is the bad thing. Consequently they believe they are not having any problem if their ORM lists all columns by name anyway. In fact, the crime is to select all columns without thinking about it—and most ORMs readily commit this crime on behalf of their users.
When we say “select *
is bad” we use the star as a
symbol for “selecting everything without thinking about it”. This is the
bad thing. If you need a more catch phrase to remember that, take
this:
It’s not about the star, stupid!
If you like my way to explain things, you’ll love my book SQL Performance Explained.