Many people consider tuning database queries to be some mysterious "dark art" out of a Harry Potter novel; with the wrong incantation, your data turns from a valuable resource into a pile of mush.
In reality, tuning queries for a relational database system is simple engineering and follows easy-to-understand rules or heuristics. The query optimizer translates the query you send to a MySQL instance, and then it determines the best way to get the requested data using those heuristics combined with what it knows about your data. Reread the last part of that: "what it knows about your data." The less the query optimizer has to guess about where your data is located, the better it can create a plan to deliver your data.
To give the optimizer better insight about the data, you can use indexes and histograms. Used properly, they can greatly increase the speed of a database query. If you follow the recipe, you will get something you will like. But if you add your own ingredients to that recipe, you may not get what you want.
Most modern relational databases use a cost-based optimizer to determine how to retrieve your data out of the database. That cost is based on reducing very expensive disk reads as much as possible. The query optimizer code inside the database server keeps statistics on getting that data as it is encountered, and it builds a historical model of what it took to get the data.
But historical data can be out of date. It's like going to the store to buy your favorite snack and being shocked at a sudden price increase or that the store closed. Your server's optimization process may make a bad assumption based on old information, and that will produce a poor query plan.
A query's complexity can work against optimization. The optimizer wants to deliver the lowest-cost query of the available options. Joining five different tables means that there are five-factorial or 120 possible combinations about which to join to what. Heuristics are built into the code to try to shortcut evaluating all the possible options. MySQL wants to generate a new query plan every time it sees a query, while other databases such as Oracle can have a query plan locked down. This is why giving detailed information on your data to the optimizer is vital. For consistent performance, it really helps to have up-to-date information for the query optimizer to use when making query plans.
Also, rules are built into the optimizer with assumptions that probably do not match the reality of your data. The query optimizer will assume all the data in a column is evenly distributed among all the rows unless it has other information. And it will default to the smaller of two possible indexes if it sees no alternative. While the cost-based model for an optimizer can make a lot of good decisions, you can smack into cases where you will not get an optimal query plan.
A query plan?
A query plan is what the optimizer will generate for the server to execute from the query. The way to see the query plan is to prepend the word
EXPLAIN to your query. For example, the following query asks for the name of a city from the city table and the name of the corresponding country table, and the two tables are linked by the country's unique code. This case is interested only in the top five cities alphabetically from the United Kingdom:
SELECT city.name as 'City', country.name as 'Country' FROM city JOIN country ON (city.countrycode = country.code) WHERE country.code = 'GBR' LIMIT 5;
EXPLAIN in front of this query will give the query plan generated by the optimizer. Skipping over all but the end of the output, it is easy to see the optimized query:
select `world`.`city`.`Name` AS `City`, 'United Kingdom' AS `Country` from `world`.`city` join `world`.`country` where (`world`.`city`.`CountryCode` = 'GBR') limit 5;
The big changes are that
country.name as 'Country' was changed to
'United Kingdom' AS 'Country' and the
WHERE clause went from looking in the country table to the city table. The optimizer determined that these two changes will provide a faster result than the original query.
You will hear indexes and keys used interchangeably in the MySQL-verse. However, indexes are made up of keys, and keys are a way to identify a record, hopefully uniquely. If a column is designed as a key, the optimizer can search a list of those keys to find the desired record without having to read the entire table. Without an index, the server has to start at the first row of the first column and read through every row of data. If the column was created as a unique index, then the server can go to that one row of data and ignore the rest. The more unique the value of the index (also known as its cardinality), the better. Remember, we are looking for faster ways of getting to the data.
The MySQL default InnoDB storage engine wants your table to have a primary key and will store your data in a B+ tree by that key. A recently added MySQL feature is invisible columns—columns that do not return data unless the column is explicitly named in the query. For example,
SELECT * FROM foo; doesn't provide any columns that are designated as hidden. This feature provides a way to add a primary key to older tables without recoding all the queries to include that new column.
To make this even more complicated, there are many types of indexes, such as functional, spatial, and composite. There are even cases where you can create an index that will provide all the requested information for a query so that there is no need to access the data table.
Describing the various indexes is beyond the scope of this article, so just think of an index as a shortcut to the record or records you desire. You can create an index on one or more columns or part of those columns. My physician's system can look up my records by the first three letters of my last name and birthdate. Using multiple columns requires using the most unique field first, then the second most unique, and so forth. An index on year-month-day works for year-month-day, year-month, and year searches, but it doesn't work for day, month-day, or year-day searches. It helps to design your indexes around how you want to use your data.
A histogram is a distribution of your data. If you were alphabetizing people by their last name, you could use a "logical bucket" for the folks with last names starting with the letters A to F, then another for G to J, and so forth. The optimizer assumes that the data is evenly distributed within the column, but this is rarely the case in practical use.
MySQL provides two types of histograms: equal height, where all the data is divided equally among the buckets, and singleton, where a single value is in a bucket. You can have up to 1,024 buckets. The amount of buckets to choose for your data column depends on many factors, including how many distinct values you have, how skewed your data is, and how high your accuracy really needs to be. After a certain amount of buckets, there are diminishing returns.
This command will create a histogram of 10 buckets on column c1 of table t:
ANALYZE TABLE t UPDATE HISTOGRAM ON c1 WITH 10 BUCKETS;
Imagine you sell small, medium, and large socks, and each size has its own bin for storage. To find the size you need, you go to the bin for that size. MySQL has had histograms since MySQL 8.0 was released three years ago, yet they are not as well-known as indexes. Unlike indexes, there is no overhead for inserting, updating, or deleting a record. To update an index, an
ANALYZE TABLE command must be updated. This is a good approach when the data does not churn very much and frequent changes to the data will reduce the efficiency.
Indexes or histograms?
Use indexes for unique items where you need to access the data directly. There is overhead for updates, deletes, and inserts, but you get speedy access if your data is properly architected. Use histograms for data that does not get updated frequently, such as quarterly results for the last dozen years.
This article grew out of a recent presentation at the Open Source 101 conference. And that presentation grew out of a workshop at a PHP UK Conference. Query tuning is a complex subject, and each time I present on indexes and histograms, I find ways to refine my presentation. But each presentation also shows that many folks in the software world are not well-versed on indexes and tend to use them incorrectly. Histograms have not been around long enough (I hope) to have been misused similarly.