I thought that I knew a thing or two about sql indexes. It’s easy right, you just run a couple sql commands to create your index, and you’re off to the races. Turns out that it isn’t quite that easy to get the index you need. You need to know a few things. You need to know how sql indexes work (especially for the database you’re using, we’ll be discussing sqlite3) You need to know how you’re going to query your data.
I thought that I knew a thing or two about sql indexes. It’s easy right, you just run a couple sql commands to create your index, and you’re off to the races. Turns out that it isn’t quite that easy to get the index you need. You need to know a few things.
If you’re brand new to sql and have never heard of indexes, it would make sense that we take a bit of a sidebar and discuss why they are important. Let’s lay a bit of groundwork to get us started on this discussion.
For the rianflow cycle counting app we’ve been discussing in the past few posts, we have a database with a few tables. The two tables that are important for this discussion are projects and datum. It is beyond the scope of this post to go into the details of each table and how they’re being used, but we’ll go over them slightly to get the example up to speed.
As you can imagine a project stores has an id, name, and a few other non-pertinent items.
CREATE TABLE IF NOT EXISTS project(
id VARCHAR(20) PRIMARY KEY,
name text,
);
The datum table represents one reading from the dataset, it has an id, the project id it belongs to, a status, an ordinal (to know where it came in the set, I know I like things like UUIDS and XIDs for primary keys), and the value (again as well as some other non-pertinent items). The statuses we’re using are to show what data points are to be used in the final cycle counting, available values are:
CREATE TABLE IF NOT EXISTS datum(
id VARCHAR(20) PRIMARY KEY,
project_id VARCHAR(20),
status VARCHAR(20),
ordinal INTEGER,
value REAL
);
Great, we’ve now defined some basic tables that will be storing our data.
There are two main queries that we make in the process of rainflow cycle counting and displaying the raw data that was imported.
Getting started with our queries, they are going to be quite simple in theory. They will be a bit more complicated because we’re using a slider to load different chunks of the data into a chart.
SELECT id, status, ordinal, value FROM datum wHERE project_id = ? LIMIT 30 OFFSET ? ORDER BY ordinal ASC;
Boom! You’ve just loaded in all of your data for a given project. If your number of projects and data per project is small, this will run without a problem. But in my sample database, I had about 150 Million data points. And it took 10 seconds to return a list of data, even the 30 that we set the limit on.
This can’t be right, why would it takes 10 seconds to load 30 data values from a sqlite database!?!
Well it has to do with how the data is stored, looked up, and retrieved out of the database. In this case, the primary key of the table is the only thing that is helping the database lookup data. So we end up with a full table scan, which is as bad as it sounds, each row must be inspected by the database engine to see if it meets the proper criteria. Not the best situation to be in.
Not only that, but a separate operation actually has to take place, another temporary b-tree (this is a balanced tree, not a binary tree) index is created on the ordinal to determine sort order. Not good!
We can actually see this by using the magic query prefix, EXPLAIN QUERY PLAN.
EXPLAIN QUERY PLAN SELECT id, status, ordinal, value FROM datum wHERE project_id = ? LIMIT 30 OFFSET ? ORDER BY ordinal ASC;
Once we run this on our database, we’ll see how sqlite plans the query:
QUERY PLAN
|–SCAN datum
`–USE TEMP B-TREE FOR ORDER BY
There has got to be a better way. Luckily for us there is a better way.
Now that we know how we’re going to make our query, we have a part of what we need to improve our performance. We know that we’re going to query based upon the project_id that is part of the datum table. So let us create an index on the database table.
CREATE INDEX idx_datum_projectid ON datum(project_id);
What does this do? We are creating an index, we’re giving it a name of idx_datum_projectid. We’re also telling it that we want it on the datum table using the column project_id.
But what is the index actually doing? Well it is doing what it sounds like actually. When a new row is inserted, the database engine is creating a pointer with the project_id to the internal row identifier. Now you can query the datum table, when using the project_id, and it knows all of the rows that have the desired project_id.
Let’s run our query again. Wait a minute, that took like 10 seconds again? It looks like we’re still having some issues with our query. Let us check it out with our explain again.
QUERY PLAN |–SEARCH datum USING INDEX idx_datum_projectid (project_id=?) `–USE TEMP B-TREE FOR ORDER BY
There is some good and some bad to this.
We’re querying our datum, with our project_id, and we need to order it by the ordinal. If we look at the output, we’re creating a temp b-tree for the order by statement. If we could add the ordinal to our index, that might just help.
CREATE INDEX idx_datum_projectid ON datum(project_id, ordinal);
Run the query again, on my machine it was near instantaneous. That is what we want! Just to verify let us check on the EXPLAIN QUERY PLAN.
QUERY PLAN
`–SEARCH datum USING INDEX idx_datum_projectid (project_id=?)
We see that the datum table is being searched by the index. And that we’re not creating a temp b-tree to sort, but why is that? Well I don’t want to dig too far into the documentation. But if you do you’ll find that the each subsequent remaining column actually represents the search order for the query. In this case, we are querying on the project_id, and not the ordinal, so the ordinal then becomes the sort order for our query.
There are some pros and cons to using indexes, knowing and understanding them will help you better know what should be done.
Like all things in software engineering, decisions are informed by way of trade offs. As you look at the pro and cons from above, you’ll see something interesting, you trade faster queries for slower inserts (and larger sizes, which disk costs make that less important, but it should be known).
Because of this, it isn’t recommended to just jump in and start adding indexes to everything. Something that took me awhile to become proficient in is understanding this question, “What is your query pattern?”
This just means:
Great, you’ve made it this far. We discussed adding an index to a query and looking into how sqlite uses the index to improve query performance. But, if you’ve been paying attention you’ll notice there has been another query used by the rainflow cycle counting app that we haven’t actually covered yet. That is querying our datum table, by the project id and by its status.
As you look above on step 5 you’ll notice it has all in bold. That is because I ran into some other pitfalls using this index, and getting the query performance that I needed with the indexes we discussed in this post.
Stay tuned, and we’ll go over the next index that I had to start working with to get the desired performance.