Difficulty: Beginner
Estimated Time: 15 minutes

Different Index types in PostgreSQL

Indexes are a key aspect to query performance and PostgreSQL has a very flexible implementation to handle different indexing schemes. This tutorial will introduce you to some major index types in PostgreSQL, helping you understand them a bit better and know the use cases where they apply. We assume you completed the introduction to indexes tutorial or are familiar with very basic concepts about indexes. It also uses EXPLAIN ANALYZE to look at how indexes change query plans so we will also assume you understand the command or have done the Crunchy Data class

This information we are about to discuss about run time binding, op_classes, and operators is not crucial to understanding the different index types. For now, you can just ignore it and do the exercises. After you do the exercises and see the examples it may make more sense. If you want to go deeper on indices in the future, then you will really need to understand these concepts.

One important fact about PostgreSQL is that all indexes classes are run-time bindable. A bound index class (such as B-tree or GiST) has provided an operator class for a data type. This operator class provide a set of operators (such as =, <, or <->) that the index will be able to use to compare different values of the data type.

An example might be the B-tree index class which has an operator class (op_class) for the CHAR data type. The op_class would define at least the = operator. By defining this operator, when your SQL contained a query like

select * from table WHERE char_column = 'hello';

The query planner could use the B-tree index you had made on char_column.

Note: indexes may also be used to improve efficiency of ORDER BY or JOIN statements. Whether or not the index gets used for the WHERE, JOIN, or ORDER BY clauses is dependent on heuristics in the query planner. Factors, such as amount of data returned, disk access speed, and amount of memory, will all influence the query planner in choosing to use an index.

The presence of operators is not as much of a concern for B-Tree indexes because, by default, PostgreSQL comes "out of the box" with operator classes for all the built-in datatypes. Where the availability of operators becomes more important is when we talk about the other Index types.

One more important implication from the discussion above is that, depending on the index type you choose, operators (such as >, = , or @@ ) may or may not be used with your index in the query planner. That is, you can make the index, but the query planner cannot use the index with the operator in your WHERE clause because the op_class did not bind the operator to the datatype. In the following tutorials we will show you which operators can be used with which datatypes + index combination.

And with that introduction, let's get into the different indexes. We have loaded a database with almost all of the data from the Crunchy Demo Data repository and we have logged you in to the psql prompt as:

user: groot

password: password

database: workshop

Final Notes

And with that we are done talking digging in a little deeper on the top useful indexes in PostgreSQL. If you want to dive deeper on each index here are the links to each index type we covered today:

Remember, there is always a tradeoff with using an index. Be sure to look at disk size, improvements in query time, and insert/update/delete timings when working with indexes.

Enjoy learning about PostgreSQL? Sign up for our newsletter and get the latest tips from us each month.

PostgreSQL Index Types

Step 1 of 4

B-Tree

B-Tree Index

B-tree in PostgreSQL stands for balanced tree index and is an indexing scheme that is the default when you create an index. B-trees work well on a data type that is continually sortable on at least 1 dimension. They have several nice properties which make them very efficient in terms of size and speed for searching through data.

The index creates a data structure made up of nodes, branches, and leaves (the final pages on disk to access) that divides up the data for quick retrieval. The size of the tree tends to grow slowly (logarithmicly) and therefore provides a small data structure that can often fit in memory even for the largest data sets.

Purpose

As mentioned in the introduction, since PostgreSQL ships with B-tree operator classes for all the default data types, the B-tree index will be the type you will use most often. In general, start with a B-tree index and if that doesn't work then look to other index types. Certain advanced data types, like full-text, JSON, or PostGIS data will not work with B-tree in the way you expect.

Operators

From the PostgreSQL documentation we get the default B-tree operators:

<

<=

=

>=

>

Constructs equivalent to combinations of these operators, such as BETWEEN and IN, can also be implemented with a B-tree index search. Also, an IS NULL or IS NOT NULL condition on an index column can be used with a B-tree index.

The optimizer can also use a B-tree index for queries involving the pattern matching operators LIKE and ~ if the pattern is a constant and is anchored to the beginning of the string — for example, col LIKE 'foo%' or col ~ '^foo', but not col LIKE '%bar'... It is also possible to use B-tree indexes for ILIKE and ~, but only if the pattern starts with *non-alphabetic characters, i.e., characters that are not affected by upper/lower case conversion.

<> or != is covered as the negation of the = operator. If you are going to want to take advantage of the index with like queries then it is important to understand the first section of this documentation on operator classes and families. In particular, with text columns you need to make sure you use the right operator class that matches your "string" column, text, char, or varchar.

Size and Speed

Let's go ahead and make some B-tree indexes on different columns and see their disk size and difference they make in query times.

We'll start by getting some information on the storm event table and the "state: column which United States state.

First let's look at some of the data.

NOTE: if you click in the black box below it will cause the code to execute in the terminal on the right

select state from se_details limit 5;

We can see that column stores textual data and is the state name in all upper case characters.

Now let's look at some statistics of the column versus the overall table:

select
    pg_size_pretty(sum(pg_column_size(state))) as total_size,
    pg_size_pretty(avg(pg_column_size(state))) as average_size,
    sum(pg_column_size(state)) * 100.0 / pg_relation_size('se_details') as percentage,
     pg_size_pretty(pg_relation_size('se_details')) as table_size 
from se_details;

We can see that the data in the column is relatively small in size and is not a large percentage of the overall table size.

Before an index

Next let's look at how long it takes to query all the storms that happened in New York state (the state where I grew up). We are adding an index to this column because people may often want to query based on their home state or sort the output by the state.

explain analyze select event_id, state from se_details where state = 'NEW YORK';

We can see that query planner did a sequential scan (Seq Scan) and the "Execution Time" tells us this query took 34.41 milliseconds (times may be slightly different depending on the load on your machine).

Finally, let's see how long an insert takes. The timing on inserts are so fast that if we "benchmark" this, there is too much variation to see the difference clearly. Therefore, we are going to insert 10,000 values using simulated data and UUIDs. We are generating our UUIDs using the pgcrypto extension.

Since Explain Analyze actually carries out the transaction, we are going to wrap this statement in a transaction that we rollback. This way we don't change the data in the table.


begin;
explain analyze insert into se_details (event_id, state) values (generate_series(100000,110000), gen_random_uuid());
rollback;

And we can see that carrying out the insert took about 45 milliseconds (again timing may vary on your machine).

After Index

Let's go ahead and make an index on the state column. The syntax for this is relatively straight forward:

create index se_details_state_idx on se_details(state);

Since we didn't specify a USING = METHOD on this call the index takes the default, which is B-tree. Given our small our table is, the indexing operation should be extremely quick.

Now let's look at the size of the index:

\di+ se_details_state_idx;

You can see that the index has a total size a little over 3x the original data in the column. The size of the index will grow at a slower rate than the size of the data in the actual column.

Now let's look at the benefit we get from this added use of space.

First let's do our same simple query again:

explain analyze select event_id, state from se_details where state = 'NEW YORK';

We can see that the query planner used the index because the first node in the query plan (the one farthest down in the output) used a Bitmap Index Scan on our index. I did 3 runs of this query and I got a maximum run time of 13.5 milliseconds and minimum run time of 1.72 milliseconds, giving me an approximate 2x to 20x speedup in query times!

You can see the query planner won't use the index if we also use the < operator:

explain analyze select event_id, state from se_details where state < 'NEW YORK';

It doesn't use the index because over 50% of the data in the table is returned in this query (much more than 5-10% threshold for when an index scan is faster than a sequential scan).

But if we say we only want the states that come alphabetically after Wisconsin (which is just Wyoming), the query planner now uses the index.

explain analyze select event_id, state from se_details where state >  'WISCONSIN';

This change in query plan occurs because we are now returning fewer rows so scan of the index and then the random access of the disk pages with the data is faster than the sequential scan.

Now let's wrap up by looking at how the B-tree index affects insert speed. Let's redo our inserts from before. Again, because inserts are so quick we need to 10K of them to detect the difference.


begin;
explain analyze insert into se_details (event_id, state) values (generate_series(100000,110000), gen_random_uuid());
rollback;

And when we run it this time we can see this run times between 262 and 95 milliseconds - so that's anywhere from a 2x to over 5x increase in insert time. One thing to note is that this only seems to become a significant time difference with a large number of inserts.

Be aware we only have two indices on this table, the primary key (event_id) and state column. If you have more indices that receive data on an insert, all of them will add to the insert timing.

With that we are done with the lesson but you should play around some more if you want. If you want to drop the index to run some test the command would be:

drop index se_details_state_idx on se_details(state);

You can also create B-tree indexes on other columns and see what happens. In the exercises we are going to move on to the GIN index.