Difficulty: beginner
Estimated Time: 10 minutes

Using PostgreSQL Full Text Search

This class is to quickly get you hands on with using some of the features in PostgreSQL Full Text Search. The database has already been started and historical storm event data from the U.S. has already been loaded.

Here are the details on the database we are connecting to:

  1. Username: groot
  2. Password: password (same password for the postgres user as well)
  3. A database named: workshop

We will not be going in depth on Full Text search in PostgreSQL. We are just going to cover enough to whet your appetite. The idea is you should read and play more with Full Text Search after completing this exercise.

Final Notes

The container used in this class is available in Dockerhub. As long as you have Docker on your machine you can use the same version of PostgreSQL as the workshop. All the data from the workshop was intentionally chosen from public domain or permissive licenses so that you can use it for commercial and non-commercial purposes. Feel free to download it and play some more at your own pace on your own machine.

The only way to truly learn this technology is to play/experiment/use it for a while. You can keep using this environment and the documentation to play with full text search, especially things like highlighting results and changing document weights.

And with that, we are finished.

Full Text Search in PostgreSQL

Full-text Search

Full Text Search in PostgreSQL

While most of you probably know of the SQL search you can do on text with LIKE or ILIKE, did you know that PostgreSQL has a full text analysis program built right in. The search capabilities are similar to those available in Lucene and its derivative such as ElasticSearch and SOLR. In this exercise we are going to show you how to setup and utilize Full Text Search (FTS).

Basic ideas in FTS

There is a detailed discussion in the documentation about the concepts in FTS. For now let's just focus on the steps (along with simplify the actual work).

  1. The first step is to take the field(s) which have the content (a document) and analyze the text into words and phrases along with positions in the original text.
    • One piece of this analysis is called a tokenizer, which identifies words. numbers, urls... in the original text
    • The other piece converts these tokens into "lexemes". A lexeme is a normalized words and you need a dictionary to process for valid lexemes
  2. Then you store these store lexemes and their positions either in a column or an index
  3. Now you use a search function that understands lexemes and positions in the original document to carry out the search. This search function must use the same dictionary that was used to create the lexemes.

And with that very basic introduction let's get to it. We are going to do a FTS on the event narratives in the Storm Events details table.

Look at our data

The storm events data is public domain data from the U.S. weather service. We will be looking at the storm details information which contains narrative information about the storms.

et's go ahead and log in to our PostgreSQL database:

psql -U groot -h localhost workshop

Remember that the password is the word 'password'.

Now if you do:

\d se_details

PostgreSQL will show you a full description of the storm events details table. To see all the \ commands in PostgreSQL just do \? (though don't do it right now).

You will see text columns. any of which could be used by themself or combined together as a single document. Today we are just going to use event_narrative

event_narrative   | text

Build the index

Building a FTS index is actually quite simple:

CREATE INDEX se_details_fulltext_idx ON se_details USING GIN (to_tsvector('english', event_narrative));

This function will take a little while to run as it tokenizes and lexemes all the content in the column. The syntax is basically the same as creating any GIN index. The only difference is that we use the to_tsvector function, passing in the dictionary to use, 'english', and the field to analyze.

If you want to see the other default dictionaries PostgreSQL includes by default just query for it:

\dF

Quick note before we use that nice shiny index, there are actual two way to store the results of the text analyzer, either in an index or a in a separate column. While there is a full discussion of the tradeoffs in the documentation it boils down to:

  1. With an index, when data is updated or inserted the index will automatically analyze it. On the downside, you, the application developer, need to know the dictionary that was used and still use the analysis function in your query.
  2. With a column, your SQL syntax is cleaner and your performance with large indices will be better. The downside is you need to write a trigger to update the processed column anytime there is a change to the original document columns.

Today, for simplicity we chose to just use the index approach. If you do end up using a FTS we recommend you do some reading on the solution that works best for you.

Using the index

If we want to do a full text search we can now do something like this:

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('villag');

You will notice in the result set we are getting village and villages. To_tsquery is a basic search parser. This query also allows us to use the :* operator and get the search to do full stemming after the end of the word

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('english', 'villa:*');

We can also now look for phrases such as words that appear close together in the document. Let's look for some big hail:

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <1> hail');

The <1> operator in this case tells the search to look for the words grapefruit and hail next to each other in the document. As expected this return no results. But if we now change the distance between the words to allow for an intervening word we start to get what we expect:

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('grapefruit <2> hail');

The order of the words using the operator is order sensitive. Swapping grapefruit and hail we again get no results:

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('hail <2> grapefruit');

You can also use | (OR) and ! (NOT) operators inside the to_tsquery(). Once you start writing more complicated search phrasings you should start to use parentheses to group search together.

The following search will find all documents with grapefruit OR the prefix golf with the word hail two words later.

select begin_date_time, event_narrative from se_details where to_tsvector('english', event_narrative) @@ to_tsquery('(grapefruit | golf:* ) <2> hail');

As you can see, we can do powerful and fast full text searching with FTS in PostgreSQL. Unfortunately, the documentation on this feature is actually quite sparse and difficult to interpret. If you want to learn this syntax you are going to have to dig in with debugging and trying different query techniques to get your desired results. One other helpful document is a presentation by one of the lead developers. Hopefully in the future the PostgreSQL community will update and improve this documentation.