Why you might not need Elasticsearch
This blog post is another in my series "Yeah, Postgres can do that", (check out my other posts on enum types, and history tracking). Postgres is a humble workhorse with powerful features waiting to be unlocked.
If you’re a web developer and you want to add a search bar to your application, your first instinct might be to reach for a standalone service such as Elasticsearch or Solr. These tools are impressive, but they require considerable time to set up and maintain. Your platform engineers will need to manage yet another service, and you will need to write and test the logic to synchronize the search engine with your database. None of this is trivial.
Postgres has the tools to host a search engine inside your database which automatically stays in sync with your underlying data. In this article I’m going to walk through the basics of how Postgres implements its full text search. Then I’ll cover how to integrate it into your Rails app.
For more tips on using Postgres and other tools, join The Gnar community newsletter (and let us know what else you want us to write about by emailing hi@thegnar.co)
Writing Postgres Search Queries
Before we implement full text search in our Rails app, let’s see how Postgres does it under the hood. Let’s imagine that we have a website with blog posts, where the body field contains the contents of the article, and we’re looking for all blog posts related to 'cooking'.
Basic Search Queries
SELECT id, author_id, title, body
FROM blog_posts
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'cooking');
This sort of syntax is likely unfamiliar to you. The tsvector
data type is how Postgres breaks a string of text into tokens that are used for searching. The tsquery
is how it searches against those tokens. The @@
operator returns a boolean. Either the query matches or it doesn’t.
I encourage you to connect to a psql
session (or equivalently rails dbconsole
) and play around with these values yourself.
Try entering this in your psql
terminal:
SELECT to_tsvector('english', 'The complete guide to cooking at home');
This function to_tsvector
takes a string and converts it into a built-in Postgres data type called tsvector
. This is what you’ll get back:
'complet':2 'cook':5 'guid':3 'home':7
The words are broken into tokens, and each token contains a list of numbers. Those are the positions where the original words existed. “complete” is stored as the token “complet” and it’s the second word, hence the :2
. “Cooking” is stored as the token “cook” and is the fifth word, hence the :5
. Common words such as “the”, “to”, and “at” are what are called “stop words”, because they don’t generally help a user find what they’re looking for, so they’re ignored.
We can now search for the term "cooking" within our new ts_vector
type. The following will return true:
SELECT
to_tsvector('english', 'The complete guide to cooking at home')
@@
to_tsquery('english', 'cooks');
Searching for “puppies” will return false:
to_tsvector('english', 'The complete guide to cooking at home')
@@
to_tsquery('english', 'puppies');
This solution alone is more robust than any sort of regular expression-based text matching.
But it only gets us part of the way there. A binary search result of TRUE
/FALSE
isn’t all that useful. What we really want to be able to do is to rank the search results.
Ranking Search Results
To rank search results, Postgres provides the function ts_rank
. It accepts a tsvector
and tsquery
and returns a non-negative floating point number.
SELECT ts_rank(
to_tsvector('english', 'The complete cook''s guide to cooking at home'),
to_tsquery('english', 'cooks')
);
So if we want to order our search results from most relevant to least relevant, we’ll simply order by the ts_rank
from highest to lowest. Let’s go back to our blog posts example:
SELECT id, author_id, title, body
FROM blog_posts
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'cooking')
ORDER BY ts_rank(to_tsvector('english', body), to_tsquery(‘english’, ‘cooking’)) DESC;
This all works, but not quickly. All the data is tokenized and scanned at the time the query is executed, which can have performance implications. Our next enhancement is to use database indexes.
Indexing Search Results
We can add an index to just about any field in Postgres, and we can add an index to any FUNCTION
in Postgres as well.
CREATE INDEX idx_blog_posts_search
ON blog_posts USING GIN
((to_tsvector('english', body));
FYI: The name of the index is arbitrary. Just make sure that it doesn’t collide with anything else you’ve defined in your Postgres database.
GIN Indexes:
GIN index stands for “Generalized Inverted Index”. If you’re wondering how it’s implemented, think about the index at the end of your textbook from school. There’s an alphabetical listing of all the terms. When you find the term you’re looking for, it lists all the pages where that term is mentioned. Because it’s alphabetized, you can find the term relatively quickly, even if the number of possible terms grows large.
The GIN index is implemented using a btree, which is the same data structure used for most regular indexes. The difference is that a standard btree is one to one. Each record has one node in the btree index. Each node in the btree index maps back to a single record. With a GIN index, if the record breaks into 7 tokens, the record will show up in 7 different places on the GIN index, and each node in the GIN index can point back to multiple records.
Indexes on a Function:
The output of any Postgres function can be indexed, and that’s what we do here:
to_tsvector('english', body)
The function above accepts a single field on a table and outputs a vector result.
Multicolumn Searching
Sometimes we may want to search for content in multiple fields. Let’s say we also want to be able to search by the title of the blog post. We’d simply concatenate the title and body fields together. This is done by the ||
operator. Let’s take a look:
SELECT 'hello' || 'world';
This will display 'helloworld'
in your psql
console. Along with using string literals, we can also concatenate two fields together. Let’s go back to our blog_posts
table and search by title in addition to body:
SELECT id, author_id, title, body
FROM blog_posts
WHERE to_tsvector('english', title || ' ' || body) @@ to_tsquery('english', 'cooking')
ORDER BY ts_rank(to_tsvector('english', title || ' ' || body), to_tsquery(‘english’, ‘cooking’)) DESC;
Concatenating with COALESCE
Using the statements above, if any one item being concatenated is NULL
, the entire concatenation will return NULL
. This is a disaster for full text search because it means that a single NULL
value means that the record won’t be found in the search bar. This is where the COALESCE
function comes into play. We have to make sure that we handle every field that could possibly be NULL
:
SELECT id, author_id, title, body
FROM blog_posts
WHERE to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')) @@ to_tsquery('english', 'cooking')
ORDER BY ts_rank(to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')), to_tsquery(‘english’, ‘cooking’)) DESC;
Now we are covered if any field happens to be NULL
. In this example we’d probably want every blog post to have a title and a body, but imagine that we’re searching for authors based on first, middle and last name, and not all authors have specified their middle names.
If you’re thinking this SQL query is looking unwieldy, don’t worry. We’re about to fix that. The first step is to store to_tsquery(‘english’, ‘cooking’))
in its own variable:
SELECT id, author_id, title, body
FROM blog_posts, to_tsquery('english', 'cooking') as q
WHERE to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')) @@ q
ORDER BY ts_rank(to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')), q) DESC;
So we’ve gotten rid of one source of duplication. Now let’s see how we can DRY (Don’t Repeat Yourself) up the to_tsvector
function.
Using Postgres Computed Columns
In the previous example we DRY’d up the tsquery
variable. But, we have to concatenate the searchable fields together twice, the first is in the WHERE
clause, the second is in the ORDER BY
clause.
SELECT id, author_id, title, body
FROM blog_posts, to_tsquery('english', 'cooking') as q
WHERE to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')) @@ q
ORDER BY ts_rank(to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')), q) DESC;
This doesn’t even mention the fact that we need to create a GIN index for this concatenated string:
DROP INDEX IF EXISTS idx_blog_posts_search ON blog_posts;
CREATE INDEX idx_blog_posts_search
ON blog_posts USING GIN
((to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(body, '')));
So we now have this very complicated, easy-to-make-a-mistake pattern in three separate places. We can fix this with a Postgres generated column.
Generated columns are the Postgres equivalent of formulas in a spreadsheet. If I look at a cell on a spreadsheet, it looks the same to me whether it’s a hard-coded value or a formula. The same thing is true with generated columns. Let’s generate a column to make our blog posts searchable. In a database migration (or your psql
terminal if you’re just experimenting), we'll need to execute the following code:
ALTER TABLE blog_posts ADD COLUMN search_index tsvector
GENERATED ALWAYS AS
(to_tsvector('english', (coalesce(title::text, '') || ' ' || coalesce(body::text, ''))))
STORED;
DROP INDEX IF EXISTS idx_blog_posts_search_index
CREATE INDEX idx_blog_posts_search_index ON blog_posts USING GIN(search_index);
The generated column is called search_index
. (The name is arbitrary). It’s on the table blog_posts
, and it’s of type tsvector
. Then we create a GIN index on that generated column. Now when we search for blog posts, all we have to do is reference that generated column.
SELECT id, author_id, title, body
FROM blog_posts, to_tsquery('english', 'cooking') AS q
WHERE search_index @@ q
ORDER BY ts_rank(search_index, q) DESC;
So now we’ve played with the basics of Postgres full text search. There are other even more advanced topics about search which I’ll mention at the end of this post, but the most important thing to leave you with is how to integrate this into Rails.
Rails Integration:
I don’t know of a perfect gem that will do this, so my preferred method is to use custom scopes on an Active Record model and to use a plain old database migration to manually update the database.
Let’s type into our rails console:
rails generate migration AddSearchIndexToBlogPosts
Now we can open the latest migration file and paste in the following code:
class AddSearchIndexToBlogPosts < ActiveRecord::Migration[6.1]
def change
reversible do |migrate|
migrate.up do
execute <<~SQL
ALTER TABLE blog_posts ADD COLUMN search_index tsvector
GENERATED ALWAYS AS (
to_tsvector('english', (
coalesce(title::text, '')
|| ' ' ||
coalesce(body::text, ''))
)
) STORED;
CREATE INDEX idx_blog_posts_search_index
ON blog_posts USING GIN(search_index);
SQL
end
migrate.down do
execute <<~SQL
DROP INDEX idx_blog_posts_search_index;
ALTER TABLE blog_posts DROP COLUMN search_index;
SQL
end
end
end
end
Now in our Rails model, we can reference this generated, indexed column using a scope. I’ll break down all the components in a second.
class BlogPost < ApplicationRecord
scope :search, -> (query_string) {
to_query = sanitize_sql([
"websearch_to_tsquery('simple', ?)",
query_string
])
self
.where(Arel.sql("search_index @@ #{to_query}"))
.order(Arel.sql("ts_rank(search_index, #{to_query}) DESC"))
}
end
Postgres provides an excellent function websearch_to_tsquery
which allows users to enter queries in roughly the same style that they would in a Google search bar. We still have to make sure that the user’s input is first sanitized for Active Record, which is where sanitize_sql
comes into play. This immunizes us against SQL injections.
Now that we’ve manually sanitized the user input, we can concatenate it worry-free. However, to keep Active Record from fussing at us, Arel.sql
is a way to let Active Record know that we’re manually sanitizing everything and that it’s safe to use.