Elasticsearch (Usage)

To use Elasticsearch in VMS we're using the following gems, which are the official Elasticsearch gems, maintained by the developers of Elasticsearch:

  • elasticsearch-model
  • elasticsearch-rails

Code for both can be found in the elastic/elasticsearch-rails repo. Both gems rely on code in elastic/elasticsearch-ruby.

Adding Models to the index

In order to be able to create and maintain an index for a model in Elasticsearch, the following needs to be added to the model:

  1. Two concerns to include
  2. Mapping of Model to Elasticsearch (settings do {})
  3. Representation of the Model to be indexed, in json (#as_indexed_json)

Concerns to include

Simply add these two lines to your model:

include Elasticsearch::Model
include Elasticsearch::Model::Callbacks

Elasticsearch::Model

This adds, amongst others, the settings do {} and the as_indexed_json methods to the Model, which will be used to configure the model for Elasticsearch

Elasticsearch::Model::Callbacks

This will add callbacks to your Model so that upon save or update, the Model will be reindexed by Elasticsearch so the changes in the database are reflected in the index of Elasticsearch. For example, if you change the title of a Production, without these callbacks the new title will not be updated in Elasticsearch and thus you will not be able to find the Production when looking for keywords from the updated title.

Mapping the Model

In order to index a document in Elasticsearch, it needs to know which fields exists in the document and what type the field is. This you define in the settings do {} block. Without these mappings, Elasticsearch creates indices for all fields in a document and infers their type automatically. We have not used this, and instead declared the fields to index explicitly in the model. This is done by using the dynamic: false option when creating the mapping in mapping do {}.

If you want to use any field from a document to do a search, ordering or filtering in Elasticsearch, you'll need to add a mapping here. Otherwise, Elasticsearch does not know about the field and can't use it in a query. When you just need to have the field present in the document, but don't want to perform any operations on it, you don't need to create a mapping here. In that case, you just add it to #as_indexed_json (see below).

Check all the field types you can use in the Elasticsearch documentation.

Most field types speak for themselves. The only one I want to highlight are keyword and text (but more comprehensive info can be found in the docs linked before):

Type text

This type means the field is regarded as free text and Elasticsearch will try to analyze the text according to the analyzer you choose. This means Elasticsearch will perform operations like stemming, removal of stop words, etc.

Example: say you have a document with one field, title, which has the contents "Look at the jumping dogs". When indexing this as text, with the english analyzer, Elasticsearch might store this document internally as "Look, jump, dog", so when you search for any of these terms, the document will be matched. Elasticsearch removed "at" and "the" as these are stop words, and used the stem of the words "jumping" and "dogs".

Type keyword

This type means the field is regarded as a literal keyword and used by Elasticsearch as-is. This means no operations like stemming and stop word removal. Documents will only match when the query is exactly the same as the keyword. Usually this type of field is useful for discrete labels, that's why we've used it in the MyChannels context for a field like publication_status.

It's important to note, that you can index a field according to more than one type. For example, you could have a title field for which you create both a text and a keyword index

JSON representation of the model

This is how the document is stored in Elasticsearch. When you perform a query and Elasticsearch returns a document, this is what it will look like. Fields that are in this representation are not necessarily indexed. That only happens when you've created a mapping for them too (see above). If you create a mapping for a new field, you will need to add it here too, but not the other way around.

The values you define here is what the indexer uses to index the document according to the mappings. So for example, in our mapping for a Production we've defined a integer field organisation_id. In here we set the value of the field with the actual organisation_id retrieved from the Channel the Show the Production belongs to, belongs to.

If you look carefully, you'll see that in Production we use an array of organisation_ids to fill the accessible_organisation_ids mapping. While Elasticsearch does not have an Array datatype, you can have an integer (or string) type and fill this with an array of integers (or strings). This way, Elasticsearch will match a document when the value for organisation_id in a query matches one of the entries in the array of accessible_organisation_ids. This only works when all values in the Array are of the same type, you can't mix integers and strings.

Example: a document 'Production' with the field accessible_organisation_ids with a value of [1, 2, 3] will match queries which look for a Production with accessible_organisation_ids = 1, or accessible_organisation_ids = 2, etc.

Querying Elasticsearch

To execute queries to Elasticsearch, we've made a wrapper Elasticsearch::QueryBuilder that offers building blocks to build a valid query (in JSON format) that can be send to Elasticsearch. On top of that wrapper one could build model-specific query builders to encapsulate all business logic needed to build a query for a specific model, like we did in Elasticsearch::ProductionQueryBuilder for Productions.

You can read more about how queries work in the docs on the Query DSL, or more specifically the docs about Bool Query (the query type we're currently using).

In short: how queries in Elasticsearch work

Elasticsearch consists of documents that you can retrieve by querying the Elasticsearch index. Given a query, Elasticsearch calculates a score for each document based on how well it matches the query. You can influence this behaviour by adding specific index fields to your document and modifying the query you send to Elasticsearch.

When I talk about a query here I don't mean a simple string like "cute dogs", I mean the object that's been built using the Elasticsearch Query DSL

If you really want to know how all of this exactly works, please read the Elasticsearch documentation carefully or read some basic introduction texts to information retrieval. In here I will just give a brief overview of the basics.

Elasticsearch calculates a score based on how well a document matches a query. For example, given a query title: "dog" ,this score can be calculated based on how often the word appears in a document's title, the length of the title (a short title with the word "dog" in it is more relevant than a long title), etc. Each document thus gets a score for this query and the results are ranked by score, highest score first.

In addition to this, we can also use filters. Filters are applied without calculating a score: a document either matches a filter or it doesn't. When it doesn't, it's never included in the search results, even if it matches with one or more search terms.

In our Elasticsearch implementation, we use a few features to build a query with QueryBuilder

We're using Bool queries with a should clause, meaning a query should match any of the options we pass it. Combined with one or more filters, the should clause also returns documents that match the filters, but none of the search terms. To prevent this, we also use minimum-should-match to require at least one matching search term for each query

  • add_multi_match_query
  • add_term_query
  • add_filter
  • add_sort_option

add_multi_match_query

Elasticsearch documentation for Multi Match Query

A multi-match query is a query that looks for a search term (e.g. "dog") in multiple fields of a document. Each document will get a score assigned, based on how well it matches to the query according to Elasticsearch.

You can influence the score by assigning different weights to each field by using a caret and a integer value in the name of the field. For example ["title^2", "description"] means that Elasticsearch assigns twice as much value to the document when the search term appears in the document's title vs. the document's description. Given these three documents:

{id: 1, title: "A dog", description: "A tale about a dog"}
{id: 2, title: "Tail of the dog", description: "Wag it like a polaroid picture"}
{id: 3, title: "Funny animals", description: "Dogs and cats are funny"}

Elasticsearch will assign the highest score to the document with id 1, the second highest score to document with id 2 and the lowest score to the document with id 3.

We're using this type of query to match documents with the search term in their title and description.

add_term_query

Elasticsearch documentatation for Term Query

A term query matches a document with an exact match on the specified field. This works for all fields but text fields, as these are analyzed and cannot do an exact match. If you want to do an exact match on a text field, you should add a second keyword index for this field.

The difference between a term query and a filter, is that matching the term query influences the score of the document, whereas filters do not. Also, because we're using the should clause, a matching term query is not required for a document to be seen as relevant for this total query.

We're using this type of query to let users find documents based on their ID.

add_filter

Elasticsearch documentation for Filter Context

Filters can be used to exclude documents from the search results, regardless if they match with any of the search terms. Filters have no influence on the relevancy score of a document. If you use multiple filters in a query, only documents that match all filters are returned.

We use this for the filters in VMS (filter by publication status, filter by organisation, etc.).

add_sort_option

Elasticsearch documentation for Sort

Elasticsearch returns documents sorted by relevancy automatically. However, if you want, you can sort the results on any other field in the document. Realise however, that you will lose the relevancy ordering when you do this. If you do an explicit sort + use a search term, don't be surprised if the most relevant documents are not shown first.

Example: Rendering an index view through Elasticsearch

The index views for all objects in VMS (Productions, Shows, Channels, etc) have a lot of common functionality: you can search, sort, and filter the results shown in the table.

For all index view but the Production view, we're still relying on the database for these features (using the pg_search gem). For performance reasons, we're using Elasticsearch to search Productions in VMS. Therefore, we've moved all sorting and filtering functionality to Elasticsearch too, as it's complicated to first retrieve documents based on a search query from Elasticsearch and then do all filter, sorting and pagination operations in ActiveRecord.

Theoretically, when no search query is given we could retrieve everything directly from the database instead of Elasticsearch, but that way we should have to duplicate a lot of functionality. This way, it works straightforward: Elasticsearch computes which Productions match with your parameters and ActiveRecord does a simple select query with a list of ids it needs to fetch for the page to be rendered.

Using Elasticsearch in RSpec

In order to use Elasticsearch in tests, a few things should be taken into account.

  1. We're using a special index per model in the test environment
  2. Tests that rely on Elasticsearch need to be tagged with es: true
  3. Always do a refresh_index! after creating all your needed objects in a test

Special index per model

In theory you could load a special instance of Elasticsearch for a test environment. In practice, we've failed to make this work. So we've chosen to use an index with a -test suffix when using Elasticsearch in feature (Capybara) tests. Before and after each test this index is deleted and recreated, somewhat similar to how DatabaseCleaner works in tests.

If you add a new model to Elasticsearch and want to use this in a test, make sure to add the following to the model:

index_name "productions-#{Rails.env}" if Rails.env.to_s == "test"

Where productions is the name of the actual model.

Tag your tests

In order for the index to be fresh for each test, you need to create and destroy the index for each test. This is done automatically when you tag a test with es: true (or simply :es), like so:

# with es: true
it "searches for productions based on title", es: true do
  # your test here
end

# with :es
it "searches for productions based on title", :es do
  # your test here
end

What these tags need to do is defined in spec/rails_helper.rb. Also, when adding a new model to the test suite, make sure to add the name of the model to the arrays in the config.around(:each, es: true) do {} block.

Refresh index in your test

When you create an object in your tests using FactoryBot, the Elasticsearch callbacks defined in the model should kick in and index the document correctly in Elasticsearch. However, sometimes it can happen that Elasticsearch doesn't index the document, for some reason. In order to always have an up-to-date index which correctly represents your objects in the database (and thus what you expect to see rendered in the HTML pages in your feature spec), make sure to call refresh_index!, after your calls to FactoryBot, but before accessing the page through the Capybara DSL, like so:

create(:production, title: "My Production", show: show)
create(:production, title: "Other Production", show: show)

Production.__elasticsearch__.refresh_index!

login_as admin
visit productions_path
# continue your test
Last Updated: 3/28/2019, 1:25:17 PM