Difficulty: Intermediate
Estimated Time: 15 minutes

In Scylla, unlike Apache Cassandra, both Global and Local Secondary Indexes are implemented using Materialized Views under the hood.

Global Secondary Indexes (also called “Secondary Indexes”) are another mechanism in Scylla which allows efficient searches on non-partition keys by creating an index. Rather than creating an index on the entire partition key, this index is created on specific columns. Each Secondary Index indexes one specific column. In cases where you are using a composite (compound) partition key, a secondary index can index the column. Secondary indexes are transparent to the application. Queries have access to all the columns in the table, and indexes can be added or removed on the fly without changing the application.

Updates therefore can be more efficient with Secondary Indexes than with Materialized Views because only changes to the primary key and indexed column cause an update in the Secondary Index view.

What’s more, the size of an index is proportional to the size of the indexed data. As data in Scylla is distributed across multiple nodes, it’s impractical to store the whole index on a single node, as it limits the size of the index to the capacity of a single node, not the capacity of the entire cluster.

Hence the name Global Secondary Indexes. With global indexing, a Materialized View is created for each index. This Materialized View has the indexed column as a partition key, and it also stores the base table primary key. This means that it’s possible to query by the indexed column. Under the hood, Scylla will query the MV, get the base table primary key, and then fetch the requested column.

Global Secondary indexes provide a further advantage: it’s possible to use the indexed column’s value to find the corresponding index table row in the cluster, so reads are scalable. Note, however, that with this approach, writes are slower than with local indexing (described below) because of the overhead required to keep the indexed view up to date.

Local Secondary Indexes are an enhancement to Global Secondary Indexes, which allow Scylla to optimize workloads where the partition key of the base table and the index are the same key. Like their global counterparts, Scylla’s local indexes are based on Materialized Views. The subtle difference lies in the primary key; local indexes share the base partition key, ensuring that their data will be colocated with base rows. When using a Token Aware Driver, the same node is likely the coordinator, and the query does not require any inter-node communication.

Scylla’s superior performance often makes it acceptable for the user to use advanced but slower features like Materialized Views. This helps to improve the application’s data consistency and speed up its development.

This lab is part of Scylla University. You can learn more about MV in this lesson.

To summarize, Indexing is a useful tool that provides more types of queries on your tables. In principle, columns we wish to be queryable should be declared when the table is created, as part of a table’s primary key. Secondary Indexing is a neat way of making other columns queryable, but it comes with a cost of additional storage space and processing power to maintain the secondary index data coherent with the primary index information.

If you’d like to further investigate what happens when using the above queries with secondary indexes, try turning TRACING on and executing the queries again.

Check out other courses on Scylla University to improve your skills.

You can learn more about these topics in Scylla Documentation: Materialized Views, Local Secondary Indexes, and Global Secondary Indexes. Two additional and useful references are this blog post and this one.

ScyllaDB: Materialized Views and Indexes Lab 2

Step 1 of 3

Base Table

In this step, you'll set up the environment by creating a one-node Scylla cluster. You’ll then connect to the cluster with the CQL Shell, create a keyspace, a table, insert some data into the newly created table, and read the data. Next you'll consider what can be done if you want to query the table by a field that isn't part of the primary key.

Setup and Creating the Base Table - on Terminal #1

Start by creating a Docker container with Scylla. This tutorial was created with version 4.3:

docker run --name scylla-si -d scylladb/scylla:4.3.0 --overprovisioned 1 --smp 1

Wait a minute or so and check the node status:

docker exec -it scylla-si nodetool status

You’ll see that eventually, all the nodes have UN for status. U means up, and N means normal. If you get a message "nodetool: Unable to connect to Scylla API server: java.net.ConnectException: Connection refused (Connection refused)", it means you have to wait a bit more for the node to be up and responding.

Open 3 terminals (#1 for the base table, #2 for global index, #3 for local index

Switch to terminal #1, run the CQL Shell, and create a Keyspace:

docker exec -ti scylla-si /bin/bash

cqlsh

CREATE KEYSPACE restaurant_chain WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Keep in mind that SimpleStrategy should not be used in production. Learn more about this in the Replication Factor lesson.

Next, create a Table and insert some data:

USE restaurant_chain;

CREATE TABLE restaurant_chain.menus ( city text, name text, dish_type text, price float, PRIMARY KEY (city, name));

INSERT INTO menus (city, name, price, dish_type) VALUES ('Reykjavik', 'hakarl', 16, 'cold Icelandic starter'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Reykjavik', 'svid', 21, 'hot Icelandic main dish'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'sour rye soup', 7, 'Polish soup'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'sorrel soup', 5, 'Polish soup'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Cracow', 'beef tripe soup', 6, 'Polish soup'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Warsaw', 'pork jelly', 8, 'cold Polish starter'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Ho Chi Minh', 'bun mam', 8, 'Vietnamese soup'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Da Lat', 'banh mi', 5, 'Vietnamese breakfast'); INSERT INTO menus (city, name, price, dish_type) VALUES ('Ho Chi Minh', 'goi cuon', 6, 'Vietnamese hot starter');

Now let’s run some queries:

SELECT * from restaurant_chain.menus;

But wait! This is a full table scan. This could have been a problem if we didn’t have very little data. We should always restrict our queries using partition keys.

SELECT * FROM menus where city = 'Warsaw';

If we had a lot of data this would perform very well.

But what if we wanted to query by other fields? Let’s try.

SELECT * from menus where dish_type = 'Polish soup';

Oops!

How about other fields?

SELECT * from menus where city = 'Warsaw' and dish_type = 'Polish soup';

We get the same error.

If we add “ALLOW FILTERING” to the above queries they would work. But just like our first query, because we are querying regular columns it would be a full table scan – VERY INEFFICIENT! Indexes to the rescue!

Let’s take a look at our current schema.

DESC KEYSPACE restaurant_chain;