The Inconspicuous Codis/Redis cluster CrossSlot bug

Background — Spot the Bug!

A few months ago I was given a seemingly simple ticket. A downstream team wanted a simple API with the following logic

Request: Given a list of userIDs

Response: Find the number of conversations by these users with at least 1 unread message

We currently store the data of the unread messages in a redis sorted set with the following schema

╔══════════════════╦═════════════════════════╦════════════════╗
║ Key ║ Value ║ Type ║
╠══════════════════╬═════════════════════════╬════════════════╣
║ user_id{USER_ID} ║ {convo_id1, convo_id2…} ║ Sorted Set (Z) ║
╚══════════════════╩═════════════════════════╩════════════════╝

To answer the query, we can write the following LUA script to retrieve the necessary Data, by inputting the proper key slice into the script

GET_UNREAD_CONV_COUNT_LUA = `
local t = {}
for i, key in ipairs(KEYS) do
table.insert(t, redis.call('ZCARD', key))
end
return t
`

There is currently a bug in the methodology described above. Can you find it?

Hint: Under some circumstances, it will give a wrong answer but under other circumstance, it will work perfectly fine. Review some of the assumptions I had when developing.

create your own GOpher avatar at https://gopherize.me/

In the meantime, let’s discuss about some Distributed Redis Systems:

Codis

Codis is an open-sourced distributed redis system that may be used for storing data that changes frequently and needs to be retrieved quickly & frequently. It is very similar to Redis Clusters but I will use Codis to demonstrate the basics of these distributed redis systems as Codis provides a very pretty front-end to visualise what happens underneath.

Architecture of Codis System. We will only be discussing the basics — Codis Groups & Proxies

1. Single Redis Instance

Starting with the basic building block, I’m sure you are all familiar with a single redis instance. If not, here’s all you need to know:

  • Redis implements a in-memory key-value storage. Every value stored on Redis has a corresponding key to access the data
  • It can perform CRUD (create, read, update, delete) operations on the stored data, either directly or via a LUA Script.

However, having a single Redis to handle all requests have limitations on:

  1. Storage Size
  2. Queries able to handle per second (QPS)

A single instance is able to handle a signifiant amount of load but if we are talking about > 100,000 QPS, we need a way to scale this out.

there is only so much a single server can take ~

2. Improving QPS: Codis-Group

Typically, a normal application will have significantly more read operations than writes. To improve the QPS of these, we can perform Master-Slave Replication by create one or more Slaves that helps the master handle Read-only operations.

give your Redis a BFF, perform master-slave replication

To maintain consistency with the Master, the master Redis will continuously push changes to the data to the Slave to be updated. This mechanism however, does not guarantee strong consistency between the Master and Slave. This was intentionally done to maintain the lowest possible latency for every operation, which is the original design consideration of Redis.

A network of Redis Servers (one Master, many Slaves) make up a single Codis Group.

3. Improving QPS & Storage capacity: Multiple Codis Groups

We are also able to scale this out by having multiple Codis Groups, with each group storing and operating on its own unique set of keys and values.

Teamwork makes the dream work!

Coordination — Codis Proxy

Having multiple Codis groups requires some sort of coordinator or orchestrator to allocate which group manages which keys. That is the job of the Codis Proxy. This is done through a process called Slotting.

Slotting

The Slotting Algorithm maps any key to one of 1024 slots (or buckets) the following way:

slot = CRC16(key) % 1024

Let’s say a proxy calculates the slot of a particular key to be slot # 5

The Codis proxy then finds out which Codis group is managing keys for slot 5 (A Codis group can manage multiple slots)

After that, it forwards the query to the Codis group where it will perform the necessary operations on the group.

So, what caused the bug?

Let’s say you have key1 which is stored in Codis group 1 and key19 stored in group 2.

When you execute the following redis script:

GET_UNREAD_CONV_COUNT_LUA = `
local t = {}
for i, key in ipairs(KEYS) do
table.insert(t, redis.call('ZCARD', key))
end
return t
`

The script will be diverted to the group which holds the first key. I.e. if keys = {key1, key19}, script will be passed to Codis group 1, vice versa.

Codis group 1 executes the script. It manages to find key no problem. But when it tries to get the value of key19, it returns nil as key19 is stored in Codis group 2 and not group 1. Therefore it returns the wrong value of NIL.

127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key1 key191) (integer) 2
2) (integer) 0

You may verify this by switching the order of the keys to see the difference!

127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key19 key11) (integer) 1
2) (integer) 0

Results from Redis Cluster

When we try to run the script on a Redis Cluster, it will return this error:

ERR CROSSSLOT Keys in request don't hash to the same slot

At least it returns an error rather than Codis which provides a wrong result!

How to Fix the Issue?

1. Use Hash Tags

For both Codis and Redis Cluster, if there is a set of {} in the key (i.e. user_id{1234}), the proxy will calculate the slot based on the hash of the string within the bracket.

e.g. slot of user_id{1234}, conversations{1234} & 1234 will always map to the same slot (see https://redis.io/topics/cluster-spec)

Using Hash Tags you can ensure the keys are always mapped to the same slot and hence, same machine

127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key19{some_str} key1{some_str}1) (integer) 1
2) (integer) 2

2. Batch the queries yourself

This method should be used if you are already using hash tags and need to bulk access keys across slots. This is less than ideal since it delves into the distributed mechanism of Codis which should be kept separate from application logic.

What you do is first find the mapping of slots to group (i.e. slot 1 → group 1, slot 2 → group 1, slot 3 → group 2 …). In codis there is a command slotsmapping, for Redis Cluster — Cluster Slots.

127.0.0.1:19000> slotsmapping1) 1) 0
2) 127.0.0.1:6379
3)
4) (nil)
2) 1) 1
2) 127.0.0.1:6379
3)
4) (nil)
...

Next, for every key, find the slot which it maps to. You can either do this by calculating the hash yourself, or asking Codis proxy for it via slotshashkey , Cluster Keyslot for cache cloud.

127.0.0.1:19000> slotshashkey key11) (integer) 80

From these two pieces of data, batch keys based on the group/cluster it is destined to and send them off with the script for each batch.

Afterwards, you will still need to reassemble it in the correct order which came in the request, which is a total headache if it wasn’t already!

Conclusion

Did any of you, having no prior knowledge of this bug, manage to figure it out yourself?

We are so used to thinking linearly and in a single dimension, either because we are biologically engineered that way or we’ve been trained that way, that when it comes to handling multiple data we fumble easily. Perhaps that is why solving distributed/concurrency bugs are the most difficult kinds to solve!

Core Server Software Engineer @ Shopee