The Inconspicuous Codis/Redis cluster CrossSlot bug
Background — Spot the Bug!
A few months ago I was given a seemingly simple ticket. A downstream team wanted a simple API with the following logic
Request: Given a list of userIDs
Response: Find the number of conversations by these users with at least 1 unread message
We currently store the data of the unread messages in a redis sorted set with the following schema
╔══════════════════╦═════════════════════════╦════════════════╗
║ Key ║ Value ║ Type ║
╠══════════════════╬═════════════════════════╬════════════════╣
║ user_id{USER_ID} ║ {convo_id1, convo_id2…} ║ Sorted Set (Z) ║
╚══════════════════╩═════════════════════════╩════════════════╝
To answer the query, we can write the following LUA script to retrieve the necessary Data, by inputting the proper key slice into the script
GET_UNREAD_CONV_COUNT_LUA = `
local t = {}
for i, key in ipairs(KEYS) do
table.insert(t, redis.call('ZCARD', key))
end
return t
`
There is currently a bug in the methodology described above. Can you find it?
Hint: Under some circumstances, it will give a wrong answer but under other circumstance, it will work perfectly fine. Review some of the assumptions I had when developing.
In the meantime, let’s discuss about some Distributed Redis Systems:
Codis
Codis is an open-sourced distributed redis system that may be used for storing data that changes frequently and needs to be retrieved quickly & frequently. It is very similar to Redis Clusters but I will use Codis to demonstrate the basics of these distributed redis systems as Codis provides a very pretty front-end to visualise what happens underneath.
1. Single Redis Instance
Starting with the basic building block, I’m sure you are all familiar with a single redis instance. If not, here’s all you need to know:
- Redis implements a in-memory key-value storage. Every value stored on Redis has a corresponding key to access the data
- It can perform CRUD (create, read, update, delete) operations on the stored data, either directly or via a LUA Script.
However, having a single Redis to handle all requests have limitations on:
- Storage Size
- Queries able to handle per second (QPS)
A single instance is able to handle a signifiant amount of load but if we are talking about > 100,000 QPS, we need a way to scale this out.
2. Improving QPS: Codis-Group
Typically, a normal application will have significantly more read operations than writes. To improve the QPS of these, we can perform Master-Slave Replication by create one or more Slaves that helps the master handle Read-only operations.
To maintain consistency with the Master, the master Redis will continuously push changes to the data to the Slave to be updated. This mechanism however, does not guarantee strong consistency between the Master and Slave. This was intentionally done to maintain the lowest possible latency for every operation, which is the original design consideration of Redis.
A network of Redis Servers (one Master, many Slaves) make up a single Codis Group.
3. Improving QPS & Storage capacity: Multiple Codis Groups
We are also able to scale this out by having multiple Codis Groups, with each group storing and operating on its own unique set of keys and values.
Coordination — Codis Proxy
Having multiple Codis groups requires some sort of coordinator or orchestrator to allocate which group manages which keys. That is the job of the Codis Proxy. This is done through a process called Slotting.
Slotting
The Slotting Algorithm maps any key to one of 1024 slots (or buckets) the following way:
slot = CRC16(key) % 1024
Let’s say a proxy calculates the slot of a particular key to be slot # 5
The Codis proxy then finds out which Codis group is managing keys for slot 5 (A Codis group can manage multiple slots)
After that, it forwards the query to the Codis group where it will perform the necessary operations on the group.
So, what caused the bug?
Let’s say you have key1 which is stored in Codis group 1 and key19 stored in group 2.
When you execute the following redis script:
GET_UNREAD_CONV_COUNT_LUA = `
local t = {}
for i, key in ipairs(KEYS) do
table.insert(t, redis.call('ZCARD', key))
end
return t
`
The script will be diverted to the group which holds the first key. I.e. if keys = {key1, key19}, script will be passed to Codis group 1, vice versa.
Codis group 1 executes the script. It manages to find key no problem. But when it tries to get the value of key19, it returns nil as key19 is stored in Codis group 2 and not group 1. Therefore it returns the wrong value of NIL.
127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key1 key191) (integer) 2
2) (integer) 0
You may verify this by switching the order of the keys to see the difference!
127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key19 key11) (integer) 1
2) (integer) 0
Results from Redis Cluster
When we try to run the script on a Redis Cluster, it will return this error:
ERR CROSSSLOT Keys in request don't hash to the same slot
At least it returns an error rather than Codis which provides a wrong result!
How to Fix the Issue?
1. Use Hash Tags
For both Codis and Redis Cluster, if there is a set of {} in the key (i.e. user_id{1234}), the proxy will calculate the slot based on the hash of the string within the bracket.
e.g. slot of user_id{1234}, conversations{1234} & 1234 will always map to the same slot (see https://redis.io/topics/cluster-spec)
Using Hash Tags you can ensure the keys are always mapped to the same slot and hence, same machine
127.0.0.1:19000> eval "local t = {}; for i, v in ipairs(KEYS) do table.insert(t, redis.call('zcard', v)) end return t" 2 key19{some_str} key1{some_str}1) (integer) 1
2) (integer) 2
2. Batch the queries yourself
This method should be used if you are already using hash tags and need to bulk access keys across slots. This is less than ideal since it delves into the distributed mechanism of Codis which should be kept separate from application logic.
What you do is first find the mapping of slots to group (i.e. slot 1 → group 1, slot 2 → group 1, slot 3 → group 2 …). In codis there is a command slotsmapping, for Redis Cluster — Cluster Slots.
127.0.0.1:19000> slotsmapping1) 1) 0
2) 127.0.0.1:6379
3)
4) (nil)2) 1) 1
2) 127.0.0.1:6379
3)
4) (nil)...
Next, for every key, find the slot which it maps to. You can either do this by calculating the hash yourself, or asking Codis proxy for it via slotshashkey , Cluster Keyslot for cache cloud.
127.0.0.1:19000> slotshashkey key11) (integer) 80
From these two pieces of data, batch keys based on the group/cluster it is destined to and send them off with the script for each batch.
Afterwards, you will still need to reassemble it in the correct order which came in the request, which is a total headache if it wasn’t already!
Conclusion
Did any of you, having no prior knowledge of this bug, manage to figure it out yourself?
We are so used to thinking linearly and in a single dimension, either because we are biologically engineered that way or we’ve been trained that way, that when it comes to handling multiple data we fumble easily. Perhaps that is why solving distributed/concurrency bugs are the most difficult kinds to solve!