How we re-architected Announcements
Sendbird’s Announcements API, a premium feature, allows you to send announcements to massive groups of users in group channels and track the announcements’ open rate. This post introduces Announcements v2 and describes how we rearchitected this feature from v1 to v2, implemented EKS and container technology to scale out batch jobs, saved costs with an improved scheduling strategy, and shares our plans to address ordering issues with user-based queues.
The first available Announcements version (v1) had a hard limit on performance and scalability. It ran as a single cron process, and was affected by other processes running on the same server instance. Unfortunately, it did not offer the scalability or flexibility on various sizes of workloads.
In order to overcome this limitation, Announcements v1.5 leveraged AWS ECS to scale announcement workers with regard to the number of announcements being scheduled. To do so, we launch an ECS task for every minute and each ECS task was responsible for sending an announcement. As Announcements v1.5 was the first version to use AWS ECS technology, there was still some room for improvement. Announcements v1.5 used scheduled actions triggered by an AWS CloudWatch (CW) event, which launched containers for every amount of time. It could be affected by an AWS incident on both ECS and CW. We also spent too much time provisioning our tasks because a single ECS task consumes only one announcement, which means ECS tasks should be launched as often as the announcements being scheduled. Last, we had to take care of stale announcements manually.
Take customer relationships to the next level.
To address these challenges, we launched another version upgrade with Announcements v1.6, and created two different types of tasks. The first is a worker task which sends messages to target channels/users according to scheduled announcements. The other is the scheduler task, which detects the existence of scheduled announcements and launches worker tasks as much as needed. Unlike worker tasks, which do heavy computations and IOs, a scheduler task needs small computing power. Rather than launching a heavy ECS task for every minute, we can save cost by running a lightweight one all day and launch worker tasks only when there are announcements to be processed. While v1.5 takes n minutes to launch n ECS tasks, v1.6 can launch as many as needed at once.
We also created a job queue between the scheduler and worker tasks so that worker tasks can process more than one announcement if there is an announcement in a job queue. Launching an ECS Fargate task takes a minute and includes provisioning, pulling images, and running a container. We can save time and cost because the worker task of v1.6 can handle multiple announcements until the job queue completes, while v1.5 can only handle one announcement job per ECS task.
In order to get database load under control, we introduced a speed limit on Announcements v1.6 that makes it possible to limit message sending rates from the Announcements service when the database instance is busy (e.g., high CPU usage). Moreover, we damp message sending rates on retries after database queries are terminated. We also developed automatic stale announcement handling after unexpected behavior and, as a result, both the frequency and workload of issue handling from announcements reduced a lot.
Announcements v2 offers a series of improvements, but we can still make the feature better. There are a few customers who want their announcements to be sent in order, which means if two announcements are targeting the same channel and users, then messages from the announcement scheduled earlier are sent first. Our current job queue does not consider it, so a task which consumes an announcement targeting the same channel with another one will be blocked if its announcement is scheduled later. We will improve the process by implementing a user-based queue which accepts only consumable ones. Stay tuned … there’s more to come!