Skip to main content

10. Design A Rate Limiter

info

设计单机限流器与分布式限流器

Reference: ByteByteGo

In a network system, a rate limiter is used to control the rate of traffic sent by a client or a service. In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked. Here are a few examples:

  • A user can write no more than 2 posts per second.
  • You can create a maximum of 10 accounts per day from the same IP address.
  • You can claim rewards no more than 5 times per week from the same device.

In this chapter, you are asked to design a rate limiter. Before starting the design, we first look at the benefits of using an API rate limiter:

  • Prevent resource starvation caused by Denial of Service (DoS) attack [1]. Almost all APIs published by large tech companies enforce some form of rate limiting. For example, Twitter limits the number of tweets to 300 per 3 hours [2]. Google docs APIs have the following default limit: 300 per user per 60 seconds for read requests [3]. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls. 速率限制器通过阻止过多的调用来防止有意或无意的 DoS 攻击。
  • Reduce cost. Limiting excess requests means fewer servers and allocating more resources to high priority APIs. Rate limiting is extremely important for companies that use paid third party APIs. For example, you are charged on a per-call basis for the following external APIs: check credit, make a payment, retrieve health records, etc. Limiting the number of calls is essential to reduce costs. 降低成本,限制过多的请求意味着更少的服务器并将更多的资源分配给高优先级的 API。
  • Prevent servers from being overloaded. To reduce server load, a rate limiter is used to filter out excess requests caused by bots or users’ misbehavior. 防止服务器过载。

Step 1 - Understand the problem and establish design scope

Rate limiting can be implemented using different algorithms, each with its pros and cons. The interactions between an interviewer and a candidate help to clarify the type of rate limiters we are trying to build.

Requirements

Here is a summary of the requirements for the system:

  • Accurately limit excessive requests. 准确地限制过多的请求。
  • Low latency. The rate limiter should not slow down HTTP response time. 低延迟,不会拖慢HTTP响应时间。
  • Use as little memory as possible. 尽可能少的内存。
  • Distributed rate limiting. The rate limiter can be shared across multiple servers or processes. 多个服务器或进程之间共享速率限制器。
  • Exception handling. Show clear exceptions to users when their requests are throttled. 异常处理。
  • High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system. 高容错性,确保速率限制器在出现问题时不会影响整个系统。

Step 2 - Propose high-level design

Where to put the rate limiter?

Instead of putting a rate limiter at the API servers, we create a rate limiter middleware, which throttles requests to your APIs as shown in below Figure. 使用中间件。

  • Assume our API allows 2 requests per second, and a client sends 3 requests to the server within a second.
  • The first two requests are routed to API servers.
  • However, the rate limiter middleware throttles the third request and returns a HTTP status code 429. The HTTP 429 response status code indicates a user has sent too many requests.

While designing a rate limiter, an important question to ask ourselves is: where should the rater limiter be implemented, on the server-side or in a gateway? There is no absolute answer. It depends on your company’s current technology stack, engineering resources, priorities, goals, etc. Here are a few general guidelines:

  • Identify the rate limiting algorithm that fits your business needs. When you implement everything on the server-side, you have full control of the algorithm. However, your choice might be limited if you use a third-party gateway.
  • If you have already used microservice architecture and included an API gateway in the design to perform authentication, IP whitelisting, etc., you may add a rate limiter to the API gateway.

Algorithms for rate limiting

Rate limiting can be implemented using different algorithms, and each of them has distinct pros and cons. Even though this chapter does not focus on algorithms, understanding them at high-level helps to choose the right algorithm or combination of algorithms to fit our use cases. Here is a list of popular algorithms:

  • Token bucket 令牌桶
  • Leaking bucket 漏桶
  • Fixed window counter 固定窗口计数器
  • Sliding window log
  • Sliding window counter 滑动窗口计数器

Token bucket algorithm 令牌桶算法

The token bucket algorithm is widely used for rate limiting. It is simple, well understood and commonly used by internet companies. Both Amazon [5] and Stripe [6] use this algorithm to throttle their API requests.

The token bucket algorithm work as follows:

  • A token bucket is a container that has pre-defined capacity. Tokens are put in the bucket at preset rates periodically. Once the bucket is full, no more tokens are added. 令牌桶是一个具有预定义容量的容器。令牌定期以预设速率放入桶中。一旦桶满了,就不再添加令牌。
  • Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket. Figure 5 explains how it works. 每个请求消耗一个令牌。当请求到达时,我们检查桶中是否有足够的令牌。
  • If there are enough tokens, we take one token out for each request, and the request goes through.
  • If there are not enough tokens, the request is dropped. 如果没有足够的令牌,请求将被丢弃。

Figure 6 illustrates how token consumption, refill, and rate limiting logic work. In this example, the token bucket size is 4, and the refill rate is 4 per 1 minute.

The token bucket algorithm takes two parameters:

  • Bucket size: the maximum number of tokens allowed in the bucket. 桶的最大令牌数量。
  • Refill rate: number of tokens put into the bucket every second. 每秒放进桶中的令牌数量。

How many buckets do we need? This varies, and it depends on the rate-limiting rules. Here are a few examples.

  • It is usually necessary to have different buckets for different API endpoints. For instance, if a user is allowed to make 1 post per second, add 150 friends per day, and like 5 posts per second, 3 buckets are required for each user. 通常需要为不同的 API 端点设置不同的桶。

  • If we need to throttle requests based on IP addresses, each IP address requires a bucket. 如果我们需要根据 IP 地址限制请求,每个 IP 地址都需要一个桶。

  • If the system allows a maximum of 10,000 requests per second, it makes sense to have a global bucket shared by all requests.

  • Pros

    • The algorithm is easy to implement.
    • Memory efficient.
    • Token bucket allows a burst of traffic for short periods. A request can go through as long as there are tokens left.
  • Cons

    • Two parameters in the algorithm are bucket size and token refill rate. However, it might be challenging to tune them properly.
  • Challenges

    • 如何更小的降低对流量的折损?
    • Adjective Load Control 自适应限流器: 对下游负载情况进行分析根据反馈指标调节令牌桶平均放入令牌的速率,来做到使用限流减少误限率。

Leaking bucket algorithm 漏桶算法

The leaking bucket algorithm is similar to the token bucket except that requests are processed at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:

  • When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue.
  • Otherwise, the request is dropped. 请求到达时,如果队列已满,则丢弃请求。
  • Requests are pulled from the queue and processed at regular intervals. 请求从队列中拉出并定期处理。

Leaking bucket algorithm takes the following two parameters:

  • Bucket size: it is equal to the queue size. The queue holds the requests to be processed at a fixed rate. 队列大小。队列以固定速率保存要处理的请求。
  • Outflow rate: it defines how many requests can be processed at a fixed rate, usually in seconds. 输出请求的固定速率。

Shopify, an ecommerce company, uses leaky buckets for rate-limiting[7].

  • Pros
    • Memory efficient given the limited queue size.
    • Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed.
  • Cons
    • A burst of traffic fills up the queue with old requests, and if they are not processed in time, recent requests will be rate limited.
    • There are two parameters in the algorithm. It might not be easy to tune them properly.
  • Challenges
    • 漏桶算法中,出水的速率是恒定的,这就很难应对突发流量在某一毫秒,QPS 可能极高超过剩余容量,大量请求被拒绝或自旋,但下一秒却又恢复。这就会导致 1s 内拒绝掉了更多的请求。
    • 我们有了一个新的技术约束: 应对突发流量,尽可能的减少发生限流时被拒绝请求的比例。

Fixed window counter algorithm 固定窗口计数器

Fixed window counter algorithm works as follows:

  • The algorithm divides the timeline into fix-sized time windows and assign a counter for each window. 该算法将时间线划分为固定大小的时间窗口,并为每个窗口分配一个计数器。
  • Each request increments the counter by one. 每个请求都会将计数器递增 1。
  • Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts. 一旦计数器达到预定义的阈值,新请求将被丢弃,直到新的时间窗口开始。

Let us use a concrete example to see how it works. In Figure 8, the time unit is 1 second and the system allows a maximum of 3 requests per second. In each second window, if more than 3 requests are received, extra requests are dropped as shown in Figure 8.

A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through. 该算法的一个主要问题是时间窗口边缘的突发流量可能导致超过允许的配额通过的请求。

In Figure 9, the system allows a maximum of 5 requests per minute, and the available quota resets at the human-friendly round minute. As seen, there are five requests between 2:00:00 and 2:01:00 and five more requests between 2:01:00 and 2:02:00. For the one-minute window between 2:00:30 and 2:01:30, 10 requests go through. That is twice as many as allowed requests.

  • Pros
    • Memory efficient.
    • Easy to understand.
    • Resetting available quota at the end of a unit time window fits certain use cases. 在单位时间窗口结束时重置可用配额适合某些用例。
  • Cons
    • Spike in traffic at the edges of a window could cause more requests than the allowed quota to go through. 窗口边缘的流量激增可能导致通过的请求超过允许的配额。
    • 比如,在 1s 中请求可能是不均匀的,这就导致在 1s 中切换的边界丢失限流状态,如果在此时流量徒增就会导致服务雪崩。

Sliding window log algorithm 滑动窗口日志

As discussed previously, the fixed window counter algorithm has a major issue: it allows more requests to go through at the edges of a window. The sliding window log algorithm fixes the issue. It works as follows:

  • The algorithm keeps track of request timestamps. Timestamp data is usually kept in cache, such as sorted sets of Redis[8]. 该算法跟踪请求时间戳。时间戳数据通常保存在缓存中,例如 Redis 的排序集。
  • When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older than the start of the current time window. 当有新请求进来时,删除所有过时的时间戳。过时的时间戳定义为早于当前时间窗口开始的时间戳。
  • Add timestamp of the new request to the log. 将新请求的时间戳添加到日志中。
  • If the log size is the same or lower than the allowed count, a request is accepted. Otherwise, it is rejected. 如果日志大小等于或小于允许的计数,则接受请求。否则,它被拒绝。

In this example, the rate limiter allows 2 requests per minute. Usually, Linux timestamps are stored in the log. However, human-readable representation of time is used in our example for better readability.

  • The log is empty when a new request arrives at 1:00:01. Thus, the request is allowed.

  • A new request arrives at 1:00:30, the timestamp 1:00:30 is inserted into the log. After the insertion, the log size is 2, not larger than the allowed count. Thus, the request is allowed. 插入后日志大小为2,不大于允许的计数。因此,请求被允许。

  • A new request arrives at 1:00:50, and the timestamp is inserted into the log. After the insertion, the log size is 3, larger than the allowed size 2. Therefore, this request is rejected even though the timestamp remains in the log. 插入后,日志大小为 3,大于允许的大小 2。因此,即使时间戳保留在日志中,该请求也会被拒绝。

  • A new request arrives at 1:01:40. Requests in the range [1:00:40,1:01:40) are within the latest time frame, but requests sent before 1:00:40 are outdated. Two outdated timestamps, 1:00:01 and 1:00:30, are removed from the log. After the remove operation, the log size becomes 2; therefore, the request is accepted. 新请求到达后,重新计算最近1分钟范围内的请求数量。 在此示例中,日志中的时间戳 1:00:01 和 1:00:30 将被删除。 删除操作后,日志大小变为 2,因此接受请求。

  • Pros

    • Rate limiting implemented by this algorithm is very accurate. In any rolling window, requests will not exceed the rate limit. 该算法实现的速率限制非常准确。在任何滚动窗口中,请求都不会超过速率限制。
  • Cons

    • The algorithm consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory. 该算法消耗大量内存,因为即使请求被拒绝,其时间戳仍可能存储在内存中。

Sliding window counter algorithm 滑动窗口计数器

The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. The algorithm can be implemented by two different approaches. 滑动窗口计数器算法是一种结合了固定窗口计数器和滑动窗口日志的混合方法。该算法可以通过两种不同的方法来实现。

Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. 假设速率限制器每分钟最多允许 7 个请求,前一分钟有 5 个请求,当前分钟有 3 个请求。

For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling window is calculated using the following formula:

  • Requests in current window + requests in the previous window * overlap percentage of the rolling window and previous window. 当前窗口中的请求数 + 上一个窗口中的请求数 * 滚动窗口和上一个窗口的重叠百分比
  • Using this formula, we get 3 + 5 * 0.7% = 6.5 request. Depending on the use case, the number can either be rounded up or down. In our example, it is rounded down to 6. 根据用例,数字可以向上或向下舍入。在我们的示例中,它向下舍入为 6。

Since the rate limiter allows a maximum of 7 requests per minute, the current request can go through. However, the limit will be reached after receiving one more request. 由于速率限制器每分钟最多允许 7 个请求,因此当前请求可以通过。但是,将在收到更多请求后达到限制。

Due to the space limitation, we will not discuss the other implementation here. Interested readers should refer to the reference material [9]. This algorithm is not perfect. It has pros and cons.

  • Pros
    • It smooths out spikes in traffic because the rate is based on the average rate of the previous window. 它消除了流量峰值,因为速率基于前一个窗口的平均速率。
    • Memory efficient.
  • Cons
    • It only works for not-so-strict look back window. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. However, this problem may not be as bad as it seems. According to experiments done by Cloudflare [10], only 0.003% of requests are wrongly allowed or rate limited among 400 million requests. 它仅适用于不太严格的回顾窗口。它是实际速率的近似值,因为它假设前一个窗口中的请求是均匀分布的。

Step 3 - High-level architecture

The basic idea of rate limiting algorithms is simple. At the high-level, we need a counter to keep track of how many requests are sent from the same user, IP address, etc. If the counter is larger than the limit, the request is disallowed. 我们需要一个计数器来跟踪从同一用户、IP 地址等发送的请求数量。如果计数器大于限制,则该请求将被禁止。

Where shall we store counters? Using the database is not a good idea due to slowness of disk access. In-memory cache is chosen because it is fast and supports time-based expiration strategy. For instance, Redis [11] is a popular option to implement rate limiting. It is an in-memory store that offers two commands: INCR and EXPIRE.

  • INCR: It increases the stored counter by 1.
  • EXPIRE: It sets a timeout for the counter. If the timeout expires, the counter is automatically deleted.

  • The client sends a request to rate limiting middleware.
  • Rate limiting middleware fetches the counter from the corresponding bucket in Redis and checks if the limit is reached or not. 速率限制中间件从 Redis 中相应的存储桶中获取计数器,并检查是否达到限制。
  • If the limit is reached, the request is rejected.
  • If the limit is not reached, the request is sent to API servers. Meanwhile, the system increments the counter and saves it back to Redis. 如果未达到限制,请求将发送到 API 服务器。同时,系统增加计数器并将其保存回Redis。

Step 4 - Design deep dive

The high-level design in Figure 12 does not answer the following questions:

  • How are rate limiting rules created? Where are the rules stored?
  • How to handle requests that are rate limited?

In this section, we will first answer the questions regarding rate limiting rules and then go over the strategies to handle rate-limited requests. Finally, we will discuss rate limiting in distributed environment, a detailed design, performance optimization and monitoring.

Rate limiting rules

Lyft open-sourced their rate-limiting component [12]. We will peek inside of the component and look at some examples of rate limiting rules:

domain: messaging
descriptors:
- key: message_type
value: marketing
rate_limit:
unit: day
requests_per_unit: 5

In the above example, the system is configured to allow a maximum of 5 marketing messages per day. 在上面的示例中,系统配置为每天最多允许 5 条营销消息。

Here is another example:

domain: auth
descriptors:
- key: auth_type
value: login
rate_limit:
unit: minute
requests_per_unit: 5

This rule shows that clients are not allowed to login more than 5 times in 1 minute. Rules are generally written in configuration files and saved on disk. 该规则表明客户端在 1 分钟内登录次数不得超过 5 次。规则通常写入配置文件并保存在磁盘上。

Exceeding the rate limit 超过速率限制

In case a request is rate limited, APIs return a HTTP response code 429 (too many requests) to the client. Depending on the use cases, we may enqueue the rate-limited requests to be processed later. For example, if some orders are rate limited due to system overload, we may keep those orders to be processed later. 例如,如果某些订单由于系统过载而受到速率限制,我们可能会保留这些订单以供稍后处理。

Rate limiter headers

The rate limiter returns the following HTTP headers to clients:

X-Ratelimit-Remaining: The remaining number of allowed requests within the window.

X-Ratelimit-Limit: It indicates how many calls the client can make per time window.

X-Ratelimit-Retry-After: The number of seconds to wait until you can make a request again without being throttled.

When a user has sent too many requests, a 429 too many requests error and X-Ratelimit-Retry-After header are returned to the client.

Detailed design

Figure 13 presents a detailed design of the system.

  • Rules are stored on the disk. Workers frequently pull rules from the disk and store them in the cache.
  • When a client sends a request to the server, the request is sent to the rate limiter middleware first.
  • Rate limiter middleware loads rules from the cache. It fetches counters and last request timestamp from Redis cache. Based on the response, the rate limiter decides:
    • if the request is not rate limited, it is forwarded to API servers.
    • if the request is rate limited, the rate limiter returns 429 too many requests error to the client. In the meantime, the request is either dropped or forwarded to the queue. 如果请求受到速率限制,则速率限制器会向客户端返回 429 请求过多错误。与此同时,请求要么被丢弃,要么被转发到队列。

Rate limiter in a distributed environment 分布式中的限流器

Building a rate limiter that works in a single server environment is not difficult. However, scaling the system to support multiple servers and concurrent threads is a different story. There are two challenges:

  • Race condition
  • Synchronization issue

Race condition 竞态条件

As discussed earlier, rate limiter works as follows at the high-level:

  • Read the counter value from Redis.
  • Check if (counter + 1) exceeds the threshold.
  • If not, increment the counter value by 1 in Redis.

Race conditions can happen in a highly concurrent environment as shown in Figure 14.

Assume the counter value in Redis is 3. If two requests concurrently read the counter value before either of them writes the value back, each will increment the counter by one and write it back without checking the other thread. Both requests (threads) believe they have the correct counter value 4. However, the correct counter value should be 5.

Locks are the most obvious solution for solving race condition. However, locks will significantly slow down the system. Two strategies are commonly used to solve the problem: Lua script [13] and sorted sets data structure in Redis [8]. For readers interested in these strategies, refer to the corresponding reference materials [8] [13]. 锁是解决竞争条件最明显的解决方案。然而,锁会显着减慢系统速度。通常使用两种策略来解决该问题:Lua脚本[13]和Redis中的排序集数据结构[8]。

Synchronization issue

Synchronization is another important factor to consider in a distributed environment. To support millions of users, one rate limiter server might not be enough to handle the traffic. When multiple rate limiter servers are used, synchronization is required. For example, on the left side of Figure 15, client 1 sends requests to rate limiter 1, and client 2 sends requests to rate limiter 2. As the web tier is stateless, clients can send requests to a different rate limiter as shown on the right side of Figure 15. If no synchronization happens, rate limiter 1 does not contain any data about client 2. Thus, the rate limiter cannot work properly. 由于 Web 层是无状态的,客户端可以将请求发送到不同的速率限制器。如果没有发生同步,则速率限制器 1 不包含有关客户端 2 的任何数据。因此,速率限制器无法工作适当地。

One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. This solution is not advisable because it is neither scalable nor flexible. A better approach is to use centralized data stores like Redis. The design is shown in Figure 16.

Performance optimization 性能优化

Performance optimization is a common topic in system design interviews. We will cover two areas to improve.

First, multi-data center setup is crucial for a rate limiter because latency is high for users located far away from the data center. Most cloud service providers build many edge server locations around the world. For example, as of 5/20 2020, Cloudflare has 194 geographically distributed edge servers [14]. Traffic is automatically routed to the closest edge server to reduce latency. 首先,多数据中心设置对于速率限制器至关重要,因为远离数据中心的用户的延迟很高。流量会自动路由到最近的边缘服务器以减少延迟。

Monitoring

After the rate limiter is put in place, it is important to gather analytics data to check whether the rate limiter is effective. Primarily, we want to make sure:

  • The rate limiting algorithm is effective.
  • The rate limiting rules are effective.

For example, if rate limiting rules are too strict, many valid requests are dropped. In this case, we want to relax the rules a little bit. In another example, we notice our rate limiter becomes ineffective when there is a sudden increase in traffic like flash sales. In this scenario, we may replace the algorithm to support burst traffic. Token bucket is a good fit here. 例如,如果速率限制规则太严格,则许多有效请求会被丢弃。在这种情况下,我们想稍微放宽规则。在另一个例子中,我们注意到当流量突然增加(例如限时抢购)时,我们的速率限制器会变得无效。在这种场景下,我们可能会更换算法来支持突发流量。令牌桶在这里很合适。

Step 5 - Wrap up

In this chapter, we discussed different algorithms of rate limiting and their pros/cons. Algorithms discussed include:

  • Token bucket 令牌桶
  • Leaking bucket 漏桶
  • Fixed window 固定窗口
  • Sliding window log 滑动窗口日志
  • Sliding window counter 滑动窗口计数器

Then, we discussed the system architecture, rate limiter in a distributed environment, performance optimization and monitoring. Similar to any system design interview questions, there are additional talking points you can mention if time allows:

  • Hard vs soft rate limiting.
    • Hard: The number of requests cannot exceed the threshold.
    • Soft: Requests can exceed the threshold for a short period.
  • Rate limiting at different levels. In this chapter, we only talked about rate limiting at the application level (HTTP: layer 7). It is possible to apply rate limiting at other layers. For example, you can apply rate limiting by IP addresses using Iptables [15] (IP: layer 3). Note: The Open Systems Interconnection model (OSI model) has 7 layers [16]: Layer 1: Physical layer, Layer 2: Data link layer, Layer 3: Network layer, Layer 4: Transport layer, Layer 5: Session layer, Layer 6: Presentation layer, Layer 7: Application layer.
  • Avoid being rate limited. Design your client with best practices:
  • Use client cache to avoid making frequent API calls.
  • Understand the limit and do not send too many requests in a short time frame.
  • Include code to catch exceptions or errors so your client can gracefully recover from exceptions.
  • Add sufficient back off time to retry logic.

Reference materials