What is Rate Limiting?

Rate limiting is an essential mechanism to ensure both the stability and security of online applications. It allows controlling, on the client side, the flow of data sent to a server or, on the application side, the incoming traffic to prevent overload and abuse.

In this article, we will cover the following points:

  • The importance of rate limiting
  • Different approaches and available algorithms (Fixed Window, Leaky Bucket, etc.) with Python examples
  • An alternative self-adaptive approach

Why is Rate Limiting / Throttling Important?

Rate limiting (or throttling) is useful, at the very least, whenever a service is expected to be used intensively, whether continuously or in short bursts. It is important to note that any publicly accessible service must necessarily be considered as being used intensively. Since there is no control over the clients, it is impossible to guarantee that they will use the service in a “reasonable” manner. Whether excessive usage is intentional or completely unintentional, the outcome remains the same.

That being said, even in the case of a service used within a fully controlled architecture, it can still be challenging to ensure that client usage remains manageable over time. On the client side, it is also crucial to properly handle error recovery, partial service interruptions, and other scenarios that require request retries—situations that can result in an exceptional spike in service usage. In other words, it is always recommended to implement some form of traffic limitation, even a basic one, when developing services and/or their clients.

Regardless of the type of service, a simultaneous massive influx of requests can:

  • Significantly slow down the system
  • Cause service interruptions
  • Make it easier for bots to exploit the service

Rate limiting helps mitigate these risks by enforcing restrictions on the number of requests a user can send within a given timeframe (client-side) or the number of requests that can be processed simultaneously (server-side). It also ensures fair and balanced access to an API or application, preventing resource monopolization and maintaining overall service stability.

Distributed Denial-of-Service (DDoS)

Rate limiting helps mitigate the impact of a Distributed Denial-of-Service (DDoS) attack, which involves overwhelming a server with massive amounts of requests from multiple different IPs. While fully countering such an attack is particularly challenging, rate limiting can help resist it long enough to implement a more targeted and effective defense strategy.

By combining rate limiting with other protection mechanisms such as challenge-response (CAPTCHA)IP filtering, or specialized services like Cloudflare, it becomes possible to reduce the impact of these attacks. Some systems also use behavioral approaches that analyze traffic in real-time to detect suspicious patterns (e.g., a sudden spike in requests, unusual origins, etc.).

In modern infrastructures, DDoS protection services leverage machine learning algorithms to dynamically adjust filtering rules, enhancing resilience against these types of threats.

Approaches to Rate Limiting / Throttling

These techniques can be implemented at two levels:

Server-Side

The main goal of implementing limits on the server side is to ensure service stability by preventing an excessive number of requests—or, more broadly, incoming traffic—from overwhelming available resources. Every server has a limited amount of memory, threads, and processing power. Without a limitation mechanism, a massive influx of data could quickly degrade performance or even cause a complete outage.

Rate limiting can be applied at different levels of infrastructure:

  • At the operating system level, where restrictions on the number of simultaneous connections per IP address can be enforced.
  • Within middleware and proxies, such as NGINX or API gateways, which can filter requests before they reach the application.
  • Directly within the application, where requests can be tracked per user or API key for more granular control.

One of the most effective strategies is to reject excessive requests as early as possible in the processing chain, minimizing their impact on critical server resources.

Client-Side

Rate limiting or API throttling is not just for servers. On the client side, it is a proactive strategy aimed at:

  • Avoiding being blocked by a service
  • Optimizing network resource usage (reducing bandwidth consumption)

API clients also benefit from rate limiting. Many platforms impose quotas on API calls (e.g., 1,000 requests per hour). Implementing a client-side limitation mechanism helps avoid exceeding these quotas, ensuring smooth and uninterrupted service usage.

Additionally, reducing outbound data requests on the client side helps lower unnecessary server load, improving service responsiveness and overall user experience. This can be seen as a “responsible” approach when using a service shared with other users.

On the other hand, in a less “responsible” context, rate limiting is the best strategy for web scraping—where an automated script collects data from a website. If requests are sent too quickly, the target server may detect abnormal activity and block access. By regulating the request frequency, a client can stay below detection thresholds while still gathering data efficiently.

In such cases, the ideal approach is often to simulate the behavior of a typical user. Depending on the service, this might involve:

  • Sending limited requests over time
  • Using time-based request windows
  • Introducing random delays and long interruptions

In these scenarios, rate limiting strategies can be adapted and fine-tuned based on the specific requirements of the target service. The next section will introduce the different rate limiting techniques and how they can be implemented effectively.

Rate Limiting Strategies

There are several approaches to implementing request rate limitation. Below is a brief overview of each method:

Fixed Window Counter

This method allows a fixed number of requests per defined time interval (e.g., 100 requests per minute).

Python

Advantages:

  • Easy to implement and understand.
  • Provides a simple counter that is easy to interpret for each time window.

Disadvantages:

  • Allows sending the maximum capacity of the current window all at once.
  • Can also result in twice the maximum window capacity being sent within a short timeframe (at the end of one window + the beginning of the next).

Sliding Window Log

This variant dynamically adjusts the limit based on a rolling time window, ensuring a more even distribution of requests over time.

Python

Advantages:

  • Easy to implement and understand.
  • Precise, always enforces the desired capacity for a given time window.

Disadvantages:

  • Allows sending the maximum capacity of the current window all at once.
  • Requires storing all timestamps in memory, making it less suitable for high request volumes.

Sliding Window Counter

This approach combines the two previous strategies to eliminate the need for storing all timestamps in memory.

Python

Advantages:

  • Precise, always enforces the desired capacity for a given time window.

Disadvantages:

  • Allows sending the maximum capacity of the current window all at once.

Token Bucket

Rather than enforcing strict limitations, this approach assigns users a pool of tokens, with each request consuming one token. These tokens are replenished over time.

This method is well-suited for systems that require some flexibility (e.g., APIs that allow request bursts before imposing restrictions).

Python

Advantages:

  • Easy to implement and understand.
  • Helps regulate request flow over short time intervals.

Disadvantages:

  • Does not guarantee a constant transfer rate.

Leaky Bucket

This approach is similar in concept to the Token Bucket method but operates differently. It is fundamentally asynchronous, as requests are placed in a queue and processed at a constant rate. If the queue is full, any additional requests are discarded.

Python

Advantages:

  • Processes requests at a constant rate, making it highly predictable.

Disadvantages:

  • More complex to implement.
  • Does not handle request bursts well, as excess requests are quickly rejected.

Variants

Combinations

It is entirely possible to combine some of these techniques. In such cases, it may be necessary to implement a mechanism to cancel token acquisition from a rate limiter.

For example, a system could attempt to acquire a token from multiple rate limiters using different approaches before fully validating a request. Some algorithms are better suited for this than others.

limited burst support can also be added to the Leaky Bucket approach by redirecting rejected requests to another method capable of handling them.

Variable Token Allocation

Another potentially simple method—especially for capacity-based approaches—is to adjust the number of tokens granted over time. Instead of acquiring a fixed single token per interval, the system could grant n tokens dynamically (where n could be a floating-point value).

With this technique, it becomes possible to move from a linear request rate to a more complex pattern, such as a sinusoidal or semi-random flow, allowing greater flexibility in traffic shaping.

The Self-Adaptive Concurrent Approach: The Netflix Case

For enforcing external API request limits or restricting incoming requests, the previously mentioned approaches are often sufficient. However, in some cases, it can be beneficial to go beyond these standard methods.

Imagine being able to automatically adjust the number of concurrent requests while maximizing resource utilization. This is exactly what Netflix’s engineering team aimed to achieve with their Java library, available here: Netflix Concurrency Limits.

The concept is relatively simple and is based entirely on request response time:

  • If response time increases, it indicates overload, so the request limit is reduced.
  • If performance remains stable, the limit is gradually increased.

This library actually offers multiple algorithms, allowing you to configure the method and speed of convergence to best suit your needs.

Advantages of This Method:

  • No need for manual testing to find the optimal request limit.
  • Requires no centralization or coordination.
  • No prior knowledge of system topology or architecture is needed.
  • Automatically adapts to topology or infrastructure changes.
  • Simple to implement.

We highly recommend using this approach—both client-side and server-side—if it aligns with your use case.

Conclusion

Rate limiting is a crucial component for ensuring a system’s availability and reliability. Various strategies exist, depending on specific needs. To go further, solutions like NGINX rate limiting, Redis, or specialized middleware can be explored. Implementing an effective rate limiter is essential to protect infrastructures and ensure an optimal user experience.

For internal needs, especially in the era of microservices architectures, consider using alternative solutions or developing a custom solution tailored to your specific requirements.

While the approaches described in this article are relatively simple, they can be more than sufficient for small-scale applications or internal projects. However, they may be too limited for external-facing use cases.