Distributed rate limiting - queue

I have multiple servers/workers going through a task queue doing API requests. (Django with Memcached and Celery for the queue) The API requests are limited to 10 requests a second. How can I rate limit it so that the total number of requests (all servers) don't pass the limit?
I've looked through some of the related rate limit questions I'm guessing they are focused on a more linear, non concurrent scenario. What sort of approach should I take?

Have you looked in Rate Limiter from Guava project? They introduced this class in one of the latest releases and it seems to partially satisfy your needs.
Surely it won't calculate rate limit across multiple nodes in distributed environment but what you coud do is to have rate limit configured dynamically based on number of nodes which are are running (ie for 5 nodes you'd have rate limit of 2 API requests a second)

I have been working on an opensource project to solve this exact problem called Limitd. Although I don't have clients for other technologies than node yet, the protocol and the idea are simple.
Your feedback is very welcomed.

I solved that problem unfortunately not for your technology: bandwidth-throttle/token-bucket
If you want to implement it, here's the idea of the implementation:
It's a token bucket algorithm which converts the containing tokens into a timestamp since when it last was completely empty. Every consumption updates this timestamp (locked) so that each process shares the same state.

Related

Are time-related OpenTelemetry metrics an anti-pattern?

When setting up metrics and telemetry for my API, is it an anti-pattern to track something like "request-latency" as a metric (possibly in addition to) tracking it as a span?
For example, say my API makes a request to another API in order to generate a response. If I want to track latency information such as:
My API's response latency
The latency for the request from my API to the upstream API
DB request latency
Etc.
That seems like a good candidate for using a span but I think it would also be helpful to have it as a metric.
Is it a bad practice to duplicate the OTEL data capture (as both a metric and a span)?
I can likely extract that information and avoid duplication, but it might be simpler to log it as a metric as well.
Thanks in advance for your help.
I would say traces and also metrics have own use cases. Traces have usually low retention period (AWS X-Ray: 30 days) + you can generate metrics based on traces for short time period (AWS X-Ray: 24 hours). If you will need longer time period then those queries will be expensive (and slow). So I would say metrics stored in time series DB will be perfect use case for longer time period stats.
BTW: there is also experimental Span Metrics Processor, which you can use to generate Prometheus metrics from the spans directly with OTEL collector - no additional app instrumentation/code.

determine ideal number of workers and EC2 sizing for master

I have a requirement to use locust to simulate 20,000 (and higher) users in a 10 minute test window.
the locustfile is a tasksquence of 9 API calls. I am trying to determine the ideal number of workers, and how many workers should be attached to an EC2 on AWS. My testing shows with 20 workers, on two EC2 instance, the CPU load is minimal. the master however suffers big time. a 4 CPU 16 GB RAM system as the master ends up thrashing to the point that the workers start printing messages like this:
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.util.exception_handler: Retry failed after 3 times.
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/ERROR/locust.runners: RPCError found when sending heartbeat: ZMQ sent failure
[2020-06-12 19:10:37,312] ip-172-31-10-171.us-east-2.compute.internal/INFO/locust.runners: Reset connection to master
the master seems memory exhausted as each locust master process has grown to 12GB virtual RAM. ok - so the EC2 has a problem. But if I need to test 20,000 users, is there a machine big enough on the planet to handle this? or do i need to take a different approach and if so, what is the recommended direction?
In my specific case, one of the steps is to download a file from CloudFront which is randomly selected in one of the tasks. This means the more open connections to cloudFront trying to download a file, the more congested the available network becomes.
Because the app client is actually a native app on a mobile and there are a lot of factors affecting the download speed for each mobile, I decided to to switch from a GET request to a HEAD request. this allows me to test the response time from CloudFront, where the distribution is protected by a Lambda#Edge function which authenticates the user using data from earlier in the test.
Doing this dramatically improved the load test results and doesn't artificially skew the other testing happening as with bandwidth or system resource exhaustion, every other test will be negatively impacted.
Using this approach I successfully executed a 10,000 user test in a ten minute run-time. I used 4 EC2 T2.xlarge instances with 4 workers per T2. The 9 tasks in test plan resulted in almost 750,000 URL calls.
The answer for the question in the title is: "It depends"
Your post is a little confusing. You say you have 10 master processes? Why?
This problem is most likely not related to the master at all, as it does not care about the size of the downloads (which seems to be the only difference between your test case and most other locust tests)
There are some general tips that might help:
Switch to FastHttpUser (https://docs.locust.io/en/stable/increase-performance.html)
Monitor your network usage (if your load gens are already maxing out their bandwidth or CPU then your test is very unrealistic anyway, and adding more users just adds to the noice. In general, start low and work your way up)
Increase the number of loadgens
In general, the number of users is not an issue for locust, but number of requests per second or bandwidth might be.

Server Throughput definition ambiguity

Is throughput the max number of requests a server instance can handle or is it the number of requests that the server instance is currently handling?
Edit: By "currently handling" I mean, the number of requests the server is receiving for a given time interval in recent time. For eg: The server is currently handling 400 reqs every min.
For eg:, I might have a server instance with a lot of hardware which can have high throughput, but I might be only receiving small amount of traffic. What does throughput measure in such a situation. Also, what about the inverse case, i.e if my instance can only handle x requests per min. but is receiving y>>>x requests per min.
If throughput is the max no. of requests a server can handle, how is it measured? Do we do a load/stress test, where we keep increasing the requests per min on the server until it cannot handle them anymore?
No, Throughput is an aggregation that depends on execution time, you can send 1000 requests in the same second and your server won't handle, but when you'll send 1000 requests in an hour and your server will handle it normally.
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time).
You want to find the number of concurrent users that your server can handle by increasing JMeter threads until server reach his maximum
Throughput is the number of Samplers which JMeter executes within the duration of your test. If you want to see the actual amount of requests which are being sent - consider using i.e. Server Hits Per Second listener (can be installed using JMeter Plugins Manager)
If you see that your server resources consumption doesn't increase as you increase the load in JMeter the reasons are in:
Your application middleware configuration is not suitable for high load (i.e. there is a limit of concurrent sessions and requests are queuing up instead of being processed), check out Web Performance Optimization: Top 3 Server and Client-Side Performance Tips for overall ideas what could be looked at
Your application code doesn't utilize underlying OS resources efficiently, consider using profiler tool to see what's going on under the hood.
JMeter may fail to send requests fast enough, make sure to follow JMeter Best Practices and if JMeter's machine is overloaded - consider going for Distributed Testing

How configure mongodb for support 1100 threads?

How can i configure mongodb's pool connection for support 1100 threads per seconds?
I tried some configurations like bellow without sucess.
connectionsPerHost = 200
threadsAllowedToBlockForConnectionMultiplier = 5
Can someone help me?
Thanks.
It won't.
That number of threads may be prejudicial, there's a lot of techniques to calculate some ideal number, and none of them get even close to 1100. If you're looking to attend a large number of users you should work with server redundancy. You won't get speed because 99.9% (really) of your threads will be locked waiting for a resource become available.
I've worked with java in fast processing, using distributed systems and threads, we used 0mq(tcp alternative) to acelerate communication and get more use of the threads, but we found that moderate number of threads was the ideal (if I remember correctly, 12).
Instead of letting hundreds of threads do the job, try to keep a limited number of workers threads, you won't have more resources anyway. The ideal for this kind of application would be have many servers attending your users.

Is it possible to use a cassandra table as a basic queue

Is it possible to use a table in cassandra as a queue, I don't think the strategy I use in mysql works, ie given this table:
create table message_queue(id integer, message varchar(4000), retries int, sending boolean);
We have a transaction that marks the row as "sending", tries to send, and then either deletes the row, or increments the retries count. The transaction ensures that only one server will be attempting to process an item from the message_queue at any one time.
There is an article on datastax that describes the pitfalls and how to get around it, however Im not sure what the impact of having lots of tombstones lying around is, how long do they stay around for?
Don't do this. Cassandra is a terrible choice as a queue backend unless you are very, very careful. You can read more of the reasons in Jonathan Ellis blog post "Cassandra anti-patterns: Queues and queue-like datasets" (which might be the post you're alluding to). MySQL is also not a great choice for backing a queue, us a real queue product like RabbitMQ, it's great and very easy to use.
The problem with using Cassandra as the storage for a queue is this: every time you delete a message you write a tombstone for that message. Every time you query for the next message Cassandra will have to trawl through those tombstones and deleted messages and try to determine the few that have not been deleted. With any kind of throughput the number of read values versus the number of actual live messages will be hundreds of thousands to one.
Tuning GC grace and other parameters will not help, because that only applies to how long tombstones will hang around after a compaction, and even if you dedicated the CPUs to only run compactions you would still have dead to live rations of tens of thousands or more. And even with a GC grace of zero tombstones will hang around after compactions in some cases.
There are ways to mitigate these effects, and they are outlined in Jonathan's post, but here's a summary (and I don't write this to encourage you to use Cassandra as a queue backend, but because it explains a bit more about Cassandra works, and should help you understand why it's a bad fit for the problem):
To avoid the tombstone problem you cannot keep using the same queue, because it will fill upp with tombstones quicker than compactions can get rid of them and your performance will run straight into a brick wall. If you add a column to the primary key that is deterministic and depends on time you can avoid some of the performance problems, since fewer tombstones have time to build up and Cassandra will be able to completely remove old rows and all their tombstones.
Using a single row per queue also creates a hotspot. A single node will have to handle that queue, and the rest of the nodes will be idle. You might have lots of queues, but chances are that one of them will see much more traffic than the others and that means you get a hotspot. Shard the queues over multiple nodes by adding a second column to the primary key. It can be a hash of the message (for example crc32(message) % 60 would create 60 shards, don't use a too small number). When you want to find the next message you read from all of the shards and pick one of the results, ignoring the others. Ideally you find a way to combine this with something that depends on time, so that you fix that problem too while you're at it.
If you sort your messages after time of arrival (for example with TIMEUUID clustering key) and can somehow keep track of the newest messages that has been delivered, you can do a query to find all messages after that message. That would mean less thrawling through tombstones for Cassandra, but it is no panacea.
Then there's the issue of acknowledgements. I'm not sure if they matter to you, but it looks like you have some kind of locking mechanism in your schema (I'm thinking of the retries and sending columns). This will not work. Until Cassandra 2.0 and it's compare-and-swap features there is no way to make that work correctly. To implement a lock you need to read the value of the column, check if it's not locked, then write that it should now be locked. Even with consistency level ALL another application node can do the same operations at the same time, and both end up thinking that they locked the message. With CAS in Cassandra 2.0 it will be possible to do atomically, but at the cost of performance.
There are a couple of more answers here on StackOverflow about Cassandra and queues, read them (start with this: Table with heavy writes and some reads in Cassandra. Primary key searches taking 30 seconds.
The grace period can be defined. Per default it is 10 days:
gc_grace_seconds¶
(Default: 864000 [10 days]) Specifies the time to wait before garbage
collecting tombstones (deletion markers). The default value allows a
great deal of time for consistency to be achieved prior to deletion.
In many deployments this interval can be reduced, and in a single-node
cluster it can be safely set to zero. When using CLI, use gc_grace
instead of gc_grace_seconds.
Taken from the
documentation
On a different note, I do not think that implementing a queue pattern in Cassandra is very useful. To prevent your worker to process one entry twice, you need to enforce "ALL" read consistency, which defeats the purpose of distributed database systems.
I highly recommend looking at specialized systems like messaging systems which support the queue pattern natively. Take a look at RabbitMQ for instance. You will be up and running in no time.
Theo's answer about not using Cassandra for queues is spot on.
Just wanted to add that we have been using Redis sorted sets for our queues and it has been working pretty well. Some of our queues have tens of millions of elements and are accessed hundreds of times per second.