Limiting the number of times an endpoint of Kubernetes pod can be accessed? - kubernetes

I have a machine learning model inside a docker image. I pushed the docker image to google container registry and then deploy it inside a Kubernetes pod. There is a fastapi application that runs on Port 8000 and this Fastapi endpoint is public
(call it mymodel:8000).
The structure of fastapi is :
app.get("/homepage")
asynd def get_homepage()
app.get("/model):
aysnc def get_modelpage()
app.post("/model"):
async def get_results(query: Form(...))
User can put query and submit them and get results from the machine learning model running inside the docker. I want to limit the number of times a query can be made by all the users combined. So if the query limit is 100, all the users combined can make only 100 queries in total.
I thought of a way to do this:
Store a database that stores the number of times GET and POST method has been called. As soon as the total number of times POST has been called crosses the limit, stop accepting any more queries.
Is there an alternative way of doing this using Kubernetes limits? Such as I can define a limit_api_calls such that the total number of times mymodel:8000 is accessed is at max equal to limit_api_calls.
I looked at the documentation and I could only find setting limits for CPUs, Memory and rateLimits.

There are several approaches that could satisfy your needs.
Custom implementation: As you mentioned, keep in a persistence layer the number of API calls received and deny requests after it has been reached.
Use a service mesh: Istio (for instance) will let you limit the number of requests received and act as a circuit breaker.
Use an external Api Manager: Apigee will also let you limit and even charge your users, however if it is only for internal use (not pay per use) I definitely won't recommend it.
The tricky part is what you want to happen after the limit has been reached, if it is just a pod you may exit the application to finish and clear it.
Otherwise, if you have a deployment with its replica set and several resources associated with it (like configmaps), you probably want to use some kind of asynchronous alert or polling check to clean up everything related to your deployment. You may want to have a deep look at orchestrators like Airflow (Composer) and use several tools such as Helm for keeping deployments easy.

Related

Fastapi scaleup multi-tennent application

I am trying to understand how to scale up Fastapi on our app. We have currently application developed like into snippet code bellow. So we dont use async calls. Our application is multi-tennent and we expect to load big requests (~10mbs) per requests.
from fastapi import FastAPI
app = FastAPI()
#app.get("/")
def root():
psycopg2 queries select ... Query last 2-3 minutes or ml model
return {"message": "Hello World"}
When the API call is made another user is wating to start doing requests which is what we dont want. I can increase from 1 worker to 4-6 workers (guvicorn). So than 4-6 users can use app independently. Does it means that we can handle 4-6x workers more or is it less ?
We were thinking to change to async and uses async postgres drivers (asyncio) we could get more throughtput. I assume than will be database bottnlneck soon ? Also we did some performance testing and this approach would decrease time on half according to our tests.
How can we scale up our apllication further if we want in peak times handle 1000 users at same time ? What should we take into consideration ?
First of all: Does this processing need to be sync? I mean, is the user waiting for the response of this processing that takes 2-3 minutes? It is not recommended that you have APIs that take that long to respond.
If your user doesn't need to wait until it finishes, you have a few options:
You can use celery and make this processing async using a background tasks. Celery is commonly used for this kind of things where you have huge queries or huge processing that takes a while and that can be done async.
You can also use the background task from FastAPI that allows you to run things on background.
If we do it this way you will be able to easily scale your application. Note that celery currently doesn't support async, so you would not be able to use async there unless you implement a few tweaks yourself.
About scaling the number of workers - FastAPI recommends that you use your container structure to manage the number of replicas running, so instead of having gunicorn, you could simply scale the number of replicas of your service. If you are not using containers, then you can use a structure from gunicorn that allows you to automatically spins up new workers based on the number of requests that you are receiving.
If none of my answers above make sense for you, I'd suggest:
Use the async driver from Postgres so while it is running and processing your query FastAPI will be able to receive requests from other users. Note that if your query is huge, you might need a lot of memory to do what you are saying.
Create some sort of auto scaling based on response time/requests per second so you can scale your application as you receive more requests

Akka Source, is there a way to throttle based on a global rate limit coming from an api call?

There is the throttle function on Source https://doc.akka.io/docs/akka/current/stream/operators/Source-or-Flow/throttle.html but this only works in a local context (1 server). If I wanted to share a rate limit (for 3rd party api calls) with other servers (say I have 2 servers instead of 1 for redundancy), I'd like the rate limit to efficiently be spread across the 2 servers (if one server dies from out of memory, the other server should pick up the freed up rate limit until the dead server restarts).
Is this possible somehow through akka's Source assuming I have something like Redis returning whether an action is allowed or disallowed + what the time until an action will be allowed?
Off the top of my head, you can dispense with Redis and use Akka Cluster to deal with failure detection: and set up an actor to subscribe to the cluster events (member joined, member left/downed) and update the local throttle.
Local dynamic throttling can be implemented via a custom graph stage (materializing as a handle through which to change the throttle), or you can also do that via an actor (in which case an ask stage is nice). In the latter case, you can go further and have the throttling actors coordinate among themselves to reallocate unused request capacity between nodes.

Invoking CloudRun endpoint from within itself

Assuming there is a Flask web server that has two routes, deployed as a CloudRun service over GKE.
#app.route('/cpu_intensive', methods=['POST'], endpoint='cpu_intensive')
def cpu_intensive():
#TODO: some actions, cpu intensive
#app.route('/batch_request', methods=['POST'], endpoint='batch_request')
def batch_request():
#TODO: invoke cpu_intensive
A "batch_request" is a batch of many same structured requests - each one is highly CPU intensive and handled by the function "cpu_intensive". No reasonable machine can handle a large batch and thus it needs to be paralleled across multiple replicas.
The deployment is configured that every instance can handle only 1 request at a time, so when multiple requests arrive CloudRun will replicate the instance.
I would like to have a service with these two endpoints, one to accept "batch_requests" and only break them down to smaller requests and another endpoint to actually handle a single "cpu_intensive" request. What is the best way for "batch_request" break down the batch to smaller requests and invoke "cpu_intensive" so that CloudRun will scale the number of instances?
make http request to localhost - doesn't work since the load balancer is not aware of these calls.
keep the deployment URL in a conf file and make a network call to it?
Other suggestions?
With more detail, it's now clearer!!
You have 2 responsibilities
One to split -> Many request can be handle in parallel, no compute intensive
One to process -> Each request must be processed on a dedicated instance because of compute intensive process.
If your split performs internal calls (with localhost for example) you will be only on the same instance, and you will parallelize nothing (just multi thread the same request on the same instance)
So, for this, you need 2 services:
one to split, and it can accept several concurrent request
The second to process, and this time you need to set the concurrency param to 1 to be sure to accept only one request in the same time.
To improve your design, and if the batch processing can be asynchronous (I mean, the split process don't need to know when the batch process is over), you can add PubSub or Cloud Task in the middle to decouple the 2 parts.
And if the processing requires more than 4 CPUs 4Gb of memory, or takes more than 1 hour, use Cloud Run on GKE and not Cloud Run managed.
Last word: Now, if you don't use PubSub, the best way is to set the Batch Process URL in Env Var of your Split Service to know it.
I believe for this use case it's much better to use GKE rather than Cloud Run. You can create two kubernetes deployements one for the batch_request app and one for the cpu_intensive app. the second one will be used as worker for the batch_request app and will scale on demand when there are more requests to the batch_request app. I believe this is called master-worker architecture in which you separate your app front from intensive work or batch jobs.

How to perform multiple HTTP DELETE operation on same Resource with different IDs in JMeter?

I have a question regarding **writing test for HTTP DELETE method in JMeter using Concurrency Thread Group**. I want to measure **how many DELETEs** can it perform in certain amount of time for certain amount of Users (i.e. Threads) who are sending Concurrent HTTP (DELETE) Requests.
Concurrency Thread Group parameters are:
Target Concurrency: 50 (Threads)
RampUp Time: 10 secs
RampUp Steps Count: 5 secs
Hold Target Rate Time (sec): 5 secs
Threads Iterations Limit: infinite
The thing is that HTTP DELETE is idempotent operation i.e. if inovked on same resource (i.e. Record in database) it kind of doesn't make much sense. How can I achieve deletion of multiple EXISTING records in database by passing Entity's ID in URL? E.g.:
http://localhost:8080/api/authors/{id}
...where ID is being incremented for each User (i.e. Thread)?
My question is how can I automate deletion of multiple EXISTING rows in database (Postgres 11.8)...should I write some sort of script or is there other easier way to achieve that?
But again I guess it will probably perform multiple times same thing on same resources ID (e.g. HTTP DELETE will be invoked more than once on http://localhost:8080/api/authors/5).
Any help/advice is greatly appreciated.
P.S. I'm doing this to performance test my SpringBoot, Vert.X and Dropwizard RESTful Web service apps.
UPDATE1:
Sorry, I've didn't fully specify reason for writing these Test Use Case for my Web Service apps which communicate with Postgres DB. MAIN reason why I'm actually doing this testing is to test PERFORMANCES of blocking and NON-blocking WEB Server implementations for mentioned frameworks (SpringBoot, Dropwizard and Vert.X). Web servers are:
Blocking impelementations:
1.1. Apache Tomcat (SpringBoot)
1.2. Jetty (Dropwizard)
Non-blocking: Vert.X (uses own implementation based on Netty)
If I am using JMeter's JDBC Request in my Test Plan won't that actually slow down Test execution?
The easiest way is using either Counter config element or __counter() function in order to generate an incrementing number on each API hit:
More information: How to Use a Counter in a JMeter Test
Also the list of IDs can be obtained from the Postgres database via JDBC Request sampler and iterated using ForEach Controller

Unique access to Kubernetes resources

For some integration tests we would like to have a way of ensuring, that only one test at a time has access to certain resources (e.g. 3 DeploymentConfigurations).
For that to work we have have the following workflow:
Before test is started - wait until all DCs are "undeployed".
When test is started - set DC replicas to 1.
When test is stopped - set DC replicas to 0.
This works to some degree, but obviously has the problem, that once the test terminates unexpectedly, the DCs might still be in flight.
Now one way to "solve" this would be to introduce a CR, with a Controller, which handles lifetime of the lock (CR).
Is there any more elegant and straight forward way of allowing unique access to Kubernetes resources?
EDIT:
Sadly we are stuck with Kubernetes 1.9 for now.
look at 'kubectl wait' api to set different conditions between the test flow and depending on the result proceed to next test step.