HAProxy balancing algorithm - webserver

I have 4 servers for my search engine. I am using HAproxy for load balancing. Which balancing algorithm will be optimal to use in this case where request time will be small?

I think it depends on the servers' strength. I mean if one server is more powerful than another, that it have to have more weight than the other. ;) I don't think that it depends on the work you're doing.;)

Related

System Design - how to Pick CPU, Memory for an application

I am practicing System Design concepts and I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
For Back of the envelope calculation ,I saw examples of calculating tps for read and write calls, calculate bandwidth needs, database storage needs etc. but I have not seen how to determine cpu, memory needs and how many instances are enough. Is there a procedure that guides to solve this problem?
My hunch says that we pick small to medium sized server instance (if we use cloud provider like AWS) and run stress tests for calculated TPS and see CPU and memory usage and see if we need to increase or decrease server configuration based on results?
I would greatly appreciate any inputs you may have.
I am not clear what configuration (cpu, memory, disk storage) to pick for an application instance? Also, how many instances are needed (assuming you are running your application on Kubernetes cluster)
This is mostly a question about economics. If resources was very cheap, you could use a lot of them - but unfortunately, they have an economic cost.
Scale out horizontal or scale up vertical
The first fundamental question to ask, should you scale up your app vertically (e.g. to bigger instances) or should you scale out your app horizontally.
The most important thing here is that scaling out horizontally is much easier. But wether you can scale out horizontally of if you have to scale up vertically depends on your app. If your app is a stateless webserver, it typically is very easy to scale out, but if you have a stateful cache or database, scale up vertically might be your only short term option. Try to design so that you can scale out horizontally since that is much easier.
Accurate size - use observability
To find your accurate size, use observability and investigate your bottlenecks and adjust relatively to that.
E.g. if you use too little memory, your app will be terminated, or if you use too little CPU, your response time will be slow. Just start somewhere and adjust.
In addition to Jonas's answer:
You have two approaches (which are not mutually exclusive):
Estimate your needs based on expected load, etc.
Adjust you needs based on what you observe in production.
Regarding the first approach:
Have you done any analysis into what your expected load is? E.g. how many users (unique sessions), how many requests on average per hour (page views, API calls, etc), potential peaks in activity leading to increased load, etc.
Have you done any benchmarking?
Have you looked at your system and what it does, and worked out if it has any specific resource (CPU, memory, disk, etc) needs?
Estimating resources ahead of time requires some knowledge (or informed guesses) regarding what the load will be, as per the 3 points above. Having an idea of what the daily or hourly request average is isn't a bad place to start.
Also make sure you aware if any potential spikes that might catch you out (end of month for financial systems/services). Whether or not these are significant enough that is worth worrying about is another thing. A friend of mine was working on a ticketing system once, and they had massive traffic spikes for major events that did warrant serious scaling-out and back... but your average system probably won't need to be that extreme.
CPU is probably only worth "worrying" about if you have anything that does any above average processing - this should be obvious through benchmarking or if you/your team has good knowledge of your code.
Disk usage can be calculated - e.g.
If on average a user generates 1Mb of data in a session (not including system logs), and you get 100 sessions a day then that's 100Mb a day, 500Mb a working week, 200Mb a month, etc.
If a user profile has on average 200Kb of data and 300Kb of storage space (images) then you can calculate that.
You can also do this for records, especially for records that you know are "large" (e.g. >25mb) or where there will be lots of them (e.g. millions).
You can also start to forecast growth over time if you allow a growth rate (e.g. number of users and their sessions, and the amount of data generated). A simple way to do that is to have a spreadsheet with some simple formulas that take various inputs like number of users, average requests per user, disk space per user, etc. You can then do what-if modelling by playing with the inputs.
In terms of the second approach - as Jonas says, observe and adjust. Make sure you know how to do that, and that your solution provides the data you need. This might be using metrics provided by your cloud-provider (if applicable) or instrumentation / reporting you have custom built into you solution.
Scaling-Up is probably more relevant in scenarios where you have a central point/resource that cannot be scaled-out, like a central database.

Set cpu requests in K8s for fluctuating load

I have a service deployed in Kubernetes and I am trying to optimize the requested cpu resources.
For now, I have deployed 10 instances and set spec.containers[].resources.limits.cpu to 0.1, based on the "average" use. However, it became obvious that this average is rather useless in practice because under constant load, the load increases significantly (to 0.3-0.4 as far as I can tell).
What happens consequently, when multiple instances are deployed on the same node, is that this node is heavily overloaded; pods are no longer responsive, are killed and restarted etc.
What is the best practice to find a good value? My current best guess is to increase the requested cpu to 0.3 or 0.4; I'm looking at Grafana visualizations and see that the pods on the heavily loaded node(s) converge there under continuous load.
However, how can I know if they would use more load if they could before becoming unresponsive as the node is overloaded?
I'm actually trying to understand how to approach this in general. I would expect an "ideal" service (presuming it is CPU-focused) to use close to 0.0 when there is no load, and close to 1.0 when requests are constantly coming in. With that assumption, should I set the cpu.requests to 1.0, taking a perspective where actual constant usage is assumed?
I have read some Kubernetes best practice guides, but none of them seem to address how to set the actual value for cpu requests in practice in more depth than "find an average".
Basically come up with a number that is your lower acceptable bound for how much the process runs. Setting a request of 100m means that you are okay with a lower limit of your process running 0.1 seconds for every 1 second of wall time (roughly). Normally that should be some kind of average utilization, usually something like a P99 or P95 value over several days or weeks. Personally I usually look at a chart of P99, P80, and P50 (median) over 30 days and use that to decide on a value.
Limits are a different beast, they are setting your CPU timeslice quota. This subsystem in Linux has some persistent bugs so unless you've specifically vetted your kernel as correct, I don't recommend using it for anything but the most hostile of programs.
In a nutshell: Main goal is to understand how much traffic a pod can handle and how much resource it consumes to do so.
CPU limits are hard to understand and can be harmful, you might want
to avoid them, see static policy documentation and relevant
github issue.
To dimension your CPU requests you will want to understand first how much a pod can consume during high load. In order to do this you can :
disable all kind of autoscaling (HPA, vertical pod autoscaler, ...)
set the number of replicas to one
lift the CPU limits
request the highest amount of CPU you can on a node (3.2 usually on 4cpu nodes)
send as much traffic as you can on the application (you can achieve simple Load Tests scenarios with locust for example)
You will eventually end up with a ratio clients-or-requests-per-sec/cpu-consumed. You can suppose the relation is linear (this might not be true if your workload complexity is O(n^2) with n the number of clients connected, but this is not the nominal case).
You can then choose the pod resource requests based on the ratio you measured. For example if you consume 1.2 cpu for 1000 requests per second you know that you can give each pod 1 cpu and it will handle up to 800 requests per second.
Once you know how much a pod can consume under its maximal load, you can start setting up cpu-based autoscaling, 70% is a good first target that can be refined if you encounter issues like latency or pods not autoscaling fast enough. This will avoid your nodes to run out of cpu if the load increases.
There are a few gotchas, for example single-threaded applications are not able to consume more than a cpu. Thus if you give it 1.5 cpu it will run out of cpu but you won't be able to visualize it from metrics as you'll believe it still can consume 0.5 cpu.

Latency penalty for using istio or other sidecar proxy

I am playing with an idea of using istio for some of the features, however I find it hard to find any reasonable estimates of the latency it adds to every call. 1ms for every service call seems like a lot, especially once there are 10 services involved in a chain, each having request&response passing through istio.
Has anybody measured latency penalty for having sidecar proxy introduced?
This post does a benchmark of Istio and Linkerd to compare both service meshes in different aspects like CPU, memory or latency.
Istio provides performance numbers, including latency, here: https://istio.io/docs/concepts/performance-and-scalability/#latency-for-istio-hahahugoshortcode-s2-hbhb
From the page...
The default configuration of Istio 1.1 adds 8ms to the 90th percentile latency of the data plane over the baseline.

How Finagle aperture algorithm chooses "non overlapping" subsets?

I have been reading about Finagle and trying to understand the code to figure out how Aperture's subset choice works.
I have seen that ApertureLeastLoaded has a "useDeterministicOrdering" and an "EndpointFactory" which I guess should be the key points to make the decision of which clients to take in the subset.
While reading the "deterministic subsetting" section of Google SRE's book, I understood that the best way to pick a subset of servers from the client point of view, is to know the total number of clients, and a unique sequential identifier of the current client, that can be used as seed of the subset generator.
In Finagle I can't understand how this process is done (I'm not super familiar with Scala) and the documentation both on the website and in the code, explain just how the aperture paradigm works, but not very clear how the initial subset is chosen
I hope somebody can enlighten me
One of the unique properties of Aperture is that its window is sized dynamically based on a clients offered load. That is, clients have a built in controller which can expand or shrink their window at runtime. This property is important as it allows clients to operate more efficiently and better adapt to a changing environment, but it does make it more complex to achieve a uniform load distribution across servers.
To contrast, the subsetting algorithm, as proposed by the Google SRE book, suggests that operators choose a static subset size which allows a uniform load distribution to be calculated analytically but introduces another static configuration that needs to be revisited as a system evolves.
Deterministic Aperture is, to the best of our knowledge, a novel algorithm for achieving a uniform load distribution while maintaining the dynamic properties of the window sizing mentioned above. From a high level, clients construct a topology of their peer cluster (which gives them a sense of ordering and proximity) and then derive a unique per-client permutation of the servers from the topology such that each server is uniformly represented across the permutations.
We are still in the early stages of testing this in production at Twitter, but early results look very promising. After we gather more empirical results, we hope to publish some more detailed content on how the algorithm works and its properties.

Netlogo High performance Computing

Are there any high performance computing facilites available for running NetLogo behavior space like R servers.
Thanks.
You can use headless mode to run batches of experiments on a cluster/cloud computing platform. This involves simply running an executable so should be compatible with most setups. If you don't have access to a cluster through an institution, I know people use AWS and Google compute. You probably want an instance with many cores, since that allows a single instance of BehaviorSpace to automatically distribute the runs involved in an experiment across multiple processes. Higher processing power of course helps too. You shouldn't need much memory. The n1-highcpu-16 or n1-standard-16 instance types in Google compute looks pretty ideal to me.