Kubernetes Service: IPVS load balancing algorithm - kubernetes

As discovered here, there is a new kind of kube service that are IPVS and have many algorithms for load balancing.
The only problem is I didn't find where those algorithms are specified.
My understanding:
rr: round-robin
-> call backend pod one after another in a loop
lc: least connection
-> group all pod with the lowest number of connection, and send message to it. Which kind of connection? only the ones from this service ?
dh: destination hashing
-> ?something based on url?
sh: source hashing
-> ?something based on url?
sed: shortest expected delay
-> either the backend with less ping or some logic on the time a backend took to respond in the past
nq: never queue
-> same as least connection? but refusing messages at some points ?
If anyone has the documentation link (not provided in the official page and still saying IPVS is beta whereas it is stable sync 1.11) or the real algorithm behind all of them, please help.
I tried: Google search with the terms + lookup in the official documentation.

They are defined in the code
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/apis/config/types.go#L193
rr round robin : distributes jobs equally amongst the available real servers
lc least connection : assigns more jobs to real servers with fewer active jobs
sh source hashing : assigns jobs to servers through looking up a statically assigned hash table by their source IP addresses
dh destination hashing : assigns jobs to servers through looking up a statically assigned hash table by their destination IP addresses
sed shortest expected delay : assigns an incoming job to the server with the shortest expected delay. The expected delay that the job will experience is (Ci + 1) / Ui if sent to the ith server, in which Ci is the number of jobs on the ith server and Ui is the fixed service rate (weight) of the ith server.
nq never queue : assigns an incoming job to an idle server if there is, instead of waiting for a fast one; if all the servers are busy, it adopts the ShortestExpectedDelay policy to assign the job.
All those come from IPVS official documentation : http://www.linuxvirtualserver.org/docs/scheduling.html
Regards

Related

How to configure channels and AMQ for spring-batch-integration where all steps are run as slaves on another cluster member

Followup to Configuration of MessageChannelPartitionHandler for assortment of remote steps
Even though the first question was answered (I think well), I think I'm confused enough that I'm not able to ask the right questions. Please direct me if you can.
Here is a sketch of the architecture I am attempting to build. Today, we have a job that runs a step across the cluster that works. We want to extend the architecture to run n (unbounded and different) jobs with n (unbounded and different) remote steps across the cluster.
I am not confusing job executions and job instances with jobs. We already run multiple job instances across the cluster. We need to be able to run other processes that are scalable in hte same way as the one we have defined already.
The source data is all coming from database which are known to the steps. The partitioner is defining the range of data for the "where" clause in the source database and putting that in the stepContext. All of the actual work happens in the stepContext. The jobContext simply serves to spawn steps, wait for completion, and provide the control API.
There will be 0 to n jobs running concurrently, with 0 to n steps from however many jobs running on the slave VM's concurrently.
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
What I think is unclear (but I can't tell since it's unclear) is how the single reply channel is picked up by the aggregatedReplyChannel and then returned to the correct step. Of course I could be so lost I'm asking the wrong questions.
Thank you in advance
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
No, there is no need for that. StepExecutionRequests are identified with a correlation Id that makes it possible to distinguish them.
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
That should not be the case, as requests are uniquely identified with a correlation ID (similar to the previous point).
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
The messageChannelpartitionHandler should be step or job scoped, as mentioned in the Javadoc (see recommendation in the last note). As a side note, there was an issue with message crossing in a previous version due to the reply channel being instance based, but it was fixed here.

Having multiple sockets for same Context() and same port in ZMQ

My current system takes input stream from cameras, each camera in a separate instance, and apply Computer Vision models on each camera (Object Detection, Object Tracking and Personnel Recognition), and then pass the results to a sink/master process that performs the rest of functionality over those results and I'm using ZMQ as an inter-process communication.
What I implemented now is that each worker connects to a different port, and then the sink subscribes to these ports independently, but this solution is not scalable, as we might have 3 or 4 cameras/worker, and I felt that it won't be efficient to keep opening ports like that.
Multiple Ports Implementation
That's when I tried to implement Multi-Pub/Single-Sub module, where all workers will connect to one port and the sink will subscribe to that port only.
Single Port Implementation
The problem I faced is that I no longer can distinguish between different cameras since I'm receiving different footages in the same port which causes a problem in streaming them later, that's why I'm thinking about the possibility of having multiple sockets for each context, while each socket subscribes to a different IP, is that possible?
Note: I've seen this answer but it has different ports for different sockets which does not really serve my case.
Q : " ... I no longer can distinguish between different cameras ... "
A :Yet, there are ZeroMQ tools to do so - check details about :
.setsockopt( zmq.METADATA, "X-key:value" )
.setsockopt( zmq.ROUTING_ID, Id )
As you see, PUB/SUB-archetype is the worst one to be used here ( you pay all the costs of TOPIC-filter based subscription-management, yet receive nothing for doing that ).
Using better matching archetypes is the way to go.
Given not performance details were posted, the capacity may soon get over-saturated, so may use more specific steps to flatten the workload and protect smooth-flow of the service :
.setsockopt( zmq.TOS, aTransportPath_TOS )
.setsockopt( zmq.MAXMSGSIZE, aBLOB_limit_to_save_RAM )
Given a streaming could block on many "old"-frames not having got through the e2e-pipeline in due time, it might make sense to also set this :
.setsockopt( zmq.CONFLATE, 1 )
As you can see, there are many smart details in the configuration space of the ZeroMQ, plus once scaling is to grow larger and larger, your design shall also fine-tune the Context()-engine performance once instantiating :
.Context( aNumOfContextIOthreads2use )

How to effectively establish point to point channel using ZeroMQ?

I have trouble with establishing asynchronous point to point channel using ZeroMQ.
My approach to build point to point channel was that it generates as many ZMQ_PAIR sockets as possible up to the number of peers in the network. Because ZMQ_PAIR socket ensures an exclusive connection between two peers, it needs the same number of peers. My first attempt is realized as the following diagram that represents paring connections between two peers.
But the problem of the above approach is the fact that each pairing socket needs a distinct bind address. For example, if four peers are in the network, then each peer should have at least three ( TCP ) address to bind the rest of peers, which is very unrealistic and inefficient.
( I assume that peer has exactly one unique address among others. Ex. tcp://*:5555 )
It seems that there is no way other than using different patterns, which contain some set of message brokers, such as XREQ/XREP.
( I intentionally avoid broker based approach, because my application will heavily exchange message between peers, which it will often result in performance bottleneck at the broker processes. )
But I wonder that if there is anybody who uses ZMQ_PAIR socket to efficiently build point to point channel? Or is there a way to bypass to have distinct host IP addresses for multiple ZMQ_PAIR sockets to bind?
Q: How to effectively establish ... well,
Given the above narrative, the story of "How to effectively ..." ( where a metric of what and how actually measures the desired effectivity may get some further clarification later ), turns into another question - "Can we re-factor the ZeroMQ Signalling / Messaging infrastructure, so as to work without using as many IP-addresses:port#-s as would the tcp://-transport-class based topology actually need?"
Upon an explicitly expressed limit of having not more than a just one IP:PORT# per host/node ( being thus the architecture's / desing's the very, if not the most expensive resource ) one will have to overcome a lot troubles on such a way forward.
It is fair to note, that any such attempt will come at an extra cost to be paid. There will not be any magic wand to "bypass" such a principal limit expressed above. So get ready to indeed pay the costs.
It reminds me one Project in TELCO, where a distributed-system was operated in a similar manner with a similar original motivation. Each node had an ssh/sshd service setup, where local-port forwarding enabled to expose a just one publicly accessible IP:PORT# access-point and all the rest was implemented "inside" a mesh of all the topological links going through ssh-tunnels not just because the encryption service, but right due to the comfort of having the ability to maintain all the local-port-forwarding towards specific remote-ports as a means of how to setup and operate such exclusive peer-to-peer links between all the service-nodes, yet having just a single public access IP:PORT# per node.
If no other approach will seem feasible ( PUB/SUB being evicted for either traffic actually flowing to each terminal node in cases of older ZeroMQ/API versions, where Topic-filtering gets processed but on the SUB-side, which both security and network Departments will not like to support, or for concentrated workloads and immense resources needs on PUB-side, in cases of newer ZeroMQ/API versions, where Topic-filter is being processed on the sender's side. Adressing, dynamic network peer (re-)discovery, maintenance, resources planning, fault resilience, ..., yes, not any easy shortcut seems to be anywhere near to just grab and (re-)use ) the above mentioned "stone-age" ssh/sshd-port-forwarding with ZeroMQ, running against such local-ports only, may save you.
Anyway - Good Luck on the hunt!

Why Paxos is design in two phases

Why Paxos requires two phases(prepare/promise + accept/accepted) instead of a single one? That is, using only prepare/promise portion, if the proposer has heard back from a majority of acceptors, that value is choose.
What is the problem, does it break safety or liveness?
It breaks safety not to follow the full protocol.
Typical implementations of multi-paxos have a steady state mode where a stable leader streams Accept messages containing fresh values. Only when a problem occurs (leader crashes, stalls, or is partitioned by a network issue) does a new leader need to issue prepare messages to ensure saftey. A full description of this is in the write-up of how TRex an open source Paxos library implements Paxos.
Consider the following crash scenario which TRex would handle properly:
Nodes A, B, C with A leading
Client application sends V1 to leader A
Node A is in steady state and so sends accept(n, V1) to nodes B and C. The network starts to fail though so only B sees the message and it replies with accepted(n)
Node A sees the response and has a majority {A,B} so it knows the value is fixed due to the safety proof of the protocol.
Node A attempts to broadcast the outcome to everyone just as it's server dies. Only the client application who issued the V1 gets the message. Imagine that V1 is a customer order and upon learning the order is fixed the client application debts the customer credit card.
Node C times out on the dead leader and attempts to lead. It never saw the value V1. It cannot arbitrarily choose any new value without rolling back the order V1 but the customer has already been charged.
So Node C first issues a prepare(n+1) and node B responds with promise(n+1, V1).
Node C then issues accept(n+1, V1) and as long as the remaining messages get through nodes B and C will learn the value V1 was chosen.
Intuitively we can say that Node C has chosen to collaborate with the dead node A by choosing A's value. So intuitively we can see why there must be two rounds. The first round is needed to discover whether there is any pending work to finish. The second round is used to fix the correct value to give consistency across all processes within the system.
It's not entirely accurate, but you can think of the two phases as 1) copying the data, and then 2) committing the data. If the data is just copied to the other servers, those servers would have no idea if enough other servers have the data for it to be considered safe to serve. Thus there is a second phase to let the servers know that they can commit the data.
Paxos is a little more complex than that, which allows it to continue during failures of either phase. Part of the Paxos proof is that it is the minimal protocol for doing this completely. That is, other protocols do more work, either because they add more capabilities, or because they were poorly designed.

check dns-blacklist of ip by using namp

I'm talking about this script for nmap
http://nmap.org/nsedoc/scripts/dns-blacklist.html
User Summary
Checks target IP addresses against multiple DNS anti-spam and open proxy blacklists and returns a list of services for which an IP has been flagged. Checks may be limited by service category (eg: SPAM, PROXY) or to a specific service name.
Does it possible to use Timing and Performance of nmap like make it in parallel and set time out? any example please?
Thank you
The dns-blacklist script uses the dnsbl library to perform queries. That library uses Lua coroutines to issue many requests in parallel. The number of coroutines (including scripts) that can be running at any given time is set by the CONCURRENCY_LIMIT variable in nse_main.lua, and is not settable by the user. A more complete description of NSE parallelism can be found in the online documentation.
For timeouts, the script itself does not accept a timeout script-argument. Fortunately, though, the dnsbl library offloads the DNS query execution to the dns library, which has a function called get_default_timeout:
get_default_timeout = function()
local timeout = {[0] = 10000, 7000, 5000, 4000, 4000, 4000}
return timeout[nmap.timing_level()] or 4000
end
This shows that the dns library will set the timeout for DNS queries to 4000 ms (4 seconds) for -T3 (the default) through -T5, but will be more cautions at lower timing levels.