Connection timeout and socket timeout advice - sockets

I'm currently doing some research into connection and socket timeout settings but I'm fairly new to this stuff.
As a stab in the dark we are thinking of adding 40 seconds as a connection and sockettimeout when we run json over http calls to another server.
httpConnectionManagerParams.setConnectionTimeout(40000);
httpConnectionManagerParams.setSoTimeout(40000);
But really I don't know about how to understand what the ideal settings or best practices to use are. I would appreciate it if someone could give me some pointers on what to consider when or tips on how to come up with a good gestimate on these settings.
The sort of advice I'm looking for is something like ... 40 seconds is far too long because it might cause another issue.... or ... the higher you set this value the more chance you have of causing another isssue... or 40 seconds is not to high at all... or to work out an ideal figure multiply Y by T
thanks
EDIT
Adding firebug trace of server call.

There's no reason whatsoever why they should be equal. Considering each condition separately, you want to set it high enough that a timeout will indicate a genuine problem rather than just a temporary overload, and low enough that you maintain responsiveness of the application.
As a rule, 40 seconds is far too long for a connect timeout. I would view anything in double figures with suspicion. A server should be able to accept tens or hundreds of connections per second.
Read timeouts are a completely different matter. The 'correct' value, if there is such a thing, depends entirely on the average service time for the request and on its variance. As a starting point, you might want to set it to double the expected service time, or the average service time plus two or three standard deviations, depending entirely on your service level requirements and the performance of your server and its variance. There is no hard and fast rule about this. Many contractual service level agreements (SLAs) specify a 'normal' response time of two seconds, which may inform your deliberations. But it's your decision.

Related

How to distribute tasks between servers where each task must be done by only one server?

Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.
There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.
Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?
Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".
Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:
Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.
Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....
...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.
OK, let's conclude what we have by now:
Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.
And that's it.
P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.
Try rabbitmq worker queues
https://www.rabbitmq.com/tutorials/tutorial-two-python.html
It has an acknowledgement feature so if a task fails or server cashes it will automatically replay your task. Based on your specific use case u can setup retries, etc
"Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?"
You are getting into a known problem in distributed systems, how does a system makes decisions when the system is partitioned. Let me elaborate on this.
A simple statement "server dies" requires quite a deep dive on what does this actually mean. Did the server lost power? Is it the network between your control plane and the server is down (and the task is keep running)? Or, maybe, the task was done successfully, but the failure happened just before the task server was about to report about it? If you want to be 100% correct in deciding the current state of the system - that the same as to say that the system has to be 100% consistent.
This is where CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) comes to play. Since your system may be partitioned at any time (a worker server may get disconnected or die - which is the same state) and you want to be 100% correct/consistent, this means that the system won't be 100% available.
To reiterate the previous paragraph: if the system suspects a task server is down, the system as a whole will have to come to a stop, till it will be able to determine on what happened with the particular task server.
Trade off between consistency and availability is the core of distributed systems. Since you want to be 100% correct, you won't have 100% availability.
While availability is not 100%, you still can improve the system to make it as available as possible. Several approaches may help with that.
Simplest one is to alert a human when the system suspects it is down. The human will get a notification (24/7), wake up, login and do a manual check on what is going on. Whether this approach works for your case - it depends on how much availability you need. But this approach is completely legit and is widely used in the industry (those engineers carrying pagers).
More complicated approach is to let the system to fail over to another task server automatically, if that is possible. Few options are available here, depending on type of task.
First type of task is a re-runnable one, but they have to exist as a single instance. In this case, the system uses "STONITH" (shoot the other node in the head) technic to make sure previous node is dead for good. For example, in a cloud the system would actually kill the whole container of task server and then start a new container as a failover.
Second type of tasks is not re-runnable. For example, a task of transferring money from account A to be B is not (automatically) re-runnable. System does not know if the task failed before or after the money were moved. Hence, the fail over needs to do extra steps to calculate the outcome, which may also be impossible if network is not working correctly. In this cases the system usually goes to halt, till it can make 100% correct decision.
None of these options will give 100% of availability, but they can do as good as possible due to nature of distributed systems.

Ajax polling vs SSE (performance on server side)

I'm curious about if there is some type of standard limit on when is better to use Ajax Polling instead of SSE, from a server side viewpoint.
1 request every second: I'm pretty sure is better SSE
1 request per minute: I'm pretty sure is better Ajax
But what about 1 request every 5 seconds? How can we calculate where is the limit frequency for Ajax or SSE?
No way is 1 request per minute always better for Ajax, so that assumption is flawed from the start. Any kind of frequent polling is nearly always a costly choice. It seems from our previous conversation in comments of another question that you start with a belief that an open TCP socket (whether SSE connection or webSocket connection) is somehow costly to server performance. An idle TCP connection takes zero CPU (maybe every once in a long while, a keep alive might be sent, but other than that, an idle socket does not use CPU). It does use a bit of server memory to handle the socket descriptor, but a highly tuned server can have 1,000,000 open sockets at once. So, your CPU usage is going to be more about how many connections are being established and what are they asking the server to do every time they are established than it is about how many open (and mostly idle) connections there are.
Remember, every http connection has to create a TCP socket (which is roundtrips between client/server), then send the http request, then get the http response, then close the socket. That's a lot of roundtrips of data to do every minute. If the connection is https, it's even more work and roundtrips to establish the connection because of the crypto layer and endpoint certification. So doing all that every minute for hundreds of thousands of clients seems like a massive waste of resources and bandwidth when you could create one SSE connection and the client just listen for data to stream from the server over that connection.
As I said in our earlier comment exchange on a different question, these types of questions are not really answerable in the abstract. You have to have specific requirements of both client and server and a specific understanding of the data being delivered and how urgent it is on the client and therefore a specific polling interval and a specific scale in order to begin to do some calculations or test harnesses to evaluate which might be the more desirable way to do things. There are simply too many variables to come up with a purely hypothetical answer. You have to define a scenario and then analyze different implementations for that specific scenario.
Number of requests per second is only one of many possible variables. For example, if most the time you poll there's actually nothing new, then that gives even more of an advantage to the SSE case because it would have nothing to do at all (zero load on the server other than a little bit of memory used for an open socket most of the time) whereas the polling creates continual load, even when nothing to do.
The #1 advantage to server push (whether implement with SSE or webSocket) is that the server only has to do anything with the client when there is actually pertinent data to send to that specific client. All the rest of the time, the socket is just sitting there idle (perhaps occasionally on a long interval, sending a keep-alive).
The #1 disadvantage to polling is that there may be lots of times that the client is polling the server and the server has to expend resources to deal with the polling request only to inform that client that it has nothing new.
How can we calculate where is the limit frequency for Ajax or SSE?
It's a pretty complicated process. Lots of variables in a specific scenario need to be defined. It's not as simple as just requests/sec. Then, you have to decide what you're attempting to measure or evaluate and at what scale? "Server performance" is the only thing you mention, but that has to be completely defined and different factors such as CPU usage and memory usage have to be weighted into whatever you're measuring or calculating. Then, you may even need to run some test harnesses if the calculations don't yield an obvious answer or if the decision is so critical that you want to verify your calculations with real metrics.
It sounds like you're looking for an answer like "at greater than x requests/min, you should use polling instead of SSE" and I don't think there is an answer that simple. It depends upon far more things than requests/min or requests/sec.
"Polling" incurs overhead on all parties. If you can avoid it, don't poll.
If SSE is an option, it might be a good choice. "It depends".
Q: What (if any) kind of "event(s)" will your app need to handle?

Mongo Connection Count creeping up one per 10 second with mgo driver

We monitor our mongoDB connection count using this:
http://godoc.org/labix.org/v2/mgo#GetStats
However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.
This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.
We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.
This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.
As an example, there is the dataset that I am referencing:
Clusters: 1
MasterConns: 9936 <-- creeps up 1 per second
SlaveConns: -7359 <-- why is this negative?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- what is the difference between the socket count and the master conns count?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.
MasterConns cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive field.
To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.
Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.
Having SlaveConns negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.
SocketsInUse is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close operation was missed by the application.

How to handle bursts with Actors?

Suppose I have an actor, which handles X requests per second. It is ok in average but sometimes there are bursts and clients send Y > X requests per second. Suppose also that all requests have timeouts so they cannot wait in queue forever.
Assuming we program in Scala and Akka what are the best practices/design patterns to make the actor handle those bursts? Are there any code examples, which handle bursts?
As long as your machine can handle the increased load (i.e. has enough CPUs), then I would suggest pooling the Actor using a Router. It sounds like from your example, a dynamically resizing router might be the best fit, but even a standard Round Robin or Smallest Mailbox might be enough. Below is the link for the routers section from the Akka documentation. I hope this helps.
http://doc.akka.io/docs/akka/2.1.2/scala/routing.html
You could also consider distributing the actor across multiple nodes, but that might be overkill for your scenario. If you have interest in that approach, let me know and I can post more context on doing that.
Now as far as what to do when after you pool the actors but the system still is getting backlogged, that's really up to you, but here are a few options. If you can handle the occasional increases in latency due to bursting, then do nothing. The actors mailboxes will just get a little backed up but they will clear as soon as the burst eases off. If not, then the question is how to handle incoming messages when the actors are backlogged. If you want to fast fail in that situation and not accept the message you might want to look into using a bounded mailbox (http://doc.akka.io/docs/akka/2.1.2/scala/dispatchers.html). When the mailbox reaches it's size limit and can no longer queue messages, the caller will get a failure sending the message (I think). Not awesome, but at least will lead to the system stabilizing faster.
I assume you are doing ask (?) (i.e. request/response), so when you do that, you get a Future. That Future will time out (with an implicitly defined timeout value) if it does not receive a response in time, so during a burst, Futures attached to the calls into the backlogged actors will just start timing out; they will not be stuck there forever.

What does server throughput mean

If the throughput is increase how will be changed the response and request time?
If I have the data(request/min)?
JMeter's definition of throughput can be seen here: https://jmeter.apache.org/usermanual/glossary.html
Basically its a measure of how many requests that JMeter were able to send to your test site/application in one second. Or in another word the number of requests that your test site/application was able to receive from JMeter in one second. An increase in the throughput will mean your site/application was able to receive more requests per second while a decrease will mean a reduction in the number of request it handled per second.
The relationship between throughput with response/request time totally depends as ysth stated. I typically use this number to see the load of the server but run the test several times (30x min) and take the average.
There's not necessarily a relationship. Can you tell us anything more about why you want to know this, what you plan to do with the information, etc.? It may help get you an answer better suited to your needs.
After completion of the project development as a developer, we are responsible to test the performance of the application.
As part of performance testing, we have to check
1)Response time of application
2)bottle nack of application
3)Throughput of application
Throughput of application:-
In general 'Request capacity of application in a given time.'
As per Apache JMeter doc :-
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time).