Kamailio needs to block 200 OK from CANCELed branches, how? - sip

I have a Kamailio 4.0.4 proxy (K) running registrar and tm. There are multiple clients for some AORs and they all auto-accept certain INVITEs which is causing a race condition and having 200 OKs from multiple branches being sent to the callee.
Scenario:
- A sends invite to B
K finds 2 contacts in the uloc for B, let's call them B1 & B2
INVITE is branched and sent to B1 and B2
Note: B1 has a link latency of 100ms and B2 latency of 150ms
Both B1 and B2 auto-accept with 200 OK instantly as they get it
200ms after branching INVITE, K gets 200 OK from B1 and relays it to A
K also CANCELs the INVITE to B2
A is actually a local AS which ACKs the 200 OK back to B1 instantly
Now the problem is that B2 already sent a 200 OK 50ms ago and won't be receiving the CANCEL for another 150ms
So the 200 OK from B2 comes to K but the call is already setup between A and B1
What happens is that the 200 OK is relayed to A which at this point gets utterly confused because it's not a very good AS to be honest.
Now to the actual question, how do I stop the extra 200 OK from going to A?
I can see a few options of how it should work:
Drop the 200 OK, just throw it away. B2 should not resend it because the CANCEL will hit it soon
ACK + BYE the 200 OK from inside Kamailio, but this will result in a media session being brought up and torn down immediately by B2
I can't even find an RFC covering this race condition..

IIRC, according to the rfc, the 200ok responses have to be always forwarded, being the decision of caller to pick one and send a ACK+BYE for the other.
The easy solution with kamailio is to drop any 200ok once you got one. The callee might not stop resending it even when a CANCEL arrives, will wait for ACK and eventually BYE.
TM module will always forward the 200ok, as per rfc. If you want to drop in kamailio.cfg, a possible solution:
use reply_route { ... } block to intercept 200ok for invite
use htable to store when first 200ok is received (the key can be call-id)
use cfgutils to get locks that will protect against races to access/update the htable
processing logic: if it is a 200ok for INVITE, lock for htable, check if there is a key for that callid. If yes, unlock and drop. If not, add an item with call-id as a key, unlock and let the response go on

Related

ZeroMQ: Which socket types for arbitrary communication between exactly 2 peers?

I'm using 0MQ to let multiple processes talk to each other (IPC sockets, but should also work via TCP across different nodes). My code is similar to a client/server pattern, but REQ/REP sockets are not enough. Here is a sample conversation. See below for further details.
Process A
Process B
open socket
not started
start process B
-
-
open socket, connect to A
-
send hello (successful start, socket information)
request work
-
-
do work
-
send response (work result 1)
-
send response (work result 2)
-
send unsolicited message
-
send response (work finished)
request termination
-
Actually, A is (even though doing all the requests) closer to be the server component, since it is constantly running. Based on external triggers, A starts a sort of plugin process B.
Every request needs to be answered by a finished response. Before that, N (between 0 and an arbitrary upper bound) responses can be sent from B.
A new request can be sent from A even when the current request is still ongoing (no finished message received). If relevant, the code could be updated to buffer the requests.
B sends an initial message which is not preceded by a request from A.
B can send other messages (logging) anywhere in between, also not preceded by a request.
Optional: A single socket in A should handle multiple plugin processes B, C, D...
A DEALER/ROUTER combination would probably match all requirements, but might be a bit too much. Process B will only ever connect to a single peer. And without the optional requirement above, the same would be true for process A as well. So I'm a bit hesitant to use DEALER and ROUTER sockets which are both able to handle multiple peers.

Always use neg[.z.w] to ensure that all messages are asynchronous?

Consider the following definition on server:
f:{show "Received ",string x; neg[.z.w] (`mycallback; x+1)}
on client side:
q)mycallback:{show "Returned ",string x;}
q)neg[h] (`f; 42)
q)"Returned 43"
In the q for motrtals, the tip says:
When performing asynchronous messaging, always use neg[.z.w] to ensure
that all messages are asynchronous. Otherwise you will get a deadlock
as each process waits for the other.
therefore I change the definition on the server as:
f:{show "Received ",string x; .z.w (`mycallback; x+1)}
everything goes fine, and I haven't seen any deadlocks.
Can anyone give me an example to show why I should always use neg[.z.w]?
If I understand you're question correctly I think your asking how sync and async messages work. The issue with the example you have provided is that x+1 is a very simple query that can be evaluated almost instantaneously. For a more illustrative example consider changing this to a sleep (or a more strenuous calculation, eg. a large database query).
On your server side define:
f:{show "Received ",string x;system "sleep 10"; neg[.z.w] (`mycallback; x+1)}
Then on your client side you can send the synchronous query:
h(`f; 42)
multiple times. Doing this you will see there is no longer a q prompt on the client side as it must wait for a response. These requests can be queued and thus block both the client and server for a significant amount of time.
Alternatively, if you were to call:
(neg h)(`f; 42)
on the client side. You will see the q prompt remain, as the client is not waiting for a response. This is an asynchronous call.
Now, in your server side function you are looking at using either .z.w or neg .z.w. This follows the exact same principal however from a server perspective. If the response to query is large enough, the messaging can take a significant amount of time. Consequently, by using neg this response can be sent asynchronously so the server is not blocked during this process.
NOTE: If you are working on a windows machine you will need to swap out sleep for timeout or perhaps a while loop if you are following my examples.
Update: I suppose one way to cause such a deadlock would be to have two dependant processes, attempting to synchronously call each other. For example:
q)\p 10002
q)h:hopen 10003
q)g:{h (`f1;`)}
q)h (`f;`)'
on one side and
q)\p 10003
q)h:hopen 10002
q)f:{h (`g;`)}
q)f1:{show "test"}
on the other. This would result in both processes being stuck and thus test never being shown.
Joe's answer covers pretty much everything, but to your specific example, a deadlock happens if the client calls
h (`f; 42)
Client is waiting response from the server before processing the next request, but the server is also waiting response from the client before it completes the client's request.

SIP Timer inside core

I have below question on behavior of SIP Core in SIP communication.
Suppose party A calling party B. B receives INVITE and generates a '200 OK'. After generating '200 OK', B reaches the terminating state (SIP State Machine) and consequently there is no state machine in B.
Now if '200 OK' does not reach A, B is suppose to re-transmit '200 OK' as it has not receive ACK. RFC 3261 says it is SIP Core responsibility to re-transmit the '200 OK'.
So what is trigger in B's SIP core to send this re-transmit? Does it maintains any timers? Or is it implementation dependent?
Regards,
Sudhansu
The restransmission of 2xx from B is explained in Section 13.3.1.4 The INVITE is Accepted
The exact text is this one:
The 2xx response is passed to the transport with an
interval that starts at T1 seconds and doubles for each
retransmission until it reaches T2 seconds (T1 and T2 are defined in
Section 17). Response retransmissions cease when an ACK request for
the response is received.
The end of retransmission is explained after:
If the server retransmits the 2xx response for 64*T1 seconds without
receiving an ACK, the dialog is confirmed, but the session SHOULD be
terminated. This is accomplished with a BYE, as described in Section
15.
This means that the application layer (ie: not the transaction layer)
needs to manage the timers.
The timers T1 and T2 are defined in table 4: A Table of Timer Values.
T1 500ms default Section 17.1.1.1 RTT Estimate
T2 4s Section 17.1.2.2 The maximum retransmit
interval for non-INVITE
requests and INVITE
responses
T4 5s Section 17.1.2.2 Maximum duration a
message will
remain in the network
Is it allowed to modify the T1, T2 and T4 values. However, in theory, for usual Internet, they shouldn't be changed.
For example, if all ACK are lost, the retransmission would be made within those intervals:
500ms
1s
2s
4s
4s
4s
4s
...
until total is 64*T1 = 32 seconds

Infinity confirmation loop

I came to interesting theoretical problem:
Let's assume we have Program A and Program B connected via some IPC like tcp socket or named pipe. Program A sends some data to Program B and depending on success of data delivery both A and B do some operations. However B should do its operation only if it is sure that A has got the delivery confirmation. So we came up to 3 connections:
A -> B [data tranfer]
B -> A [delivery confirmation]
A -> B [confirmation of getting delivery confirmation]
It may look weird but the goal is to don't do any operation neither on A nor B until both sides know that data has been transfered.
And here is the problem because second connection is for confirmation of success of first. And third is for confirmation of second but in fact there is no guarantee that connection 2 and 3 not fail and in that case we fall into infinite loop of confirmation. Is there some CS theory which solve that problem?
If I read your question right, the problem is called "the two general's problem". The gist of the issue is that the last entity that sends either a message or an acknowledgement knows nothing about the status of what it just sent, and so on.

XEP-0124 / BOSH: Omit ACK in response

I'm reading the XEP-0124 / BOSH specification and do not understand the following sentence in chapter 9.1 Request Acknowledgements:
The only exception is that, after its
session creation response, the
connection manager SHOULD NOT include
an 'ack' attribute in any response if
the value would be the 'rid' of the
request being responded to.
In my words: I should not send an ACK if the respond is dedicated for the last and only request (in connection manager's queue).
But: There is a client with it's own state machine. Maybe the client already send a second request -- where the first one is not replied -- and expect to get two answers. In this case the client except a ACK with RID of the "older" request and the connection manager have to set ACK.
Conclusion: The connection mananager MUST set ACK as long multiple requests are allowed.
I'm not sure, but is this text paragraph dedicated only for the use case where no further request is send by the client but the session creation phase is finished successfully and the connection manager have to send "ping" messages to the client due to "wait" timeouts ?
So, as I read it:
If the highest RID (in sequence) that you have received is 11 (you might have received 14 after that, but it is out of sequence since 12 & 13 are missing), and you are responding on:
The same request, then you should not (it is recommended that you do not, but if you have a good reason to, then you may) send an 'ack' attribute.
An earlier held request (say RID 10) then you should set 'ack' to 11 since that is the highest in-sequence RID that you have received so far.
It's okay if the client sent multiple requests and the server doesn't yet know about them. This is because there is a chance that when the client sent 11, the server has no held connections and it will respond back on the same connection. In that case, there are 2 requests sent out (11 & 12), but the response for each one acks that same request since the server always has something to send back immediately.