How to manage the error queue in openssl (SSL_get_error and ERR_get_error) - sockets

In OpenSSl, The man pages for The majority of SSL_* calls indicate an error by returning a value <= 0 and suggest calling SSL_get_error() to get the extended error.
But within the man pages for these calls as well as for other OpenSSL library calls, there are vague references to using the "error queue" in OpenSSL - Such is the case in the man page for SSL_get_error:
The current thread's error queue must be empty before the TLS/SSL I/O
operation is attempted, or SSL_get_error() will not work reliably.
And in that very same man page, the description for SSL_ERROR_SSL says this:
SSL_ERROR_SSL
A failure in the SSL library occurred, usually a protocol error.
The OpenSSL error queue contains more information on the error.
This kind of implies that there is something in the error queue worth reading. And failure to read it makes a subsequent call to SSL_get_error unreliable. Presumably, the call to make is ERR_get_error.
I plan to use non-blocking sockets in my code. As such, it's important that I reliably discover when the error condition is SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE so I can put the socket in the correct polling mode.
So my questions are this:
Does SSL_get_error() call ERR_get_error() implicitly for me? Or do I need to use both?
Should I be calling ERR_clear_error prior to every OpenSSL library call?
Is it possible that more than one error could be in the queue after an OpenSSL library call completes? Hence, are there circumstances where the first error in the queue is more relevant than the last error?

SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error, the error stays in the queue.
You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write etc) that is followed by SSL_get_error, otherwise you may be reading an old error that occurred previously in the current thread.

Related

Any ideas why we're getting Intermittent gRPC Unavailable/Unknown RpcExceptions (C++/C#)

We are using gRPC (version 1.37.1) for our inter-process communication between our C# process and C++ process. Both processes act as a server and client with the other and run on the same machine over localhost using the HTTP/2 transport. All of the calls are use blocking synchronous unary calls and not bi-directional streaming. Some average(ish) stats:
From C++->C#: 0-2 calls per second, 0-40 calls per minute
From C#->C++: 0-5 calls per second, 0-200 calls per minute
Intermittently, we were getting one of 3 issues
C# client call to C++ server comes back with an RpcException, usually “HTTP2/Parse Error”, “Endpoint Read Failed”, or “Transport Closed”
C++ client call to C# server comes back with Unavailable or Unknown
C++ client WaitForConnected call to check the channel fails after 500ms
The top most one is the most frequent and where we have the most information about. Usually, what we’ll see is the Client receives the RPC call and runs into an unknown frame type. Then the subchannel goes into shutdown and everything usually re-connects fine. We also generally see an embedded error like the following (note that we replaced all FILE instances to FUNCTION in our gRPC source):
win_read","file_line":307,"os_error":"The system detected an invalid pointer address in attempting to use a pointer argument in a call.\r\n","syscall":"WSARecv","wsa_error":10014}]},{"created":"#1622120588.494000000","description":"frame of size 262404 overflows local window of 65535","file":"grpc_core::chttp2::TransportFlowControl::ValidateRecvData","file_line":213}]}
What we’ve seen with the unknown frame type, is that it parses the HEADERS, WINDOW_UPDATE, DATA, WINDOW_UPDATE and then gets a TCP: on_read without a corresponding READ and then tries to parse again. It’s this parse where it looks like the parser is at the wrong offset in the buffer, because it gets the unknown frame type, incoming frame size and incoming stream_id all map to the middle of the RPC call that it just parsed.
The above was what we were encountering prior to a change to create a new channel for each rpc call. While we realize it is not great from a performance standpoint, we have seen increased stability since making the change. However, we still do occasionally get rpc exceptions. Now, the most common is “Unknown”/”Stream Removed” rather than the ones listed above.
Any ideas on what might be going wrong is appreciated. We've turned on all gRPC tracing and have even added to it, as well as captured the issue in wireshark but so far aren't getting a great indication of what's causing the transport to close. Are there any good tools to monitor the socket/port for failure?

SO_ERROR value after successful socket operation

I'm curious about the behavior of the SO_ERROR socket option used with getsockopt() after a successful socket operation
The Open Group specification:
SO_ERROR
Reports information about error status and clears it. This option shall store an int value.
Usually I see SO_ERROR used after a socket operation returns -1, but what happens if the previous socket operation succeeded (thus not returning -1). Does the getsockopt() call fail? Does it return 0 as the int value?
It's ok for non-blocking connect, see connect(2)
The socket is nonblocking and the connection cannot be completed immediately. It is possible to select(2) or poll(2) for completion by selecting the socket for writing. After select(2) indicates writability, use getsockopt(2) to read the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed here, explaining the reason for the failure).
other is undefined.
It's undefined. You should only call this option when you already know that there has been an error. Not as a means to discover whether there was one.
I've learned more about SO_ERROR in Unix Networking Programmig Vol 1, it's become clear to me. SO_ERROR is used to report asynchronous errors that are the result of events within the network stack and not synchronous errors that are the result of a library call (send/recv/connect). Synchronous results are reported via errno.
Calling getsockopt() with SO_ERROR after a library call returns -1 is incorrect from the POSIX implementation.
Learning of the non-blocking connect result via select is an example of discovering when the asynchronous result is ready (which can then be retrieved via SO_ERROR)

How to request a replay of an already received fix message

I have an application that could potentitally throw an error on receiving a ExecutionReport (35=8) message.
This error is thrown at the application level and not at the fix engine level.
The fix engine records the message as seen and therefore will not send a ResendRequest (35=2). However the application has not processed it and I would like to manually trigger a re-processing of the missed ExecutionReport.
Forcing a ResendRequest (35=2) does not work as it requires modifying the expected next sequence number.
I was wonderin if FIX supports replaying of messages but without requiring a sequence number reset?
When processing an execution report, you should not throw any exceptions and expect FIX library to handle it. You either process the report or you have a system failure (i.e. call abort()). Therefore, if your code that handles execution report throws an exception and you know how to handle it, then catch it in that very same function, eliminate the cause of the problem and try processing again. For example (pseudo-code):
// This function is called by FIX library. No exceptions must be thrown because
// FIX library has no idea what to do with them.
void on_exec_report(const fix::msg &msg)
{
for (;;) {
try {
// Handle the execution report however you want.
handle_exec_report(msg);
} catch(const try_again_exception &) {
// Oh, some resource was temporarily unavailable? Try again!
continue;
} catch(const std::exception &) {
// This should never happen, but it did. Call 911.
abort();
}
}
}
Of course, it is possible to make FIX library do a re-transmission request and pass you that message again if exception was thrown. However, it does not make any sense at all because what is the point of asking the sender (over the network, using TCP/IP) to re-send a message that you already have (up your stack :)) and just need to process. Even if it did, what's the guarantee it won't happen again? Re-transmission in this case is not only doesn't sound right logically, the other side (i.e. exchange) may call you up and ask to stop doing this crap because you put too much load on their server with unnecessary re-transmit (because IRL TCP/IP does not lose messages and FIX sequence sync process happens only when connecting, unless of course some non-reliable transport is used, which is theoretically possible but doesn't happen in practice).
When aborting, however, it is FIX library`s responsibility not to increment RX sequence unless it knows for sure that user has processed the message. So that next time application starts, it actually performs synchronization and receives missing messages. If QuickFIX is not doing it, then you need to either fix this, take care of this manually (i.e. go screw with the file where it stores RX/TX sequence numbers), or use some other library that handles this correctly.
This is the wrong thing to do.
A ResendRequest tells the other side that there was some transmission error. In your case, there wasn't, so you shouldn't do that. You're misusing the protocol to cover your app's mistakes. This is wrong. (Also, as Vlad Lazarenko points out in his answer, if they did resend it, what's to say you won't have the error again?)
If an error occurs in your message handler, then that's your problem, and you need to find it and fix it, or alternately you need to catch your own exception and handle it accordingly.
(Based on past questions, I bet you are storing ExecutionReports to a DB store or something, and you want to use Resends to compensate for DB storage exceptions. Bad idea. You need to come up with your own redundancy solution.)

How to "throw" a %Status to %ETN?

Many of the Caché API methods return a %Status object which indicates if this is an error. The thing is, when it's an unknown error I don't know how to handle (like a network failure) what I really want to do is "throw" the error so my code stops what it's doing and the error gets caught by some higher level error handler (and/or the built-in %ETN error log).
I could use ztrap like:
s status = someObject.someMethod()
ztrap:$$$ISERR(status)
But that doesn't report much detail (unlike, say, .NET where I can throw an exception all the way to to the top of the stack) and I'm wondering if there are any better ways to do this.
Take a look at the Class Reference for %Exception.StatusException. You can create an exception from your status and throw it to whatever error trap is active at the time (so the flow of control would be the same as your ZTRAP example), like this
set sc = someobj.MethodReturningStatus()
if $$$ISERR(sc) {
set exception = ##class(%Exception.StatusException).CreateFromStatus(sc)
throw exception
}
However, in order to recover the exception information inside the error trap code that catches this exception, the error trap must have been established with try/catch. The older error handlers, $ztrap and $etrap, do not provide you with the exception object and you will only see that you have a <NOCATCH> error as the $ZERROR value. Even in that case, the flow of control will work as you want it to, but without try/catch, you would be no better off than you are with ZTRAP
These are two different error mechanisms and can't be combined in this way. ztrap and %ETN are for Cache level errors (the angle bracket errors like <UNDEFINED>). %Status objects are for application level errors (including errors that occurred through the use of the Cache Class Library) and you can choose how you want to handle them yourself. It's not really meaningful to handle a bad %Status through the Cache error mechanism because no Cache error has occurred.
Generally what most people do is something akin to:
d:$$$ISERR(status) $$$SomeMacroRelevantToMyAppThatWillHandleThisStatus(status)
It is possible to create your own domain with your own whole host of %Status codes with attendant %msg values for your application. Your app might have tried to connect to an FTP server and had a bad password, but that doesn't throw a <DISCONNECT> and there is no reason to investigate the stack, just an application level error that needs to be handled, possibly by asking the user to enter a new password.
It might seem odd that there are these two parallel error mechanisms, but they are describing two different types of errors. Think of one of them being "platform" level errors, and the other as "application level errors"
Edit: One thing I forgot, try DecomposeStatus^%apiOBJ(status) or ##class(%Status).LogicalToOdbc(status) to convert the status object to a human readable string. Also, if you're doing command line debugging or just want to print the readable form to the principal device, you can use $system.OBJ.DisplayError(status).

GetQueuedCompletionStatus returns ERROR_NETNAME_DELETED on remote socket closure

I am writing a small server-client-stuff using an I/O-Completion Port.
I get the server and client connected successfully via AcceptEx over my completion port.
After the client has connected the client socket is associated with the completion port and an overlapped call to WSARecv on that socket is invoked.
Everything works just fine, until I close the client test program.
GetQueuedCompletionStatus() returns FALSE and GetLastError returns
ERROR_NETNAME_DELETED
, which makes sense to me (after I read the description on the MSDN).
But my problem is, that I thought the call to GetQueuedCompletionStatus would return me a packet indicating the failure due to closure of the socket, because WSARecv would return the apropriate return value.
Since this is not the case I don´t know which clients´ socket caused the error and cant act the way i need to (freeing structures , cleanup for this particular connection, etc)...
Any suggestion on how to solve this, Or hints?
Thanks:)
EDIT: http://codepad.org/WeYINasO <- the code responsible... the "error" occures at the first functions beginning of the while-loop (the call to GetCompletionStatus() which is only a wrapper for GetQueuedCompletionStatus() working fine in other cases) [Did post it there, because it looks shitty & messy in here]
Here are the scenarios that you need to watch for when calling GetQueuedCompletionStatus:
GetQueuedCompletionStatus returns TRUE: A successful completion packet has been received, all the out parameters have been populated.
GetQueuedCompletionStatus returns FALSE, lpOverlapped == NULL: No packet was dequeued. The other out parameters contain indeterminate values.
GetQueuedCompletionStatus returns FALSE, lpOverlapped != NULL: The function has dequeued a failed completion packet. The error code is available via GetLastError.
To answer your question, when GetQueuedCompletionStatus returns FALSE and lpOverlapped != NULL, there was a failed I/O completion. It's the value of lpOverlapped that you need to be concerned about.
I know this is an old question, but I found this page while fruitlessly googling for details about ERROR_NETNAME_DELETED. It is an error which I get while doing an overlapped Readfile().
After some debugging it turned out that the problem was caused by a program which was writing to a socket but forgetting to call closesocket() before using ExitProcess() (due to garbage collection issues). Calling CloseHandle() did not prevent the error, nor did adding WSACleanup() before ExitProcess(). However, adding a short sleep before the client exited did prevent the error. Maybe avoiding ExitProcess() would have prevented the problem also.
So I suspect your problem is caused by the program exiting without closing down the socket properly.
I don't think this would be an issue on Unix where sockets are just ordinary file descriptors.