Checking health of Kafka Stream Threads - apache-kafka

When some of the stream threads die (because of an exception for example), I don't want to continue, but restart the process.
In order to do this, I need to identify this state.
I know I can use kafkaStream.state(), but it checks the state of the whole kstreams. Meaning if only one StreamThread has died, it will not be discovered by kafkaStream.state().
What is the best way for me to know in the code that all StreamThreads are alive and working?

Update : adding timeout to KafkaStreams#close() as it can cause deadlock as Matthias stated in the comment
If you want to detect if any StreamThreads has dies then you can use KafkaStreams#setUncaughtExceptionHandler(), you can stop streaming and exit app:
kafkaStreams.setUncaughtExceptionHandler((t, e) -> {
logger.error("Exiting ", e);
kafkaStreams.close(10);
System.exit(1);//exit with error code so container can restart this app
});

Related

Is it possible to avoid nested RetryLoop.callWithRetry calls so I get a consistent timeout?

I've configured a reasonable timeout using BoundedExponentialBackoffRetry, and generally it works as I'd expect if ZK is down when I make a call like "create.forPath". But if ZK is unavailable when I call acquire on an InterProcessReadWriteLock, it takes far longer before it finally times out.
I call acquire which is wrapped in "RetryLoop.callWithRetry" and it goes onto call findProtectedNodeInForeground which is also wrapped in "RetryLoop.callWithRetry". If I've configured the BoundedExponentialBackoffRetry to retry 20 times, the inner retry tries 20 times for every one of the 20 outer retry loops, so it retries 400 times.
We really need a consistent timeout after which we fail. Have I done anything wrong / anyway around this? If not, I guess I'll call the troublesome methods in a new thread that I can kill after my own timeout.
Here is the sample code to recreate it. I stick break points at the lines following the comments, bring ZK down and then let it continue and take the stacktrace whilst it's re-trying.
public class GoCurator {
public static void main(String[] args) throws Exception {
CuratorFramework cf = CuratorFrameworkFactory.newClient(
"localhost:2181",
new BoundedExponentialBackoffRetry(200, 10000, 20)
);
cf.start();
String root = "/myRoot";
if(cf.checkExists().forPath(root) == null) {
// Stacktrace A showing what happens if ZK is down for this call
cf.create().forPath(root);
}
InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, "/grant/myLock");
// See stacktrace B showing the nested re-try if ZK is down for this call
lcok.readLock().acquire();
lcok.readLock().release();
System.out.println("done");
}
}
Stacktrace A (if ZK is down when I'm calling create().forPath). This shows the single retry loop so it exist after the correct number of attempts:
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.Object.wait(Object.java:502)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2617)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
at com.gebatech.curator.GoCurator.main(GoCurator.java:25)
Stacktrace B (if ZK is down when I call InterProcessReadWriteLock#readLock#acquire). This shows the nested re-try loop so it doesn't exit until 20*20 attempts.
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:434)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:56)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.CreateBuilderImpl.findProtectedNodeInForeground(CreateBuilderImpl.java:1239)
at org.apache.curator.framework.imps.CreateBuilderImpl.access$1700(CreateBuilderImpl.java:51)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1167)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
at com.gebatech.curator.GoCurator.main(GoCurator.java:29)
This turns out to be a real, longstanding, problem with how Curator uses retries. I have a fix and PR ready here: https://github.com/apache/curator/pull/346 - I'd appreciate more eyes on it.

UndeliverableException thrown within a RxAndroidBle stream

I have a misbehaving BLE device (temp sensor) that keeps throwing a status 8 (GATT_INSUF_AUTHORIZATION or GATT_CONN_TIMEOUT) exception everytime i try to connect to the device. I'm not concerned about this exception as the device is faulty.
However, I keep getting notified that i've not handled the error correctly by rxjava2 when using RxAndroidBle(1.9.1); see here;
This is my code.
rxBleClient
.getBleDevice(macAddress)
.establishConnection(false)
.flatMapSingle { it.readRssi() }
.subscribe({ "test1:Success" }, { "test1:error" })
and the Error
I/RxBle#GattCallback: MAC='E9:CF:8A:D0:01:19' onConnectionStateChange(), status=8, value=0
D/RxBle#ClientOperationQueue: FINISHED ConnectOperation(147547253) in 10257 ms
D/RxBle#ConnectionOperationQueue: Connection operations queue to be terminated (MAC='E9:CF:8A:D0:01:19')
com.polidea.rxandroidble2.exceptions.BleDisconnectedException: Disconnected from MAC='E9:CF:8A:D0:01:19' with status 8 (GATT_INSUF_AUTHORIZATION or GATT_CONN_TIMEOUT)
at com.polidea.rxandroidble2.internal.connection.RxBleGattCallback$2.onConnectionStateChange(RxBleGattCallback.java:77)
at android.bluetooth.BluetoothGatt$1$4.run(BluetoothGatt.java:249)
at android.bluetooth.BluetoothGatt.runOrQueueCallback(BluetoothGatt.java:725)
at android.bluetooth.BluetoothGatt.-wrap0(Unknown Source:0)
at android.bluetooth.BluetoothGatt$1.onClientConnectionState(BluetoothGatt.java:244)
at android.bluetooth.IBluetoothGattCallback$Stub.onTransact(IBluetoothGattCallback.java:70)
at android.os.Binder.execTransact(Binder.java:697)
D/BleDeviceManagerNew$observeRssiTest: test1:error
E/plication$setupApp: Terminal Exception From RXJAVA was Not handled correctly
io.reactivex.exceptions.UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | com.polidea.rxandroidble2.exceptions.BleDisconnectedException: Disconnected from MAC='E9:CF:8A:D0:01:19' with status 8 (GATT_INSUF_AUTHORIZATION or GATT_CONN_TIMEOUT)
at io.reactivex.plugins.RxJavaPlugins.onError(RxJavaPlugins.java:367)
at io.reactivex.internal.operators.observable.ObservableUnsubscribeOn$UnsubscribeObserver.onError(ObservableUnsubscribeOn.java:67)
at io.reactivex.internal.operators.observable.ObservableSubscribeOn$SubscribeOnObserver.onError(ObservableSubscribeOn.java:63)
I'm not sure what else I should do - i've implemented a 'catch all' solution but don't like this approach;
RxJavaPlugins.setErrorHandler { e -> Timber.e(e, "Terminal Exception From RXJAVA was Not handled correctly") }
but don't see that as a good solution as expected that i should be-able to handle exception on the steam. Any suggestions of where I went wrong?
Your code is fine. The library has a flaw that does not allow to achieve your desired behaviour. More on the topic is on this library's wiki page.
While it is possible to design an API that would not throw UndeliverableException it would need to have a separate error Observable or Completable for BluetoothAdapter turning off and a separate one for RxBleConnection disconnect. The user would be responsible to mix those into their chain appropriately.
Current API does not allow it.

MS-MPI MPI_Barrier: sometimes hangs indefinitely, sometimes doesn't

I'm using the MPI.NET library, and I've recently moved my application to a bigger cluster (more COMPUTE-NODES). I've started seeing various collective functions hang indefinitely, but only sometimes. About half the time a job will complete, the rest of the time it'll hang. I've seen it happen with Scatter, Broadcast, and Barrier.
I've put a MPI.Communicator.world.Barrier() call (MPI.NET) at the start of the application, and created trace logs (using the MPIEXEC.exe /trace switch).
C# code snippet:
static void Main(string[] args)
{
var hostName = System.Environment.MachineName;
Logger.Trace($"Program.Main entered on {hostName}");
string[] mpiArgs = null;
MPI.Environment myEnvironment = null;
try
{
Logger.Trace($"Trying to instantiated on MPI.Environment on {hostName}. Is currently initialized? {MPI.Environment.Initialized}");
myEnvironment = new MPI.Environment(ref mpiArgs);
Logger.Trace($"Is currently initialized?{MPI.Environment.Initialized}. {hostName} is waiting at Barrier... ");
Communicator.world.Barrier(); // CODE HANGS HERE!
Logger.Trace($"{hostName} is past Barrier");
}
catch (Exception envEx)
{
Logger.Error(envEx, "Could not instantiate MPI.Environment object");
}
// rest of implementation here...
}
I can see the msmpi.dll's MPI_Barrier function being called in the log, and I can see messages being sent and received thereafter for a passing and a failing example. For the passing example, messages are sent/received and then the MPI_Barrier function Leave is logged.
For the failing example it look like one (or more) of the send messages is lost - it is never received by the target. Am I correct in thinking that messages lost within the MPI_Barrier call will mean that the processes never synchronize, therefore all get stuck at the Communicator.world.Barrier() call?
What could be causing this to happen intermittently? Could poor network performance between the COMPUTE-NODES be a cause?
I'm running MS HPC Pack 2008 R2, so the version of MS-MPI is pretty old, v2.0.
EDIT - Additional information
If I keep a task running within the same node, then this issue does not happen. For example, if I run a task using 8 cores on one node then fine, but if i use 9 cores on two nodes I'll see this issue ~50% of the time.
Also, we have two clusters in use and this only happens on one of them. They are both virtualized environments, but appear to be set up identically.

Moving from file-based tracing session to real time session

I need to log trace events during boot so I configure an AutoLogger with all the required providers. But when my service/process starts I want to switch to real-time mode so that the file doesn't explode.
I'm using TraceEvent and I can't figure out how to do this move correctly and atomically.
The first thing I tried:
const int timeToWait = 5000;
using (var tes = new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl") { StopOnDispose = false })
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
using (var tes = new TraceEventSession("TEMPSESSIONNAME", TraceEventSessionOptions.Attach))
{
Thread.Sleep(timeToWait);
tes.SetFileName(null);
Thread.Sleep(timeToWait);
Console.WriteLine("Done");
}
Here I wanted to make that I can transfer the session to real-time mode. But instead, the file I got contained events from a 15s period instead of just 10s.
The same happens if I use new TraceEventSession("TEMPSESSIONNAME", #"c:\temp\TEMPSESSIONNAME.etl", TraceEventSessionOptions.Create) instead.
It seems that the following will cause the file to stop being written to:
using (var tes = new TraceEventSession("TEMPSESSIONNAME"))
{
tes.EnableProvider(ProviderExtensions.ProviderName<MicrosoftWindowsKernelProcess>());
Thread.Sleep(timeToWait);
}
But here I must reenable all the providers and according to the documentation "if the session already existed it is closed and reopened (thus orphans are cleaned up on next use)". I don't understand the last part about orphans. Obviously some events might occur in the time between closing, opening and subscribing on the events. Does this mean I will lose these events or will I get the later?
I also found the following in the documentation of the library:
In real time mode, events are buffered and there is at least a second or so delay (typically 3 sec) between the firing of the event and the reception by the session (to allow events to be delivered in efficient clumps of many events)
Does this make the above code alright (well, unless the improbable happens and for some reason my thread is delayed for more than a second between creating the real-time session and starting processing the events)?
I could close the session and create a new different one but then I think I'd miss some events. Or I could open a new session and then close the file-based one but then I might get duplicate events.
I couldn't find online any examples of moving from a file-based trace to a real-time trace.
I managed to contact the author of TraceEvent and this is the answer I got:
Re the exception of the 'auto-closing and restarting' feature, it is really questions about the OS (TraceEvent simply calls the underlying OS API). Just FYI, the deal about orphans is that it is EASY for your process to exit but leave a session going. This MAY be what you want, but often it is not, and so to make the common case 'just work' if you do Create (which is the default), it will close a session if it already existed (since you asked for a new one).
Experimentation of course is the touchstone of 'truth' but I would frankly expecting unusual combinations to just work is generally NOT true.
My recommendation is to keep it simple. You need to open a new session and close the original one. Yes, you will end up with duplicates, but you CAN filter them out (after all they are IDENTICAL timestamps).
The other possibility is use SetFileName in its intended way (from one file to another). This certainly solves your problem of file size growth, and often is a good way to deal with other scenarios (after all you can start up you processing and start deleting files even as new files are being generated).

Using `chan pending output` instead of writable fileevent

Yo, I've written a server with a simple protocol: the client sends a line, the server sends a line back in response, repeat. To prevent a client from filling Tcl's output buffer by sending lots of lines but not accepting data back, can I just check chan pending output instead of using the writable fileevent?
proc respond {stream msg} {
if {[chan pending output $stream] <= 1024} {
puts $stream $msg
} else {
#close $stream
}
}
For output, chan pending output will correctly describe the number of bytes waiting in the output queue. Normally, that value will be bounded by the -buffersize value that you chan configure (or fconfigure) it to have.
That value will only be exceeded when the channel is non-blocking; with a blocking channel, when the value would go over it, instead there's a blocking write to the underlying device (socket, pipe, file, serial line, whatever) so by the time you could see that it went over, it's back under the limit again.
But if you're using non-blocking channels, you really should use chan event (or fileevent). Luckily for the actual writes, Tcl will actually do this for you automatically; the single most useful thing you could want from a writable event is already there. In practice, the most common actual use of a writable event is in detecting when an async socket connection becomes ready for service.
So what you are doing will work, but you'll have to think carefully about what to do if the output buffer is “getting full”; the idea that a message can need to be delayed is a place where a simple abstraction tends to become leaky. With 8.6's coroutines, you could (probably) do a transparent suspend or something like that, but getting that sort of thing right can take a little thought. (For example, a GUI client might need to show a busy indicator and put things into a state where the user can't enter more requests.)