I have an application that consists of many modules deployed in several JBoss instances. Those modules broadcast their version numbers via JMS to each other.
It works in the following way. Each module periodically broadcasts its versions like this:
camelTemplate.asyncSendBody(moduleVersionsOutputChannel, new ArrayList<>(moduleVersions));
Here's the definition of moduleVersionsOutputChannel:
<endpoint id="moduleVersionsOutputChannel" uri="direct:module.versions.output.channel"/>
Then there's a route that consumes messages from moduleVersionsOutputChannel (referred as source below) and posts the messages into topics:
from(source).multicast()
.parallelProcessing().timeout(moduleVersionsMonitoringConfig.getMulticastTimeout())
.to(moduleVersionsMonitoringConfig.getChannels())
.id(moduleVersionsService.getCurrentModuleId() + "-outbound");
At the beginning everything works fine, but after some time (hours or days) the threads that do multicast are blocked. I observe blocks at 2 different points.
The first block happens in the thread where asyncSendBody is called:
"moduleVersionsBroadcastScheduler-1#57542" prio=5 tid=0x423 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
at org.apache.camel.processor.MulticastProcessor.doProcessParallel(MulticastProcessor.java:328)
at org.apache.camel.processor.MulticastProcessor.process(MulticastProcessor.java:214)
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:72)
at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:163)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.component.direct.DirectProducer.process(DirectProducer.java:51)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.UnitOfWorkProducer.process(UnitOfWorkProducer.java:73)
at org.apache.camel.impl.ProducerCache$2.doInProducer(ProducerCache.java:378)
at org.apache.camel.impl.ProducerCache$2.doInProducer(ProducerCache.java:346)
at org.apache.camel.impl.ProducerCache.doInProducer(ProducerCache.java:242)
at org.apache.camel.impl.ProducerCache.sendExchange(ProducerCache.java:346)
at org.apache.camel.impl.ProducerCache.send(ProducerCache.java:184)
at org.apache.camel.impl.DefaultProducerTemplate.send(DefaultProducerTemplate.java:124)
at org.apache.camel.impl.DefaultProducerTemplate.sendBody(DefaultProducerTemplate.java:137)
at org.apache.camel.impl.DefaultProducerTemplate$14.call(DefaultProducerTemplate.java:621)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2025)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132)
at org.apache.camel.impl.DefaultProducerTemplate.asyncSendBody(DefaultProducerTemplate.java:626)
at ...
Another time the execution blocks at the later point when it actually tries to post a message to JMS topics:
"Camel (moduleVersionsContext) thread #1044 - Multicast#52031" daemon prio=5 tid=0xa16 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:472)
at org.hornetq.core.client.impl.ClientProducerCreditsImpl.acquireCredits(ClientProducerCreditsImpl.java:90)
at org.hornetq.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:307)
at org.hornetq.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:288)
at org.hornetq.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:140)
at org.hornetq.jms.client.HornetQMessageProducer.doSend(HornetQMessageProducer.java:438)
at org.hornetq.jms.client.HornetQMessageProducer.send(HornetQMessageProducer.java:205)
at org.springframework.jms.core.JmsTemplate.doSend(JmsTemplate.java:589)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.doSend(JmsConfiguration.java:336)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.doSendToDestination(JmsConfiguration.java:275)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.access$100(JmsConfiguration.java:217)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate$1.doInJms(JmsConfiguration.java:231)
at org.springframework.jms.core.JmsTemplate.execute(JmsTemplate.java:466)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.send(JmsConfiguration.java:228)
at org.apache.camel.component.jms.JmsProducer.doSend(JmsProducer.java:431)
at org.apache.camel.component.jms.JmsProducer.processInOnly(JmsProducer.java:385)
at org.apache.camel.component.jms.JmsProducer.process(JmsProducer.java:153)
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:110)
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:72)
at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:163)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:398)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:105)
at org.apache.camel.processor.MulticastProcessor.doProcessParallel(MulticastProcessor.java:712)
at org.apache.camel.processor.MulticastProcessor.access$200(MulticastProcessor.java:83)
at org.apache.camel.processor.MulticastProcessor$1.call(MulticastProcessor.java:293)
at org.apache.camel.processor.MulticastProcessor$1.call(MulticastProcessor.java:278)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Can you tell me whether I have a bug in my Camel configuration or it's a known Camel issue and there's some workaround?
My environment:
Apache Camel 2.12.3
JBoss 6.1 EAP
OpenJDK 1.7.0_55
Ubuntu Linux 12.04 LTS
I had similar problem. Try to configure the cxf producer endpoint to synchronous. This resolved my issue
http://camel.465427.n5.nabble.com/Intermittent-STUCK-threads-in-Weblogic-Server-due-to-camel-application-td5750624.html
Related
Is it possible to recover automatically from an exception thrown during query execution?
Context: I'm developing a Spark application that reads data from a Kafka topic, processes the data, and outputs to S3. However, after running for a couple of days in production, the spark application faces some network hiccups from S3 that causes an exception to be thrown and stops the application. It's also worth mentioning that this application runs on Kubernetes using GCP's Spark k8s Operator.
From what I've seen so far, these exceptions are minor and a simple restart of the application solves the issue. Can we handle those exceptions and restart the structured streaming query automatically?
Here's an example of a thrown exception:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Job aborted.
=== Streaming Query ===
Identifier: ...
Current Committed Offsets: ...
Current Available Offsets: ...
Current State: ACTIVE
Thread State: RUNNABLE
Logical Plan: ...
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:297)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at io.blahblahView$$anonfun$11$$anonfun$apply$2.apply(View.scala:90)
at io.blahblahView $$anonfun$11$$anonfun$apply$2.apply(View.scala:82)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at io.blahblahView$$anonfun$11.apply(View.scala:82)
at io.blahblahView$$anonfun$11.apply(View.scala:79)
at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
... 1 more
Caused by: java.io.FileNotFoundException: No such file or directory: s3a://.../view/v1/_temporary/0
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:993)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:734)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:291)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJobInternal(FileOutputCommitter.java:361)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:334)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.commitJob(HadoopMapReduceCommitProtocol.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:187)
... 47 more
What's the simplest way of taking care of such issues automatically?
After spending too many hours trying to find an elegant fix to this issue, and not finding anything, here's what I came up with.
Some might say it's a hack, but it's simple, it works and solves a complex problem. I tested it in production and it solves the issue of recovering automatically from failure due to an occasional minor exception.
I call it a The Query Watchdog. Here's the simplest version where the watchdog will retry running the query indefinitely:
val writer = df.writeStream...
while (true) {
val query = writer.start()
try {
query.awaitTermination()
}
catch {
case e: StreamingQueryException => println("Streaming Query Exception caught!: " + e);
}
}
Some people might want to replace the while(true) with some kind of counter to limit the number of retries. Someone could also supplement this code and send notifications through slack or email whenever a retry happened. Others could simply collect the number of retries in Prometheus.
Hope it helps,
Cheers
Since you are using the Spark Operator, why not using its restart functionality? If the controller notices that the application has stopped then it will re-submit it automatically.
This will work assuming that the application fails, i.e. the driver pod stops. There are some cases where a driver exception is thrown but the driver pod keeps running without doing anything. In that case the Spark Operator will think that the application is still running ok.
No, there is not in a reliable way to do this. BTW, No is also an answer.
Logic for checking exceptions are generally via try / catch running on the driver.
As unexpected situations at Executor level are already standardly handled by the Spark Framework itself for Structured Streaming, and if the error is non-recoverable, then the App / Job simply crashes after signalling of error(s) back to the driver unless you code try / catch within the various foreachXXX constructs.
That said, it is not clear for the foreachXXX constructs that the micro batch will be recoverable in such an approach afaics, some part of the microbatch is highly likely lost. Hard to test though.
Given that Spark has things standardly catered for that you cannot hook into, why would it be possible to insert a loop or try/catch in the source of the program? Likewise broadcast variables area an issue - although some have techniques around this so they say. But it is not in the spirit of the framework.
So, good question as I wonder(ed) about this (in the past).
Depending on you Spark runtime and environment, an alternative recommended for example in Databricks documentation is to simply let the streaming queries fail so that the retries can be handled at Spark job level.
One of the benefits of this is that it decouples the retry policy and related email notifications from your application.
Running Ignite inside kubernetes environment, but after a while when the load is increased getting below Blocked Thread in Thread dump,
qtp102185114-154-acceptor-2#6275fad4-ServerConnector#6a3aa688{HTTP/1.1}{0.0.0.0:8080} - priority:5 - threadId:0x00007f78a6e1c000 - nativeId:0x84 - state:BLOCKED
stackTrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:234)
- waiting to lock <0x00000000a049f430> (a java.lang.Object)
at org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:377)
at org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:500)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- None
Can anyone explain what's going wrong here? Given Blocked thread is on every node of a cluster.
It looks like Jetty (which is used for REST API internally in Ignite) doesn't have enough threads in the pool to accept all incoming connection when loading increases. I would suggest checking CPU utilization firstly and it is not 100% then you are able to increase threads count in Jetty XML configuration
we use a JBoss cluster (EAP 6.4.10) with 6 instances and heavily use the bundled Infinispan 5.2.11 for different in-memory-grid use cases. Most of them are distributed caches, however (replicated, actually). We also have distributed transactions and JMS bound to transactional borders in the mix. Backend is SQL Server 2016 Enterprise, and even here transactions could span multiple SQL server instances and databases (DTC).
Synchronization of Infinispan is done with UDP multicast using JGroups 3.2.13
From time to time, and especially under or after heavy load, we face the problem that JBoss worker threads pile up at specific Infinispan internal locks that seemingly never get released. Thus, the http connection pool starvates, open transactions on the databases don't get committed or rolled back; JMS messages get lost; we face a heap of locks on the database that affect all other systems connecting to the same databases (which are the other JBoss instances in the cluster).
Currently we watch the http thread pools and as soon as threads start to pile up for a certain amount of time, the instance is removed from the load balancer and shut down.
Sometimes this is unnecessary, i.e. the cluster heals itself and the culprit instance starts to behave normally again without manual intervention.
However, usually the only way is manually restarting the instance and keeping fingers crossed that the correct infinispan cluster forms again.
Stack traces from hanging instances always show several interesting features:
One group of threads (http workers) pile up on a condition deep inside JGroup's flow control implementation; the application's last call in those cases usually is some operation on a distributed cache (e.g. remove()):
http-threads - 178 awaiting notification on [ 0x0000000374059958 ]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2163)
at org.jgroups.util.CreditMap.decrement(CreditMap.java:157)
at org.jgroups.protocols.MFC.handleDownMessage(MFC.java:104)
at org.jgroups.protocols.FlowControl.down(FlowControl.java:341)
at org.jgroups.protocols.FRAG2.down(FRAG2.java:148)
at org.jgroups.protocols.RSVP.down(RSVP.java:143)
at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:1030)
at org.jgroups.JChannel.down(JChannel.java:722)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.down(MessageDispatcher.java:618)
at org.jgroups.blocks.RequestCorrelator.sendRequest(RequestCorrelator.java:174)
at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:360)
at org.jgroups.blocks.GroupRequest.sendRequest(GroupRequest.java:103)
at org.jgroups.blocks.Request.execute(Request.java:83)
at org.jgroups.blocks.MessageDispatcher.cast(MessageDispatcher.java:337)
at org.jgroups.blocks.MessageDispatcher.castMessage(MessageDispatcher.java:249)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.processCalls(CommandAwareRpcDispatcher.java:333)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.invokeRemoteCommands(CommandAwareRpcDispatcher.java:146)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.broadcastRemoteCommands(CommandAwareRpcDispatcher.java:197)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.invokeRemotely(JGroupsTransport.java:498)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:173)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:194)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:251)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:238)
at org.infinispan.remoting.rpc.RpcManagerImpl.invokeRemotely(RpcManagerImpl.java:233)
at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:212)
at org.infinispan.remoting.rpc.RpcManagerImpl.broadcastRpcCommand(RpcManagerImpl.java:204)
at org.infinispan.interceptors.ReplicationInterceptor.handleCrudMethod(ReplicationInterceptor.java:307)
at org.infinispan.interceptors.ReplicationInterceptor.visitRemoveCommand(ReplicationInterceptor.java:269)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.EntryWrappingInterceptor.invokeNextAndApplyChanges(EntryWrappingInterceptor.java:302)
at org.infinispan.interceptors.EntryWrappingInterceptor.visitRemoveCommand(EntryWrappingInterceptor.java:207)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor.visitRemoveCommand(NonTransactionalLockingInterceptor.java:124)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
at org.infinispan.commands.AbstractVisitor.visitRemoveCommand(AbstractVisitor.java:67)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.base.CommandInterceptor.handleDefault(CommandInterceptor.java:134)
at org.infinispan.commands.AbstractVisitor.visitRemoveCommand(AbstractVisitor.java:67)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:284)
at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:222)
at org.infinispan.statetransfer.StateTransferInterceptor.visitRemoveCommand(StateTransferInterceptor.java:171)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.CacheMgmtInterceptor.visitRemoveCommand(CacheMgmtInterceptor.java:137)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitRemoveCommand(AbstractVisitor.java:67)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1186)
at org.infinispan.CacheImpl.removeInternal(CacheImpl.java:314)
at org.infinispan.CacheImpl.remove(CacheImpl.java:308)
at org.infinispan.CacheImpl.remove(CacheImpl.java:302)
at org.infinispan.AbstractDelegatingCache.remove(AbstractDelegatingCache.java:313)
at com.company.project.information.AuthenticationService.processChallengeImpl(AuthenticationService.java:155)
at com.company.project.information.AuthenticationService.processChallengeForLogin(AuthenticationService.java:132)
at com.company.project.information.AuthenticationService.respondChallenge(AuthenticationService.java:418)
at com.company.project.security.auth.ServerLoginModule.verify(ServerLoginModule.java:280)
at com.company.project.security.auth.ServerLoginModule._login(ServerLoginModule.java:166)
at com.company.project.security.auth.ServerLoginModule.login(ServerLoginModule.java:87)
at sun.reflect.GeneratedMethodAccessor487.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.jboss.security.authentication.JBossCachedAuthenticationManager.defaultLogin(JBossCachedAuthenticationManager.java:399)
at org.jboss.security.authentication.JBossCachedAuthenticationManager.proceedWithJaasLogin(JBossCachedAuthenticationManager.java:338)
at org.jboss.security.authentication.JBossCachedAuthenticationManager.authenticate(JBossCachedAuthenticationManager.java:326)
at org.jboss.security.authentication.JBossCachedAuthenticationManager.isValid(JBossCachedAuthenticationManager.java:142)
at org.jboss.as.security.service.SimpleSecurityManager.authenticate(SimpleSecurityManager.java:418)
at org.jboss.as.security.service.SimpleSecurityManager.authenticate(SimpleSecurityManager.java:377)
at org.jboss.as.ejb3.security.SecurityContextInterceptor$1.run(SecurityContextInterceptor.java:54)
at org.jboss.as.ejb3.security.SecurityContextInterceptor$1.run(SecurityContextInterceptor.java:48)
at org.jboss.as.ejb3.security.SecurityContextInterceptor.processInvocation(SecurityContextInterceptor.java:86)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ejb3.deployment.processors.StartupAwaitInterceptor.processInvocation(StartupAwaitInterceptor.java:22)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ejb3.component.interceptors.ShutDownInterceptorFactory$1.processInvocation(ShutDownInterceptorFactory.java:64)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ejb3.component.interceptors.LoggingInterceptor.processInvocation(LoggingInterceptor.java:59)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ee.component.NamespaceContextInterceptor.processInvocation(NamespaceContextInterceptor.java:50)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ejb3.component.interceptors.AdditionalSetupInterceptor.processInvocation(AdditionalSetupInterceptor.java:55)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.as.ee.component.TCCLInterceptor.processInvocation(TCCLInterceptor.java:45)
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:288)
at org.jboss.invocation.ChainedInterceptor.processInvocation(ChainedInterceptor.java:61)
at org.jboss.as.ee.component.ViewService$View.invoke(ViewService.java:189)
at org.jboss.as.ejb3.remote.LocalEjbReceiver.processInvocation(LocalEjbReceiver.java:271)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:184)
at org.jboss.ejb.client.EJBObjectInterceptor.handleInvocation(EJBObjectInterceptor.java:58)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
at org.jboss.ejb.client.EJBHomeInterceptor.handleInvocation(EJBHomeInterceptor.java:83)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
at org.jboss.ejb.client.TransactionInterceptor.handleInvocation(TransactionInterceptor.java:42)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
at org.jboss.ejb.client.ReceiverInterceptor.handleInvocation(ReceiverInterceptor.java:125)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
at com.company.project.module.service.CallIdPropagator$1.handleInvocation(CallIdPropagator.java:60)
at org.jboss.ejb.client.EJBClientInvocationContext.sendRequest(EJBClientInvocationContext.java:186)
at org.jboss.ejb.client.EJBInvocationHandler.sendRequestWithPossibleRetries(EJBInvocationHandler.java:255)
at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:200)
at org.jboss.ejb.client.EJBInvocationHandler.doInvoke(EJBInvocationHandler.java:183)
at org.jboss.ejb.client.EJBInvocationHandler.invoke(EJBInvocationHandler.java:146)
at com.sun.proxy.$Proxy653.hasOneOfThisNetworkFeatures(
Unknown Source)
at sun.reflect.GeneratedMethodAccessor773.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.cgm.life.clientservices.Invoker.doPost(Invoker.java:129)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:754)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:847)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:295)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:214)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:231)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:149)
at org.jboss.as.jpa.interceptor.WebNonTxEmCloserValve.invoke(WebNonTxEmCloserValve.java:50)
at org.jboss.as.jpa.interceptor.WebNonTxEmCloserValve.invoke(WebNonTxEmCloserValve.java:50)
at org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:150)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:97)
at org.jboss.web.rewrite.RewriteValve.invoke(RewriteValve.java:466)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:102)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:344)
at org.apache.coyote.ajp.AjpAprProcessor.process(AjpAprProcessor.java:475)
at org.apache.coyote.ajp.AjpAprProtocol$AjpConnectionHandler.process(AjpAprProtocol.java:454)
at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.run(AprEndpoint.java:2562)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
at org.jboss.threads.JBossThread.run(JBossThread.java:122)
(full stack trace just to give you the whole picture). The condition in this example (0x0000000374059958) blocks hundrets of threads with differing stack traces; all of them share the call to some distributed cache, though.
Another group of threads (seemingly UDP protocol handlers, of which we have a whole bunch) also are awaiting notification on an object. The context seems to be something with state transfer in the Infinispan cluster:
Incoming-8,shared=udp awaiting notification on [ 0x00000003810b5188 ]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:100)
at org.infinispan.statetransfer.StateTransferInterceptor.updateTopologyIdAndWaitForTransactionData(StateTransferInterceptor.java:311)
at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:281)
at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:222)
at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:156)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:82)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:62)
at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:82)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.commands.remote.BaseRpcInvokingCommand.processVisitableCommand(BaseRpcInvokingCommand.java:61)
at org.infinispan.commands.remote.SingleRpcCommand.perform(SingleRpcCommand.java:70)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:100)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:121)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:85)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:247)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:220)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:484)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:391)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:249)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:600)
at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:130)
at org.jgroups.JChannel.up(JChannel.java:707)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1025)
at org.jgroups.protocols.RSVP.up(RSVP.java:188)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:897)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:247)
at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:453)
at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:793)
at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:609)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:147)
at org.jgroups.protocols.FD.up(FD.java:253)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
at org.jgroups.protocols.MERGE3.up(MERGE3.java:290)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2616)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1269)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1831)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Only 4 threads are waiting on this notification.
However, there are even more threads (like, hundrets) waiting on different objects, but at the same position in the code:
OOB-4950,shared=udp awaiting notification on [ 0x000000038bccd2b0 ]
OOB-4951,shared=udp awaiting notification on [ 0x00000003bacf4930 ]
OOB-4952,shared=udp awaiting notification on [ 0x000000038bccd2b0 ]
OOB-4953,shared=udp awaiting notification on [ 0x000000038bccd2b0 ]
OOB-4954,shared=udp awaiting notification on [ 0x00000003bacf4930 ]
OOB-4956,shared=udp awaiting notification on [ 0x000000038bccd2b0 ]
OOB-4958,shared=udp awaiting notification on [ 0x000000038bccd2b0 ]
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:100)
at org.infinispan.statetransfer.StateTransferInterceptor.updateTopologyIdAndWaitForTransactionData(StateTransferInterceptor.java:311)
at org.infinispan.statetransfer.StateTransferInterceptor.handleTopologyAffectedCommand(StateTransferInterceptor.java:281)
at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:222)
at org.infinispan.statetransfer.StateTransferInterceptor.visitRemoveCommand(StateTransferInterceptor.java:171)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.CacheMgmtInterceptor.visitRemoveCommand(CacheMgmtInterceptor.java:137)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:120)
at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:128)
at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:92)
at org.infinispan.commands.AbstractVisitor.visitRemoveCommand(AbstractVisitor.java:67)
at org.infinispan.commands.write.RemoveCommand.acceptVisitor(RemoveCommand.java:73)
at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
at org.infinispan.commands.remote.BaseRpcInvokingCommand.processVisitableCommand(BaseRpcInvokingCommand.java:61)
at org.infinispan.commands.remote.SingleRpcCommand.perform(SingleRpcCommand.java:70)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleInternal(InboundInvocationHandlerImpl.java:100)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handleWithWaitForBlocks(InboundInvocationHandlerImpl.java:121)
at org.infinispan.remoting.InboundInvocationHandlerImpl.handle(InboundInvocationHandlerImpl.java:85)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.executeCommandFromLocalCluster(CommandAwareRpcDispatcher.java:247)
at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.handle(CommandAwareRpcDispatcher.java:220)
at org.jgroups.blocks.RequestCorrelator.handleRequest(RequestCorrelator.java:484)
at org.jgroups.blocks.RequestCorrelator.receiveMessage(RequestCorrelator.java:391)
at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:249)
at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:600)
at org.jgroups.blocks.mux.MuxUpHandler.up(MuxUpHandler.java:130)
at org.jgroups.JChannel.up(JChannel.java:707)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:1025)
at org.jgroups.protocols.RSVP.up(RSVP.java:188)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:182)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:400)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:418)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:897)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:247)
at org.jgroups.protocols.UNICAST2.up(UNICAST2.java:453)
at org.jgroups.protocols.pbcast.NAKACK.handleMessage(NAKACK.java:751)
at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:609)
at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:147)
at org.jgroups.protocols.FD.up(FD.java:253)
at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:288)
at org.jgroups.protocols.MERGE3.up(MERGE3.java:290)
at org.jgroups.protocols.Discovery.up(Discovery.java:359)
at org.jgroups.protocols.TP$ProtocolAdapter.up(TP.java:2616)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1269)
at org.jgroups.protocols.TP$IncomingPacket.handleMyMessage(TP.java:1831)
at org.jgroups.protocols.TP$IncomingPacket.run(TP.java:1804)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
There're also threads waiting for a lock that is held by a thread waiting for the condition in (1):
Transaction Reaper Worker 9 waiting to acquire [ 0x0000000507058c88 ]
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.afterCompletion(TwoPhaseCoordinator.java:356
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.afterCompletion(TwoPhaseCoordinator.java:334
at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.cancel(TwoPhaseCoordinator.java:120)
at com.arjuna.ats.arjuna.AtomicAction.cancel(AtomicAction.java:215)
at com.arjuna.ats.arjuna.coordinator.TransactionReaper.doCancellations(TransactionReaper.java:370)
at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:78)
One of the http threads holds 0x0000000507058c88 that in turn waits for 0x0000000374059958
That's about it with respect to what I can read from the system. Something is seriously stuck inside, but I have no idea how to analyze or troubleshoot, let alone solve the problems. (We tried updating to EAP 6.4.20, but the same behavior could be observed here, so we rolled this update back due to other, newer, shinier problems that occured after the update).
I'll gladly provide more information (like specific config params) that can help to diagnose the problem.
Thanks for anyone helping out!
Best,
Markus
Turns out it makes a lot of sense to monitor your GC log for interesting events... x_X
The situation was that we suffered Old-Gen fragmentation with CMS, and the Java Virtual Machine periodically decided to pause the application server for up to 45 seconds to do defragmentation.
We switched to G1 and these pauses along with Infinispan-desynchronizations vanished as well. This setup is now running quite stable for the last couple of months.
In a long run I am seeing massive number of threads getting piled up by MongoDB Java driver (v3.0.3). All these threads are server monitoring threads, all parked waiting:
cluster-ClusterId{value='562233d1b26c940820028340', description='null'}-192.168.0.2:27017
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:237)
com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:218)
com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:167)
java.lang.Thread.run(Unknown Source)
Right now there are about 250 of them. I don't think that many threads are needed to monitor a connection to a single database host. I am obviously doing something wrong.., but as far as I can tell we didn't do any setting changes when moved from driver v2 to v3. Could be a bug in driver? Any ideas?
This issue has been fixed in 3.2.2.
https://jira.mongodb.org/browse/JAVA-2074
I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html