Artemis shutting down with AMQ222010: Critical IO Error, shutting down the server [ask for clarification] - activemq-artemis

Does this fix mentioned here fix an error that was introduced in 2.16 or does it fix a long existing problem?
For a long time we have periodical ActiveMQ Artemis stoppages with similar threaddump + shutdown combo starting with
09:24:10,086 WARN [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=AIOSequentialFile:/home/jms-artemis/jms-broker/./data/journal/activemq-data-267599.amq, message=Timeout on close: java.io.IOException: Timeout on close
at org.apache.activemq.artemis.core.io.aio.AIOSequentialFile.close(AIOSequentialFile.java:126) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.io.aio.AIOSequentialFile.close(AIOSequentialFile.java:103) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalFilesRepository.closeFile(JournalFilesRepository.java:481) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalImpl.moveNextFile(JournalImpl.java:3021) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalImpl.switchFileIfNecessary(JournalImpl.java:2960) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2679) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$200(JournalImpl.java:90) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.core.journal.impl.JournalImpl$6.run(JournalImpl.java:1063) [artemis-journal-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.4.0.jar:2.4.0]
at org.apache.activemq.artemis.utils.actors.ProcessorBase$ExecutorTask.run(ProcessorBase.java:53) [artemis-commons-2.4.0.jar:2.4.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_211]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_211]
at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_211]
09:24:10,101 WARN [org.apache.activemq.artemis.journal] waiting pending callbacks on activemq-data-267599.amq from 10 seconds!
<...>
And for a long time the only similar problem-resolution was this post which states the solution is:
Check disk storage health on a hardware level in order to avoid such issues in the future.
That was not our case of course, but we could not figure out what the core of the problem was, so we just adjusted to the periodical downtime. But few days ago noticed a very similar question where it was advised to move to 2.17.0. However, the referenced issue states that it affects version 2.16/0.
We had no failback configuration, and we are using is 2.4.0.
The diff is really hard to get for external person, and the distance between 2.4.0 and 2.16.0 is far enough, so the question is does the stated 2.17.0 version fix some kind of similar but new problem introduced in 2.16, or the problem was long lurking and to overcome we should move from 2.4 to 2.17 and happiness will come?

The "Affects Version" on ARTEMIS-3084 is 2.16.0, but that's just the most recent version that exhibited the problem. I'm confident the issue was present in previous versions. If you're experiencing this issue I recommend you move to 2.17.0.

Related

Bind exception while running FlumeEventCount

Iam trying out spark streaming examples with apache flume.
For installation, I have just unzipped spark-1.1.0-bin-hadoop1.tar.gz and apache-flume-1.4.0-bin.tar.gz for running spark streaming examples.Is this the correct way or else Is there any other way, then just let me know.
With the above mentioned steps.. I have tried executing the examples, it throws an bind exception. Can someone help me with this issue ?
My flume is running properly it is able to push the data on relevant port...
The error Iam facing is mentioned below..
14/11/07 23:19:23 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Error starting receiver 0 - org.jboss.netty.channel.ChannelException:
Failed to bind to: /172.29.17.178:65001
at org.apache.avro.ipc.NettyServer.<init>(NettyServer.java:106)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:344)
at sun.nio.ch.Net.bind(Net.java:336)
Even after killing the specific port still facing with the similar issue..
Can someone help me to resolve this issue..
That port is already in use.
From your log:
"Caused by: java.net.BindException: Address already in use"
Use another one.

IllegalStateException: Committed ; Committed before 500 org.eclipse.jetty.io.EofException

I am getting this below error when I checked the jetty logs.
From the posts on ther sites, it seems that the solaris response was too slow for the client and it's trying to send a response back to someone
who's no longer listening for one.
Please let me know if anybody has any idea regarding this. Thanks in advance.
2014-08-01 10:10:08.377:WARN:oejs.Response:qtp136272369-10006: Committed before 500 org.eclipse.jetty.io.EofException
2014-08-01 10:10:08.378:WARN:oejs.ServletHandler:qtp136272369-10006: /estore-web/dec/ret/admin/session/getSessionObject/16019
java.lang.IllegalStateException: Committed
at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1004)
at org.eclipse.jetty.server.Response.sendError(Response.java:353)
at org.jboss.resteasy.plugins.server.servlet.HttpServletResponseWrapper.sendError(HttpServletResponseWrapper.java:71)
at org.jboss.resteasy.core.SynchronousDispatcher.handleFailure(SynchronousDispatcher.java:262)
at org.jboss.resteasy.core.SynchronousDispatcher.handleWriterException(SynchronousDispatcher.java:379)
at org.jboss.resteasy.core.SynchronousDispatcher.handleException(SynchronousDispatcher.java:218)
at org.jboss.resteasy.core.SynchronousDispatcher.handleWriteResponseException(SynchronousDispatcher.java:203)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:504)
at org.jboss.resteasy.core.SynchronousDispatcher.invoke(SynchronousDispatcher.java:119)
at org.jboss.resteasy.plugins.server.servlet.ServletContainerDispatcher.service(ServletContainerDispatcher.java:208)
at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:55)
at org.jboss.resteasy.plugins.server.servlet.HttpServletDispatcher.service(HttpServletDispatcher.java:50)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:848)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:696)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1515)
at com.tme.user.management.security.filter.BearerTokenAuthenticatingFilter.doFilter(BearerTokenAuthenticatingFilter.java:151)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1495)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:519)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1097)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:448)
at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1031)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:200)
at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:446)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:271)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:246)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
at java.lang.Thread.run(Thread.java:745)
Theoretically, it says: An existing connection was forcibly closed by the remote host Some reasons:
Bad network connection
Host shutdown
Time-outs
Bottle-neck with data overload
Intranet: Changed ip without changing intranet DNS
I am having the same exception. What were you trying to do when was it thrown?

Apache Camel blocking on multicast

I have an application that consists of many modules deployed in several JBoss instances. Those modules broadcast their version numbers via JMS to each other.
It works in the following way. Each module periodically broadcasts its versions like this:
camelTemplate.asyncSendBody(moduleVersionsOutputChannel, new ArrayList<>(moduleVersions));
Here's the definition of moduleVersionsOutputChannel:
<endpoint id="moduleVersionsOutputChannel" uri="direct:module.versions.output.channel"/>
Then there's a route that consumes messages from moduleVersionsOutputChannel (referred as source below) and posts the messages into topics:
from(source).multicast()
.parallelProcessing().timeout(moduleVersionsMonitoringConfig.getMulticastTimeout())
.to(moduleVersionsMonitoringConfig.getChannels())
.id(moduleVersionsService.getCurrentModuleId() + "-outbound");
At the beginning everything works fine, but after some time (hours or days) the threads that do multicast are blocked. I observe blocks at 2 different points.
The first block happens in the thread where asyncSendBody is called:
"moduleVersionsBroadcastScheduler-1#57542" prio=5 tid=0x423 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
at org.apache.camel.processor.MulticastProcessor.doProcessParallel(MulticastProcessor.java:328)
at org.apache.camel.processor.MulticastProcessor.process(MulticastProcessor.java:214)
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:72)
at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:163)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.component.direct.DirectProducer.process(DirectProducer.java:51)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.UnitOfWorkProducer.process(UnitOfWorkProducer.java:73)
at org.apache.camel.impl.ProducerCache$2.doInProducer(ProducerCache.java:378)
at org.apache.camel.impl.ProducerCache$2.doInProducer(ProducerCache.java:346)
at org.apache.camel.impl.ProducerCache.doInProducer(ProducerCache.java:242)
at org.apache.camel.impl.ProducerCache.sendExchange(ProducerCache.java:346)
at org.apache.camel.impl.ProducerCache.send(ProducerCache.java:184)
at org.apache.camel.impl.DefaultProducerTemplate.send(DefaultProducerTemplate.java:124)
at org.apache.camel.impl.DefaultProducerTemplate.sendBody(DefaultProducerTemplate.java:137)
at org.apache.camel.impl.DefaultProducerTemplate$14.call(DefaultProducerTemplate.java:621)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor$CallerRunsPolicy.rejectedExecution(ThreadPoolExecutor.java:2025)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:132)
at org.apache.camel.impl.DefaultProducerTemplate.asyncSendBody(DefaultProducerTemplate.java:626)
at ...
Another time the execution blocks at the later point when it actually tries to post a message to JMS topics:
"Camel (moduleVersionsContext) thread #1044 - Multicast#52031" daemon prio=5 tid=0xa16 nid=NA waiting
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:472)
at org.hornetq.core.client.impl.ClientProducerCreditsImpl.acquireCredits(ClientProducerCreditsImpl.java:90)
at org.hornetq.core.client.impl.ClientProducerImpl.sendRegularMessage(ClientProducerImpl.java:307)
at org.hornetq.core.client.impl.ClientProducerImpl.doSend(ClientProducerImpl.java:288)
at org.hornetq.core.client.impl.ClientProducerImpl.send(ClientProducerImpl.java:140)
at org.hornetq.jms.client.HornetQMessageProducer.doSend(HornetQMessageProducer.java:438)
at org.hornetq.jms.client.HornetQMessageProducer.send(HornetQMessageProducer.java:205)
at org.springframework.jms.core.JmsTemplate.doSend(JmsTemplate.java:589)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.doSend(JmsConfiguration.java:336)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.doSendToDestination(JmsConfiguration.java:275)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.access$100(JmsConfiguration.java:217)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate$1.doInJms(JmsConfiguration.java:231)
at org.springframework.jms.core.JmsTemplate.execute(JmsTemplate.java:466)
at org.apache.camel.component.jms.JmsConfiguration$CamelJmsTemplate.send(JmsConfiguration.java:228)
at org.apache.camel.component.jms.JmsProducer.doSend(JmsProducer.java:431)
at org.apache.camel.component.jms.JmsProducer.processInOnly(JmsProducer.java:385)
at org.apache.camel.component.jms.JmsProducer.process(JmsProducer.java:153)
at org.apache.camel.processor.SendProcessor.process(SendProcessor.java:110)
at org.apache.camel.management.InstrumentationProcessor.process(InstrumentationProcessor.java:72)
at org.apache.camel.processor.interceptor.TraceInterceptor.process(TraceInterceptor.java:163)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.processor.RedeliveryErrorHandler.process(RedeliveryErrorHandler.java:398)
at org.apache.camel.processor.CamelInternalProcessor.process(CamelInternalProcessor.java:191)
at org.apache.camel.util.AsyncProcessorHelper.process(AsyncProcessorHelper.java:105)
at org.apache.camel.processor.MulticastProcessor.doProcessParallel(MulticastProcessor.java:712)
at org.apache.camel.processor.MulticastProcessor.access$200(MulticastProcessor.java:83)
at org.apache.camel.processor.MulticastProcessor$1.call(MulticastProcessor.java:293)
at org.apache.camel.processor.MulticastProcessor$1.call(MulticastProcessor.java:278)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Can you tell me whether I have a bug in my Camel configuration or it's a known Camel issue and there's some workaround?
My environment:
Apache Camel 2.12.3
JBoss 6.1 EAP
OpenJDK 1.7.0_55
Ubuntu Linux 12.04 LTS
I had similar problem. Try to configure the cxf producer endpoint to synchronous. This resolved my issue
http://camel.465427.n5.nabble.com/Intermittent-STUCK-threads-in-Weblogic-Server-due-to-camel-application-td5750624.html

Spark fails on big shuffle jobs with java.io.IOException: Filesystem closed

I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html

Netty SslHandler headache

I've wasted a couple of days now trying to track down an intermittent bug in the newly added Akka transport encryption.
NOTE: I have experimented with setting setIssueHandshake(true) on either or both of the server and the clients, but it doesn't help at all.
Our cipher spec tests a couple of different ciphers from different suites, to make sure that the settings we support actually works. However, the tests can sometimes pass 10 times, and then start failing every other test, it's truly SecureRandomly failing ;-)
Please do note that the test fails even on SHA1PRNG so it is clearly unrelated to the additional ciphers we provide.
The code that creates the SslHandler: https://github.com/akka/akka/blob/wip-ssl-unbroken-%E2%88%9A/akka-remote/src/main/scala/akka/remote/netty/NettySSLSupport.scala
The code that constructs the pipeline: https://github.com/akka/akka/blob/wip-ssl-unbroken-%E2%88%9A/akka-remote/src/main/scala/akka/remote/netty/NettyRemoteSupport.scala#L66
The tests: https://github.com/akka/akka/blob/wip-ssl-unbroken-%E2%88%9A/akka-remote/src/test/scala/akka/remote/Ticket1978CommunicationSpec.scala
The fall-back config (for what the test above doesn't override): https://github.com/akka/akka/blob/wip-ssl-unbroken-%E2%88%9A/akka-remote/src/main/resources/reference.conf
The keystore & truststore to use for the test: https://github.com/akka/akka/tree/wip-ssl-unbroken-%E2%88%9A/akka-remote/src/test/resources
The root exception that fails the test is:
**java.security.InvalidKeyException: No installed provider supports this key: (null)**
at javax.crypto.Cipher.a(DashoA13*..)
at javax.crypto.Cipher.init(DashoA13*..)
at javax.crypto.Cipher.init(DashoA13*..)
at com.sun.net.ssl.internal.ssl.CipherBox.<init>(CipherBox.java:88)
at com.sun.net.ssl.internal.ssl.CipherBox.newCipherBox(CipherBox.java:119)
at com.sun.net.ssl.internal.ssl.CipherSuite$BulkCipher.newCipher(CipherSuite.java:369)
at com.sun.net.ssl.internal.ssl.Handshaker.newReadCipher(Handshaker.java:410)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.changeReadCiphers(SSLEngineImpl.java:550)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1051)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:845)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:721)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:607)
at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:969)
at org.jboss.netty.handler.ssl.SslHandler.decode(SslHandler.java:670)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:333)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:91)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:373)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:247)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
And the "full" one is:
[ERROR] [06/20/2012 10:38:33.670] [remote-sys-4] [ActorSystem(remote-sys)] RemoteServerError#akka://remote-sys#localhost:59104] Error[
javax.net.ssl.SSLException: Algorithm missing:
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.changeReadCiphers(SSLEngineImpl.java:554)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1051)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:845)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:721)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:607)
at org.jboss.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:969)
at org.jboss.netty.handler.ssl.SslHandler.decode(SslHandler.java:670)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:333)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:214)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:91)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:373)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:247)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: java.security.NoSuchAlgorithmException: Could not create cipher AES/128
at com.sun.net.ssl.internal.ssl.CipherBox.<init>(CipherBox.java:99)
at com.sun.net.ssl.internal.ssl.CipherBox.newCipherBox(CipherBox.java:119)
at com.sun.net.ssl.internal.ssl.CipherSuite$BulkCipher.newCipher(CipherSuite.java:369)
at com.sun.net.ssl.internal.ssl.Handshaker.newReadCipher(Handshaker.java:410)
at com.sun.net.ssl.internal.ssl.SSLEngineImpl.changeReadCiphers(SSLEngineImpl.java:550)
... 17 more
Caused by: java.security.InvalidKeyException: No installed provider supports this key: (null)
at javax.crypto.Cipher.a(DashoA13*..)
at javax.crypto.Cipher.init(DashoA13*..)
at javax.crypto.Cipher.init(DashoA13*..)
at com.sun.net.ssl.internal.ssl.CipherBox.<init>(CipherBox.java:88)
... 21 more
]
Not a bug in Netty, there was an unfortunate write-race between the application-level handshake and the SSL handshake. Worth to note is that setIssueHandshake(true) doesn't seem to be transparently handling the handshake as you need to manually back off writes until the handshake is completed.
whilst I've not seen exactly this exception, it's certainly the case that a javax.crypto.Cipher is not thread safe; I have an application where I finally tracked down a bug which was solved by synchronizing on the cipher:
cipher synchronized { cipher doFinal encryptedBytes }
Apologies if this is not the solution, but you posted a lot of code! (It's probably not exactly the same as the stack trace indicates that the issue is even getting a Cipher instance - but could this also need synchronizing?)