Spark with neo4j connector, neo4j times out? - scala

EDIT: I have tried several things to get this to work to completion and nothing seems to work:
Updated the neo4j-community.vmoptions:
-server
-Xms4096m
-Xmx4096m
-XX:NewSize=1024m
Also updated my mac to handle more threads.
C02RH2U9G8WM:~ meuser$ ulimit -u
709
C02RH2U9G8WM:~ meuser$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 256
pipe size (512 bytes, -p) 1
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 709
virtual memory (kbytes, -v) unlimited
Finally ran a test to make sure it wasnt limited harddrive space that was causing the issues. Everything seemed to check out, and during execution I would check my thread usage in another window....
jstack -l 'pid' | grep tid | wc -l
and
ps -elfT | wc -l
Nothing seemed out of hand so I am really confused as to why I am getting the errors below when running spark code in scala connecting to neo4j that runs fine with small set of answers but blows up when I let it rip on everything.
note: in my code I do not use sc.stop() or anything even though I think resources need to be cleared somehow while running a decently sized loop from within spark/scale
I did check the logs, turns out it might be memory related?
2017-10-13 01:17:45.073+0000 ERROR [o.n.b.t.SocketTransportHandler] Fatal error occurred when handling a client connection: unable to create new native thread unable to create new native thread
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:714)
at org.neo4j.kernel.impl.util.Neo4jJobScheduler.schedule(Neo4jJobScheduler.java:94)
at org.neo4j.bolt.v1.runtime.concurrent.ThreadedWorkerFactory.newWorker(ThreadedWorkerFactory.java:68)
at org.neo4j.bolt.v1.runtime.MonitoredWorkerFactory.newWorker(MonitoredWorkerFactory.java:54)
at org.neo4j.bolt.BoltKernelExtension.lambda$newVersions$1(BoltKernelExtension.java:234)
at org.neo4j.bolt.transport.ProtocolChooser.handleVersionHandshakeChunk(ProtocolChooser.java:95)
at org.neo4j.bolt.transport.SocketTransportHandler.chooseProtocolVersion(SocketTransportHandler.java:109)
at org.neo4j.bolt.transport.SocketTransportHandler.channelRead(SocketTransportHandler.java:58)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
at io.netty.handler.codec.ByteToMessageDecoder.handlerRemoved(ByteToMessageDecoder.java:219)
at io.netty.channel.DefaultChannelPipeline.callHandlerRemoved0(DefaultChannelPipeline.java:631)
at io.netty.channel.DefaultChannelPipeline.remove(DefaultChannelPipeline.java:468)
at io.netty.channel.DefaultChannelPipeline.remove(DefaultChannelPipeline.java:428)
at org.neo4j.bolt.transport.TransportSelectionHandler.switchToSocket(TransportSelectionHandler.java:126)
at org.neo4j.bolt.transport.TransportSelectionHandler.decode(TransportSelectionHandler.java:81)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:565)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:479)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:745)
I am running these tests on a single mac laptop pro (not clustered or anything yet). An i7 16GB of ram.
I am running some spark scala code that uses the neo4j connector. When I run it on small data sets it works fine, when I let it rip on processing all the data in a neo4j database, it times out after running for a good while (several hours). Usually with me having to reset the community edition restarting the neo4j server. Is there some setting somewhere where I can adjust this time out? Either in Spark/Scala or in neo4j configs?
org.neo4j.driver.v1.exceptions.ClientException: Failed to establish connection with server. Make sure that you have a server with bolt enabled on localhost:7687
at org.neo4j.driver.internal.connector.socket.SocketClient.negotiateProtocol(SocketClient.java:197)
at org.neo4j.driver.internal.connector.socket.SocketClient.start(SocketClient.java:76)
at org.neo4j.driver.internal.connector.socket.SocketConnection.<init>(SocketConnection.java:63)
at org.neo4j.driver.internal.connector.socket.SocketConnector.connect(SocketConnector.java:52)
at org.neo4j.driver.internal.pool.InternalConnectionPool.acquire(InternalConnectionPool.java:113)
at org.neo4j.driver.internal.InternalDriver.session(InternalDriver.java:53)
at org.neo4j.spark.Executor$.execute(Neo4j.scala:360)
at org.neo4j.spark.Neo4jRDD.compute(Neo4j.scala:408)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Related

Vertx program memory keeps growing

I used the top command to check on the Linux server, and found that the Memory of the deployed Vertx program has been increasing. I printed the Memory usage with Native Memory Tracking, and found that the Internal Memory has been increasing.I didn't manually allocate out of heap memory in the code.
Native Memory Tracking:
Total: reserved=7595MB, committed=6379MB
Java Heap (reserved=4096MB, committed=4096MB)
(mmap: reserved=4096MB, committed=4096MB)
Class (reserved=1101MB, committed=86MB)
(classes #12776)
(malloc=7MB #18858)
(mmap: reserved=1094MB, committed=79MB)
Thread (reserved=122MB, committed=122MB)
(thread #122)
(stack: reserved=121MB, committed=121MB)
Code (reserved=253MB, committed=52MB)
(malloc=9MB #12566)
(mmap: reserved=244MB, committed=43MB)
GC (reserved=155MB, committed=155MB)
(malloc=6MB #302)
(mmap: reserved=150MB, committed=150MB)
Internal (reserved=1847MB, committed=1847MB)
(malloc=1847MB #35973)
Symbol (reserved=17MB, committed=17MB)
(malloc=14MB #137575)
(arena=3MB #1)
Native Memory Tracking (reserved=4MB, committed=4MB)
(tracking overhead=3MB)
vertx version:3.9.8
Cluster:Hazelcast
startup script:su web -s /bin/bash -c "/usr/bin/nohup /usr/bin/java -XX:NativeMemoryTracking=detail -javaagent:../target/showbiz-server-game-1.0-SNAPSHOT-fat.jar -javaagent:../../quasar-core-0.8.0.jar=b -Dvertx.hazelcast.config=/data/appdata/webdata/Project/config/cluster.xml -jar -Xms4G -Xmx4G -XX:-OmitStackTraceInFastThrow ../target/server-1.0-SNAPSHOT-fat.jar start-Dvertx-id=server -conf application-conf.json -Dlog4j.configurationFile=log4j2_logstash.xml -cluster >nohup.out 2>&1 &"
If your producers are much faster than your consumers, and back pressure isn't handled properly, it's possible to have memory that keeps increasing.
Also, this could vary depending on how the code is written.
This similar reported issue could be of help and consider exploring writeStream too.

Unable to connect to the NetBeans Distribution because of Zero sized file

I recently reinstalled Netbeans IDE on my Windows 10 PC in order to restore some unrelated configurations. When I tried checking for new plugins in order to be able to download the Sakila sample database,
I get this error.
I've tested the connection on both No Proxy and Use Proxy Settings, and both connection tests seem to end succesfully.
I have allowed Netbeans through my firewall, but this has changed nothing either.
I haven't touched my proxy configuration, so it's on default (autodetect). Switching the autodetect off doesn't change anything, either, no matter what proxy config i have on Netbeans.
Here's part of my log file that might be helpful:
Compiler: HotSpot 64-Bit Tiered Compilers
Heap memory usage: initial 32,0MB maximum 910,5MB
Non heap memory usage: initial 2,4MB maximum -1b
Garbage collector: PS Scavenge (Collections=12 Total time spent=0s)
Garbage collector: PS MarkSweep (Collections=3 Total time spent=0s)
Classes: loaded=6377 total loaded=6377 unloaded 0
INFO [org.netbeans.core.ui.warmup.DiagnosticTask]: Total memory 17.130.041.344
INFO [org.netbeans.modules.autoupdate.updateprovider.DownloadListener]: Connection content length was 0 bytes (read 0bytes), expected file size can`t be that size - likely server with file at http://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz?unique=NB_CND_EXTIDE_GFMOD_GROOVY_JAVA_JC_MOB_PHP_WEBCOMMON_WEBEE0d55337f9-fc66-4755-adec-e290169de9d5_bf88d09e-bf9f-458e-b1c9-1ea89147b12b is temporary down
INFO [org.netbeans.modules.autoupdate.ui.Utilities]: Zero sized file reported at http://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz?unique=NB_CND_EXTIDE_GFMOD_GROOVY_JAVA_JC_MOB_PHP_WEBCOMMON_WEBEE0d55337f9-fc66-4755-adec-e290169de9d5_bf88d09e-bf9f-458e-b1c9-1ea89147b12b
java.io.IOException: Zero sized file reported at http://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz?unique=NB_CND_EXTIDE_GFMOD_GROOVY_JAVA_JC_MOB_PHP_WEBCOMMON_WEBEE0d55337f9-fc66-4755-adec-e290169de9d5_bf88d09e-bf9f-458e-b1c9-1ea89147b12b
at org.netbeans.modules.autoupdate.updateprovider.DownloadListener.doCopy(DownloadListener.java:155)
at org.netbeans.modules.autoupdate.updateprovider.DownloadListener.streamOpened(DownloadListener.java:78)
at org.netbeans.modules.autoupdate.updateprovider.NetworkAccess$Task$1.run(NetworkAccess.java:111)
Caused: java.io.IOException: Zero sized file reported at http://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz?unique=NB_CND_EXTIDE_GFMOD_GROOVY_JAVA_JC_MOB_PHP_WEBCOMMON_WEBEE0d55337f9-fc66-4755-adec-e290169de9d5_bf88d09e-bf9f-458e-b1c9-1ea89147b12b
at org.netbeans.modules.autoupdate.updateprovider.DownloadListener.notifyException(DownloadListener.java:103)
at org.netbeans.modules.autoupdate.updateprovider.AutoupdateCatalogCache.copy(AutoupdateCatalogCache.java:246)
at org.netbeans.modules.autoupdate.updateprovider.AutoupdateCatalogCache.writeCatalogToCache(AutoupdateCatalogCache.java:99)
at org.netbeans.modules.autoupdate.updateprovider.AutoupdateCatalogProvider.refresh(AutoupdateCatalogProvider.java:154)
at org.netbeans.modules.autoupdate.services.UpdateUnitProviderImpl.refresh(UpdateUnitProviderImpl.java:180)
at org.netbeans.api.autoupdate.UpdateUnitProvider.refresh(UpdateUnitProvider.java:196)
[catch] at org.netbeans.modules.autoupdate.ui.Utilities.tryRefreshProviders(Utilities.java:433)
at org.netbeans.modules.autoupdate.ui.Utilities.doRefreshProviders(Utilities.java:411)
at org.netbeans.modules.autoupdate.ui.Utilities.presentRefreshProviders(Utilities.java:405)
at org.netbeans.modules.autoupdate.ui.UnitTab$14.run(UnitTab.java:806)
at org.openide.util.RequestProcessor$Task.run(RequestProcessor.java:1423)
at org.openide.util.RequestProcessor$Processor.run(RequestProcessor.java:2033)
It might be that the update server is down just right now; i haven't been able to test this either. But it also might be something wrong with my configurations. I'm going crazy!!1!
Something that worked for me was changing the "http:" to "https:" in the update urls.
I.E. Change "http://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz"
to "https://updates.netbeans.org/netbeans/updates/8.0.2/uc/final/distribution/catalog.xml.gz"
No idea why that makes it work on my end. I'm running Linux Mint 19.1.

Kafka server failed to start - java.io.IOException: Map failed

I am not able to start Kafka Server because of the error below.
java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
at kafka.log.TimeIndex.<init>(TimeIndex.scala:55)
at kafka.log.LogSegment.<init>(LogSegment.scala:73)
at kafka.log.Log.loadSegments(Log.scala:267)
at kafka.log.Log.<init>(Log.scala:116)
at kafka.log.LogManager$$anonfun$createLog$1.apply(LogManager.scala:365)
at kafka.log.LogManager$$anonfun$createLog$1.apply(LogManager.scala:361)
at scala.Option.getOrElse(Option.scala:121)
at kafka.log.LogManager.createLog(LogManager.scala:361)
at kafka.cluster.Partition$$anonfun$getOrCreateReplica$1.apply(Partition.scala:109)
at kafka.cluster.Partition$$anonfun$getOrCreateReplica$1.apply(Partition.scala:106)
at kafka.utils.Pool.getAndMaybePut(Pool.scala:70)
at kafka.cluster.Partition.getOrCreateReplica(Partition.scala:105)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$3.apply(Partition.scala:166)
at kafka.cluster.Partition$$anonfun$4$$anonfun$apply$3.apply(Partition.scala:166)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:166)
at kafka.cluster.Partition$$anonfun$4.apply(Partition.scala:160)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:213)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:221)
at kafka.cluster.Partition.makeLeader(Partition.scala:160)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManager.scala:754)
at kafka.server.ReplicaManager$$anonfun$makeLeaders$4.apply(ReplicaManager.scala:753)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at kafka.server.ReplicaManager.makeLeaders(ReplicaManager.scala:753)
at kafka.server.ReplicaManager.becomeLeaderOrFollower(ReplicaManager.scala:698)
at kafka.server.KafkaApis.handleLeaderAndIsrRequest(KafkaApis.scala:148)
at kafka.server.KafkaApis.handle(KafkaApis.scala:84)
at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:62)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
... 34 more
Tried the below options . but of no use . please help
Upgraded OS from 32 to 64 Bit.
Increased java heap size to 1 GB.
Uninstalled and installed Apache Kafka
When this doesn't resolve the problem you could try to increase vm.max_map_count. The default value is 65536 (check this with sysctl vm.max_map_count)
With cat /proc/[kafka-pid]/maps | wc -l you can see how many maps are used.
Increase the setting with :
sysctl -w vm.max_map_count=262144
Upgrading the JVM to 64 bit resolved the issue
This helped me : Changing KAFKA_HEAP_OPTS="-Xmx256M -Xms256M" (Originally 512m)
change this to : kafka-server-start script
Thanks !
this helped me:
change :
export KAFKA_HEAP_OPTS="-Xmx512M -Xms512M"
(originally 1G)
in kafka-server-start script
I had the same issue on Windows, Kafka takes some memory during the process. So We need to increase the heap to prevent throttling the performance of the application. This can be achieved graphically through JAVA Control Panel.
Inside Runtime Parameters, you can change the size of the memory allocated by JVM:
-Xmx512m that assigns 512MB,
-Xmx1024m that assigns 1GB,
-Xmx2048m that assigns 2GB,
-Xmx3072m that assigns 3GB memory and so on.

Play framework scala - too many open files - how to in production

In the following, I am using Play framework 2.4.0 with Scala 2.11.7.
I am stressing a simple play app with Gatling, injecting 5000 users over 60 seconds and in a few seconds, the play server returns the following:
"Failed to accept a connection." and "java.io.IOException: Too many open files in system".
Here is the associated stacktrace:
22:52:48.943 [application-akka.actor.default-dispatcher-12] INFO play.api.Play$ - Application started (Dev)
22:53:08.939 [New I/O server boss #17] WARN o.j.n.c.s.nio.AbstractNioSelector - Failed to accept a connection.
java.io.IOException: Too many open files in system
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) ~[na:1.8.0_45]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) ~[na:1.8.0_45]
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) ~[na:1.8.0_45]
at org.jboss.netty.channel.socket.nio.NioServerBoss.process(NioServerBoss.java:100) [netty-3.10.3.Final.jar:na]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) [netty-3.10.3.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) [netty-3.10.3.Final.jar:na]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) [netty-3.10.3.Final.jar:na]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) [netty-3.10.3.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
I suppose this is due to the ulimit of the system (would it be possible to confirm that?), and if so, my question is the following:
How this kind of error is managed in production environment? Is this by setting a high value with ulimit -n <high_value>?
Most foolproof way to check is to:
cat /proc/PID/limits
You will see:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
...
Max open files 1024 1024 files
...
To see your current shell, you can always:
ulimit -a
And get
...
open files (-n) 1024
...
Changing it is best done system wide via /etc/security/limits.conf
But you can use ulimit -n to change for only the current shell.
In terms of how to deal with this situation, there's simply no alternative to having enough file descriptors. Set it high and if it hits, there is a leak or you work for facebook. In production, I believe the general recommendation is 16k.

Hadoop streaming job failure: Task process exit with nonzero status of 137

I've been banging my head on this one for a few days, and hope that somebody out there will have some insight.
I've written a streaming map reduce job in perl that is prone to having one or two reduce tasks take an extremely long time to execute. This is due to a natural asymmetry in the data: some of the reduce keys have over a million rows, where most have only a few dozen.
I've had problems with long tasks before, and I've been incrementing counters throughout to ensure that map reduce doesn't time them out. But now they are failing with an error message I hadn't seen before:
java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
This is not the standard timeout error message, but the error code 137 = 128+9 suggests that my reducer script received a kill -9 from Hadoop. The tasktracker log contains the following:
2011-09-05 19:18:31,269 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201109051336_0003_m_000029_1,7) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2940)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
at sun.nio.ch.IOUtil.write(IOUtil.java:43)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
... 24 more
2011-09-05 19:18:31,289 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.92.8.202:50060, dest: 10.92.8.201:46436, bytes: 7340032, op: MAPRED_SHUFFLE, cliID: attempt_201109051336_0003_m_000029_1
2011-09-05 19:18:31,292 ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
at org.mortbay.jetty.Response.sendError(Response.java:240)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2963)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
I've been looking around the forums, and it sounds like the Warnings are commonly found in jobs that run without error, and can usually be ignored. The error makes it look like the reducer lost contact with the map output, but I can't figure out why. Does anyone have any ideas?
Here's a potentially relevant fact: These long tasks were making my job take days where it should take minutes. Since I can live without the output from one or two keys, I tried to implement a ten minute timeout in my reducer as follows:
eval {
local $SIG{ALRM} = sub {
print STDERR "Processing of new merchant names in $prev_merchant_zip timed out...\n";
print STDERR "reporter:counter:tags,failed_zips,1\n";
die "timeout";
};
alarm 600;
#Code that could take a long time to execute
alarm 0;
};
This timeout code works like a charm when I test the script locally, but the strange mapreduce errors started after I introduced it. However, the failures seem to occur well after the first timeout, so I'm not sure if it's related.
Thanks in advance for any help!
Two possibilities come to mind:
RAM usage, if a tasks uses too much RAM the OS can kill it (after horrible swapping, etc).
Are you using any non-reentrant libraries? Maybe the timer is being triggered at an inopportune point in a library call.
Exit code 137 is a typical sign of the infamous OOM killer. You can easily check it using dmesg command for messages like this:
[2094250.428153] CPU: 23 PID: 28108 Comm: node Tainted: G C O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt20-1+deb8u2
[2094250.428155] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[2094250.428156] ffff880773439400 ffffffff8150dacf ffff881328ea32f0 ffffffff8150b6e7
[2094250.428159] ffff881328ea3808 0000000100000000 ffff88202cb30080 ffff881328ea32f0
[2094250.428162] ffff88107fdf2f00 ffff88202cb30080 ffff88202cb30080 ffff881328ea32f0
[2094250.428164] Call Trace:
[2094250.428174] [<ffffffff8150dacf>] ? dump_stack+0x41/0x51
[2094250.428177] [<ffffffff8150b6e7>] ? dump_header+0x76/0x1e8
[2094250.428183] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428186] [<ffffffff8114088d>] ? oom_kill_process+0x21d/0x370
[2094250.428188] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428193] [<ffffffff811a053a>] ? mem_cgroup_oom_synchronize+0x52a/0x590
[2094250.428195] [<ffffffff8119fac0>] ? mem_cgroup_try_charge_mm+0xa0/0xa0
[2094250.428199] [<ffffffff81141040>] ? pagefault_out_of_memory+0x10/0x80
[2094250.428203] [<ffffffff81057505>] ? __do_page_fault+0x3c5/0x4f0
[2094250.428208] [<ffffffff8109d017>] ? put_prev_entity+0x57/0x350
[2094250.428211] [<ffffffff8109be86>] ? set_next_entity+0x56/0x70
[2094250.428214] [<ffffffff810a2c61>] ? pick_next_task_fair+0x6e1/0x820
[2094250.428219] [<ffffffff810115dc>] ? __switch_to+0x15c/0x570
[2094250.428222] [<ffffffff81515ce8>] ? page_fault+0x28/0x30
You can easily if OOM is enabled:
$ cat /proc/sys/vm/overcommit_memory
0
Basically OOM killer tries to kill process that eats largest part of memory. And you probably don't want to disable it:
The OOM killer can be completely disabled with the following command.
This is not recommended for production environments, because if an
out-of-memory condition does present itself, there could be unexpected
behavior depending on the available system resources and
configuration. This unexpected behavior could be anything from a
kernel panic to a hang depending on the resources available to the
kernel at the time of the OOM condition.
sysctl vm.overcommit_memory=2
echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
Same situation can happen if you use e.g. cgroups for limiting memory. When process exceeds given limit it gets killed without warning.
I got this error. Kill a day and found it was an infinite loop somewhere in the code.