Hadoop streaming job failure: Task process exit with nonzero status of 137 - streaming

I've been banging my head on this one for a few days, and hope that somebody out there will have some insight.
I've written a streaming map reduce job in perl that is prone to having one or two reduce tasks take an extremely long time to execute. This is due to a natural asymmetry in the data: some of the reduce keys have over a million rows, where most have only a few dozen.
I've had problems with long tasks before, and I've been incrementing counters throughout to ensure that map reduce doesn't time them out. But now they are failing with an error message I hadn't seen before:
java.io.IOException: Task process exit with nonzero status of 137.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
This is not the standard timeout error message, but the error code 137 = 128+9 suggests that my reducer script received a kill -9 from Hadoop. The tasktracker log contains the following:
2011-09-05 19:18:31,269 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201109051336_0003_m_000029_1,7) failed :
org.mortbay.jetty.EofException
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:787)
at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:548)
at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:569)
at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:946)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:646)
at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:577)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2940)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72)
at sun.nio.ch.IOUtil.write(IOUtil.java:43)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334)
at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:169)
at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221)
at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:721)
... 24 more
2011-09-05 19:18:31,289 INFO org.apache.hadoop.mapred.TaskTracker.clienttrace: src: 10.92.8.202:50060, dest: 10.92.8.201:46436, bytes: 7340032, op: MAPRED_SHUFFLE, cliID: attempt_201109051336_0003_m_000029_1
2011-09-05 19:18:31,292 ERROR org.mortbay.log: /mapOutput
java.lang.IllegalStateException: Committed
at org.mortbay.jetty.Response.resetBuffer(Response.java:994)
at org.mortbay.jetty.Response.sendError(Response.java:240)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2963)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
I've been looking around the forums, and it sounds like the Warnings are commonly found in jobs that run without error, and can usually be ignored. The error makes it look like the reducer lost contact with the map output, but I can't figure out why. Does anyone have any ideas?
Here's a potentially relevant fact: These long tasks were making my job take days where it should take minutes. Since I can live without the output from one or two keys, I tried to implement a ten minute timeout in my reducer as follows:
eval {
local $SIG{ALRM} = sub {
print STDERR "Processing of new merchant names in $prev_merchant_zip timed out...\n";
print STDERR "reporter:counter:tags,failed_zips,1\n";
die "timeout";
};
alarm 600;
#Code that could take a long time to execute
alarm 0;
};
This timeout code works like a charm when I test the script locally, but the strange mapreduce errors started after I introduced it. However, the failures seem to occur well after the first timeout, so I'm not sure if it's related.
Thanks in advance for any help!

Two possibilities come to mind:
RAM usage, if a tasks uses too much RAM the OS can kill it (after horrible swapping, etc).
Are you using any non-reentrant libraries? Maybe the timer is being triggered at an inopportune point in a library call.

Exit code 137 is a typical sign of the infamous OOM killer. You can easily check it using dmesg command for messages like this:
[2094250.428153] CPU: 23 PID: 28108 Comm: node Tainted: G C O 3.16.0-4-amd64 #1 Debian 3.16.7-ckt20-1+deb8u2
[2094250.428155] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
[2094250.428156] ffff880773439400 ffffffff8150dacf ffff881328ea32f0 ffffffff8150b6e7
[2094250.428159] ffff881328ea3808 0000000100000000 ffff88202cb30080 ffff881328ea32f0
[2094250.428162] ffff88107fdf2f00 ffff88202cb30080 ffff88202cb30080 ffff881328ea32f0
[2094250.428164] Call Trace:
[2094250.428174] [<ffffffff8150dacf>] ? dump_stack+0x41/0x51
[2094250.428177] [<ffffffff8150b6e7>] ? dump_header+0x76/0x1e8
[2094250.428183] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428186] [<ffffffff8114088d>] ? oom_kill_process+0x21d/0x370
[2094250.428188] [<ffffffff8114044d>] ? find_lock_task_mm+0x3d/0x90
[2094250.428193] [<ffffffff811a053a>] ? mem_cgroup_oom_synchronize+0x52a/0x590
[2094250.428195] [<ffffffff8119fac0>] ? mem_cgroup_try_charge_mm+0xa0/0xa0
[2094250.428199] [<ffffffff81141040>] ? pagefault_out_of_memory+0x10/0x80
[2094250.428203] [<ffffffff81057505>] ? __do_page_fault+0x3c5/0x4f0
[2094250.428208] [<ffffffff8109d017>] ? put_prev_entity+0x57/0x350
[2094250.428211] [<ffffffff8109be86>] ? set_next_entity+0x56/0x70
[2094250.428214] [<ffffffff810a2c61>] ? pick_next_task_fair+0x6e1/0x820
[2094250.428219] [<ffffffff810115dc>] ? __switch_to+0x15c/0x570
[2094250.428222] [<ffffffff81515ce8>] ? page_fault+0x28/0x30
You can easily if OOM is enabled:
$ cat /proc/sys/vm/overcommit_memory
0
Basically OOM killer tries to kill process that eats largest part of memory. And you probably don't want to disable it:
The OOM killer can be completely disabled with the following command.
This is not recommended for production environments, because if an
out-of-memory condition does present itself, there could be unexpected
behavior depending on the available system resources and
configuration. This unexpected behavior could be anything from a
kernel panic to a hang depending on the resources available to the
kernel at the time of the OOM condition.
sysctl vm.overcommit_memory=2
echo "vm.overcommit_memory=2" >> /etc/sysctl.conf
Same situation can happen if you use e.g. cgroups for limiting memory. When process exceeds given limit it gets killed without warning.

I got this error. Kill a day and found it was an infinite loop somewhere in the code.

Related

Celery - handle WorkerLostError exception with Task.retry()

I'm using celery 4.4.7
Some of my tasks are using too much memory and are getting killed with SIGTERM 9. I would like to retry them later since I'm running with concurrency on the machine and they might run OK again.
However, as far as I understand you can't catch WorkerLostError exception thrown within a task i.e. this won't won't work as I expect:
from billiard.exceptions import WorkerLostError
#celery_app.task(acks_late=True, max_retries=2, autoretry_for=(WorkerLostError,))
def some_task():
#task code
I also don't won't to use task_reject_on_worker_lost as it makes the tasks requeued and max_retries is not applied.
What would be the best approach to handle my use case?
Thanks in advance for your time :)
Gal

How to count cache-misses in mmap-ed memory (using eBPF)?

I would like to get timeseries
t0, misses
...
tN, misses
where tN is a timestamp (second-resolution) and misses is a number of times the kernel made disk-IO for my PID to load missing page of the mmap()-ed memory region when process did access to that memory. Ok, maybe connection between disk-IO and memory-access is harder to track, lets assume my program can not do any disk-io with another (than assessing missing mmapped memory) reason. I THINK, I need to track something called node-load-misses in perf world.
Any ideas how eBPF can be used to collect such data? What probes should I use?
Tried to use perf record for similar purpose: I dislike how much data perf records. As I recall the try was like (also I dont remember how I parsed that output.data file):
perf record -p $PID -a -F 10 -e node-loads -e node-load-misses -o output.data
I thought eBPF could give some facility to implement such thing in less overhead way.
Loading of mmaped pages which are not present in memory is not hardware event like perf's cache-misses or node-loads or node-load-misses. When your program assess not present memory address, GPFault/pagefault exception is generated by hardware and it is handled in software by Linux kernel codes. For first access to anonymous memory physical page will be allocated and mapped for this virtual address; for access of mmaped file disk I/O will be initiated. There are two kinds of page faults in linux: minor and major, and disk I/O is major page fault.
You should try to use trace-cmd or ftrace or perf trace. Support of fault tracing was planned for perf tool in 2012, and patches were proposed in https://lwn.net/Articles/602658/
There is a tracepoint for page faults from userspace code, and this command prints some events with memory address of page fault:
echo 2^123456%2 | perf trace -e 'exceptions:page_fault_user' bc
With recent perf tool (https://mirrors.edge.kernel.org/pub/linux/kernel/tools/perf/) there is perf trace record which can record both mmap syscalls and page_fault_user into perf.data and perf script will print all events and they can be counted by some awk or python script.
Some useful links on perf and tracing: http://www.brendangregg.com/perf.html http://www.brendangregg.com/ebpf.html https://github.com/iovisor/bpftrace/blob/master/INSTALL.md
And some bcc tools may be used to trace disk I/O, like https://github.com/iovisor/bcc/blob/master/examples/tracing/disksnoop.py or https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt
And for simple time-series stat you can use perf stat -I 1000 command with correct software events
perf stat -e cpu-clock,page-faults,minor-faults,major-faults -I 1000 ./program
...
# time counts unit events
1.000112251 413.59 msec cpu-clock # 0.414 CPUs utilized
1.000112251 5,361 page-faults # 0.013 M/sec
1.000112251 5,301 minor-faults # 0.013 M/sec
1.000112251 60 major-faults # 0.145 K/sec
2.000490561 16.32 msec cpu-clock # 0.016 CPUs utilized
2.000490561 1 page-faults # 0.005 K/sec
2.000490561 1 minor-faults # 0.005 K/sec
2.000490561 0 major-faults # 0.000 K/sec

How to fill data upto a size in multiple disk?

I am creating 4 mountpoint disk in Windows OS. I need to copy files up to a threshold value (say 50 GB).
I tried with vdbench. It works fine, but it throws an exception at last.
compratio=4
dedupratio=1
dedupunit=256k
* Host Definition section
hd=default,user=Administator,shell=vdbench,jvms=1
hd=localhost,system=localhost
********************************************************************************
* Storage Definition section
fsd=fsd1,anchor=C:\UnMapTest-Volume1\disk1\,depth=1,width=1,files=1,size=5g
fsd=fsd2,anchor=C:\UnMapTest-Volume2\disk2\,depth=1,width=1,files=1,size=5g
fwd=fwd1,fsd=fsd*,operation=write,xfersize=1m,fileio=sequential,fileselect=random,threads=10
rd=rd1,fwd=fwd1,fwdrate=max,format=yes,elapsed=1h,interval=1
Below is the exception from vdbench. Due to this my calling script would fail.
05:29:14.287 Message from slave localhost-0:
05:29:14.289 file=C:\UnMapTest-Volume1\disk1\\vdb.1_1.dir\vdb_f0001.file,busy=true
05:29:14.290 Thread: FwgThread write C:\UnMapTest-Volume1\disk1\ rd=rd1 For loops: None
05:29:14.291
05:29:14.292 last_ok_request: Thu Dec 28 05:28:57 PST 2017
05:29:14.292 Duration: 16.92 seconds
05:29:14.293 consecutive_blocks: 10001
05:29:14.294 last_block: FILE_BUSY File busy
05:29:14.294 operation: write
05:29:14.295
05:29:14.296 Do you maybe have more threads running than that you have
05:29:14.296 files and therefore some threads ultimately give up after 10000 tries?
05:29:14.300 *
05:29:14.301 ******************************************************
05:29:14.302 * Slave localhost-0 aborting: Too many thread blocks *
05:29:14.302 ******************************************************
05:29:14.303 *
05:29:21.235
05:29:21.235 Slave localhost-0 prematurely terminated.
05:29:21.235
05:29:21.235 Slave aborted. Abort message received:
05:29:21.235 Too many thread blocks
05:29:21.235
05:29:21.235 Look at file localhost-0.stdout.html for more information.
05:29:21.735
05:29:21.735 Slave localhost-0 prematurely terminated.
05:29:21.735
java.lang.RuntimeException: Slave localhost-0 prematurely terminated.
at Vdb.common.failure(common.java:335)
at Vdb.SlaveStarter.startSlave(SlaveStarter.java:198)
at Vdb.SlaveStarter.run(SlaveStarter.java:47)
I am using PowerShell in a Windows machine. Even if some other tools like Diskspd is having way to fill data up to some threshold then please provide me.
I found the answer by myself.
I have done this using Diskspd.exe as below
The following command fill 50 GB data in the mentioned disk folder
.\diskspd.exe -c50G -b4K -t2 C:\UnMapTest-Volume1\disk1\testfile1.dat
It is very simple than Vdbench for my requirement.
Caution : But it is not having real data so array side disk size is
not shown up with the size

JVM crashes frequently

JVM crashes surprizingly and frequently on our prod environment and results in Jboss (EAP6.3) going down. We have java7 U72 installed
Crash logs has same output where current thread is:
Current thread (0x00000000d1d99000): JavaThread "Lucene Merge Thread #0" daemon [_thread_in_Java, id=1144, stack(0x00000000f6a00000,0x00000000f6b00000)]
and all the log is full of :
JavaThread "elasticsearch[Node BD852E44][search][T#68]" daemon [_thread_blocked, id=14396, stack(0x00000000f7b30000,0x00000000f7c30000)]
elasticsearch is some were related to indexing and it uses Lucene in hood as far as I understand but we have number or application deployed how to check on this can someone please help. complete crash logs are at : http://pastebin.com/845LU9iK
Looks like it didn't manage to record stack traces for the affected thread.
If that's the same for all crashes then it doesn't seem to match known lucene or jboss bugs.
# guarantee(result == EXCEPTION_CONTINUE_EXECUTION) failed: Unexpected result from topLevelExceptionFilter
AIUI this indicates an error in native exception handling, so it's one error masking another, probably making this crash log fairly useless.
So I can only provide really generic advice:
you're using an older JVM version, update to the latest java 7, java 8 or possibly even a java 9 dev build and see if it goes away. Even if they still crash they might provide different/more useful error reports
to diagnose potential compiler bugs you can try running with the following flags
-XX:-TieredCompilation 1 should disable the C1 compiler
-XX:+TieredCompilation -XX:TieredStopAtLevel=1 should disable the C2 compiler
-Xint disables all JIT, very slow
ask on the hotspot-dev mailing list for further guidance
1: Tiered compilation is a new java 7 feature, it basically combines the interpreter, C1 and C2 JIT compilers (which formerly were used separately in the client and server VMs) into different optimizing stages.
Each of them can have optimization bugs. Turning off individual stages helps isolating them as potential cause.
Edit: The new crash report is more useful since it at least has java frames, the interesting part is the following:
J 1559 sun.misc.Unsafe.getByte(J)B (0 bytes) # 0x000000000178e99b [0x000000000178e960+0x3b]
j java.nio.DirectByteBuffer.get()B+11
j org.apache.lucene.store.ByteBufferIndexInput.readByte()B+4
J 9447 C2 org.apache.lucene.store.DataInput.readVInt()I (114 bytes) # 0x000000000348cc00 [0x000000000348cbc0+0x40]
DataInput.readVInt seems to be an ongoing source of grief, see this SO answer for possible solutions

High tomcat thread count caused by threads waiting on MultiThreadedHttpConnectionManager ConnectionPool

We've recently started seeing spikes in the thread counts on our tomcat servers (peaking at over 1000 when normally at around 100). We performed a thread dump on one of the tomcat servers whilst it's thread count was high and found that a large number of the threads were waiting on MultiThreadedHttpConnectionManager$ConnectionPool, stack trace as follows:
"TP-Processor21700" daemon prio=10 tid=0x4a0b3400 nid=0x2091 in Object.wait() [0x399f3000..0x399f4004]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked <0x58ee5030> (a org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
...
There are 3 points in our code where httpClient.executeMethod() is called (to obtain info via an http request to another tomcat server). In each case the GetMethod object passed to it has had its socket timeout value set (i.e. via getMethod.getParams().setSoTimeout();) before hand, and the MultiThreadedConnectionManager is configured in spring to have a connectionTimeout value of 10 seconds. One thing I have noticed is that only 2 of the 3 httpClient.executeMethod() invocations are followed by a call to getMethod.releaseConnection(), so I'm wondering if this may be the cause of the problem (i.e. connections not being explicitly released). However what's strange is that
the problem has only started occurring in the last few days and the source code has not been modified for over a year, plus the fact that there has been no recent surge in requests coming through to the tomcat servers. One change that did occur a couple of days before the problem started to occur was that we upgraded the JVM used by the tomcat server from Java 5 (1.5 update 14) to Java 6 (1.6 update 25). We have tried temporarily reverting the JVM version to Java 5 to see if the problem stopped occurring but it did not. Another point to note is that in most cases the tomcat server eventually recovers and
the thread count drops back to normal - we've only had one instance where a tomcat process appears to have crashed because of the thread count increase.
We are running Tomcat 5.5 with commons-httpclient-3.1.jar running against a Java 1.6 update 25 on a Red Hat linux environment.
Please let me know if you can suggest any ideas as to what may be the cause of this issue.
Thanks.
The problem was indeed caused by the fact that only 2 of the 3 httpClient.executeMethod(getMethod) invocations were followed by a call to getMethod.releaseConnection(). Ensuring all 3 httpClient.executeMethod(getMethod) invocations were inside a try/catch block followed by a finally block containing a call to getMethod.releaseConnection() prevented the high thread counts from occurring. Although this code had been in our live system for over a year, it appears that the reason the high thread count issue only recently started occurring was because various search engine crawlers had started hitting the site with lots of URL requests that caused the code where the connection was being used but not subsequently released to execute. Problem solved.