Azure Data Factory Jobs Failing in Hadoop/Map Reduce? - azure-data-factory

Some of my ADF jobs are randomly failing, with the output directed in data in the /PackageJobs/~job/Status/stderr file below.
Note that this doesn't always happen, it occurs randomly on some of the jobs, and others complete normally.
What can be causing this problem?
The stderr data is as follows:
log4j:ERROR Could not instantiate class [com.microsoft.log4jappender.FilterLogAppender].
java.lang.ClassNotFoundException: com.microsoft.log4jappender.FilterLogAppender
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.log4j.helpers.Loader.loadClass(Loader.java:198)
at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:327)
at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:124)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:785)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.<clinit>(LogManager.java:127)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at org.apache.commons.logging.impl.Log4JLogger.getLogger(Log4JLogger.java:262)
at org.apache.commons.logging.impl.Log4JLogger.<init>(Log4JLogger.java:108)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.commons.logging.impl.LogFactoryImpl.createLogFromClass(LogFactoryImpl.java:1025)
at org.apache.commons.logging.impl.LogFactoryImpl.discoverLogImplementation(LogFactoryImpl.java:844)
at org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:541)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:292)
at org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:269)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:657)
at org.apache.hadoop.util.ShutdownHookManager.<clinit>(ShutdownHookManager.java:44)
at org.apache.hadoop.util.RunJar.run(RunJar.java:200)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
log4j:ERROR Could not instantiate appender named "RMSUMFilterLog".
16/03/04 10:56:02 INFO impl.TimelineClientImpl: Timeline service address: http://headnodehost:8188/ws/v1/timeline/
16/03/04 10:56:02 INFO client.RMProxy: Connecting to ResourceManager at headnodehost/100.74.24.3:9010
16/03/04 10:56:02 INFO client.AHSProxy: Connecting to Application History server at headnodehost/100.74.24.3:10200
16/03/04 10:56:03 INFO impl.TimelineClientImpl: Timeline service address: http://headnodehost:8188/ws/v1/timeline/
16/03/04 10:56:03 INFO client.RMProxy: Connecting to ResourceManager at headnodehost/100.74.24.3:9010
16/03/04 10:56:03 INFO client.AHSProxy: Connecting to Application History server at headnodehost/100.74.24.3:10200
16/03/04 10:56:06 INFO mapred.FileInputFormat: Total input paths to process : 1
16/03/04 10:56:06 INFO mapreduce.JobSubmitter: number of splits:1
16/03/04 10:56:06 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/03/04 10:56:06 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
16/03/04 10:56:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1457068773628_0022
16/03/04 10:56:07 INFO mapreduce.JobSubmitter: Kind: mapreduce.job, Service: job_1457068773628_0019, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#655019bc)
16/03/04 10:56:08 INFO impl.YarnClientImpl: Submitted application application_1457068773628_0022
16/03/04 10:56:08 INFO mapreduce.Job: The url to track the job: http://headnodehost:9014/proxy/application_1457068773628_0022/
16/03/04 10:56:08 INFO mapreduce.Job: Running job: job_1457068773628_0022
16/03/04 10:56:18 INFO mapreduce.Job: Job job_1457068773628_0022 running in uber mode : false
16/03/04 10:56:18 INFO mapreduce.Job: map 0% reduce 0%
16/03/04 10:56:31 INFO mapreduce.Job: map 100% reduce 0%
16/03/04 23:48:59 INFO mapreduce.Job: Task Id : attempt_1457068773628_0022_m_000000_0, Status : FAILED
AttemptID:attempt_1457068773628_0022_m_000000_0 Timed out after 600 secs
16/03/04 23:49:00 INFO mapreduce.Job: map 0% reduce 0%
16/03/04 23:49:16 INFO mapreduce.Job: map 100% reduce 0%
16/03/05 00:01:00 INFO mapreduce.Job: Task Id : attempt_1457068773628_0022_m_000000_1, Status : FAILED
AttemptID:attempt_1457068773628_0022_m_000000_1 Timed out after 600 secs
16/03/05 00:01:01 INFO mapreduce.Job: map 0% reduce 0%
16/03/05 00:01:21 INFO mapreduce.Job: map 100% reduce 0%
16/03/05 00:13:00 INFO mapreduce.Job: Task Id : attempt_1457068773628_0022_m_000000_2, Status : FAILED
AttemptID:attempt_1457068773628_0022_m_000000_2 Timed out after 600 secs
16/03/05 00:13:01 INFO mapreduce.Job: map 0% reduce 0%
16/03/05 00:13:18 INFO mapreduce.Job: map 100% reduce 0%
16/03/05 00:25:03 INFO mapreduce.Job: Job job_1457068773628_0022 failed with state FAILED due to: Task failed task_1457068773628_0022_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/03/05 00:25:03 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Rack-local map tasks=1
Total time spent by all maps in occupied slots (ms)=48514665
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=48514665
Total vcore-seconds taken by all map tasks=48514665
Total megabyte-seconds taken by all map tasks=74518525440
16/03/05 00:25:03 ERROR streaming.StreamJob: Job not successful!
Streaming Command Failed!

This looks like the known timeout issue with Hadoop/HDI. If an activity doesn’t write anything on the console for 10 mins, then it gets killed. Can you please modify your code to write a ping on console every 9 minutes and see if it works

Related

Container exited with a non-zero exit code 134

I'm trying to run some matlab deployed code in Hadoop 2.7.3 with MCR from Matlab R2016a (9.0.1) in Ubuntu 16.04.
When trying to execute a Matlab example's deployed code, with the command:
sudo ./run_airlinesmall.sh /usr/local/MATLAB/MATLAB_Runtime/v901 -D mw.mcrroot=/usr/local/MATLAB/MATLAB_Runtime/v901 "/pruebas/datasets/airline.csv" /pruebas/resultados/myresults
I receive the following error output:
Using MATLAB_HADOOP_INSTALL: /usr/hadoop/hadoop-2.7.3/bin/hadoop
Using HADOOP_PREFIX: /usr/hadoop/hadoop-2.7.3/bin/hadoop
Using HADOOP_HOME: /usr/hadoop/hadoop-2.7.3/bin/hadoop
Find out Hadoop version: /usr/hadoop/hadoop-2.7.3/bin/hadoop version
Use Hadoop V2 JAR files
------------------------------------------
Launch Hadoop job airlinesmall.ctf
hola
/usr/local/MATLAB/MATLAB_Runtime/v901
Executing: /usr/hadoop/hadoop-2.7.3/bin/hadoop jar "/usr/local/MATLAB/MATLAB_Runtime/v901/toolbox/mlhadoop/jar/a2.2.0/mwmapreduce.jar" com.mathworks.hadoop.MWMapReduceDriver "-D" "mw.mcrroot=/usr/local/MATLAB/MATLAB_Runtime/v901" "airlinesmall.ctf" "/pruebas/datasets/airline.csv" "/pruebas/resultados/myresults"
java.library.path: /usr/hadoop/hadoop-2.7.3/lib/native
HDFSCTFPath=hdfs://localhost:9000/user/root/airlinesmall/airlinesmall.ctf
Uploading CTF into distributed cache completed.
mapred.child.env: MCR_CACHE_ROOT=/tmp,LD_LIBRARY_PATH=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64
mapred.child.java.opts: -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64
New java.library.path: /usr/hadoop/hadoop-2.7.3/lib/native:/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64
Using MATLAB mapper.
Set input format class to: ChunkFileRecordReader.
Using MATLAB reducer.
Set outputformat class to: class org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
Set map output key class to: class com.mathworks.hadoop.MxArrayWritable2
Set map output value class to: class com.mathworks.hadoop.MxArrayWritable2
Set reduce output key class to: class com.mathworks.hadoop.MxArrayWritable2
Set reduce output value class to: class com.mathworks.hadoop.MxArrayWritable2
*************** run ******************
16/10/05 10:27:59 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/10/05 10:28:00 INFO input.FileInputFormat: Total input paths to process : 1
16/10/05 10:28:00 INFO mapreduce.JobSubmitter: number of splits:1
16/10/05 10:28:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475655722794_0002
16/10/05 10:28:00 INFO impl.YarnClientImpl: Submitted application application_1475655722794_0002
16/10/05 10:28:00 INFO mapreduce.Job: The url to track the job: http://pcitbu:8088/proxy/application_1475655722794_0002/
16/10/05 10:28:00 INFO mapreduce.Job: Running job: job_1475655722794_0002
16/10/05 10:28:04 INFO mapreduce.Job: Job job_1475655722794_0002 running in uber mode : false
16/10/05 10:28:04 INFO mapreduce.Job: map 0% reduce 0%
16/10/05 10:28:07 INFO mapreduce.Job: Task Id : attempt_1475655722794_0002_m_000000_0, Status : FAILED
Exception from container-launch.
Container id: container_1475655722794_0002_01_000002
Exit code: 134
Exception message: /bin/bash: línea 1: 6811 Abortado (`core' generado) /usr/java/jdk1.8.0_101/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64 -Djava.io.tmpdir=/tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475655722794_0002/container_1475655722794_0002_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 127.0.1.1 35025 attempt_1475655722794_0002_m_000000_0 2 > /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000002/stdout 2> /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000002/stderr
...
Stack trace: ExitCodeException exitCode=134: /bin/bash: línea 1: 6944 Abortado (`core' generado) /usr/java/jdk1.8.0_101/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64 -Djava.io.tmpdir=/tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475655722794_0002/container_1475655722794_0002_01_000004/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000004 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 127.0.1.1 35025 attempt_1475655722794_0002_m_000000_2 4 > /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000004/stdout 2> /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475655722794_0002/container_1475655722794_0002_01_000004/stderr
at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
at org.apache.hadoop.util.Shell.run(Shell.java:479)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 134
16/10/05 10:28:21 INFO mapreduce.Job: map 100% reduce 100%
16/10/05 10:28:21 INFO mapreduce.Job: Job job_1475655722794_0002 failed with state FAILED due to: Task failed task_1475655722794_0002_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
16/10/05 10:28:21 INFO mapreduce.Job: Counters: 16
Job Counters
Failed map tasks=4
Killed reduce tasks=1
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=8006
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=8006
Total time spent by all reduce tasks (ms)=0
Total vcore-milliseconds taken by all map tasks=8006
Total vcore-milliseconds taken by all reduce tasks=0
Total megabyte-milliseconds taken by all map tasks=8198144
Total megabyte-milliseconds taken by all reduce tasks=0
Map-Reduce Framework
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
I've been trying for a while but I have not been able to solve it.
Thanks in advance.
EDIT:
LOGS:
/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475737682094_0001/container_1475737682094_0001_01_000001/stderr
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFORMACIÓN: Registering org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver as a provider class
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFORMACIÓN: Registering org.apache.hadoop.yarn.webapp.GenericExceptionHandler as a provider class
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory register
INFORMACIÓN: Registering org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices as a root resource class
oct 06, 2016 9:10:36 AM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFORMACIÓN: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 AM'
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFORMACIÓN: Binding org.apache.hadoop.mapreduce.v2.app.webapp.JAXBContextResolver to GuiceManagedComponentProvider with the scope "Singleton"
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFORMACIÓN: Binding org.apache.hadoop.yarn.webapp.GenericExceptionHandler to GuiceManagedComponentProvider with the scope "Singleton"
oct 06, 2016 9:10:36 AM com.sun.jersey.guice.spi.container.GuiceComponentProviderFactory getComponentProvider
INFORMACIÓN: Binding org.apache.hadoop.mapreduce.v2.app.webapp.AMWebServices to GuiceManagedComponentProvider with the scope "PerRequest"
log4j:WARN No appenders could be found for logger (org.apache.hadoop.ipc.Server).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475737682094_0001/container_1475737682094_0001_01_000002/stderr (and same for containers 2, 3 and 5)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<fl::i18n::MessageCatalog::MessageCatalogNotInitialized>'
what(): Message Catalog has not been initialized by a successful call to MessageCatalog::MessageCatalogInit
nodemanager log:
2016-10-06 10:40:16,391 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=134: /bin/bash: línea 1: 9908 Abortado (`core' generado) /usr/java/jdk1.8.0_101/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64 -Djava.io.tmpdir=/tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 127.0.1.1 33516 attempt_1475742936553_0001_m_000000_1 3 > /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000003/stdout 2> /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000003/stderr
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.run(Shell.java:479)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.lang.Thread.run(Thread.java:745)
2016-10-06 10:40:16,392 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 134
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000003 transitioned from RUNNING to EXITED_WITH_FAILURE
2016-10-06 10:40:16,392 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1475742936553_0001_01_000003
2016-10-06 10:40:16,404 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000003
2016-10-06 10:40:16,404 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1475742936553_0001 CONTAINERID=container_1475742936553_0001_01_000003
2016-10-06 10:40:16,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000003 transitioned from EXITED_WITH_FAILURE to DONE
2016-10-06 10:40:16,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1475742936553_0001_01_000003 from application application_1475742936553_0001
2016-10-06 10:40:16,404 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1475742936553_0001
2016-10-06 10:40:16,479 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1475742936553_0001_000001 (auth:SIMPLE)
2016-10-06 10:40:16,486 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1475742936553_0001_01_000003
2016-10-06 10:40:16,487 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=127.0.0.1 OPERATION=Stop Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1475742936553_0001 CONTAINERID=container_1475742936553_0001_01_000003
2016-10-06 10:40:17,126 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1475742936553_0001_01_000003
2016-10-06 10:40:17,127 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1475742936553_0001_01_000003
2016-10-06 10:40:17,169 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9727 for container-id container_1475742936553_0001_01_000001: 397.7 MB of 2 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used
2016-10-06 10:40:18,410 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1475742936553_0001_01_000003]
2016-10-06 10:40:18,486 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1475742936553_0001_000001 (auth:SIMPLE)
2016-10-06 10:40:18,490 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1475742936553_0001_01_000004 by user root
2016-10-06 10:40:18,490 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root IP=127.0.0.1 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1475742936553_0001 CONTAINERID=container_1475742936553_0001_01_000004
2016-10-06 10:40:18,490 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Adding container_1475742936553_0001_01_000004 to application application_1475742936553_0001
2016-10-06 10:40:18,491 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000004 transitioned from NEW to LOCALIZING
2016-10-06 10:40:18,491 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_INIT for appId application_1475742936553_0001
2016-10-06 10:40:18,491 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_INIT for appId application_1475742936553_0001
2016-10-06 10:40:18,491 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got APPLICATION_INIT for service mapreduce_shuffle
2016-10-06 10:40:18,491 INFO org.apache.hadoop.mapred.ShuffleHandler: Added token for job_1475742936553_0001
2016-10-06 10:40:18,491 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000004 transitioned from LOCALIZING to LOCALIZED
2016-10-06 10:40:18,504 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000004 transitioned from LOCALIZED to RUNNING
2016-10-06 10:40:18,505 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:
...
org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 134
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception message: /bin/bash: línea 1: 10040 Abortado (`core' generado) /usr/java/jdk1.8.0_101/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64 -Djava.io.tmpdir=/tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000005/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 127.0.1.1 33516 attempt_1475742936553_0001_m_000000_3 5 > /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005/stdout 2> /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005/stderr
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=134: /bin/bash: línea 1: 10040 Abortado (`core' generado) /usr/java/jdk1.8.0_101/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx200m -Djava.library.path=/usr/local/MATLAB/MATLAB_Runtime/v901/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v901/bin/glnxa64 -Djava.io.tmpdir=/tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000005/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 127.0.1.1 33516 attempt_1475742936553_0001_m_000000_3 5 > /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005/stdout 2> /usr/hadoop/hadoop-2.7.3/logs/userlogs/application_1475742936553_0001/container_1475742936553_0001_01_000005/stderr
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell.run(Shell.java:479)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
2016-10-06 10:40:25,458 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: at java.lang.Thread.run(Thread.java:745)
2016-10-06 10:40:25,459 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 134
2016-10-06 10:40:25,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000005 transitioned from RUNNING to EXITED_WITH_FAILURE
2016-10-06 10:40:25,459 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1475742936553_0001_01_000005
2016-10-06 10:40:25,468 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
2016-10-06 10:40:25,497 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:8042
2016-10-06 10:40:25,497 WARN org.apache.hadoop.http.HttpServer2: HttpServer Acceptor: isRunning is false. Rechecking.
2016-10-06 10:40:25,497 WARN org.apache.hadoop.http.HttpServer2: HttpServer Acceptor: isRunning is false
2016-10-06 10:40:25,498 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1475742936553_0001_01_000001 is : 143
2016-10-06 10:40:25,505 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000005
2016-10-06 10:40:25,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
2016-10-06 10:40:25,505 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1475742936553_0001 CONTAINERID=container_1475742936553_0001_01_000005
2016-10-06 10:40:25,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000005 transitioned from EXITED_WITH_FAILURE to DONE
2016-10-06 10:40:25,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1475742936553_0001_01_000001
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001/container_1475742936553_0001_01_000001
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1475742936553_0001_01_000005 from application application_1475742936553_0001
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1475742936553_0001
2016-10-06 10:40:25,518 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE APPID=application_1475742936553_0001 CONTAINERID=container_1475742936553_0001_01_000001
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1475742936553_0001_01_000001 transitioned from EXITED_WITH_FAILURE to DONE
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1475742936553_0001_01_000001 from application application_1475742936553_0001
2016-10-06 10:40:25,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event CONTAINER_STOP for appId application_1475742936553_0001
2016-10-06 10:40:25,597 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Applications still running : [application_1475742936553_0001]
2016-10-06 10:40:25,598 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Waiting for Applications to be Finished
2016-10-06 10:40:25,598 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1475742936553_0001 transitioned from RUNNING to APPLICATION_RESOURCES_CLEANINGUP
2016-10-06 10:40:25,598 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /tmp/hadoop-pcitbu/nm-local-dir/usercache/root/appcache/application_1475742936553_0001
2016-10-06 10:40:25,598 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1475742936553_0001
2016-10-06 10:40:25,599 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Application application_1475742936553_0001 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED
2016-10-06 10:40:25,599 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1475742936553_0001, with delay of 10800 seconds
complete log: http://pastebin.com/j76TjcHF
Now sometimes it prints that message, othertimes the system just reboot.
Tell me if you may need more info.

Subclassing SparkException doesn't allow for message passing between Master and Workers

I need to build an application where a master node distributes a large dataset to a number of worker nodes for parallel processing. I'm running this application on a single machine and JVM, therefore I've called setMaster("local[4]") on my SparkConf object. I'm using Spark 1.5.2 and Scala 2.10.5 through IntelliJ.
If a certain condition occurs in the portions of the dataset handled by the executors, I need the master node to be notified and perform some action. In addition to that, I need the other executors to die. To that end, I looked around the Scala Spark API and realized that SparkException allows me to do the first portion of what I'm looking for, by propagating the exception (which is Serializable, by the way) to the driver. I have verified this experimentally, as follows:
def main(args:Array[String]) = {
val conf = new SparkConf().setAppName("Spark Exceptions").setMaster("local[4]")
val sc = new SparkContext(conf)
val l = Range(1, 5000)
val parl = sc.parallelize(l, 8);
val mappedRDD = parl.map(func)
try {
val res = mappedRDD.collect()
println(res)
} catch {
case s:SparkException => println("A worker threw an exception.")
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new SparkException("Bad number detected.")
else
Math.pow(i, 2)
}
If you look closely at the example above, you will note that since the original Range contains both 1 and 4000, two failures are guaranteed in the worker nodes. Indeed, I see two executors failing in stderr, while my stdout is populated with:
A worker threw an exception.
Process finished with exit code 0
Unfortunately, the SparkException thrown does not kill the other executors, since, as mentioned before, I can see both executors failing in stderr, while two other executors complete their tasks successfully. So my first question is: is there any way I can immediately kill the other executors once this exception is caught by the driver program?
My second question is a little bit more subtle: I'd like some information to be exchanged from the executors to the worker node about what piece of information caused the error. Sure, I could write to and read from a file, particularly since I'm on the same filesystem, but I'd like a faster and more elegant solution. So I thought I'd subclass SparkException in order to add a field that described what piece of data caused the error:
import org.apache.spark.SparkException
class WorkerViolation(msg:String, data:Any) extends SparkException(msg) {
override def toString = "A worker violation occurred: " + msg
def getData = data
def this(dat:Any) = this("Error at worker.", dat)
}
The goal is to be able to use the getData accessor to retrieve some information. To that end, I tried modifying the program above, as follows:
...
catch {
case w:WorkerViolation => println("A worker threw an exception, with data: " + w.getData)
case t:Throwable => throw(t)
}
}
def func(i:Int) = {
if(i == 1 || i == 4000)
throw new WorkerViolation("Bad number detected.", i)
else
Math.pow(i, 2)
}
Note that this time I'm both throwing and catching WorkerViolations. Unfortunately, this particular exception seems to be killing the driver node as well. The full trace is of course gigantic, yet copied for consistency:
15/12/07 18:31:17 WARN util.Utils: Your hostname, debian resolves to a loopback address: 127.0.1.1; using 192.168.2.222 instead (on interface eth0)
15/12/07 18:31:17 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/12/07 18:31:17 INFO spark.SecurityManager: Changing view acls to: jason
15/12/07 18:31:17 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jason)
15/12/07 18:31:17 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 18:31:17 INFO Remoting: Starting remoting
15/12/07 18:31:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#192.168.2.222:33572]
15/12/07 18:31:17 INFO spark.SparkEnv: Registering MapOutputTracker
15/12/07 18:31:17 INFO spark.SparkEnv: Registering BlockManagerMaster
15/12/07 18:31:17 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20151207183117-4300
15/12/07 18:31:17 INFO storage.MemoryStore: MemoryStore started with capacity 2.1 GB.
15/12/07 18:31:17 INFO network.ConnectionManager: Bound socket to port 34704 with id = ConnectionManagerId(192.168.2.222,34704)
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/12/07 18:31:17 INFO storage.BlockManagerInfo: Registering block manager 192.168.2.222:34704 with 2.1 GB RAM
15/12/07 18:31:17 INFO storage.BlockManagerMaster: Registered BlockManager
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:42426
15/12/07 18:31:17 INFO broadcast.HttpBroadcast: Broadcast server started at http://192.168.2.222:42426
15/12/07 18:31:17 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-0ae72587-14c5-4bfe-a151-2bcafc889ee8
15/12/07 18:31:17 INFO spark.HttpServer: Starting HTTP Server
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:55556
15/12/07 18:31:17 INFO server.Server: jetty-8.1.14.v20131031
15/12/07 18:31:17 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/12/07 18:31:17 INFO ui.SparkUI: Started SparkUI at http://192.168.2.222:4040
15/12/07 18:31:18 INFO spark.SparkContext: Starting job: collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Got job 0 (collect at SparkExceptions.scala:16) with 8 output partitions (allowLocal=false)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at SparkExceptions.scala:16)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Missing parents: List()
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14), which has no missing parents
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from Stage 0 (MappedRDD[1] at map at SparkExceptions.scala:14)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:0 as 1350 bytes in 4 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:1 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:1 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:2 as TID 2 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:2 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:3 as TID 3 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:3 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 3
15/12/07 18:31:18 INFO executor.Executor: Running task ID 1
15/12/07 18:31:18 INFO executor.Executor: Running task ID 0
15/12/07 18:31:18 INFO executor.Executor: Running task ID 2
15/12/07 18:31:18 ERROR executor.Executor: Exception in task ID 0
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 2 is 5565
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 1 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 2 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Sending result for 1 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 3 is 5565
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 2
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 1
15/12/07 18:31:18 INFO executor.Executor: Sending result for 3 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 3
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:4 as TID 4 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:4 as 1350 bytes in 0 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 4
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Starting task 0.0:5 as TID 5 on executor localhost: localhost (PROCESS_LOCAL)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Serialized task 0.0:5 as 1350 bytes in 1 ms
15/12/07 18:31:18 INFO executor.Executor: Running task ID 5
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Lost TID 0 (task 0.0:0)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 4 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 4 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 4
15/12/07 18:31:18 WARN scheduler.TaskSetManager: Loss was due to helpers.WorkerViolation
A worker violation occurred: Bad number detected.
at SparkExceptions$.func(SparkExceptions.scala:26)
at SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/07 18:31:18 INFO executor.Executor: Serialized size of result for 5 is 5565
15/12/07 18:31:18 INFO executor.Executor: Sending result for 5 directly to driver
15/12/07 18:31:18 INFO executor.Executor: Finished task ID 5
15/12/07 18:31:18 ERROR scheduler.TaskSetManager: Task 0.0:0 failed 1 times; aborting job
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 2 in 27 ms on localhost (progress: 1/8)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 1 in 30 ms on localhost (progress: 2/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Cancelling stage 0
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Stage 0 was cancelled
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 4 in 11 ms on localhost (progress: 3/8)
15/12/07 18:31:18 INFO scheduler.DAGScheduler: Failed to run collect at SparkExceptions.scala:16
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 failed 1 times, most recent failure: Exception failure in TID 0 on host localhost: A worker violation occurred: Bad number detected.
SparkExceptions$.func(SparkExceptions.scala:26)
SparkExceptions$$anonfun$1.apply$mcDI$sp(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
SparkExceptions$$anonfun$1.apply(SparkExceptions.scala:14)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
scala.collection.AbstractIterator.to(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 5 in 11 ms on localhost (progress: 4/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/07 18:31:18 INFO scheduler.TaskSetManager: Finished TID 3 in 34 ms on localhost (progress: 5/8)
15/12/07 18:31:18 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Process finished with exit code 1
So my second question would then be: Why does throwing an exception of a class derived from SparkException kill the driver program as well? Is there a different strategy I can use for executor-driver communication?
FWIW, I have decided that in order to allow for a higher degree of message-passing between nodes, going down to the level of akka actors is the preferred way to go.

Spark cluster can't assign resources from remote scala application

So, I've been trying to get off of the ground running Spark-scala. I've written a simple test program, which just extends the SparkPi example a bit :
def main(args: Array[String]): Unit = {
test()
}
def calcPi(spark: SparkContext, args: Array[String], numSlices: Long): Array[Double] = {
val start = System.nanoTime()
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(numSlices * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
val piVal = 4.0 * count / n
println("Pi is roughly " + piVal)
spark.stop()
val end = System.nanoTime()
return Array(piVal, end - start, (piVal - Math.PI)/Math.PI)
}
def test(): Unit ={
val conf = new SparkConf().setAppName("Pi Test")
conf.setSparkHome("/usr/local/spark")
conf.setMaster("spark://<URL_OF_SPARK_CLUSTER>:7077")
conf.set("spark.executor.memory", "512m")
conf.set("spark.cores.max", "1")
conf.set("spark.blockManager.port", "33291")
conf.set("spark.executor.port", "33292")
conf.set("spark.broadcast.port", "33293")
conf.set("spark.fileserver.port", "33294")
conf.set("spark.driver.port", "33296")
conf.set("spark.replClassServer.port", "33297")
val sc = new SparkContext(conf)
val pi = calcPi(sc, Array(), 1000)
for(item <- pi) {
println(item)
}
}
I then made sure that ports 33291-33300 are open on my machine.
when I run the program, it succssfully hits the spark cluster, and seems to assign cores:
But when the program gets the point where it's actually running the hadoop job, the application logs say:
15/12/07 11:50:21 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49), which has no missing parents
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1840) called with curMem=0, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1840.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO MemoryStore: ensureFreeSpace(1194) called with curMem=1840, maxMem=2061647216
15/12/07 11:50:21 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1194.0 B, free 1966.1 MB)
15/12/07 11:50:21 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.5.106:33291 (size: 1194.0 B, free: 1966.1 MB)
15/12/07 11:50:21 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/12/07 11:50:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at BotDetector.scala:49)
15/12/07 11:50:21 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/12/07 11:50:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:50:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:51:51 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:06 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:21 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/0 is now EXITED (Command exited with code 1)
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Executor app-20151207175020-0003/0 removed: Command exited with code 1
15/12/07 11:52:22 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 0
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor added: app-20151207175020-0003/1 on worker-20151207173821-10.240.0.7-33295 (10.240.0.7:33295) with 5 cores
15/12/07 11:52:22 INFO SparkDeploySchedulerBackend: Granted executor ID app-20151207175020-0003/1 on hostPort 10.240.0.7:33295 with 5 cores, 512.0 MB RAM
15/12/07 11:52:22 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now LOADING
15/12/07 11:52:23 INFO AppClient$ClientActor: Executor updated: app-20151207175020-0003/1 is now RUNNING
15/12/07 11:52:36 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
and when I go onto the remote server and look at the worker logs, they say:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hduser/apache-tez-0.7.0-src/tez-dist/target/tez-0.7.0/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/12/07 17:50:21 INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
15/12/07 17:50:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/07 17:50:21 INFO spark.SecurityManager: Changing view acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: Changing modify acls to: hduser,jschirmer
15/12/07 17:50:21 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hduser, jschirmer); users with modify permissions: Set(hduser, jschirmer)
15/12/07 17:50:22 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/12/07 17:50:22 INFO Remoting: Starting remoting
15/12/07 17:50:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher#10.240.0.7:33292]
15/12/07 17:50:22 INFO util.Utils: Successfully started service 'driverPropsFetcher' on port 33292.
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:65)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:146)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:245)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:97)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:159)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:65)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
... 4 more
15/12/07 17:52:22 INFO util.Utils: Shutdown hook called
I've tried setting the driver and executor ports to explicitly open ports, with the same result. It's unclear what the problem is. Does anyone have any advice?
Also, note that if I compile this exact same code to a fat jar, and copy it to the remote server, and run it through spark-submit, then it runs successfully. I do have a yarn configuration defined on my server, and I'm open to running spark-yarn, but my understanding is that this cannot be done from a remote server, since you specify master as yarn-cluster, and there's no place to put the host in the config.
It seems you have firewall problem. First check you enabled all required port in your cluster or not then after there is some random ports in spark so you need fix those ports for your cluster then only you can use spark remotely.

How to read HDF data from HDFS for Hadoop

I am working in Image processing on Hadoop. I am using HDF satellite data for processing, I can access and use jpg and other image types of data in hadoop streaming. But while using HDF data it comes with error. Hadoop couldnt read HDF data from HDFS. It takes more than twenty minutes to show the error also. My HDF data size is more than 150MB single file.
How to solve this problem. How to make hadoop can read this HDF data from HDFS.
Some of my code
hadoop#master:/usr/local/master/hdf/examples$ ./runD1.sh
Buildfile: /usr/local/master/hdf/build.xml
downloader:
setup:
test_settings:
compile:
BUILD SUCCESSFUL
Total time: 0 seconds
Output HIB: /var/www/html/uploads/
14/09/26 15:28:46 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Found host successfully: 0
Repeated host: 1
Repeated host: 2
Repeated host: 3
Tried to get 2 nodes, got 1
14/09/26 15:28:46 INFO input.FileInputFormat: Total input paths to process : 1
First n-1 nodes responsible for 1592259 images
Last node responsible for 1592259 images
14/09/26 15:29:04 INFO mapred.JobClient: Running job: job_201409191212_0006
14/09/26 15:29:05 INFO mapred.JobClient: map 0% reduce 0%
14/09/26 15:39:15 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_0, Status : FAILED
Task attempt_201409191212_0006_m_000000_0 failed to report status for 600 seconds. Killing!
14/09/26 15:49:17 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_1, Status : FAILED
Task attempt_201409191212_0006_m_000000_1 failed to report status for 600 seconds. Killing!
14/09/26 15:59:19 INFO mapred.JobClient: Task Id : attempt_201409191212_0006_m_000000_2, Status : FAILED
Task attempt_201409191212_0006_m_000000_2 failed to report status for 600 seconds. Killing!
Error log is:
2014-09-26 15:38:45,133 INFO org.apache.hadoop.mapred.JvmManager: In JvmRunner constructed JVM ID: jvm_201409191212_0006_m_-1211757488
2014-09-26 15:38:45,133 INFO org.apache.hadoop.mapred.JvmManager: JVM Runner jvm_201409191212_0006_m_-1211757488 spawned.
2014-09-26 15:38:45,136 INFO org.apache.hadoop.mapred.TaskController: Writing commands to /usr/local/master/temp/mapred/local/ttprivate/taskTracker/hadoop/jobcache/job_201409191212_0006/attempt_201409191212_0006_m_000000_0.cleanup/taskjvm.sh
2014-09-26 15:38:45,631 INFO org.apache.hadoop.mapred.TaskTracker: JVM with ID: jvm_201409191212_0006_m_-1211757488 given task: attempt_201409191212_0006_m_000000_0
2014-09-26 15:38:46,145 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201409191212_0006_m_000000_0 0.0%
2014-09-26 15:38:46,198 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201409191212_0006_m_000000_0 0.0% cleanup
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201409191212_0006_m_000000_0 is done.
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201409191212_0006_m_000000_0 was -1
2014-09-26 15:38:46,200 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 2
2014-09-26 15:38:46,340 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201409191212_0006_m_-1211757488 exited with exit code 0. Number of tasks it ran: 1
Please can anyone help me to solve this problem.

Sqoop installation export and import from postgresql

I v'e just installed sqoop and was testing it . I tried to export some data from hdfs to postgresql using sqoop. When I run it it throws the following exception : java.io.IOException: Can't export data, please check task tracker logs . I think there may also have been a problem in installation.
The File content is :
ustNU 45
MB1bA 0
gNbCO 76
iZP10 39
B2aoo 45
SI7eG 93
5sC4k 60
2IhFV 2
u2A48 16
yvy6R 51
LNhsV 26
mZ2yn 65
80Gp3 43
Wk5Ag 85
VUfyp 93
P077j 94
f1Oj5 11
LxJkg 72
0H7NP 99
Dk406 25
g4KRp 76
Fw3U0 80
6LD59 1
07KHx 91
F1S88 72
Bnb0v 85
A2qM7 79
Z6cAt 81
0M3DO 23
m0s09 44
KIvwd 13
GNUD0 78
um93a 20
19bHv 75
4Of3s 75
5hFen 16
This is the posgres table:
Table "public.mysort"
Column | Type | Modifiers
--------+---------+-----------
name | text |
marks | integer |
The sqoop command is:
sqoop export --connect jdbc:postgresql://localhost/testdb --username akshay --password akshay --table mysort -m 1 --export-dir MySort/input
Followed by the error:
Warning: /usr/lib/hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
14/06/11 18:28:06 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/06/11 18:28:06 INFO manager.SqlManager: Using default fetchSize of 1000
14/06/11 18:28:06 INFO tool.CodeGenTool: Beginning code generation
14/06/11 18:28:06 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "mysort" AS t LIMIT 1
14/06/11 18:28:06 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/hadoop
Note: /tmp/sqoop-hduser/compile/0402ad4b5cf7980040264af35de406cb/mysort.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/06/11 18:28:07 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hduser/compile/0402ad4b5cf7980040264af35de406cb/mysort.jar
14/06/11 18:28:07 INFO mapreduce.ExportJobBase: Beginning export of mysort
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hbase/lib/slf4j-log4j12-1.6.4.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link it with '-z noexecstack'.
14/06/11 18:28:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/06/11 18:28:22 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative
14/06/11 18:28:23 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
14/06/11 18:28:23 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/06/11 18:28:24 INFO input.FileInputFormat: Total input paths to process : 1
14/06/11 18:28:24 INFO input.FileInputFormat: Total input paths to process : 1
14/06/11 18:28:25 INFO mapreduce.JobSubmitter: number of splits:1
14/06/11 18:28:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1402488523460_0003
14/06/11 18:28:25 INFO impl.YarnClientImpl: Submitted application application_1402488523460_0003
14/06/11 18:28:25 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1402488523460_0003/
14/06/11 18:28:25 INFO mapreduce.Job: Running job: job_1402488523460_0003
14/06/11 18:28:46 INFO mapreduce.Job: Job job_1402488523460_0003 running in uber mode : false
14/06/11 18:28:46 INFO mapreduce.Job: map 0% reduce 0%
14/06/11 18:29:04 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_0, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:29:23 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_1, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:29:42 INFO mapreduce.Job: Task Id : attempt_1402488523460_0003_m_000000_2, Status : FAILED
Error: java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
14/06/11 18:30:03 INFO mapreduce.Job: map 100% reduce 0%
14/06/11 18:30:03 INFO mapreduce.Job: Job job_1402488523460_0003 failed with state FAILED due to: Task failed task_1402488523460_0003_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0
14/06/11 18:30:03 INFO mapreduce.Job: Counters: 9
Job Counters
Failed map tasks=4
Launched map tasks=4
Other local map tasks=3
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=69336
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=69336
Total vcore-seconds taken by all map tasks=69336
Total megabyte-seconds taken by all map tasks=71000064
14/06/11 18:30:03 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
14/06/11 18:30:03 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 100.1476 seconds (0 bytes/sec)
14/06/11 18:30:03 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/06/11 18:30:03 INFO mapreduce.ExportJobBase: Exported 0 records.
14/06/11 18:30:03 ERROR tool.ExportTool: Error during export: Export job failed!
This is the log file :
2014-06-11 17:54:37,601 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-06-11 17:54:37,602 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-06-11 17:54:52,678 WARN [main] org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2014-06-11 17:54:52,777 INFO [main] org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-06-11 17:54:52,846 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-06-11 17:54:52,847 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2014-06-11 17:54:52,855 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-06-11 17:54:52,855 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1402488523460_0002, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#971d0d8)
2014-06-11 17:54:52,901 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2014-06-11 17:54:53,165 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /tmp/hadoop-hduser/nm-local-dir/usercache/hduser/appcache/application_1402488523460_0002
2014-06-11 17:54:53,249 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2014-06-11 17:54:53,249 WARN [main] org.apache.hadoop.conf.Configuration: job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2014-06-11 17:54:53,393 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2014-06-11 17:54:53,689 INFO [main] org.apache.hadoop.mapred.Task: Using ResourceCalculatorProcessTree : [ ]
2014-06-11 17:54:53,899 INFO [main] org.apache.hadoop.mapred.MapTask: Processing split: Paths:/user/hduser/MySort/input/data.txt:0+891082
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
2014-06-11 17:54:53,904 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Exception raised during data export
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,028 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
2014-06-11 17:54:54,030 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: On input: ustNU 45
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: On input file: hdfs://localhost:9000/user/hduser/MySort/input/data.txt
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: At position 0
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Currently processing split:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: Paths:/user/hduser/MySort/input/data.txt:0+891082
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: This issue might not necessarily be caused by current input
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper: due to the batching nature of export.
2014-06-11 17:54:54,031 ERROR [main] org.apache.sqoop.mapreduce.TextExportMapper:
2014-06-11 17:54:54,032 INFO [Thread-12] org.apache.sqoop.mapreduce.AutoProgressMapper: Auto-progress thread is finished. keepGoing=false
2014-06-11 17:54:54,033 WARN [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hduser (auth:SIMPLE) cause:java.io.IOException: Can't export data, please check task tracker logs
2014-06-11 17:54:54,033 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Can't export data, please check task tracker logs
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.util.NoSuchElementException
at java.util.ArrayList$Itr.next(ArrayList.java:839)
at mysort.__loadFromFields(mysort.java:198)
at mysort.parse(mysort.java:147)
at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
... 10 more
2014-06-11 17:54:54,037 INFO [main] org.apache.hadoop.mapred.Task: Runnning cleanup for the task
Any help in resolving the issue is appreciated.
Here is the complete procedure for installation and import and export commands for Sqoop. Hope fully it may be helpful to some one. This one is tried and tested by me and actually works.
Download : apache.mirrors.tds.net/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
sudo mv sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz /usr/lib/sqoop
copy paste followingtwo lines in .bashrc
export SQOOP_HOME=/usr/lib/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
Go to /usr/lib/sqoop/conf folder and copy sqoop-env-template.sh to new file sqoop-env.sh and modify export HADOOP_HOME ,HBASE_HOME,etc to the installation directory
Download the postgresql conector jar file from jdbc.postgresql.org/download/postgresql-9.3-1101.jdbc41.jar
create a directory manager.d in sqoop/conf/
create a file postgresql in conf/ and add the following line in it
org.postgresql.Driver=/usr/lib/sqoop/lib/postgresql-9.3-1101.jdbc41.jar
name the connector.jar file accordingly
For Export
Create a user in postgres:
createuser -P -s -e ace
Enter password for new role: ace
Enter it again: ace
CREATE DATABASE testdb OWNER ace TABLESPACE ace;
create table stud1(id int,name text);
Create a file student.txt
Add lines such as:
1,Ace
2,iloveapis
hadoop fs -put student.txt
sqoop export --connect jdbc:postgresql://localhost:5432/testdb --username ace --password ace --table stud1 -m 1 --export-dir student.txt
check in postgres: Select * from stud1;
For Import:
sqoop import --connect jdbc:postgresql://localhost:5432/testdb --username akshay --password akshay --table stud1 --m 1
hadoop fs -ls -R stud1
Expected Output:
-rw-r--r-- 1 hduser supergroup 0 2014-06-13 18:10 stud1/_SUCCESS
-rw-r--r-- 1 hduser supergroup 21 2014-06-13 18:10 stud1/part-m-00000
hadoop fs -cat stud1/part-m-00000
Expected Output:
1,Ace
2,iloveapis
hadoop fs -copyToLocal stud1/part-m-00000 $HOME/imported_data.txt