I am trying to run KMeans on hadoop, using this guidelines.
http://www.slideshare.net/titusdamaiyanti/hadoop-installation-k-means-clustering-mapreduce?qid=44b5881c-089d-474b-b01d-c35a2f91cc67&v=qf1&b=&from_search=1#likes-panel
I am running this in eclipse-luna. when I executed, both map and reduce are showing they are complete 100%. But i am not getting output. Instead i am getting following error at the end. Please some help me to solve this..
15/03/20 11:29:44 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-hduser/mapred/staging/hduser378797276/.staging/job_local378797276_0002
15/03/20 11:29:44 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: No input paths specified in job
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:193)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1071)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at com.clustering.mapreduce.KMeansClusteringJob.main(KMeansClusteringJob.java:114)
You have to provide the input file location before running the map reduce program. There are two ways for providing the input file location.
Using ecliple go to run configration and provide the file name as arguments
Convert your program to jar file and run the below command inside your hadoop cluster
hadoop jar NameOfYourJarFile InputFileLocation OutPutFileLocation `
Related
I am running a Spark job in Cloudera Data Science Workbench. Sometimes it runs okay, but sometimes it fails with this error:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: /tmp/spark-driver.log (Permission denied)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at java.io.FileOutputStream.<init>(FileOutputStream.java:133)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
Upon checking, the files exists:
cdsw#jw4l5ll7jj0l3bcy:~$ ls /tmp/spark-driver.log
/tmp/spark-driver.log
I've already looked at the Spark UI log and can't find any other error. This is the only error we found. Already desperate for answers. Any leads would be appreciated.
Thanks!
The error stack-trace states permission issues for the /tmp directory on your machines. You can refer to the following answer by Michail N. I hope it solves your issues.
We would like try to describe my problem below:
We have small gpdb cluster. In that,we are trying for Data integration using Talend tool.
We are trying to load the incremental from a table to another table, quite simple... I thought...
Job Data Flow is
tgreenplumconnection
|
tmssqlinput--->thdfsoutput-->tmap-->tgreenplumgpload--tgreenplumcommit
Getting error
Exception in thread "Thread-1" java.lang.RuntimeException: Cannot run program "gpload": CreateProcess error=2, The system cannot find the file specified
at bigdata.sormaster_stg0_copy_0_1.SorMaster_stg0_Copy$2.run(SorMaster_stg0_Copy.java:6425)
Caused by: java.io.IOException: CreateProcess error=2, The system cannot find the file specified
at java.lang.ProcessImpl.create(Native Method)
at java.lang.ProcessImpl.<init>(ProcessImpl.java:386)
at java.lang.ProcessImpl.start(ProcessImpl.java:137)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
at java.lang.Runtime.exec(Runtime.java:620)
at java.lang.Runtime.exec(Runtime.java:528)
at bigdata.sormaster_stg0_copy_0_1.SorMaster_stg0_Copy$2.run(SorMaster_stg0_Copy.java:6413)
After configuring ssh and rsync when I try to run Scalding tutorial (https://github.com/Cascading/scalding-tutorial/) with command:
$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
I get the following error:
com.twitter.scalding.InvalidSourceException: [com.twitter.scalding.TextLineWrappedArray(tutorial/data/hello.txt)] Data is missing from one or more paths in: List(tutorial/data/hello.txt)
This error happens notwithstanding file tutorial/data/hello.txt really exists.
How to fix this?
Stdout:
$ scripts/scald.rb --hdfs tutorial/Tutorial0.scala
scripts/scald.rb:194: warning: already initialized constant SCALA_LIB_DIR
dkondratev#hadoop-n002.maxus.lan's password:
dkondratev#hadoop-n002.maxus.lan's password:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/phoenix/phoenix-4.0.0.2.1.2.1-471-client.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
14/07/07 19:05:45 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
Exception in thread "main" java.lang.Throwable: GUESS: Data is missing from the path you provided.
If you know what exactly caused this error, please consider contributing to GitHub via following link.
https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#comtwitterscaldinginvalidsourceexception
at com.twitter.scalding.Tool$.main(Tool.scala:132)
at com.twitter.scalding.Tool.main(Tool.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
I think the problem you are having is that you are telling Scalding to run in HDFS, but the file you are providing as input is in your local file system, not in HDFS. Before running the example, upload the file to your HDFS:
hadoop fs -mkdir tutorial
hadoop fs -mkdir tutorial/data
hadoop fs -put tutorial/data/hello.txt tutorial/data/hello.txt
Try to pack your job into a fat jar using Maven shade plugin and then run your Scalding job via the hadoop command:
hadoop jar your-uber.jar com.twitter.scalding.Tool bar.foo.MyClassJob --hdfs --input ... --output ...
I often find spark fails with large jobs with a rather unhelpful meaningless exception. The worker logs look normal, no errors, but they get state "KILLED". This is extremely common for large shuffles, so operations like .distinct.
The question is, how do I diagnose what's going wrong, and ideally, how do I fix it?
Given that a lot of these operations are monoidal I've been working around the problem by splitting the data into, say 10, chunks, running the app on each chunk, then running the app on all of the resulting outputs. In other words - meta-map-reduce.
14/06/04 12:56:09 ERROR client.AppClient$ClientActor: Master removed our application: FAILED; stopping client
14/06/04 12:56:09 WARN cluster.SparkDeploySchedulerBackend: Disconnected from Spark cluster! Waiting for reconnection...
14/06/04 12:56:09 WARN scheduler.TaskSetManager: Loss was due to java.io.IOException
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:703)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:779)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:159)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:143)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:149)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:176)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:257)
at scala.collection.AbstractIterator.toList(Iterator.scala:1157)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at $line5.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:13)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.RDD$$anonfun$1.apply(RDD.scala:450)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:241)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:232)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
at org.apache.spark.scheduler.Task.run(Task.scala:53)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:41)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:41)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
As of September 1st 2014, this is an "open improvement" in Spark. Please see https://issues.apache.org/jira/browse/SPARK-3052. As syrza pointed out in the given link, the shutdown hooks are likely done in incorrect order when an executor failed which results in this message. I understand you will have to little more investigation to figure out the main cause of problem (i.e. why your executor failed). If it is a large shuffle, it might be an out-of-memory error which cause executor failure which then caused the Hadoop Filesystem to be closed in their shutdown hook. So, the RecordReaders in running tasks of that executor throw "java.io.IOException: Filesystem closed" exception. I guess it will be fixed in subsequent release and then you will get more helpful error message :)
Something calls DFSClient.close() or DFSClient.abort(), closing the client. The next file operation then results in the above exception.
I would try to figure out what calls close()/abort(). You could use a breakpoint in your debugger, or modify the Hadoop source code to throw an exception in these methods, so you would get a stack trace.
The exception about “file system closed” can be solved if the spark job is running on a cluster. You can set properties like spark.executor.cores , spark.driver.cores and spark.akka.threads to the maximum values w.r.t your resource availability. I had the same problem when my dataset was pretty large with JSON data about 20 million records. I fixed it with the above properties and it ran like a charm. In my case, I set those properties to 25,25 and 20 respectively. Hope it helps!!
Reference Link:
http://spark.apache.org/docs/latest/configuration.html
I'm trying to rewrite the mule start script so it works as a service on a RHEL.
Currently I have it mostly done.
It is starting and I have the most of the log files being successfully written where I want them.
But there's a file named literally .log that I do not know what is for, neither where to configure (its name and path).
Such file is adding the following nasty lines in the mule_ee.log upon start up:
log4j:ERROR setFile(null,true) call failed.
java.io.FileNotFoundException: .log (Permission denied)
at java.io.FileOutputStream.openAppend(Native Method)
at java.io.FileOutputStream.<init>(FileOutputStream.java:207)
at java.io.FileOutputStream.<init>(FileOutputStream.java:131)
at org.apache.log4j.FileAppender.setFile(FileAppender.java:294)
at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:165)
at org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:307)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:172)
at org.apache.log4j.config.PropertySetter.setProperties(PropertySetter.java:104)
at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:809)
at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:735)
at org.apache.log4j.PropertyConfigurator.configureRootCategory(PropertyConfigurator.java:615)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:502)
at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:547)
at org.mule.module.launcher.log4j.ApplicationAwareRepositorySelector.configureFrom(ApplicationAwareRepositorySelector.java:166)
at org.mule.module.launcher.log4j.ApplicationAwareRepositorySelector.getLoggerRepository(ApplicationAwareRepositorySelector.java:95)
at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:208)
at org.apache.log4j.LogManager.getLogger(LogManager.java:228)
at org.mule.module.logging.MuleLoggerFactory.getLogger(MuleLoggerFactory.java:77)
at org.mule.module.logging.DispatchingLogger.getLogger(DispatchingLogger.java:419)
at org.mule.module.logging.DispatchingLogger.isInfoEnabled(DispatchingLogger.java:191)
at org.apache.commons.logging.impl.SLF4JLog.isInfoEnabled(SLF4JLog.java:78)
at org.mule.module.launcher.application.DefaultMuleApplication.init(DefaultMuleApplication.java:188)
at org.mule.module.launcher.application.PriviledgedMuleApplication.init(PriviledgedMuleApplication.java:46)
at org.mule.module.launcher.application.ApplicationWrapper.init(ApplicationWrapper.java:64)
at org.mule.module.launcher.DefaultMuleDeployer.deploy(DefaultMuleDeployer.java:46)
at org.mule.module.launcher.DeploymentService.guardedDeploy(DeploymentService.java:398)
at org.mule.module.launcher.DeploymentService.start(DeploymentService.java:181)
at org.mule.module.launcher.MuleContainer.start(MuleContainer.java:157)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.mule.module.reboot.MuleContainerWrapper.start(MuleContainerWrapper.java:56)
at org.tanukisoftware.wrapper.WrapperManager$12.run(WrapperManager.java:3925)`
What is that .log file for? Where is the conf file to set it up to be written in a place where the mule user has permissions to write?
It seems that log4j can't find that file because OS permissions. You can chmod to add permissions in your MULE_HOME dir. And also review your log4j config to see why is trying to read .log file.
The .log file seems to be from the 00_mmc-agent app.
The problem therefore was in the log4j.properties file located in ./apps/00_mmc-agent/classes such file has the following appender configured: log4j.appender.file.File=${app.name}.log
That ${app.name} variable seem to be not properly configured (not even in the default not modified mule starting script). hence the .log file name.
In ./apps/mmc/webapps/mmc/WEB-INF/classes/log4j.properties the appender is configured like this log4j.appender.R.File=${mule.home}/logs/mmc-console-app.log
So, to fix the startup error I modified the appender of the log4j.properties file located in ./apps/00_mmc-agent/classes to have this path:
log4j.appender.file.File=${mule.home}/logs/00_mmc-agent.log