Spark2 unable to find table or view on remote hdfs cluster

Spark2 unable to find table or view on remote hdfs cluster - scala

I am using HiveContext to query a hive table on a hdfs cluster remotely through spark 1.6.0 and am able to do so successfully. However, when doing so through spark 2.3.0, throws the following:
org.apache.spark.sql.AnalysisException:
Table or view not found: `hiveorc_replica`.`appointment`; line 1 pos 21;
'Aggregate [unresolvedalias(count(1), None)]
+- 'UnresolvedRelation `hiveorc_replica`.`appointment`
Through this message, I am able to interpret only one thing that it might be searching for the database locally instead of remotely. I am creating a spark context using:
val conf = new SparkConf().setAppName("SparkApp").setMaster("local")
val sc=new SparkContext(conf)
val hc = new HiveContext(sc)
val actualRecordCountHC = hc.sql("select count(*) from hiveorc_replica.appointment")
val records = hc.sql("select * from hiveorc_replica.appointment")
All the config files are present in the resources folder of my project. The following is my hive-site.xml:
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://fqdn:9083</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>300</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.warehouse.subdir.inherit.perms</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask.size</name>
<value>20971520</value>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>false</value>
</property>
<property>
<name>hive.smbjoin.cache.rows</name>
<value>10000</value>
</property>
<property>
<name>hive.server2.logging.operation.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.logging.operation.log.location</name>
<value>/var/log/hive/operation_logs</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>-1</value>
</property>
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value>
</property>
<property>
<name>hive.exec.copyfile.maxsize</name>
<value>33554432</value>
</property>
<property>
<name>hive.exec.reducers.max</name>
<value>1099</value>
</property>
<property>
<name>hive.vectorized.groupby.checkinterval</name>
<value>4096</value>
</property>
<property>
<name>hive.vectorized.groupby.flush.percent</name>
<value>0.1</value>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.vectorized.execution.reduce.enabled</name>
<value>false</value>
</property>
<property>
<name>hive.merge.mapfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.mapredfiles</name>
<value>false</value>
</property>
<property>
<name>hive.cbo.enable</name>
<value>false</value>
</property>
<property>
<name>hive.fetch.task.conversion</name>
<value>minimal</value>
</property>
<property>
<name>hive.fetch.task.conversion.threshold</name>
<value>268435456</value>
</property>
<property>
<name>hive.limit.pushdown.memory.usage</name>
<value>0.1</value>
</property>
<property>
<name>hive.merge.sparkfiles</name>
<value>true</value>
</property>
<property>
<name>hive.merge.smallfiles.avgsize</name>
<value>16777216</value>
</property>
<property>
<name>hive.merge.size.per.task</name>
<value>268435456</value>
</property>
<property>
<name>hive.optimize.reducededuplication</name>
<value>true</value>
</property>
<property>
<name>hive.optimize.reducededuplication.min.reducer</name>
<value>4</value>
</property>
<property>
<name>hive.map.aggr</name>
<value>true</value>
</property>
<property>
<name>hive.map.aggr.hash.percentmemory</name>
<value>0.5</value>
</property>
<property>
<name>hive.optimize.sort.dynamic.partition</name>
<value>false</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>mr</value>
</property>
<property>
<name>spark.executor.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.driver.memory</name>
<value>268435456</value>
</property>
<property>
<name>spark.executor.cores</name>
<value>1</value>
</property>
<property>
<name>spark.yarn.driver.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.yarn.executor.memoryOverhead</name>
<value>26</value>
</property>
<property>
<name>spark.dynamicAllocation.enabled</name>
<value>true</value>
</property>
<property>
<name>spark.dynamicAllocation.initialExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.minExecutors</name>
<value>1</value>
</property>
<property>
<name>spark.dynamicAllocation.maxExecutors</name>
<value>2147483647</value>
</property>
<property>
<name>hive.metastore.execute.setugi</name>
<value>true</value>
</property>
<property>
<name>hive.support.concurrency</name>
<value>true</value>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>fqdn</value>
</property>
<property>
<name>hive.zookeeper.client.port</name>
<value>2181</value>
</property>
<property>
<name>hive.zookeeper.namespace</name>
<value>hive_zookeeper_namespace_CD-HIVE-WAyDdBlP</value>
</property>
<property>
<name>hive.cluster.delegation.token.store.class</name>
<value>org.apache.hadoop.hive.thrift.MemoryTokenStore</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.sasl.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>hive.server2.authentication.kerberos.principal</name>
<value>hive/_HOST#EXAMPLE.COM</value>
</property>
<property>
<name>spark.shuffle.service.enabled</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>LDAP</value>
</property>
</configuration>
fqdn is being replaced by remote hdfs FQDN at runtime. Also, when I run the same code on the remote cluster itself where the hive database is present, through spark2, it gives results.
So, how do i run the code remotely ?

creating spark session for spark2 did the job. On seeing the logs, I found that somehow it was unable to get the value of hive.metastore.uris from hive-site.xml and setting it through spark-session was the answer.
val spark = SparkSession.builder.master("local").config("hive.metastore.uris", "thrift://"+hdfsFQDN+":9083").enableHiveSupport.getOrCreate
However, I still have a doubt as to why is it not able to get value of hive.metastore.uri for running remotely when its able to get the file from resources when running through HiveContext ?

Related

java.lang.IllegalArgumentException: Does not contain a valid host:port authority: http at org.apache.hadoop.net.NetUtils.createSocketAddr

Note that i have deployed statefulsets of 2 namenodes, 2 datanodes and 3 journalnodes for Apache Hadoop 3.3.3 HA on kubernetes.
but namenode is throwing the following error.
$ hdfs --config /opt/hadoop/etc/hadoop namenode
{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1659593176018,"date":"2022-08-04 06:06:16,018","level":"ERROR","thread":"Listener at 0.0.0.0/8020","message":"Error encountered requiring NN shutdown. Shutting down immediately.","exceptionclass":"java.lang.IllegalArgumentException","stack":["java.lang.IllegalArgumentException: **Does not contain a valid host:port authority: http:**","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:232)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:189)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:169)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158)","\tat org.apache.hadoop.hdfs.DFSUtil.substituteForWildcardAddress(DFSUtil.java:1046)","\tat org.apache.hadoop.hdfs.DFSUtil.getInfoServerWithDefaultHost(DFSUtil.java:1014)","\tat org.apache.hadoop.hdfs.server.namenode.ha.RemoteNameNodeInfo.getRemoteNameNodes(RemoteNameNodeInfo.java:61)","\tat org.apache.hadoop.hdfs.server.namenode.ha.RemoteNameNodeInfo.getRemoteNameNodes(RemoteNameNodeInfo.java:42)","\tat org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.<init>(EditLogTailer.java:191)","\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startStandbyServices(FSNamesystem.java:1501)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startStandbyServices(NameNode.java:2051)","\tat org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.enterState(StandbyState.java:69)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1024)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:995)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1769)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)"]}
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://apache-hadoop-namenode:8020</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zk-headless.backend.svc.cluster.local:2181</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/dfs/journal</value>
</property>
hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>apache-hadoop-namenode</value>
</property>
<property>
<name>dfs.ha.namenodes.apache-hadoop-namenode</name>
<value>apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local,apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</value>
</property>
<property>
<name>dfs.namenode.rpc-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>hdfs://apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>hdfs://apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>http://apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>http://apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://apache-hadoop-journalnode.backend.svc.cluster.local:8485/apache-hadoop-namenode</value>
</property>

The solution for this is to remove the http in the following property from hdfs-site.xml
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
this http address property is required as metioned it the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#:~:text=dfs.namenode.http%2Daddress.%5Bnameservice%20ID%5D.%5Bname%20node%20ID%5D%20%2D%20the%20fully%2Dqualified%20HTTP%20address%20for%20each%20NameNode%20to%20listen%20on
but for my case it worked after removing the http:// out of this property.

Mapreduce job when submitted in a queue is not running on the labelled node and instead running under default partition

I have a Hortonworks(HDP 2.4.2) cluster with three data nodes (node1,node2 & node3).
I wanted to check the node labelling feature in hortoworks . For that I created a node label(x) and mapped it to a data node node1.
So now, there are two partitions:
Default partition - contains node2 & node3
Partition x - contains node1
Later on I created a queue named "protegrity* and mapped it to node label X.
Now, whene I am running any mapreduce job, this job is getting executed but on the "protegrity* queue of the default partition which I was not expecting. It ws supposed to get executed on queue "protegrity* on node1(labelled partition x). Please refer the attached screenshot of scheduler.
Scheduler
The job executed was : hadoop jar ./hadoop-mapreduce/hadoop-mapreduce-examples.jar pi -Dmapred.job.queue.name=protegrity -Dnode_label_expression=x 2 2
The configs of capacity_seheduler.xml file is mentiond below:
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.2</value>
</property>
<property>
<name>yarn.scheduler.capacity.maximum-applications</name>
<value>10000</value>
</property>
<property>
<name>yarn.scheduler.capacity.node-locality-delay</name>
<value>40</value>
</property>
<property>
<name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
<value>false</value>
</property>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels</name>
<value>x</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.x.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.accessible-node-labels.x.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.acl_administer_queue</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.capacity</name>
<value>50</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.acl_administer_queue</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.acl_submit_applications</name>
<value>yarn</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.capacity</name>
<value>20</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.maximum-capacity</name>
<value>80</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.hawqque.user-limit-factor</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels</name>
<value>x</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels.x.capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.accessible-node-labels.x.maximum-capacity</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.acl_administer_queue</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.acl_submit_applications</name>
<value>yarn </value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.capacity</name>
<value>30</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.maximum-capacity</name>
<value>60</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.minimum-user-limit-percent</name>
<value>100</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.ordering-policy</name>
<value>fifo</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.state</name>
<value>RUNNING</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.protegrity.user-limit-factor</name>
<value>1</value>
</property>
<property>
<name>yarn.scheduler.capacity.root.queues</name>
<value>default,hawqque,protegrity</value>
</property>
But when I executed an example query given in - https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_yarn_resource_mgt/content/using_node_labels.html which doesn't triggers mapreduce job, the job executed on the labelled node and used the mentioned queue.
Query was : sudo su yarn
hadoop jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar
-shell_command "sleep 100000" -jar /usr/hdp/current/hadoop-yarn-client/hadoop-yarn-applications-distributedshell.jar
-queue protegrity -node_label_expression x
So, I am a bit confused here, node labelling work for mapreduce jobs or not !!??.
If yes, then I need a little help

Node labelling doesn't work with MapReduce job until and unless we add a node labelling to a queue and run the MapReduce job using a queue with a node label. Per your given configuration: Queue protegrity will always use default partition with a capacity of 30 % and not inside the node which you have labeled as x because you need to make your queue protegrity attach to x as default node label. Please add the following configuration to make it work:
<property>
<name>yarn.scheduler.capacity.root.protegrity.default-node-label-expression</name>
<value>x</value>
</property>

BigInsights on cloud - Class org.apache.oozie.action.hadoop.SparkMain not found

I'm trying to execute the spark oozie example on the oozie_spark branch against a BigInsights for Apache Hadoop basic cluster.
The workflow.xml looks like this:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkWordCount'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Spark-Wordcount</name>
<class>org.apache.spark.examples.WordCount</class>
<jar>${hdfsSparkAssyJar},${hdfsWordCountJar}</jar>
<spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.0.0</spark-opts>
<arg>${inputDir}/FILE</arg>
<arg>${outputDir}</arg>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
The configuration.xml:
<configuration>
<property>
<name>master</name>
<value>local</value>
</property>
<property>
<name>queueName</name>
<value>default</value>
</property>
<property>
<name>user.name</name>
<value>default</value>
</property>
<property>
<name>nameNode</name>
<value>default</value>
</property>
<property>
<name>jobTracker</name>
<value>default</value>
</property>
<property>
<name>jobDir</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>inputDir</name>
<value>/user/snowch/test/input</value>
</property>
<property>
<name>outputDir</name>
<value>/user/snowch/test/output</value>
</property>
<property>
<name>hdfsWordCountJar</name>
<value>/user/snowch/test/lib/OozieWorkflowSparkGroovy.jar</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>hdfsSparkAssyJar</name>
<value>/iop/apps/4.2.0.0/spark/jars/spark-assembly.jar</value>
</property>
</configuration>
However, the error I see in the Yarn logs is:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 13 more
I've looked for SparkMain in spark-assembly:
$ hdfs dfs -get /iop/apps/4.2.0.0/spark/jars/spark-assembly.jar
$ jar tf spark-assembly.jar | grep -i SparkMain
And here:
$ jar tf /usr/iop/4.2.0.0/spark/lib/spark-examples-1.6.1_IBM_4-hadoop2.7.2-IBM-12.jar | grep SparkMain
I've seen another question similar to this one, but this question is specifically about BigInsights on cloud.

The issue was resolved with:
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
I should have RTFM properly.

HTTP/1.1 400 Bad Request executing oozie spark job

I'm trying to execute the spark oozie example on the oozie_spark branch against a BigInsights for Apache Hadoop basic cluster.
The workflow.xml looks like this:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkWordCount'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Spark-Wordcount</name>
<class>org.apache.spark.examples.WordCount</class>
<jar>/iop/apps/4.2.0.0/spark/jars/spark-assembly.jar,${jobDir}/lib/spark-wordcount-example.jar</jar>
<spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.0.0</spark-opts>
<arg>${inputDir}/FILE</arg>
<arg>${outputDir}</arg>
<capture-output/>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
The configuration.xml:
<configuration>
<property>
<name>master</name>
<value>local</value>
</property>
<property>
<name>queueName</name>
<value>default</value>
</property>
<property>
<name>user.name</name>
<value>default</value>
</property>
<property>
<name>nameNode</name>
<value>default</value>
</property>
<property>
<name>jobTracker</name>
<value>default</value>
</property>
<property>
<name>jobDir</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>inputDir</name>
<value>/user/snowch/test/input</value>
</property>
<property>
<name>outputDir</name>
<value>/user/snowch/test/output</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>/user/snowch/test</value>
</property>
</configuration>
However, the error is:
Exception in thread "main" org.apache.hadoop.gateway.shell.HadoopException:
org.apache.hadoop.gateway.shell.ErrorResponse: HTTP/1.1 400 Bad Request
What am I doing wrong?

The problem appears to be due to the element <capture-output/>.
I removed this and the error disappeared.

Using keytab file in spark standalone program

I am trying to access a file from HDFS in my standalone scala program using apache-spark. I get the following error upon execution.
SIMPLE authentication is not enabled. Available:[TOKEN,KERBEROS]
I found this question that explains that i need to create a keytab file and then make my standalone program use it . I have generated the keytab file . Could somebody tell me how i can use it from my program.
Any help will be greatly appreciated.
ps - I am using Hadoop 2.3.0 and spark 0.9.0
update : this is how my core-site.xml looks now :
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://USHadoop</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>DEFAULT</value>
</property>
</configuration>

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark2 unable to find table or view on remote hdfs cluster - scala

Related

java.lang.IllegalArgumentException: Does not contain a valid host:port authority: http at org.apache.hadoop.net.NetUtils.createSocketAddr

Mapreduce job when submitted in a queue is not running on the labelled node and instead running under default partition

BigInsights on cloud - Class org.apache.oozie.action.hadoop.SparkMain not found

HTTP/1.1 400 Bad Request executing oozie spark job

Using keytab file in spark standalone program

Categories

Resources