Using keytab file in spark standalone program - scala

I am trying to access a file from HDFS in my standalone scala program using apache-spark. I get the following error upon execution.
SIMPLE authentication is not enabled. Available:[TOKEN,KERBEROS]
I found this question that explains that i need to create a keytab file and then make my standalone program use it . I have generated the keytab file . Could somebody tell me how i can use it from my program.
Any help will be greatly appreciated.
ps - I am using Hadoop 2.3.0 and spark 0.9.0
update : this is how my core-site.xml looks now :
<?xml version="1.0" encoding="UTF-8"?>
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://USHadoop</value>
</property>
<property>
<name>fs.trash.interval</name>
<value>1</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>authentication</value>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>DEFAULT</value>
</property>
</configuration>

Related

java.lang.IllegalArgumentException: Does not contain a valid host:port authority: http at org.apache.hadoop.net.NetUtils.createSocketAddr

Note that i have deployed statefulsets of 2 namenodes, 2 datanodes and 3 journalnodes for Apache Hadoop 3.3.3 HA on kubernetes.
but namenode is throwing the following error.
$ hdfs --config /opt/hadoop/etc/hadoop namenode
{"name":"org.apache.hadoop.hdfs.server.namenode.NameNode","time":1659593176018,"date":"2022-08-04 06:06:16,018","level":"ERROR","thread":"Listener at 0.0.0.0/8020","message":"Error encountered requiring NN shutdown. Shutting down immediately.","exceptionclass":"java.lang.IllegalArgumentException","stack":["java.lang.IllegalArgumentException: **Does not contain a valid host:port authority: http:**","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:232)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:189)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:169)","\tat org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:158)","\tat org.apache.hadoop.hdfs.DFSUtil.substituteForWildcardAddress(DFSUtil.java:1046)","\tat org.apache.hadoop.hdfs.DFSUtil.getInfoServerWithDefaultHost(DFSUtil.java:1014)","\tat org.apache.hadoop.hdfs.server.namenode.ha.RemoteNameNodeInfo.getRemoteNameNodes(RemoteNameNodeInfo.java:61)","\tat org.apache.hadoop.hdfs.server.namenode.ha.RemoteNameNodeInfo.getRemoteNameNodes(RemoteNameNodeInfo.java:42)","\tat org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.<init>(EditLogTailer.java:191)","\tat org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startStandbyServices(FSNamesystem.java:1501)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.startStandbyServices(NameNode.java:2051)","\tat org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.enterState(StandbyState.java:69)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:1024)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:995)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1769)","\tat org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1834)"]}
core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://apache-hadoop-namenode:8020</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>zk-headless.backend.svc.cluster.local:2181</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/dfs/journal</value>
</property>
hdfs-site.xml
<property>
<name>dfs.nameservices</name>
<value>apache-hadoop-namenode</value>
</property>
<property>
<name>dfs.ha.namenodes.apache-hadoop-namenode</name>
<value>apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local,apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</value>
</property>
<property>
<name>dfs.namenode.rpc-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>hdfs://apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>hdfs://apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>http://apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>http://apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://apache-hadoop-journalnode.backend.svc.cluster.local:8485/apache-hadoop-namenode</value>
</property>
The solution for this is to remove the http in the following property from hdfs-site.xml
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>apache-hadoop-namenode-0.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
<property>
<name>dfs.namenode.http-address.apache-hadoop-namenode.apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local</name>
<value>apache-hadoop-namenode-1.apache-hadoop-namenode.backend.svc.cluster.local:9870</value>
</property>
this http address property is required as metioned it the https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#:~:text=dfs.namenode.http%2Daddress.%5Bnameservice%20ID%5D.%5Bname%20node%20ID%5D%20%2D%20the%20fully%2Dqualified%20HTTP%20address%20for%20each%20NameNode%20to%20listen%20on
but for my case it worked after removing the http:// out of this property.

HBASE multi node cluster setup Issue

I have 3 nodes as master ,slave1,slave2 and trying to install HBASE in above cluster.
I have started HABSE in master node , but i see daemons are not running in slave nodes , do i need to issue start habse command in slave nodes as well ?
Could some body please help.
hbase-env.sh content:
export JAVA_HOME=/opt/jdk1.8.0_151/
regionservers content:
slave1
slave2
master
hbase-site.xml content:
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://1XX.1YY.1ZZ.1WW:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>master,slave1,slave2</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zk_data</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
</property>
</configuration>
Bashrc was pointing to another version of installed HBASE . problem solved once pointing to correct version in bashrc file. Also to clarify we have to issue start-hbase command in master node only.

BigInsights on cloud - Class org.apache.oozie.action.hadoop.SparkMain not found

I'm trying to execute the spark oozie example on the oozie_spark branch against a BigInsights for Apache Hadoop basic cluster.
The workflow.xml looks like this:
<workflow-app xmlns='uri:oozie:workflow:0.5' name='SparkWordCount'>
<start to='spark-node' />
<action name='spark-node'>
<spark xmlns="uri:oozie:spark-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>${master}</master>
<name>Spark-Wordcount</name>
<class>org.apache.spark.examples.WordCount</class>
<jar>${hdfsSparkAssyJar},${hdfsWordCountJar}</jar>
<spark-opts>--conf spark.driver.extraJavaOptions=-Diop.version=4.2.0.0</spark-opts>
<arg>${inputDir}/FILE</arg>
<arg>${outputDir}</arg>
</spark>
<ok to="end" />
<error to="fail" />
</action>
<kill name="fail">
<message>Workflow failed, error
message[${wf:errorMessage(wf:lastErrorNode())}]
</message>
</kill>
<end name='end' />
</workflow-app>
The configuration.xml:
<configuration>
<property>
<name>master</name>
<value>local</value>
</property>
<property>
<name>queueName</name>
<value>default</value>
</property>
<property>
<name>user.name</name>
<value>default</value>
</property>
<property>
<name>nameNode</name>
<value>default</value>
</property>
<property>
<name>jobTracker</name>
<value>default</value>
</property>
<property>
<name>jobDir</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>inputDir</name>
<value>/user/snowch/test/input</value>
</property>
<property>
<name>outputDir</name>
<value>/user/snowch/test/output</value>
</property>
<property>
<name>hdfsWordCountJar</name>
<value>/user/snowch/test/lib/OozieWorkflowSparkGroovy.jar</value>
</property>
<property>
<name>oozie.wf.application.path</name>
<value>/user/snowch/test</value>
</property>
<property>
<name>hdfsSparkAssyJar</name>
<value>/iop/apps/4.2.0.0/spark/jars/spark-assembly.jar</value>
</property>
</configuration>
However, the error I see in the Yarn logs is:
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], exception invoking main(), java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:234)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runSubtask(LocalContainerLauncher.java:380)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.runTask(LocalContainerLauncher.java:301)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler.access$200(LocalContainerLauncher.java:187)
at org.apache.hadoop.mapred.LocalContainerLauncher$EventHandler$1.run(LocalContainerLauncher.java:230)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: Class org.apache.oozie.action.hadoop.SparkMain not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 13 more
I've looked for SparkMain in spark-assembly:
$ hdfs dfs -get /iop/apps/4.2.0.0/spark/jars/spark-assembly.jar
$ jar tf spark-assembly.jar | grep -i SparkMain
And here:
$ jar tf /usr/iop/4.2.0.0/spark/lib/spark-examples-1.6.1_IBM_4-hadoop2.7.2-IBM-12.jar | grep SparkMain
I've seen another question similar to this one, but this question is specifically about BigInsights on cloud.
The issue was resolved with:
<property>
<name>oozie.use.system.libpath</name>
<value>true</value>
</property>
I should have RTFM properly.

Hive Metastore Configurations PostgreSQL

When I start Hive metastore service, my command line says: "Starting Hive Metastore Server" and nothing further. It doesn't actually start server, neither does it throws any error messages
Hive : 1.2.1
Hadoop : 2.7.1
Postgres: 9.3.8
hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://localhost:5432/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hiveuser</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>*****</value>
</property>
<property>
<name>org.jpox.autoCreateSchema</name>
<value>true</value>
</property>
</configuration>
[metastore is actual database created in PostgresSQL, and I can access it using: psql -U hiveuser -d metastore]
Please set the following property. especially for PostgreSQL . For more details refer here
<property>
<name>datanucleus.autoCreateSchema</name>
<value>false</value>
</property>
The property is replaced since Hive 2.0 by property datanucleus.schema.autoCreateAll and others as explained in this apache cwiki page.
Please check other specific configurations in the same page.

error running Hadoop2 map-reduce job in standalone mode in Eclipse?

I get the following error when I run my MR job in Eclipse.
2014-07-10 14:07:30 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
job done
2014-07-10 14:07:30 INFO JvmMetrics:76 - Initializing JVM Metrics with processName=JobTracker, sessionId=
2014-07-10 14:07:30 WARN JobSubmitter:149 - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2014-07-10 14:07:30 INFO JobSubmitter:439 - Cleaning up the staging area file:/Users/name/tmp/mapred/staging/sridhar519992773/.staging/job_local519992773_0001
2014-07-10 14:07:30 ERROR UserGroupInformation:1494 - PriviledgedActionException as:sridhar (auth:SIMPLE) cause:org.apache.hadoop.util.Shell$ExitCodeException: chmod: /Users/name/tmp/mapred/staging/sridhar519992773/.staging/job_local519992773_0001: No such file or directory
2014-07-10 14:07:30 ERROR App:43 - Error running MapReduce Job
org.apache.hadoop.util.Shell$ExitCodeException: chmod: /Users/name/tmp/mapred/staging/sridhar519992773/.staging/job_local519992773_0001: No such file or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:464)
at org.apache.hadoop.util.Shell.run(Shell.java:379)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
at org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:596)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:178)
at org.apache.hadoop.mapreduce.JobSubmitter.copyAndConfigureFiles(JobSubmitter.java:300)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:387)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:394)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1286)
at hadoopstandalone.standalone.App.doMapReduce(App.java:40)
at hadoopstandalone.standalone.App.main(App.java:27)
Here is my core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/Users/sridhar/Desktop/tmp</value>
</property>
</configuration>
Here is my mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
</configuration>
Your core-site.xml should look something like `
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.temp.dir</name>
<value>/home/myname/hdfs/temp</value>
</property>
</configuration>
and you hdfs-site.xml should look something like this
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/myname/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/myname/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Let me know if it works