Cannot connect locally to hdfs kerberized cluster using IntelliJ - scala

Iam trying to connect to hdfs locally via intelliJ installed on my laptop.The cluster I'am trying to connect to is Kerberized with an edge node. I generated a keytab for the edge node and configured that in the code below. Iam able to login to the edgenode now. But when I now try to access the hdfs data which is on the namenode it throws an error.
Below is the Scala code that is trying to connect to hdfs:
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.security.{Credentials, UserGroupInformation}
import org.apache.hadoop.security.token.{Token, TokenIdentifier}
import java.security.{AccessController, PrivilegedAction, PrivilegedExceptionAction}
import java.io.PrintWriter
object DataframeEx {
def main(args: Array[String]) {
// $example on:init_session$
val spark = SparkSession
.builder()
.master(master="local")
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
runHdfsConnect(spark)
spark.stop()
}
def runHdfsConnect(spark: SparkSession): Unit = {
System.setProperty("HADOOP_USER_NAME", "m12345")
val path = new Path("/data/interim/modeled/abcdef")
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://namenodename.hugh.com:8020")
conf.set("hadoop.security.authentication", "kerberos")
conf.set("dfs.namenode.kerberos.principal.pattern","hdfs/_HOST#HUGH.COM")
UserGroupInformation.setConfiguration(conf);
val ugi=UserGroupInformation.loginUserFromKeytabAndReturnUGI("m12345#HUGH.COM","C:\\Users\\m12345\\Downloads\\m12345.keytab");
println(UserGroupInformation.isSecurityEnabled())
ugi.doAs(new PrivilegedExceptionAction[String] {
override def run(): String = {
val fs= FileSystem.get(conf)
val output = fs.create(path)
val writer = new PrintWriter(output)
try {
writer.write("this is a test")
writer.write("\n")
}
finally {
writer.close()
println("Closed!")
}
"done"
}
})
}
}
Iam able to log into the edgenode. But when Iam trying to write to hdfs (the doAs method) it throws the following error:
WARN Client: Exception encountered while connecting to the server : java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM
18/06/11 12:12:01 ERROR UserGroupInformation: PriviledgedActionException m12345#HUGH.COM (auth:KERBEROS) cause:java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM
18/06/11 12:12:01 ERROR UserGroupInformation: PriviledgedActionException as:m12345#HUGH.COM (auth:KERBEROS) cause:java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM; Host Details : local host is: "INMBP-m12345/172.29.155.52"; destination host is: "namenodename.hugh.com":8020;
Exception in thread "main" java.io.IOException: Failed on local exception: java.io.IOException: java.lang.IllegalArgumentException: Server has invalid Kerberos principal: hdfs/namenodename.hugh.com#HUGH.COM; Host Details : local host is: "INMBP-m12345/172.29.155.52"; destination host is: "namenodename.hugh.com":8020
If I log into the edgenode and do a kinit and then access the hdfs its fine. So why am I not able to access the hdfs namenode when Iam able to log into the edgenode?
Let me know if any more details are needed from my side.

The Spark conf object was set incorrectly. Below is what worked for me:
val conf = new Configuration()
conf.set("fs.defaultFS", "hdfs://namenodename.hugh.com:8020")
conf.set("hadoop.security.authentication", "kerberos")
conf.set("hadoop.rpc.protection", "privacy") ***---(was missing this parameter)***
conf.set("dfs.namenode.kerberos.principal","hdfs/_HOST#HUGH.COM") ***---(this was initially wrongly set as dfs.namenode.kerberos.principal.pattern)***

Related

Connecting AWS Glue to Mongodb Atlas Cluster

Has anyone ever managed to get this to work? I've added a connection in AWS Glue to connect to my Mongodb cluster in Atlas but experiencing
Check that your connection definition references your Mongo database with correct URL syntax, username, and password Exiting with error code 30
in aws.
I spun up an ec2 instance in the same subnet as the glue connection in my VPC and it connects just fine. Also allowed all traffic in my security group but still getting the same error.
You might gotta take a look at authSource for optional compositions in a string uri of MongoDB.
Scala Example
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.MappingSpec
import com.amazonaws.services.glue.errors.CallSite
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.SparkContext
import scala.collection.JavaConverters._
object GlueApp {
val DEFAULT_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
val WRITE_URI: String = "mongodb://<an_ip_from_atlas_project_ip_access_list>:27017"
lazy val defaultJsonOption = jsonOptions(DEFAULT_URI)
lazy val writeJsonOption = jsonOptions(WRITE_URI)
def main(sysArgs: Array[String]): Unit = {
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
// Get DynamicFrame from MongoDB
val resultFrame: DynamicFrame = glueContext.getSource("mongodb", defaultJsonOption).getDynamicFrame()
// Write DynamicFrame to MongoDB and DocumentDB
glueContext.getSink("mongodb", writeJsonOption).writeDynamicFrame(resultFrame)
Job.commit()
}
private def jsonOptions(uri: String): JsonOptions = {
new JsonOptions(
s"""{"uri": "${uri}",
|"database":"test",
|"collection":"coll",
|"username": "username",
|"password": "pwd",
|"ssl":"true",
|"ssl.domain_match":"false",
|"partitioner": "MongoSamplePartitioner",
|"partitionerOptions.partitionSizeMB": "10",
|"partitionerOptions.partitionKey": "_id"}""".stripMargin)
}
}
You maybe need to assign the authentication source as a database in the cluster you intend to connect.
mongodb://<an_ip_from_atlas_project_ip_access_list>:27017?authSource=test
References
docs.atlas.mongodb.com. 2021. Connect to a Cluster. [ONLINE] Available at: https://docs.atlas.mongodb.com/connect-to-cluster.
docs.aws.amazon.com. 2021. Examples: Setting Connection Types and Options. [ONLINE] Available at: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect-samples.html.
docs.mongodb.com. 2021. Configuration Options. [ONLINE] Available at: https://docs.mongodb.com/spark-connector/master/configuration#partitioner-conf.
docs.mongodb.com. 2021. Connection String URI Format. [ONLINE] Available at: https://docs.mongodb.com/manual/reference/connection-string/.

Reading file from Azure Data Lake Storage V2 with Spark 2.4

I am trying to read a simple csv file Azure Data Lake Storage V2 with Spark 2.4 on my IntelliJ-IDE on mac
Code Below
package com.example
import org.apache.spark.SparkConf
import org.apache.spark.sql._
object Test extends App {
val appName: String = "DataExtract"
val master: String = "local[*]"
val sparkConf: SparkConf = new SparkConf()
.setAppName(appName)
.setMaster(master)
.set("spark.scheduler.mode", "FAIR")
.set("spark.sql.session.timeZone", "UTC")
.set("spark.sql.shuffle.partitions", "32")
.set("fs.defaultFS", "abfs://development#xyz.dfs.core.windows.net/")
.set("fs.azure.account.key.xyz.dfs.core.windows.net", "~~key~~")
val spark: SparkSession = SparkSession
.builder()
.config(sparkConf)
.getOrCreate()
spark.time(run(spark))
def run(spark: SparkSession): Unit = {
val df = spark.read.csv("abfs://development#xyz.dfs.core.windows.net/development/sales.csv")
df.show(10)
}
}
It's able to read, and throwing security exception
Exception in thread "main" java.lang.NullPointerException
at org.wildfly.openssl.CipherSuiteConverter.toJava(CipherSuiteConverter.java:284)
at org.wildfly.openssl.OpenSSLEngine.toJavaCipherSuite(OpenSSLEngine.java:1094)
at org.wildfly.openssl.OpenSSLEngine.getEnabledCipherSuites(OpenSSLEngine.java:729)
at org.wildfly.openssl.OpenSSLContextSPI.getCiphers(OpenSSLContextSPI.java:333)
at org.wildfly.openssl.OpenSSLContextSPI$1.getSupportedCipherSuites(OpenSSLContextSPI.java:365)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.<init>(SSLSocketFactoryEx.java:105)
at org.apache.hadoop.fs.azurebfs.utils.SSLSocketFactoryEx.initializeDefaultFactory(SSLSocketFactoryEx.java:72)
at org.apache.hadoop.fs.azurebfs.services.AbfsClient.<init>(AbfsClient.java:79)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:817)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:149)
at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
Can anyone help me, what is the mistake?
As per my research, you will receive this error message when you have incompatible jar with the hadoop version.
I would request you to kindly go through the below issues:
http://mail-archives.apache.org/mod_mbox/spark-issues/201907.mbox/%3CJIRA.13243325.1562321895000.591499.1562323440292#Atlassian.JIRA%3E
https://issues.apache.org/jira/browse/HADOOP-16410
Had the same issue, resolved by adding wildfly version 1.0.7, as the docs shared by #cheekatlapradeep-msft mentioned
<dependency>
<groupId>org.wildfly.openssl</groupId>
<artifactId>wildfly-openssl</artifactId>
<version>1.0.7.Final</version>
</dependency>

Can't connect from Spark to S3 - AmazonS3Exception Status Code: 400

I am trying to connect from Spark (running on my PC) to my S3 bucket:
val spark = SparkSession
.builder
.appName("S3Client")
.config("spark.master", "local")
.getOrCreate()
val sc = spark.sparkContext;
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
val txtFile = sc.textFile("s3a://bucket-name/folder/file.txt")
val contents = txtFile.collect();
But getting the following exception:
Exception in thread "main"
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400,
AWS Service: Amazon S3, AWS Request ID: 07A7BDC9135BCC84, AWS Error
Code: null, AWS Error Message: Bad Request, S3 Extended Request ID:
6ly2vhZ2mAJdQl5UZ/QUdilFFN1hKhRzirw6h441oosGz+PLIvLW2fXsZ9xmd8cuBrNHCdh8UPE=
I have seen this question but it didn't help me.
Edit:
As Zack suggested, I added:
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
But I still get the same exception.
I've solve the problem.
I was targeting a region (Frankfurt) that required using version 4 of the signature.
I've changed the region of the S3 bucket to Ireland and now it's working.
According to s3 doc, some region only support "Signature Version(s) 4", need to add the configurations below:
--conf "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true"
and
--conf "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true"
Alon,
try the below configurations:
val spark = SparkSession
.builder
.appName("S3Client")
.config("spark.master", "local")
.getOrCreate()
val sc = spark.sparkContext;
sc.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY)
sc.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY)
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.us-east-1.amazonaws.com")
val txtFile = sc.textFile("s3a://s3a://bucket-name/folder/file.txt")
val contents = txtFile.collect();
I believe your issue was due to you not specifying your endpoint in the configuration set. Sub out us-east-1 for whichever region you use.
This works for me (this is everything...no other export etc. is needed)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.access.key", AWS_KEY)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.secret.key", AWS_SECRET)
sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
to run:
spark-submit --conf spark.driver.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' --conf spark.executor.extraJavaOptions='-Dcom.amazonaws.services.s3.enableV4' --packages org.apache.hadoop:hadoop-aws:2.7.1 spark_read_s3.py

Not able to read data from AWS S3(orc) through Intellij local(spark/Scala)

we are reading the date/table from AWS(hive) through Spark/scala using Intellij(witch is on local machine). we can able to see the schema of table. but not able to read data.
please find below flow to get better understanding
Intellij(spark/scala)------> hive:9083(remote)------> s3(orc)
Note: Here Intellij is present on local and hive and S3 present on AWS
Please find below code for the same:
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
//import org.apache.spark.sql.hive.HiveContext
object hiveconnect {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.config("hive.metastore.uris", "thrift://10.20.30.40:9083")
.master("local[*]")
.config("spark.sql.warehouse.dir", "s3://abc/test/main")
.config("spark.driver.allowMultipleContexts", "true")
.config("access-key","key")
.config("secret-key","key")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("show databases").show()
spark.sql("select *from ace.visit limit 5").show()
}
}
Error: Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

Not able to read conf file in spark scala

I would like to read a conf file in to my spark application. The conf file is located in Hadoop edge node directory.
omega.conf
username = "surrender"
location = "USA"
My Spark Code :
package com.test.spark
import org.apache.spark.{SparkConf, SparkContext}
import java.io.File
import com.typesafe.config.{ Config, ConfigFactory }
object DemoMain {
def main(args: Array[String]): Unit = {
println("Lets Get Started ")
val conf = new SparkConf().setAppName("SIMPLE")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val conf_loc = "/home/cloudera/localinputfiles/omega.conf"
loadConfigFile(conf_loc)
}
def loadConfigFile(loc:String):Unit ={
val config = ConfigFactory.parseFile(new File(loc))
val username = config.getString("username")
println(username)
}
}
I am running this spark application using spark-submit
spark-submit --class com.test.spark.DemoMain --master local /home/cloudera/dev/jars/spark_examples.jar
Spark job is initiated ,but it throws me the below error .It says that No configuration setting found for key 'username'
17/03/29 12:57:37 INFO SparkContext: Created broadcast 0 from textFile at DemoMain.scala:25
Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'username'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:115)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:136)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:150)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:155)
at com.typesafe.config.impl.SimpleConfig.getString (SimpleConfig.java:197)
at com.test.spark.DemoMain$.loadConfigFile(DemoMain.scala:53)
at com.test.spark.DemoMain$.main(DemoMain.scala:27)
at com.test.spark.DemoMain.main(DemoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Please help me on fixing this issue
I just tried its working fine i test this with below code
val config=ConfigFactory.parseFile(new File("/home/sandy/my.conf"))
println("::::::::::::::::::::"+config.getString("username"))
and conf file is
username = "surrender"
location = "USA"
Please check location of your file by printing it.