sparkcode to connect hive with kerberos keytab - scala

I am trying to connect to Hive using kerberos keytab authentication using spark 2.4.5
Till now i have developed the below code and i want to convert the same to a dataframe.
How can i do that and is there any other approach to connect to Hive with kerberos keytab apart from this?
val conf: org.apache.hadoop.conf.Configuration = new org.apache.hadoop.conf.Configuration()
System.setProperty("java.security.krb5.conf", krbloc)
conf.set("hadoop.security.authentication", "kerberos")
conf.set("hadoop.security.authorization", "true")
UserGroupInformation.setConfiguration(conf)
UserGroupInformation.loginUserFromKeytab("hive#drek-test.COM",
keytab_path)
Class.forName("org.apache.hive.jdbc.HiveDriver")
val con = DriverManager.getConnection(url3)
con.prepareStatement("select * from table").executeQuery

Related

how to connect to mongodb Atlas from databricks cluster using pyspark

how to connect to mongodb Atlas from databricks cluster using pyspark
This is my simple code in notebook
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("myApp") \
.config("spark.mongodb.input.uri", "mongodb+srv://admin:<password>#mongocluster.fxilr.mongodb.net/TestDatabase.Events") \
.getOrCreate()
df = spark.read.format("mongo").load()
df.printSchema()
But I am getting error as
IllegalArgumentException: Missing database name. Set via the 'spark.mongodb.input.uri' or 'spark.mongodb.input.database' property
What is wrong am i doing
I followed this steps and I was able to connect.
Install org.mongodb.spark:mongo-spark-connector_2.12:3.0.2 maven library to your cluster as I was using scala2.12
Goto Cluster detail page and in Advance option under Spark tab , you add below two config parameters
spark.mongodb.output.uri connection-string
spark.mongodb.input.uri connection-string
Note connection-string should look like this - (have your appropriate user, password and database names)
mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database?retryWrites=true&w=majority
Use following python code in your notebook and it should load your sample collection to a dataframe
# Reading from MongoDB
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/database?retryWrites=true&w=majority")\
.option("database", "my_database")\
.option("collection", "my_collection")\
.load()
You can use following to write to mongo db
events_df.write\
.format('com.mongodb.spark.sql.DefaultSource')\
.mode("append")\
.option( "uri", "mongodb+srv://user:password#cluster1.s5tuva0.mongodb.net/my_database.my_collection?retryWrites=true&w=majority") \
.save()
Hope this should work for you. Please do let others know if it worked.

register hive udf in scala - java.net.MalformedURLException: unknown protocol: s3

I am trying to register a udf in scala spark like this where registering the following udf works in hive create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'
val sparkSess = SparkSession.builder()
.appName("Opens")
.enableHiveSupport()
.config("set hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
sparkSess.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
I get an error saying
Exception in thread "main" java.net.MalformedURLException: unknown protocol: s3
Would like to know if I have to set something in config or anything else , I have just started learning.
Any help with this is appreciated.
Why not add this gdpr-hive-udfs-hadoop.jar as an external jar to your project and then do this to register the udf:
val sqlContext = sparkSess.sqlContext
val udf_parallax = sqlContext.udf .register("udf_parallax", com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash _)
Update:
1.If your hive is running on remote server:
val sparkSession= SparkSession.builder()
.appName("Opens")
.config("hive.metastore.uris", "thrift://METASTORE:9083")
.config("set hive.exec.dynamic.partition.mode", "nonstrict")
.enableHiveSupport()
.getOrCreate()
sparkSession.sql("""create temporary function udf_parallax as 'com.abc.edw.hww.etl.udf.parallax.ParallaxHiveHash' USING JAR 's3://bx-analytics-softwares/gdpr_hive_udfs/gdpr-hive-udfs-hadoop.jar'""");
2.If hive is not running on remote server:
Copy the hive-site.xml from your /hive/conf/ directory to /spark/conf/ directory and create the SparkSession as you have mentioned in the question

How to access existing table in Hive?

I am trying to access HIVE from spark application with scala.
My code:
val hiveLocation = "hdfs://master:9000/user/hive/warehouse"
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]").set("spark.sql.warehouse.dir",hiveLocation)
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("select * from test").show()
println("End of SQL session-------------------")
But it ends up with error message
Table or view not found
but when I run show tables; under hive console , I can see that table and can run Select * from test. All are in "user/hive/warehouse" location. Just for testing I tried with create table also from spark, just to find out the table location.
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("CREATE TABLE IF NOT EXISTS test11(name String)")
println("End of SQL session-------------------")
This code also executed properly (with success note) but strange thing is that I can find this table from hive console.
Even if I use select * from TBLS; in mysql (in my setup I configured mysql as metastore for hive), I did not found those tables which are created from spark.
Is spark location is different than hive console?
What I have to do if I need to access existing table in hive from spark?
from the spark sql programming guide:
(I highlighted the relevant parts)
Configuration of Hive is done by placing your hive-site.xml,
core-site.xml (for security configuration), and hdfs-site.xml (for
HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started
you need to add a hive-site.xml config file to the resource dir.
here is the minimum needed values for spark to work with hive (set the host to the host of hive):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
</configuration>

Scala Code to connect to Spark and Cassandra

I have scala ( IntelliJ) running on my laptop. I also have Spark and Cassandra running on Machine A,B,C ( 3 node Cluster using DataStax, running in Analytics mode).
I tried running Scala programs on Cluster, they are running fine.
I need to create code and run using IntelliJ on my laptop. How do I connect and run. I know I am making mistake in the code. I used general words. I need to help in writing specific code? Example: Localhost is incorrect.
import org.apache.spark.{SparkContext, SparkConf}
object HelloWorld {
def main(args: Array[String]) {
val conf = new SparkConf(true).set("spark:master", "localhost")
val sc = new SparkContext(conf)
val data = sc.cassandraTable("my_keyspace", "my_table")
}
}
val conf = new SparkConf().setAppName("APP_NAME")
.setMaster("local")
.set("spark.cassandra.connection.host", "localhost")
.set("spark.cassandra.auth.username", "")
.set("spark.cassandra.auth.password", "")
Use above code to connect to local spark and cassandra. If your cassandra cluster has authentication enabled then use username and password.
In case you want to connect to remote spark and cassandra cluster then replace localhost with cassandra host and in setMaster use spark:\\SPARK_HOST

Why does Spark Cassandra Connector fail with NoHostAvailableException?

I am having problems getting Spark Cassandra Connector working in Scala.
I'm using these versions:
Scala 2.10.4
spark-core 1.0.2
cassandra-thrift 2.1.0 (my installed cassandra is v2.1.0)
cassandra-clientutil 2.1.0
cassandra-driver-core 2.0.4 (recommended for connector?)
spark-cassandra-connector 1.0.0
I can connect and talk to Cassandra (w/o spark) and I can talk to Spark (w/o Cassandra) but the connector gives me:
com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.0.0.194:9042 (com.datastax.driver.core.TransportException: [/10.0.0.194:9042] Cannot connect))
What am I missing? Cassandra is a default install (port 9042 for cql according to cassandra.yaml). I'm trying to connect locally ("local").
My code:
val conf = new SparkConf().setAppName("Simple Application").setMaster("local")
val sc = new SparkContext("local","test",conf)
val rdd = sc.cassandraTable("myks","users")
val rr = rdd.first
println(s"Result: $rr")
Local in this context is specifying the Spark master (telling it to run in local mode) and not the Cassandra connection host.
To set the Cassandra Connection host you have to set a different property in the Spark Config
import org.apache.spark._
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "IP Cassandra Is Listening On")
.set("spark.cassandra.username", "cassandra") //Optional
.set("spark.cassandra.password", "cassandra") //Optional
val sc = new SparkContext("spark://Spark Master IP:7077", "test", conf)
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md