How to read Hive table with Spark

How to read Hive table with Spark - scala

I would like to read hive table with Spark. Hive tables data are stored as textFile in /user/hive/warehouse/problem7.db.
I do:
val warehouseLocation = hdfs://localhost:9000/user/hive/warehouse
// Create the Spark Conf and the Spark Session
val conf = new SparkConf().setAppName("Spark Hive").setMaster("local[2]").set("spark.sql.warehouse.dir", warehouseLocation)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val table1 = spark.sql("select * from problem7.categories")
table1.show(false)
I have the following error:
Table or view not found: `problem7`.`categories`

I resolved in the following way :
I create a hive-site.xml in spark/conf and add :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
</configuration>
then I start hive metastore service with the following command
hive --service metastore

Related

Unable to access to Hive warehouse directory with Spark

I'm trying to connect to the Hive warehouse directory by using Spark on IntelliJ which is located at the following path :
hdfs://localhost:9000/user/hive/warehouse
In order to do this, I'm using the following code :
import org.apache.spark.sql.SparkSession
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = "hdfs://localhost:9000/user/hive/warehouse"
val spark = SparkSession
.builder()
.appName("Spark Hive Local Connector")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.enableHiveSupport()
.getOrCreate()
spark.catalog.listDatabases().show(false)
spark.catalog.listTables().show(false)
spark.conf.getAll.mkString("\n")
import spark.implicits._
import spark.sql
sql("USE test")
sql("SELECT * FROM test.employee").show()
As one can see, I have created a database 'test' and create a table 'employee' into this database using the hive console. I want to get the result of the latest request.
The 'spark.catalog.' and 'spark.conf.' are used in order to print the properties of the warehouse path and database settings.
spark.catalog.listDatabases().show(false) gives me :
name : default
description : Default Hive database
locationUri : hdfs://localhost:9000/user/hive/warehouse
spark.catalog.listTables.show(false) gives me an empty result. So something is wrong at this step.
At the end of the execution of the job, i obtained the following error :
> Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'test' not found;
I have also configured the hive-site.xml file for the Hive warehouse location :
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://localhost:9000/user/hive/warehouse</value>
</property>
I have already created the database 'test' using the Hive console.
Below, the versions of my components :
Spark : 2.2.0
Hive : 1.1.0
Hadoop : 2.7.3
Any ideas ?

Create the resource directory under the src in your IntelliJ project copy the conf files under this folder. Build the project .. Ensure to define hive.metastore.warehouse.uris path correctly refer the hive-site.xml . In log if your are getting INFO metastore: Connected to metastore then you are good to go. Example.
Kindly note it making connection to intellij and running the job will be slow compare to package the jar and running on your hadoop cluster.

How to access existing table in Hive?

I am trying to access HIVE from spark application with scala.
My code:
val hiveLocation = "hdfs://master:9000/user/hive/warehouse"
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]").set("spark.sql.warehouse.dir",hiveLocation)
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("select * from test").show()
println("End of SQL session-------------------")
But it ends up with error message
Table or view not found
but when I run show tables; under hive console , I can see that table and can run Select * from test. All are in "user/hive/warehouse" location. Just for testing I tried with create table also from spark, just to find out the table location.
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("CREATE TABLE IF NOT EXISTS test11(name String)")
println("End of SQL session-------------------")
This code also executed properly (with success note) but strange thing is that I can find this table from hive console.
Even if I use select * from TBLS; in mysql (in my setup I configured mysql as metastore for hive), I did not found those tables which are created from spark.
Is spark location is different than hive console?
What I have to do if I need to access existing table in hive from spark?

from the spark sql programming guide:
(I highlighted the relevant parts)
Configuration of Hive is done by placing your hive-site.xml,
core-site.xml (for security configuration), and hdfs-site.xml (for
HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started
you need to add a hive-site.xml config file to the resource dir.
here is the minimum needed values for spark to work with hive (set the host to the host of hive):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
</configuration>

How to connect to Hive in Virtual Box from IntelliJ IDEA with Spark Scala

I need to connect to Hive in Cloudera CDH 5.8 running in virtualBox , from Spark - scala program created in IntelliJ on local Windows machine. Please help.

Mostly what you need is the HDFS and Hive support. You have 2 options:
1).Create core-site.xml, hive-site.xml where you configure :
core-site property
<property>
<!--<name>fs.defaultFS</name>-->
<name>fs.defaultFS</name>
<value>maprfs://cdhdemo:7222</value>
</property>
hive-site properties
<property>
<name>hive.metastore.uris</name>
<value>thrift://cdhdemo:9083</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
2).Or you can configure programmatically with SparkSession:
sparkSessionBuilder
.config("hive.metastore.uris", "thrift://chddemo:9083")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("fs.defaultFS", "maprfs://chddemo:7222")
.enableHiveSupport()

Unable to save dataframe in redshift

I'm reading large dataset form hdfs location and saving my dataframe into redshift.
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
After some time i am getting following error
s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:
I found the same issue on github
s3.amazonaws.com:443 failed to respond
am i doing something wrong ?
help me plz

I had the same issue in my case I was using AWS EMR too.
Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.
You have to configure your EMRFS setting and it will be work.
The EMR File System (EMRFS) and the Hadoop Distributed File System
(HDFS) are both installed on your EMR cluster. EMRFS is an
implementation of HDFS which allows EMR clusters to store data on
Amazon S3.
EMRFS will try to verify list consistency for objects tracked in its
metadata for a specific number of retries(emrfs-retry-logic). The default is 5. In the
case where the number of retries is exceeded the originating job
returns a failure. To overcome this issue you can override your
default emrfs configuration in the following steps:
Step1: Login your EMR-master instance
Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
sudo vi /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
fs.s3.consistent.throwExceptionOnInconsistency
false
<property>
<name>fs.s3.consistent.retryPolicyType</name>
<value>fixed</value>
</property>
<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>
<property>
<name>fs.s3.consistent</name>
<value>false</value>
</property>
And restart your EMR cluster
and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")
val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

Spark: how to not use aws credentials explicitly in Spark application

In my Spark application, I have aws credentials passed in via Command Line arguments.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
However, in Cluster Mode explicitly passing credential between nodes is huge security issue since these credentials are being passed as text.
How do I make my application to work with IAmRole or other proper approach that doesn't need this two lines of code in Spark app:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)

You can add following config in core-site.xml of hadoop conf and cannot add it in your code base
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
To use the above file simply export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running spark or conf/spark-env.sh
And for IAM Role there is already bug is open in spark 1.6 https://issues.apache.org/jira/browse/SPARK-16363

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to read Hive table with Spark - scala

Related

Unable to access to Hive warehouse directory with Spark

How to access existing table in Hive?

How to connect to Hive in Virtual Box from IntelliJ IDEA with Spark Scala

Unable to save dataframe in redshift

Spark: how to not use aws credentials explicitly in Spark application

Categories

Resources