Spark: how to not use aws credentials explicitly in Spark application - scala

In my Spark application, I have aws credentials passed in via Command Line arguments.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
However, in Cluster Mode explicitly passing credential between nodes is huge security issue since these credentials are being passed as text.
How do I make my application to work with IAmRole or other proper approach that doesn't need this two lines of code in Spark app:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)

You can add following config in core-site.xml of hadoop conf and cannot add it in your code base
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
To use the above file simply export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running spark or conf/spark-env.sh
And for IAM Role there is already bug is open in spark 1.6 https://issues.apache.org/jira/browse/SPARK-16363

Related

How to read Hive table with Spark

I would like to read hive table with Spark. Hive tables data are stored as textFile in /user/hive/warehouse/problem7.db.
I do:
val warehouseLocation = hdfs://localhost:9000/user/hive/warehouse
// Create the Spark Conf and the Spark Session
val conf = new SparkConf().setAppName("Spark Hive").setMaster("local[2]").set("spark.sql.warehouse.dir", warehouseLocation)
val spark = SparkSession.builder.config(conf).enableHiveSupport().getOrCreate()
val table1 = spark.sql("select * from problem7.categories")
table1.show(false)
I have the following error:
Table or view not found: `problem7`.`categories`
I resolved in the following way :
I create a hive-site.xml in spark/conf and add :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://localhost:9083</value>
</property>
</configuration>
then I start hive metastore service with the following command
hive --service metastore

How to connect to Hive in Virtual Box from IntelliJ IDEA with Spark Scala

I need to connect to Hive in Cloudera CDH 5.8 running in virtualBox , from Spark - scala program created in IntelliJ on local Windows machine. Please help.
Mostly what you need is the HDFS and Hive support. You have 2 options:
1).Create core-site.xml, hive-site.xml where you configure :
core-site property
<property>
<!--<name>fs.defaultFS</name>-->
<name>fs.defaultFS</name>
<value>maprfs://cdhdemo:7222</value>
</property>
hive-site properties
<property>
<name>hive.metastore.uris</name>
<value>thrift://cdhdemo:9083</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
2).Or you can configure programmatically with SparkSession:
sparkSessionBuilder
.config("hive.metastore.uris", "thrift://chddemo:9083")
.config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
.config("fs.defaultFS", "maprfs://chddemo:7222")
.enableHiveSupport()

Unable to save dataframe in redshift

I'm reading large dataset form hdfs location and saving my dataframe into redshift.
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
After some time i am getting following error
s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:
I found the same issue on github
s3.amazonaws.com:443 failed to respond
am i doing something wrong ?
help me plz
I had the same issue in my case I was using AWS EMR too.
Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.
You have to configure your EMRFS setting and it will be work.
The EMR File System (EMRFS) and the Hadoop Distributed File System
(HDFS) are both installed on your EMR cluster. EMRFS is an
implementation of HDFS which allows EMR clusters to store data on
Amazon S3.
EMRFS will try to verify list consistency for objects tracked in its
metadata for a specific number of retries(emrfs-retry-logic). The default is 5. In the
case where the number of retries is exceeded the originating job
returns a failure. To overcome this issue you can override your
default emrfs configuration in the following steps:
Step1: Login your EMR-master instance
Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
sudo vi /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
fs.s3.consistent.throwExceptionOnInconsistency
false
<property>
<name>fs.s3.consistent.retryPolicyType</name>
<value>fixed</value>
</property>
<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>
<property>
<name>fs.s3.consistent</name>
<value>false</value>
</property>
And restart your EMR cluster
and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")
val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

Spark-Scala HBase table creation fails (MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one)

I have hortonworks sandbox running in my VM. I have done all the hive-site.xml configurations and placed in Spark/conf file.
I can access HBase using PySpark and create/update tables but when I do the same implementation in scala its giving me the following error:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
I have changed my permission on ‘hive/warehouse’ folder too but still its giving me the same error.
[root#sandbox ~]# sudo -u hdfs hadoop fs -ls -d /user/hive/warehouse
drwxrwxrwt - hdfs hdfs 0 2015-02-02 09:19 /user/hive/warehouse
My hive-site.xml contains following property
<property>
<name>hive.security.authorization.enabled</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>java.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>
</description>
</property>
Thank you very much in advance.
Finally found the mistake I was making.
The correct SPARK_HOME location must be specified in the code that runs on the local machine
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/renienj/spark-1.1.0/dist"
Basically, the local machine doesn’t have privileges to HDFS because the classpath does not include HADOOP_CONF_DIR. Hence, the warehouse and tmp directories are in Hadoop but the table directory creation failures are stored in the local file system.
So to solve the problem we need submit the packaged JAR with the local distribution package.
$SPARK_HOME/bin/spark-submit --class "Hello" --master local[4] hello-scala_2.10-1.0.jar

Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

I am trying to find some docs / description of the approach on the subject, please help.
I have Hadoop 2.2.0 from Hortonworks installed with some existing Hive tables I need to query. Hive SQL works extremly and unreasonably slow on single node and cluster as well. I hope Shark will work faster.
From Spark/Shark docs I can not figure out how to make Shark work with existing Hive tables. Any ideas how to achieve this? Thanks!
You need to configure the metastore within the shark-specific hive directory. Details are provided at a similar question I answered here.
In summary, you will need to copy the hive-default.xml to hive-site.xml . Then ensure the metastore properties are set.
Here is the basic info in hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://myhost/metastore</value>
<description>the URL of the MySQL database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mypassword</value>
</property>
You can get more details here: configuring hive metastore