I'm reading large dataset form hdfs location and saving my dataframe into redshift.
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
After some time i am getting following error
s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:
I found the same issue on github
s3.amazonaws.com:443 failed to respond
am i doing something wrong ?
help me plz
I had the same issue in my case I was using AWS EMR too.
Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.
You have to configure your EMRFS setting and it will be work.
The EMR File System (EMRFS) and the Hadoop Distributed File System
(HDFS) are both installed on your EMR cluster. EMRFS is an
implementation of HDFS which allows EMR clusters to store data on
Amazon S3.
EMRFS will try to verify list consistency for objects tracked in its
metadata for a specific number of retries(emrfs-retry-logic). The default is 5. In the
case where the number of retries is exceeded the originating job
returns a failure. To overcome this issue you can override your
default emrfs configuration in the following steps:
Step1: Login your EMR-master instance
Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
sudo vi /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
fs.s3.consistent.throwExceptionOnInconsistency
false
<property>
<name>fs.s3.consistent.retryPolicyType</name>
<value>fixed</value>
</property>
<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>
<property>
<name>fs.s3.consistent</name>
<value>false</value>
</property>
And restart your EMR cluster
and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")
val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)
Related
I am running a Spark/Scala App on AWS EMR - 12 node cluster. I have multiple transformations happening where i write to HDFS and read back from hdfs to complete transformations and finally write to S3.
During one of these transformations i recently started to get the following error"
2018-08-10 20:05:31,106 [task-result-getter-2] WARN org.apache.spark.scheduler.TaskSetManager:66 - Lost task 44.0 in stage 30.0 (TID 4300, IP-address-here, executor 5): org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/hadoop/working/xml/SD5256.20171030.5251246b-5475.xml.__temp could only be replicated to 0 nodes instead of minReplication (=1). There are 11 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1735)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2561)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:829)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:510)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:847)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:790)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2486)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489)
at org.apache.hadoop.ipc.Client.call(Client.java:1435)
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy38.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:444)
at sun.reflect.GeneratedMethodAccessor239.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy39.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1838)
at org.apache.hadoop.hdfs.DataStreamer.nextBlockOutputStream(DataStreamer.java:1638)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:704)
Based on some articles and forum member comments, i updated the hdfs-site.xml by adding the following configuration:
<property>
<name>dfs.client.use.datanode.hostname</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>true</value>
</property>
Can someone help me understand why i am getting this error? and what configuration do i need to update in hdfs-site.xml to address this issue. Any help is appreciated.
I think could be due to
1.Because of multiple transformation your job needs to open more files which might be exceeding the maximum number of open files specified in the ulimit.
2.your jobs are trying to write to the same HDFS file concurrently. HDFS does not allow concurrent writes to the same file.
Possible Solutions-
1. At any given point of time an HDFS file can have only one writer connection. you need ensure that your are not writing to same HDFS files.
2.consider increasing the ulimit maximum number of open files for the user runs the job on all nodes.
I'm trying to run a Spark streaming app from my local to connect to an S3 bucket and am running into a SocketTimeoutException. This is the code to read from the bucket:
val sc: SparkContext = createSparkContext(scName)
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val ssc = new StreamingContext(sc, Seconds(time))
val lines = ssc.textFileStream("s3a://foldername/subfolder/")
lines.print()
This is the error I get:
com.amazonaws.http.AmazonHttpClient executeHelper - Unable to execute HTTP request: connect timed out
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
I thought it might be due to the proxy so I ran my spark-submit with the proxy options like so:
spark-submit --conf "spark.driver.extraJavaOptions=
-Dhttps.proxyHost=proxyserver.com -Dhttps.proxyPort=9000"
--class application.jar s3module 5 5 SampleApp
That still gave me the same error. Perhaps I'm not setting the proxy properly? Is there a way to set it in the code in SparkContext's conf?
there's specific options for proxy setup, covered in the docs
<property>
<name>fs.s3a.proxy.host</name>
<description>Hostname of the (optional) proxy server for S3 connections.</description>
</property>
<property>
<name>fs.s3a.proxy.port</name>
<description>Proxy server port. If this property is not set
but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
the value of fs.s3a.connection.ssl.enabled).</description>
</property>
Which can be set in spark defaults with the spark.hadoop prefix
spark.hadoop.fs.s3a.proxy.host=myproxy
spark.hadoop.fs.s3a.proxy.port-8080
I am trying to access HIVE from spark application with scala.
My code:
val hiveLocation = "hdfs://master:9000/user/hive/warehouse"
val conf = new SparkConf().setAppName("SOME APP NAME").setMaster("local[*]").set("spark.sql.warehouse.dir",hiveLocation)
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("select * from test").show()
println("End of SQL session-------------------")
But it ends up with error message
Table or view not found
but when I run show tables; under hive console , I can see that table and can run Select * from test. All are in "user/hive/warehouse" location. Just for testing I tried with create table also from spark, just to find out the table location.
val spark = SparkSession
.builder()
.appName("SparkHiveExample")
.master("local[*]")
.config("spark.sql.warehouse.dir", hiveLocation)
.config("spark.driver.allowMultipleContexts", "true")
.enableHiveSupport()
.getOrCreate()
println("Start of SQL Session--------------------")
spark.sql("CREATE TABLE IF NOT EXISTS test11(name String)")
println("End of SQL session-------------------")
This code also executed properly (with success note) but strange thing is that I can find this table from hive console.
Even if I use select * from TBLS; in mysql (in my setup I configured mysql as metastore for hive), I did not found those tables which are created from spark.
Is spark location is different than hive console?
What I have to do if I need to access existing table in hive from spark?
from the spark sql programming guide:
(I highlighted the relevant parts)
Configuration of Hive is done by placing your hive-site.xml,
core-site.xml (for security configuration), and hdfs-site.xml (for
HDFS configuration) file in conf/.
When working with Hive, one must instantiate SparkSession with Hive
support, including connectivity to a persistent Hive metastore,
support for Hive serdes, and Hive user-defined functions. Users who do
not have an existing Hive deployment can still enable Hive support.
When not configured by the hive-site.xml, the context automatically
creates metastore_db in the current directory and creates a directory
configured by spark.sql.warehouse.dir, which defaults to the directory
spark-warehouse in the current directory that the Spark application is
started
you need to add a hive-site.xml config file to the resource dir.
here is the minimum needed values for spark to work with hive (set the host to the host of hive):
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://host:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
</configuration>
In my Spark application, I have aws credentials passed in via Command Line arguments.
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
spark.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
However, in Cluster Mode explicitly passing credential between nodes is huge security issue since these credentials are being passed as text.
How do I make my application to work with IAmRole or other proper approach that doesn't need this two lines of code in Spark app:
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsAccessKeyId", awsAccessKeyId)
spark.sparkContext.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", awsSecretAccessKey)
You can add following config in core-site.xml of hadoop conf and cannot add it in your code base
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>my_aws_access_key_id_here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>my_aws_secret_access_key_here</value>
</property>
</configuration>
To use the above file simply export HADOOP_CONF_DIR=~/Private/.aws/hadoop_conf before running spark or conf/spark-env.sh
And for IAM Role there is already bug is open in spark 1.6 https://issues.apache.org/jira/browse/SPARK-16363
I have hortonworks sandbox running in my VM. I have done all the hive-site.xml configurations and placed in Spark/conf file.
I can access HBase using PySpark and create/update tables but when I do the same implementation in scala its giving me the following error:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
I have changed my permission on ‘hive/warehouse’ folder too but still its giving me the same error.
[root#sandbox ~]# sudo -u hdfs hadoop fs -ls -d /user/hive/warehouse
drwxrwxrwt - hdfs hdfs 0 2015-02-02 09:19 /user/hive/warehouse
My hive-site.xml contains following property
<property>
<name>hive.security.authorization.enabled</name>
<value>false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>java.jdo.option.ConnectionPassword</name>
<value>hive</value>
<description>
</description>
</property>
Thank you very much in advance.
Finally found the mistake I was making.
The correct SPARK_HOME location must be specified in the code that runs on the local machine
import os
import sys
# Path for spark source folder
os.environ['SPARK_HOME']="/Users/renienj/spark-1.1.0/dist"
Basically, the local machine doesn’t have privileges to HDFS because the classpath does not include HADOOP_CONF_DIR. Hence, the warehouse and tmp directories are in Hadoop but the table directory creation failures are stored in the local file system.
So to solve the problem we need submit the packaged JAR with the local distribution package.
$SPARK_HOME/bin/spark-submit --class "Hello" --master local[4] hello-scala_2.10-1.0.jar