Spark streaming connecting to S3 gives socket timeout - scala

I'm trying to run a Spark streaming app from my local to connect to an S3 bucket and am running into a SocketTimeoutException. This is the code to read from the bucket:
val sc: SparkContext = createSparkContext(scName)
val hadoopConf=sc.hadoopConfiguration
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val ssc = new StreamingContext(sc, Seconds(time))
val lines = ssc.textFileStream("s3a://foldername/subfolder/")
lines.print()
This is the error I get:
com.amazonaws.http.AmazonHttpClient executeHelper - Unable to execute HTTP request: connect timed out
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
I thought it might be due to the proxy so I ran my spark-submit with the proxy options like so:
spark-submit --conf "spark.driver.extraJavaOptions=
-Dhttps.proxyHost=proxyserver.com -Dhttps.proxyPort=9000"
--class application.jar s3module 5 5 SampleApp
That still gave me the same error. Perhaps I'm not setting the proxy properly? Is there a way to set it in the code in SparkContext's conf?

there's specific options for proxy setup, covered in the docs
<property>
<name>fs.s3a.proxy.host</name>
<description>Hostname of the (optional) proxy server for S3 connections.</description>
</property>
<property>
<name>fs.s3a.proxy.port</name>
<description>Proxy server port. If this property is not set
but fs.s3a.proxy.host is, port 80 or 443 is assumed (consistent with
the value of fs.s3a.connection.ssl.enabled).</description>
</property>
Which can be set in spark defaults with the spark.hadoop prefix
spark.hadoop.fs.s3a.proxy.host=myproxy
spark.hadoop.fs.s3a.proxy.port-8080

Related

Hudi: Access to timeserver times out in embedded mode

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver.
The following is based on Hudi: Getting Started guide.
Spark Session setup:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()
Code which causes timeout exception:
val inserts = convertToStringList(dataGen.generateInserts(10))
var df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
The timeout and exception throws:
170762 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating remote view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661. Server=xxx:59520
170766 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating InMemory based view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661
170769 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://xxx:59520/v1/hoodie/view/datafiles/beforeoron/latest/?partition=americas%2Funited_states%2Fsan_francisco&maxinstant=20201221180946&basepath=%2Fvar%2Ffolders%2Fz9%2F_9mf84p97hz1n45b0gnpxlj40000gp%2FT%2FHudiQuickStartSpec-hudi_trips_cow2193648737745630661&lastinstantts=20201221180946&timelinehash=70f7aa073fa3d86033278a59cbda71c6488f4883570d826663ebb51934a25abf)
246649 [Executor task launch worker for task 47] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesFromParams(RemoteHoodieTableFileSystemView.java:223)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesBeforeOrOn(RemoteHoodieTableFileSystemView.java:230)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:134)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$19c2c1bb$1(HoodieBloomIndex.java:201)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
I wasn't able to experiment with different port settings for Hudi timeserver port as I wasn't able to find the config setting that controls the port.
Any ideas why access to the timeserver times out?
The problem turned out to be rooted in the way Hudi resolves spark driver host. It seems that although it starts and binds its web server to localhost, Hudi's client subsequently uses the IP address to make calls to the server it started.
5240 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Starting Javalin ...
5348 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Listening on http://localhost:59520/
...
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
The solution is to configure "spark.driver.host" setting explicitly. The following worked for me:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.driver.host", "localhost")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()

Spark streaming, how to see JMX metrics in UI

I have configured jmx metrics in spark streaming application. Following is the code :
val sc = spark.sparkContext
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
spark.conf.set(s"spark.metrics.conf.*.sink.jmx.class", "org.apache.spark.metrics.sink.JmxSink")
UserMetricsSystem.initialize(sc, config.getAppNamespace)
val listener = new EventCollector(isSingleStream) // Some custom code
spark.streams.addListener(listener)
With metrics.properties contents are :
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
After configuring all this and running the application, I can see inputRate, latency and processingRate for the application on:
jconsole <host>:<port> // driver host
But I want to see those metrics in json format on browser.
Is there a way to access these configured jmx metrics through spark API?

Connection to Cassandra from spark Error

I am using Spark 2.0.2 and Cassandra 3.11.2 I am using this code but it give me connection error.
./spark-shell --jars ~/spark/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.10/spark-cassandra-connector-assembly-2.0.5-121-g1a7fa1f8.jar
import com.datastax.spark.connector._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val test = sc.cassandraTable("sensorkeyspace", "sensortable")
test.count
When I enter test.count command it give me this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
Can you check the yaml file? It seems the number of enough concurrent connections are open at any instance of time.

Unable to save dataframe in redshift

I'm reading large dataset form hdfs location and saving my dataframe into redshift.
df.write
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass")
.option("dbtable", "my_table_copy")
.option("tempdir", "s3n://path/for/temp/data")
.mode("error")
.save()
After some time i am getting following error
s3.amazonaws.com:443 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at org.apache.http.impl.conn.AbstractClientConnAdapter.receiveResponseHeader(AbstractClientConnAdapter.java:223)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:685)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:487)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:882)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:334)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:281)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestPut(RestStorageService.java:1043)
at org.jets3t.service.impl.rest.httpclient.RestStorageService.copyObjectImpl(RestStorageService.java:2029)
at org.jets3t.service.StorageService.copyObject(StorageService.java:871)
at org.jets3t.service.StorageService.copyObject(StorageService.java:916)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.copy(Jets3tNativeFileSystemStore.java:323)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.rename(NativeS3FileSystem.java:707)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:370)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:
I found the same issue on github
s3.amazonaws.com:443 failed to respond
am i doing something wrong ?
help me plz
I had the same issue in my case I was using AWS EMR too.
Redshift databricks library using the Amazon S3 for efficiently transfer data in and out of RedshiftSpark.This library firstly write the data in Amazon S3 and than this avro files loaded into Redshift using EMRFS.
You have to configure your EMRFS setting and it will be work.
The EMR File System (EMRFS) and the Hadoop Distributed File System
(HDFS) are both installed on your EMR cluster. EMRFS is an
implementation of HDFS which allows EMR clusters to store data on
Amazon S3.
EMRFS will try to verify list consistency for objects tracked in its
metadata for a specific number of retries(emrfs-retry-logic). The default is 5. In the
case where the number of retries is exceeded the originating job
returns a failure. To overcome this issue you can override your
default emrfs configuration in the following steps:
Step1: Login your EMR-master instance
Step2: Add following properties to /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
sudo vi /usr/share/aws/emr/emrfs/conf/emrfs-site.xml
fs.s3.consistent.throwExceptionOnInconsistency
false
<property>
<name>fs.s3.consistent.retryPolicyType</name>
<value>fixed</value>
</property>
<property>
<name>fs.s3.consistent.retryPeriodSeconds</name>
<value>10</value>
</property>
<property>
<name>fs.s3.consistent</name>
<value>false</value>
</property>
And restart your EMR cluster
and also configure your hadoopConfiguration hadoopConf.set("fs.s3a.attempts.maximum", "30")
val hadoopConf = SparkDriver.getContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3a.attempts.maximum", "30")
hadoopConf.set("fs.s3n.awsAccessKeyId", awsAccessKeyId)
hadoopConf.set("fs.s3n.awsSecretAccessKey", awsSecretAccessKey)

Spark atop of Docker not accepting jobs

I'm trying to make a hello world example work with spark+docker, and here is my code.
object Generic {
def main(args: Array[String]) {
val sc = new SparkContext("spark://172.17.0.3:7077", "Generic", "/opt/spark-0.9.0")
val NUM_SAMPLES = 100000
val count = sc.parallelize(1 to NUM_SAMPLES).map{i =>
val x = Math.random * 2 - 1
val y = Math.random * 2 - 1
if (x * x + y * y < 1) 1.0 else 0.0
}.reduce(_ + _)
println("Pi is roughly " + 4 * count / NUM_SAMPLES)
}
}
When I run sbt run, I get
14/05/28 15:19:58 INFO client.AppClient$ClientActor: Connecting to master spark://172.17.0.3:7077...
14/05/28 15:20:08 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
I checked both the cluster UI, where I have 3 nodes that each have 1.5g of memory, and the namenode UI, where I see the same thing.
The docker logs show no output from the workers and the following from the master
14/05/28 21:20:38 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster#master:7077] -> [akka.tcp://spark#10.0.3.1:48085]: Error [Association failed with [akka.tcp://spark#10.0.3.1:48085]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark#10.0.3.1:48085]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: /10.0.3.1:48085
]
This happens a couple times, and then the program times out and dies with
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Spark cluster looks down
When I did a tcpdump over the docker0 interface, and it looks like the workers and the master nodes are talking.
However, the spark console works.
If I set sc as val sc = new SparkContext("local", "Generic", System.getenv("SPARK_HOME")), the program runs
I've been there. The issue looks like the AKKA actor subsystem in Spark is binding on a different interface than Spark on docker0.
While your master ip is on: spark://172.17.0.3:7077
Akka is binding on: akka.tcp://spark#10.0.3.1:48085
If you masters/slaves are docker containers, they should be communicating through the docker0 interface in the 172.17.x.x range.
Try providing the master and slaves with their correct local IP using the env config SPARK_LOCAL_IP. See config docs for details.
In our docker setup for Spark 0.9 we are using this command to start the slaves:
${SPARK_HOME}/bin/spark-class org.apache.spark.deploy.worker.Worker $MASTER_IP -i $LOCAL_IP
Which directly provides the local IP to the worker.
For running spark on Docker it's crucial to
Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases
Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/
You have to check firewall if you are on Windows host and make sure java.exe is allowed to access the public network or change dockerNAT to private. In general, the worker must be able to connect back to the driver (the program you submitted).