Hudi: Access to timeserver times out in embedded mode - scala

I am testing Hudi 0.5.3 (supported by AWS Athena) by running it with Spark in embedded mode, i.e. with unit tests. At first, the test succeeded but now it's failing due to timeout when accessing Hudi's timeserver.
The following is based on Hudi: Getting Started guide.
Spark Session setup:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()
Code which causes timeout exception:
val inserts = convertToStringList(dataGen.generateInserts(10))
var df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.write.format("hudi").
options(getQuickstartWriteConfigs).
option(PRECOMBINE_FIELD_OPT_KEY, "ts").
option(RECORDKEY_FIELD_OPT_KEY, "uuid").
option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
option(TABLE_NAME, tableName).
mode(Overwrite).
save(basePath)
The timeout and exception throws:
170762 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating remote view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661. Server=xxx:59520
170766 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.FileSystemViewManager - Creating InMemory based view for basePath /var/folders/z9/_9mf84p97hz1n45b0gnpxlj40000gp/T/HudiQuickStartSpec-hudi_trips_cow2193648737745630661
170769 [Executor task launch worker for task 47] INFO org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView - Sending request : (http://xxx:59520/v1/hoodie/view/datafiles/beforeoron/latest/?partition=americas%2Funited_states%2Fsan_francisco&maxinstant=20201221180946&basepath=%2Fvar%2Ffolders%2Fz9%2F_9mf84p97hz1n45b0gnpxlj40000gp%2FT%2FHudiQuickStartSpec-hudi_trips_cow2193648737745630661&lastinstantts=20201221180946&timelinehash=70f7aa073fa3d86033278a59cbda71c6488f4883570d826663ebb51934a25abf)
246649 [Executor task launch worker for task 47] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesFromParams(RemoteHoodieTableFileSystemView.java:223)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestBaseFilesBeforeOrOn(RemoteHoodieTableFileSystemView.java:230)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:97)
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getLatestBaseFilesBeforeOrOn(PriorityBasedFileSystemView.java:134)
at org.apache.hudi.index.bloom.HoodieBloomIndex.lambda$loadInvolvedFiles$19c2c1bb$1(HoodieBloomIndex.java:201)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125)
I wasn't able to experiment with different port settings for Hudi timeserver port as I wasn't able to find the config setting that controls the port.
Any ideas why access to the timeserver times out?

The problem turned out to be rooted in the way Hudi resolves spark driver host. It seems that although it starts and binds its web server to localhost, Hudi's client subsequently uses the IP address to make calls to the server it started.
5240 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Starting Javalin ...
5348 [pool-1-thread-1-ScalaTest-running-HudiSimpleCdcSpec] INFO io.javalin.Javalin - Listening on http://localhost:59520/
...
org.apache.hudi.exception.HoodieRemoteException: Connect to xxx:59520 [/xxx] failed: Operation timed out (Connection timed out)
The solution is to configure "spark.driver.host" setting explicitly. The following worked for me:
private val spark = addSparkConfigs(SparkSession.builder()
.appName("spark testing")
.master("local"))
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.driver.host", "localhost")
.config("spark.ui.port", "4041")
.enableHiveSupport()
.getOrCreate()

Related

ERROR: java.lang.IllegalStateException: User did not initialize spark context

Scala version: 2.11.12
Spark version: 2.4.0
emr-5.23.0
Get the following when running the below command to create an Amazon EMR cluster
spark-submit --class etl.SparkDataProcessor --master yarn --deploy-mode cluster --conf spark.yarn.appMasterEnv.ETL_NAME=foo --conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn --conf spark.yarn.appMasterEnv.ETL_AWS_ACCESS_KEY_ID=123 --conf spark.yarn.appMasterEnv.ETL_AWS_SECRET_ACCESS_KEY=abc MY-Tool.jar
Exception
ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: User did not initialize spark context!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:485)
at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:773)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
at org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:772)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:797)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
How I create my spark session (where sparkMaster = yarn)
lazy val spark: SparkSession = {
val logger: Logger = Logger.getLogger("etl");
val sparkAppName = EnvConfig.ETL_NAME
val sparkMaster = EnvConfig.ETL_SPARK_MASTER
val sparkInstance = SparkSession
.builder()
.appName(sparkAppName)
.master(sparkMaster)
.getOrCreate()
val hadoopConf = sparkInstance.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", EnvConfig.ETL_AWS_ACCESS_KEY_ID)
hadoopConf.set("fs.s3a.secret.key", EnvConfig.ETL_AWS_SECRET_ACCESS_KEY)
logger.info("Created My SparkSession")
logger.info(s"Spark Application Name: $sparkAppName")
logger.info(s"Spark Master: $sparkMaster")
sparkInstance
}
UPDATE:
I determined that due to the application logic, in certain cases, we did not initialize the spark session. Because of this, it seems that when the cluster terminates, it also tries to do something with the session (perhaps close it) and is thus failing. Now that I have figured out this issue, the application runs but never actually completes. Currently, it seems to be hanging in a particular part involving spark when running in cluster mode:
val data: DataFrame = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(s"s3://$csvPath/$fileKey")
.toDF()
20/03/16 18:38:35 INFO Client: Application report for application_1584324418613_0031 (state: RUNNING)
AFAIK EnvConfig.ETL_AWS_ACCESS_KEY_ID and ETL_AWS_SECRET_ACCESS_KEY are not getting populated due to which sparksession cant be instanciated with null or empty values . try to print and debug the values.
also reading the properties from --conf spark.xxx
should be like this example. I hope you are following this...
spark.sparkContext.getConf.getOption("spark. ETL_AWS_ACCESS_KEY_ID")
once you check that, this example way should work...
/**
* Hadoop-AWS Configuration
*/
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.host", proxyHost)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.proxy.port", proxyPort)
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("fs.s3n.server-side-encryption-algorithm", "AES256")
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem
another thing is, use
--master yarn or --master local[*] you can use instead of
-conf spark.yarn.appMasterEnv.ETL_SPARK_MASTER=yarn
UPDATE :
--conf spark.driver.port=20002 may solve this issue. where 20002 is orbitary port.. seems like its waiting for the particular port for some time and its retrying for some time and its failing with the exception you got.
I got this idea by walking through the Sparks application master code from here
and comment This a bit hacky, but we need to wait until the spark.driver.port property has been set by the Thread executing the user class.
you can try this and let me know.
Further reading : Apache Spark : How to change the port the Spark driver listens to
In my case (after resolving the application issues), I needed to include core AND task node types when deploying in cluster mode.

Connection to Cassandra from spark Error

I am using Spark 2.0.2 and Cassandra 3.11.2 I am using this code but it give me connection error.
./spark-shell --jars ~/spark/spark-cassandra-connector/spark-cassandra-connector/target/full/scala-2.10/spark-cassandra-connector-assembly-2.0.5-121-g1a7fa1f8.jar
import com.datastax.spark.connector._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "localhost")
val test = sc.cassandraTable("sensorkeyspace", "sensortable")
test.count
When I enter test.count command it give me this error.
java.io.IOException: Failed to open native connection to Cassandra at {127.0.0.1}:9042
at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:168)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$8.apply(CassandraConnector.scala:154)
Can you check the yaml file? It seems the number of enough concurrent connections are open at any instance of time.

Spark worker nodes timeout

When I run my Spark app using sbt run with configuration pointing to master of a remote cluster nothing useful gets executed by the workers and the following warning is printed in sbt run log repeatedly.
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
This is how my spark config looks like:
#transient lazy val conf: SparkConf = new SparkConf()
.setMaster("spark://master-ip:7077")
.setAppName("HelloWorld")
.set("spark.executor.memory", "1g")
.set("spark.driver.memory", "12g")
#transient lazy val sc: SparkContext = new SparkContext(conf)
val lines = sc.textFile("hdfs://master-public-dns:9000/test/1000.csv")
I know this warning usually appears when the cluster is misconfigured and the workers either don't have the resources or aren't started in the first place. However, according to my Spark UI (on master-ip:8080) the worker nodes seem to be alive with sufficient RAM and cpu cores, they even try to execute my app but they exit and leave this in stderr log:
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled;
users with view permissions: Set(ubuntu, myuser);
groups with view permissions: Set(); users with modify permissions: Set(ubuntu, myuser); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
...
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
... 8 more
ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Any ideas?
Cannot receive any reply from 192.168.0.11:35996 in 120 seconds
Could you telnet to this port on this ip from worker, maybe your driver machine has multiple network interfaces, try to set SPARK_LOCAL_IP in $SPARK_HOME/conf/spark-env.sh

Unable to submit Spark job to yarn cluster using Scala

I am trying to submit spark job through SparkSubmit class on a Scala application from my local Windows machine to a remote Yarn cluster, but the spark ResourceManager always try to connect to 0.0.0.0.
val args = Array(
"--master", "yarn",
"--verbose",
"--class", "application-class",
"--num-executors", "1",
"--executor-cores", "1",
"--executor-memory", "10g",
"--deploy-mode", "cluster",
"--driver-memory", "10g",
"path-to-jar", "1")
SparkSubmit.main(args)
Below is the error
Failed to connect to server: 0.0.0.0/0.0.0.0:8032: retries get failed due to exceeded maximum allowed retries number: 10
When I try to submit the spark job through Command Prompt/Windows shell with same arguments as with Scala, then it works fine and submits the job to the cluster.
I have already HADOOP_CONF_DIR and YARN_CONF_DIR in environment variables and my yarn-site.xml has yarn.resourcemanager.address defined with remote IP.
Am I missing anything here?

Checkpoint data corruption in Spark Streaming

I am testing checkpointing and write ahead logs with this basic Spark streaming code below. I am checkpointing into a local directory. After starting and stopping the application a few times (using Ctrl-C) - it would refuse to start, for what looks like some data corruption in the checkpoint directoty. I am getting:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 80.0 failed 1 times, most recent failure: Lost task 0.0 in stage 80.0 (TID 17, localhost): com.esotericsoftware.kryo.KryoException: Encountered unregistered class ID: 13994
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:137)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:670)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:781)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:229)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:169)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
Full code:
import org.apache.hadoop.conf.Configuration
import org.apache.spark._
import org.apache.spark.streaming._
object ProtoDemo {
def createContext(dirName: String) = {
val conf = new SparkConf().setAppName("mything")
conf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(dirName)
val lines = ssc.socketTextStream("127.0.0.1", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
val runningCounts = wordCounts.updateStateByKey[Int] {
(values: Seq[Int], oldValue: Option[Int]) =>
val s = values.sum
Some(oldValue.fold(s)(_ + s))
}
// Print the first ten elements of each RDD generated in this DStream to the console
runningCounts.print()
ssc
}
def main(args: Array[String]) = {
val hadoopConf = new Configuration()
val dirName = "/tmp/chkp"
val ssc = StreamingContext.getOrCreate(dirName, () => createContext(dirName), hadoopConf)
ssc.start()
ssc.awaitTermination()
}
}
Basically what you are trying to do is a driver failure scenario , for this to work , based on the cluster you are running you have to follow the below instructions to monitor the driver process and relaunch the driver if it fails
Configuring automatic restart of the application driver - To automatically recover from a driver failure, the deployment infrastructure that is used to run the streaming application must monitor the driver process and relaunch the driver if it fails. Different cluster managers have different tools to achieve this.
Spark Standalone - A Spark application driver can be submitted to
run within the Spark Standalone cluster (see cluster deploy
mode), that is, the application driver itself runs on one of the
worker nodes. Furthermore, the Standalone cluster manager can be
instructed to supervise the driver, and relaunch it if the driver
fails either due to non-zero exit code, or due to failure of the
node running the driver. See cluster mode and supervise in the Spark
Standalone guide for more details.
YARN - Yarn supports a similar mechanism for automatically restarting an application. Please refer to YARN documentation for
more details.
Mesos - Marathon has been used to achieve this with Mesos.
You need to configure write ahead logs as below ,there are special instructions for S3 which you need to follow.
While using S3 (or any file system that does not support flushing) for write ahead logs, please remember to enable
spark.streaming.driver.writeAheadLog.closeFileAfterWrite
spark.streaming.receiver.writeAheadLog.closeFileAfterWrite.
See Spark Streaming Configuration for more details.
The issue looks rather Kryo Serializer issue than checkpoint corruption.
At code example (including GitHub project), Kryo Serialization is not configured.
Since it is not configured KryoException exception could not happen.
When using "write ahead logs", and restoring from a directory, all Spark config is getting from there.
At your example, createContext method does not call when starting from the checkpoint.
I assume the issue is another application were tested before with the same checkpoint directory, where Kryo Serializer where configured.
And current application fails to be restored from that checkpoint.