Spark Streaming Kafka Timeout - scala

I am trying to run Spark + Kafka integration on Amazon EMR with a simple example with spark-shell but i keep getting time out errors. However, when I publish with org.apache.kafka and same settings as below it works without failure.
Time out error:
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
I moved client.truststore.jks and client.keystore.p12 to hdfs and ran the below
$ spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0
import org.apache.spark.sql.functions.col
val kafkaOptions = Map("kafka.bootstrap.servers" -> s"$host:$port",
"kafka.security.protocol" -> "SSL",
"kafka.ssl.endpoint.identification.algorithm" -> "",
"kafka.ssl.truststore.location" -> "/home/hadoop/client.truststore.jks",
"kafka.ssl.truststore.password" -> "password",
"kafka.ssl.keystore.type" -> "PKCS12",
"kafka.ssl.key.password" -> "password",
"kafka.ssl.keystore.location" -> "/home/hadoop/client.keystore.p12",
"kafka.ssl.keystore.password" -> "password")
)
val df = spark
.read
.option("header", true)
.option("escape", "\"")
.csv("s3://bucket/file.csv")
val publishToKafkaDf = df.withColumn("value", col("body"))
publishToKafkaDf
.selectExpr( "CAST(value AS STRING)")
.write
.format("kafka")
.option("topic", "test-topic")
.options(kafkaOptions)
.save()

Solved, it was an AWS security group outbound issue with worker nodes

Related

Where to write HDFS data such they can be read with HIVE

Given I write HDFS with apache spark like this:
var df = spark.readStream
.format("kafka")
//.option("kafka.bootstrap.servers", "kafka1:19092")
.option("kafka.bootstrap.servers", "localhost:29092")
.option("subscribe", "my_event")
.option("includeHeaders", "true")
.option("startingOffsets", "earliest")
.load()
df = df.selectExpr("CAST(topic AS STRING)", "CAST(partition AS STRING)", "CAST(offset AS STRING)", "CAST(value AS STRING)")
val emp_schema = new StructType()
.add("id", StringType, true)
.add("timestamp", TimestampType, true)
df = df.select(
functions.col("topic"),
functions.col("partition"),
functions.col("offset"),
functions.from_json(functions.col("value"), emp_schema).alias("data"))
df = df.select("topic", "partition", "offset", "data.*")
val query = df.writeStream
.format("csv")
.option("path", "hdfs://172.30.0.5:8020/test")
.option("checkpointLocation", "checkpoint")
.start()
query.awaitTermination()
Here hdfs://172.30.0.5:8020 is the namenode. It seems this spark program is writing data successfully to the nameode.
How can I query this data from hive? Do I have to write the data into a special folder that hive can see it? Must I define a database for this folder? And how is this done? Where is the location of test then on the file-system?
Where is the location of test then on the file-system?
It's at /test
Note: if you properly configure fs.defaultFS in the core-site.xml, then you don't need to specify the full namenode address.
Do I have to write the data into a special folder that hive can see it?
You can, and that would be easiest, but the docs cover both options of "managed" (a dedicated HDFS location) and "external" (any other directory, with other restrictions) Hive tables
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
How can I query this data from hive?
See above link.
FWIW, Confluent has a Kafka Connector that can write data to HDFS and create Hive tables

How to parse confluent avro messages in Spark

Currently I am using Abris library to de-serialize Confluent Avro messages getting from KAFKA and it works well when topic has only messages with one version of schema as soon as topic has data with different versions it start giving me malformed data found error which is obvious because while creating the config I am passing the SchemaManager.PARAM_VALUE_SCHEMA_ID=-> "latest"
But my questions is how to know the schema Id at run time basically for each record and then pass it to the Abris config here is the sample code:
Spark version: Spark 2.4.0
Scala :2.11.12
Abris:5.0.0
def getTopicSchemaMap(topicNm: String): Map[String, String] = {
Map(
SchemaManager.PARAM_SCHEMA_REGISTRY_TOPIC -> topicNm,
SchemaManager.PARAM_SCHEMA_REGISTRY_URL -> schemaRegUrl,
SchemaManager.PARAM_VALUE_SCHEMA_NAMING_STRATEGY -> "topic.name",
SchemaManager.PARAM_VALUE_SCHEMA_ID -> "latest")
}
val kafkaDataFrameRaw = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUrl)
.option("subscribe", topics)
.option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", false)
.load()
val df= kafkaDataFrameRaw.select(
from_confluent_avro(col("value"), getTopicSchemaMap(topicNm)) as 'value, col("offset").as("offsets"))

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

Spark Streaming - Join on multiple kafka stream operation is slow

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.
Spark Cluster config:
This is how i'm reading kafka streams to tempviews in spark(scala)
spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")
I join 3 TABLES using spark spark sql
select COL1, COL2 from TABLE1
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK
Execution of Job:
Am i missing out some configuration on spark that i've to look into?
I find the same problem. And I found join between stream and stream needs more memory as I image. And the problem disappear when I increase the cores per executor.
unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.
#Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; and each receiver only create one tasks. One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.
If this still not working, another way is to use Kafka API to createDirectStream. According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html, this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.
I premilinary crafted a sample code for creating direct stream below. You might want to learn about this to customize to you own preference.
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "KAFKASERVER",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"startingOffsets" -> "earliest",
"endingOffsets" -> "latest"
)
val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
val tdf = spark.read.schema(schema).json(rdd)
df = df.union(tdf)
df.createOrReplaceTempView("TABLE1")
}
Some related materials:
https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant)
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream)
Good luck!

Spark - Is not stopping Spark Stream that consumes a Kafka topic

I'm trying to write a test for spark streaming example that consumes data from kafka. I'm using EmbeddedKafka for this.
implicit val config = EmbeddedKafkaConfig(kafkaPort = 12345)
EmbeddedKafka.start()
EmbeddedKafka.createCustomTopic(topic)
println(s"Kafka Running ${EmbeddedKafka.isRunning}")
val spark = SparkSession.builder.appName("StructuredStreaming").master("local[2]").getOrCreate
import spark.implicits._
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:12345")
.option("subscribe", topic)
.load()
// pushing data to kafka
vfes.foreach(e => {
val json = ...
EmbeddedKafka.publishStringMessageToKafka(topic, json)
})
val query = df.selectExpr("CAST(value AS STRING)")
.as[String]
.writeStream.format("console")
query.start().awaitTermination()
spark.stop()
EmbeddedKafka.stop()
When I run this, it keeps running and doesn't stop or print anything to the console.
I cannot figure out why is that.
I also tried terminating kafka by calling EmbeddedKafka.stop() before calling stop on the stream.
Try setting timeout with
query.start().awaitTermination( 3000)
wherein the 3000 is in milli seconds