How to stream a single topic of kafka , filter by key into multiple location of hdfs? - scala

I am not being to stream my data on multiple hdfs location , which is filtered by key. So below code is not working. Please help me to find the correct way to write this code
val ER_stream_V1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
val ER_stream_V2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
ER_stream_V1.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value1")
.select(functions.from_json(col("value").cast("string"), Value1Schema.schemaExecution).as("value")).select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/execution/checkpoint2005")
.option("path", "/tmp/test/value1")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()
ER_stream_V2.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value2")
.select(functions.from_json(col("value").cast("string"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/jobparameters/checkpoint2006")
.option("path", "/tmp/test/value2")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()

You should not need two readers. Create one and filter twice. You might also want to consider startingOffsets as earliest to read existing topic data
For example.
val ER_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest") // maybe change?
.option("failOnDataLoss", "false")
.load()
.toDF()
.select(col("key").cast("string").as("key"), col("value").cast("string"))
val value1Stream = ER_stream
.filter(col("key") === "Value1")
.select(functions.from_json(col("value"), Value1Schema.schemaExecution).as("value"))
.select("value.*")
val value2Stream = ER_stream
.filter(col("key") === "Value2")
.select(functions.from_json(col("value"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
value1Stream.writeStream.format("orc")
...
.start()
value2Stream.writeStream.format("orc")
...
.start()

Related

How do I do functions.from_csv at spark structured stream

I read lines from a kafka source and I want to build a kafka consumer... in spark structured streaming
I know how to tell spark that the incoming lines are json type... how do I do the same with from_csv ?
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", retailDataSchema).as("data"))
lines.printSchema()
The schema is:
val retailDataSchema = new StructType()
.add("InvoiceNo", IntegerType)
.add("Quantity", IntegerType)
.add("Country", StringType)
Thank you!
The input data looks like this:
You could do this work around:
val lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.select(col("value").cast("string")).as("data").select("data.*").selectExpr("cast(split(value,',')[0] as DataTypes.IntegerType) as InvoiceNo"
,"cast(split(value,',')[1] as DataTypes.IntegerType) as Quantity"
,"cast(split(value,',')[2] as DataTypes.StringType) as Country" );
lines.printSchema();
Or you could use the built-in function from_csv Since Apache spark 3.0.0
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_csv($"value", retailDataSchema).as("data"))
lines.printSchema()
Apache Spark Docs for from_csv built-in function

How to ingest data from two producers in kafka and join using spark structured streaming?

I am trying to read data from two kafka topics, but I am unable to join and find teh final dataframe.
My kafka topics are CSVStreamRetail and OrderItems.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
The output I am receiving is an empty dataframe.
First of all please check whether you are receiving data in your kafka topics.
You should always provide watermarking at least in one stream in case of a stream-stream join. I see you want to perform an inner join.
So I have added 200 seconds watermarking and now it is showing data in the output dataframe.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
Use the eventTimestamp for joining.
Let me know if this helps.

Why does streaming query not write any data to HDFS?

I'm using Spark Structured Streaming with Spark 2.3.1 and below is my code:
val sparkSession = SparkSession
.builder
.appName("xxx")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.config("spark.rpc.netty.dispatcher.numThreads", "2")
.config("spark.shuffle.compress", "true")
.config("spark.rdd.compress", "true")
.config("spark.sql.inMemoryColumnarStorage.compressed", "true")
.config("spark.io.compression.codec", "snappy")
.config("spark.broadcast.compress", "true")
.config("spark.sql.hive.thriftServer.singleSession", "true")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.streaming.receiver.writeAheadLog.enable","true")
.enableHiveSupport()
.getOrCreate()
val rawStreamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <value>)
.option("subscribe", <value>)
.option("key.serializer", <value>)
.option("value.serializer", <value>)
.option("startingOffsets", "earliest")
.option("auto.offset.reset",earliest)
.option("group.id", <value>)
.option("fetchOffset.numRetries", 3)
.option("fetchOffset.retryIntervalMs", 10)
.option("IncludeTimestamp", true)
.option("enable.auto.commit", <value>)
.option("security.protocol", <value>)
.option("ssl.keystore.location", <value>)
.option("ssl.keystore.password", <value>)
.option("ssl.truststore.location", <value>)
.option("ssl.truststore.password", <value>)
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I'm trying to write the data to a file in the hdfs_path:
val query = rawStreamDF
.writeStream
.format("json")
.option("startingOffsets", "latest")
.option("path", "STREAM_DATA_PATH")
.option("checkpointLocation", "checkpointPath")
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.start
Logger.log.info("Status:"+query.status)
print("Streaming Status1:"+query.status)
query.awaitTermination(450)
But, I'm getting query.status value as below:
Status:{ "message" : "Initializing sources", "isDataAvailable" : false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
But, I'm getting query.status value as below.
Status:{ "message" : "Initializing sources", "isDataAvailable" :false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
All seems fine. The streaming engine of Spark Structured Streaming didn't seem to start the query yet, but just mark it as to be started on a separate thread.
If you created a separate thread for monitoring the structured query, you'd notice the status would change right after processing the very first batch.
Consult the official documentation in Structured Streaming Programming Guide.

org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received

kafka: kafka_2.11-0.10.2.1
scala:2.12
val TOPIC_EVENT_XXX = "EVENT.xxx.ALL"
import org.apache.spark.sql.Encoders
val schema = Encoders.bean(classOf[Event]).schema
val allEventsDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "Streaming01.simon.com:9090,Streaming02.simon.com:9090,Streaming03.simon.com:9090,Streaming04.simon.com:9090")
.option("subscribe", TOPIC_EVENT_XXX)
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", 5000)//old,1000000
.load()
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
.selectExpr("parsed_value.*")
val KAFKA_BOOTSTRAP_SERVERS = "Streaming01.simon.com:9090,Streaming02.simon.com:9090,Streaming03.simon.com:9090,Streaming04.simon.com:9090,Ingest01.simon.com:9090,Ingest02.simon.com:9090,Notify01.simon.com:9090,Notify02.simon.com:9090,Serving01.simon.com:9090,Serving02.simon.com:9090,"
var waybillStatesKafkaSinkQuery = waybillStates.selectExpr("to_json(struct(*)) AS value")
.writeStream
.outputMode("append")
.format("kafka") // can be "orc", "json", "csv",memory,console etc.
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("topic", TOPIC_TIMECHAIN_WAYBILL) //TIMECHAIN.WAYBILL.ALL //TIMECHAIN.WAYBILL.TL //TOPIC_TIMECHAIN_WAYBILL
.option("checkpointLocation", CHECKPOINT_PATH_TL_EVENT_WAYBILL_STATES)
.option("kafka.max.request.size", "164217728")//134217728//209715200
.option("kafka.buffer.memory", "164217728")
.option("kafka.timeout.ms",180000)
.option("kafka.request.timeout.ms",180000)
.option("kafka.session.timeout.ms",180000)
.option("kafka.heartbeat.interval.ms",120000)
.option("kafka.retries",100)
.option("failOnDataLoss","false")//后添加的【2018-07-11】
.start()
The following error occurred while running the above program.:
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.

Queries with streaming sources must be executed with writeStream.start();

I'm trying to read the messages from kafka (version 10) in spark and trying to print it.
import spark.implicits._
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
ds1.collect.foreach(println)
ds1.writeStream
.format("console")
.start()
ds1.printSchema()
getting an error Exception in thread "main"
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
You are branching the query plan: from the same ds1 you are trying to:
ds1.collect.foreach(...)
ds1.writeStream.format(...){...}
But you are only calling .start() on the second branch, leaving the other dangling without a termination, which in turn throws the exception you are getting back.
The solution is to start both branches and await termination.
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA")
.load()
val query1 = ds1.collect.foreach(println)
.writeStream
.format("console")
.start()
val query2 = ds1.writeStream
.format("console")
.start()
ds1.printSchema()
query1.awaitTermination()
query2.awaitTermination()
I struggled a lot with this issue. I tried each of suggested solution from various blog.
But I my case there are few statement in between calling start() on query and finally at last i was calling awaitTerminate() function that cause this.
Please try in this fashion, It is perfectly working for me.
Working example:
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
If you write in this way that will cause exception/ error:
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
// some statement
// some statement
query.awaitTermination();
will throw given exception and will close your streaming driver.
i fixed issue by using following code.
val df = session
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "streamTest2")
.load();
val query = df.writeStream
.outputMode("append")
.format("console")
.start()
query.awaitTermination()
Kindly remove ds1.collect.foreach(println) and ds1.printSchema() , use outputMode and awaitAnyTermination for background process Waiting until any of the queries on the associated spark.streams has terminated
val spark = SparkSession
.builder
.appName("StructuredNetworkWordCount")
.config("spark.master", "local[*]")
.getOrCreate()
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topicA") .load()
val consoleOutput1 = ds1.writeStream
.outputMode("update")
.format("console")
.start()
spark.streams.awaitAnyTermination()
|key|value|topic|partition|offset|
+---+-----+-----+---------+------+
+---+-----+-----+---------+------+
I was able to resolves this issue by following code. In my scenario, I had multiple intermediate Dataframes, which were basically the transformations made on the inputDF.
val query = joinedDF
.writeStream
.format("console")
.option("truncate", "false")
.outputMode(OutputMode.Complete())
.start()
.awaitTermination()
joinedDF is the result of the last transformation performed.