How to ingest data from two producers in kafka and join using spark structured streaming? - apache-kafka

I am trying to read data from two kafka topics, but I am unable to join and find teh final dataframe.
My kafka topics are CSVStreamRetail and OrderItems.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
The output I am receiving is an empty dataframe.

First of all please check whether you are receiving data in your kafka topics.
You should always provide watermarking at least in one stream in case of a stream-stream join. I see you want to perform an inner join.
So I have added 200 seconds watermarking and now it is showing data in the output dataframe.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
Use the eventTimestamp for joining.
Let me know if this helps.

Related

How do I do functions.from_csv at spark structured stream

I read lines from a kafka source and I want to build a kafka consumer... in spark structured streaming
I know how to tell spark that the incoming lines are json type... how do I do the same with from_csv ?
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", retailDataSchema).as("data"))
lines.printSchema()
The schema is:
val retailDataSchema = new StructType()
.add("InvoiceNo", IntegerType)
.add("Quantity", IntegerType)
.add("Country", StringType)
Thank you!
The input data looks like this:
You could do this work around:
val lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.select(col("value").cast("string")).as("data").select("data.*").selectExpr("cast(split(value,',')[0] as DataTypes.IntegerType) as InvoiceNo"
,"cast(split(value,',')[1] as DataTypes.IntegerType) as Quantity"
,"cast(split(value,',')[2] as DataTypes.StringType) as Country" );
lines.printSchema();
Or you could use the built-in function from_csv Since Apache spark 3.0.0
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_csv($"value", retailDataSchema).as("data"))
lines.printSchema()
Apache Spark Docs for from_csv built-in function

OneHotEncoder with Streaming Dataframe

I would like to apply OneHotEncoder to multiple Columns in my Streaming Dataframe, but I've got the following error.
Any suggestions?
Many thanks!
Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start();;
CODE:
// Read csv
val Stream = spark.read
.format("csv")
.option("header", "true")
.option("delimiter", ";")
.option("header", "true")
.schema(DFschema)
.load("C:/[...]"/
// Kafka
val properties = new Properties()
//val topic = "mongotest"
properties.put("bootstrap.servers", "localhost:9092")
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
Stream.selectExpr("CAST(Col AS STRING) AS KEY",
"to_json(struct(*)) AS value")
.writeStream
.format("kafka")
.option("topic", "predict")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:[...]")
.start()
Subscribe to topic
val lines = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "predict")
.load()
val df = Stream
.selectExpr("CAST(value AS STRING)")
val jsons = df.select(from_json($"value", DFschema) as "data").select("data.*")
ETL[...]
Apply funcion Bucketizer() to field
val Msplits = Array(Double.NegativeInfinity,7, 14, 21, Double.PositiveInfinity)
val bucketizerM = new Bucketizer()
.setInputCol("MEASURE")
.setOutputCol("MEASURE_c")
.setSplits(Msplits)
val bucketedData1 = bucketizerD.transform(out)
val bucketedData2 = bucketizerM.transform(bucketedData1) # Works
Error using OneHotEncoder()
val indexer = new StringIndexer()
.setInputCol("CODE")
.setOutputCol("CODE_index")
val encoder = new OneHotEncoder()
.setInputCol("CODE")
.setOutputCol("CODE_encoded")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("A","B", "CODE_encoded"))
.setOutputCol("features")
val transformationPipeline = new Pipeline()
.setStages(Array(indexer, encoder, vectorAssembler))
val fittedPipeline = transformationPipeline.fit(bucketedData2) # Does't work

How to checkpoint many source of spark streaming

I have many CSV spark.readStream in a different locations, I have to checkpoint all of them with scala, I specified a query for every stream but when I run the job, I got this message
java.lang.IllegalArgumentException: Cannot start query with name "query1" as a query with that name is already active
I solved my problem by creating a many streaming query like this :
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()
val spark = SparkSession
.builder
.appName("test")
.config("spark.local", "local[*]")
.getOrCreate()
spark.sparkContext.setCheckpointDir(path_checkpoint)
val event1 = spark
.readStream //
.schema(schema_a)
.option("header", "true")
.option("sep", ",")
.csv(path_a)
val query = event1.writeStream
.outputMode("append")
.format("console")
.start()
spark.streams.awaitAnyTermination()

Why does streaming query not write any data to HDFS?

I'm using Spark Structured Streaming with Spark 2.3.1 and below is my code:
val sparkSession = SparkSession
.builder
.appName("xxx")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.config("spark.rpc.netty.dispatcher.numThreads", "2")
.config("spark.shuffle.compress", "true")
.config("spark.rdd.compress", "true")
.config("spark.sql.inMemoryColumnarStorage.compressed", "true")
.config("spark.io.compression.codec", "snappy")
.config("spark.broadcast.compress", "true")
.config("spark.sql.hive.thriftServer.singleSession", "true")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.streaming.receiver.writeAheadLog.enable","true")
.enableHiveSupport()
.getOrCreate()
val rawStreamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", <value>)
.option("subscribe", <value>)
.option("key.serializer", <value>)
.option("value.serializer", <value>)
.option("startingOffsets", "earliest")
.option("auto.offset.reset",earliest)
.option("group.id", <value>)
.option("fetchOffset.numRetries", 3)
.option("fetchOffset.retryIntervalMs", 10)
.option("IncludeTimestamp", true)
.option("enable.auto.commit", <value>)
.option("security.protocol", <value>)
.option("ssl.keystore.location", <value>)
.option("ssl.keystore.password", <value>)
.option("ssl.truststore.location", <value>)
.option("ssl.truststore.password", <value>)
.load()
.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
I'm trying to write the data to a file in the hdfs_path:
val query = rawStreamDF
.writeStream
.format("json")
.option("startingOffsets", "latest")
.option("path", "STREAM_DATA_PATH")
.option("checkpointLocation", "checkpointPath")
.trigger(Trigger.ProcessingTime("5 seconds"))
.outputMode("append")
.start
Logger.log.info("Status:"+query.status)
print("Streaming Status1:"+query.status)
query.awaitTermination(450)
But, I'm getting query.status value as below:
Status:{ "message" : "Initializing sources", "isDataAvailable" : false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
But, I'm getting query.status value as below.
Status:{ "message" : "Initializing sources", "isDataAvailable" :false, "isTriggerActive" : false }
Could you let me know where I'm going wrong?
All seems fine. The streaming engine of Spark Structured Streaming didn't seem to start the query yet, but just mark it as to be started on a separate thread.
If you created a separate thread for monitoring the structured query, you'd notice the status would change right after processing the very first batch.
Consult the official documentation in Structured Streaming Programming Guide.

How to display the records from Kafka to console?

I'm learning Structured Streaming and I was not able to display the output to my console.
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.streaming.ProcessingTime
object kafka_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
// val schema = StructType().add("a", IntegerType()).add("b", StringType())
val schema = StructType(Seq(
StructField("a", IntegerType, true),
StructField("b", StringType, true)
))
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.option("startingOffsets", "earliest")
.load()
val values = df.selectExpr("CAST(value AS STRING)").as[String]
values.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
My input to Kafka
my name is abc how are you ?
I just want to display strings from Kafka to spark console