How do I do functions.from_csv at spark structured stream - scala

I read lines from a kafka source and I want to build a kafka consumer... in spark structured streaming
I know how to tell spark that the incoming lines are json type... how do I do the same with from_csv ?
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", retailDataSchema).as("data"))
lines.printSchema()
The schema is:
val retailDataSchema = new StructType()
.add("InvoiceNo", IntegerType)
.add("Quantity", IntegerType)
.add("Country", StringType)
Thank you!
The input data looks like this:

You could do this work around:
val lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.select(col("value").cast("string")).as("data").select("data.*").selectExpr("cast(split(value,',')[0] as DataTypes.IntegerType) as InvoiceNo"
,"cast(split(value,',')[1] as DataTypes.IntegerType) as Quantity"
,"cast(split(value,',')[2] as DataTypes.StringType) as Country" );
lines.printSchema();
Or you could use the built-in function from_csv Since Apache spark 3.0.0
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "topic2")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING)")
.select(functions.from_csv($"value", retailDataSchema).as("data"))
lines.printSchema()
Apache Spark Docs for from_csv built-in function

Related

How to stream a single topic of kafka , filter by key into multiple location of hdfs?

I am not being to stream my data on multiple hdfs location , which is filtered by key. So below code is not working. Please help me to find the correct way to write this code
val ER_stream_V1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
val ER_stream_V2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false")
.load()
ER_stream_V1.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value1")
.select(functions.from_json(col("value").cast("string"), Value1Schema.schemaExecution).as("value")).select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/execution/checkpoint2005")
.option("path", "/tmp/test/value1")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()
ER_stream_V2.toDF()
.select(col("key"), col("value").cast("string"))
.filter(col("key")==="Value2")
.select(functions.from_json(col("value").cast("string"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
.writeStream
.format("orc")
.option("metastoreUri", configManager.getString("spark.datasource.hive.warehouse.metastoreUri"))
.option("checkpointLocation", "/tmp/teststreaming/jobparameters/checkpoint2006")
.option("path", "/tmp/test/value2")
.trigger(Trigger.ProcessingTime("5 Seconds"))
.partitionBy("jobid")
.start()
You should not need two readers. Create one and filter twice. You might also want to consider startingOffsets as earliest to read existing topic data
For example.
val ER_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", configManager.getString("Kafka.Server"))
.option("subscribe", "Topic1")
.option("startingOffsets", "latest") // maybe change?
.option("failOnDataLoss", "false")
.load()
.toDF()
.select(col("key").cast("string").as("key"), col("value").cast("string"))
val value1Stream = ER_stream
.filter(col("key") === "Value1")
.select(functions.from_json(col("value"), Value1Schema.schemaExecution).as("value"))
.select("value.*")
val value2Stream = ER_stream
.filter(col("key") === "Value2")
.select(functions.from_json(col("value"), Value2Schema.schemaJobParameters).as("value"))
.select("value.*")
value1Stream.writeStream.format("orc")
...
.start()
value2Stream.writeStream.format("orc")
...
.start()

Stream from kafka to kafka with Spark Structured Streaming in scala

I am trying to run simple variations of examples from official spark tutorial and a book "spark streaming in action".
The content of exceptions are strange. What is wrong with my code?
First of all I start kafka zookeeper, server, producer and 2 consumers. Then I run following code:
// read from kafka
val df = sparkService.sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", topic1)
.load()
// write to kafka
import sparkService.sparkSession.implicits._
val query = df.selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
.writeStream
.outputMode(OutputMode.Append())
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", topic2)
.option("checkpointLocation", "/home/pt/Dokumenty/tmp/")
.option("failOnDataLoss", "false") // only when testing
.start()
query.awaitTermination(30000)
Error occurs on writting to kafka:
Exception in thread "main" org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got 1
1609627750463

Can we fetch data from Kafka from specific offset in spark structured streaming batch mode

In kafka I get new topics dynamically and I have to process it using spark streaming from a specific offset. Is there a possibility to pass the json value from a variable. For example consider the below code
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}""")
.load()
In this I want to dynamically update value for startingOffsets... I tried to pass the value in string and called it but it did not work... If I am giving the same value in startingOffsets it is working. How to use a variable in this scenario?
val start_offset= """{"topic1":{"0":23,"1":-2},"topic2":{"0":-2}}"""
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribePattern", "topic.*")
.option("startingOffsets", start_offset)
.load()
java.lang.IllegalArgumentException: Expected e.g. {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got """{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}"""
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[*]").setAppName("ReadSpecificOffsetFromKafka");
val spark = SparkSession.builder().config(conf).getOrCreate();
spark.sparkContext.setLogLevel("error");
import spark.implicits._;
val start_offset = """{"first_topic" : {"0" : 15, "1": -2, "2": 6}}"""
val fromKafka = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092, localhost:9093")
.option("subscribe", "first_topic")
// .option("startingOffsets", "earliest")
.option("startingOffsets", start_offset)
.load();
val selectedValues = fromKafka.selectExpr("cast(value as string)", "cast(partition as integer)");
selectedValues.writeStream
.format("console")
.outputMode("append")
// .trigger(Trigger.Continuous("3 seconds"))
.start()
.awaitTermination();
}
This is the exact code to fetch specific offset from kafka using spark structured streaming and scala
Looks like your job is check pointing the Kafka offsets onto some
persistent storage. Try cleaning those. and Re run your Job.
Also try renaming your job and running it.
Spark can read the stream via readStream. So try with an offset displayed in the error message to get rid of the error.
spark
.readStream
.format("kafka")
.option("subscribePattern", "topic.*")

How to ingest data from two producers in kafka and join using spark structured streaming?

I am trying to read data from two kafka topics, but I am unable to join and find teh final dataframe.
My kafka topics are CSVStreamRetail and OrderItems.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
The output I am receiving is an empty dataframe.
First of all please check whether you are receiving data in your kafka topics.
You should always provide watermarking at least in one stream in case of a stream-stream join. I see you want to perform an inner join.
So I have added 200 seconds watermarking and now it is showing data in the output dataframe.
val spark = SparkSession
.builder
.appName("Spark-Stream-Example")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
val ordersSchema = new StructType()
.add("order_id", IntegerType)
.add("order_date", StringType)
.add("order_customer_id", IntegerType)
.add("order_status", StringType)
val orderItemsSchema = new StructType()
.add("order_item_id",IntegerType)
.add("order_item_order_id",IntegerType)
.add("order_item_product_id",IntegerType)
.add("order_item_quantity",IntegerType)
.add("order_item_subtotal",DoubleType)
.add("order_item_product_price", DoubleType)
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "CSVStreamRetail")
.load()
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "OrderItems")
.load()
val ordersDF = df1.selectExpr("CAST(value AS STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value", ordersSchema).as("orders_data"),$"timestamp")
.select("orders_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val orderItemsDF = df2.selectExpr("CAST(value as STRING)", "CAST(timestamp as TIMESTAMP)").as[(String,Timestamp)]
.select(from_json($"value",orderItemsSchema).as("order_items_data"),$"timestamp")
.select("order_items_data.*","timestamp")
.withWatermark("timestamp","200 seconds")
val finalDF = orderItemsDF.join(ordersDF, orderItemsDF("order_item_order_id")===ordersDF("order_id"))
finalDF
.writeStream
.format("console")
.option("truncate", "false")
.start()
.awaitTermination()
Use the eventTimestamp for joining.
Let me know if this helps.

org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received

kafka: kafka_2.11-0.10.2.1
scala:2.12
val TOPIC_EVENT_XXX = "EVENT.xxx.ALL"
import org.apache.spark.sql.Encoders
val schema = Encoders.bean(classOf[Event]).schema
val allEventsDF = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "Streaming01.simon.com:9090,Streaming02.simon.com:9090,Streaming03.simon.com:9090,Streaming04.simon.com:9090")
.option("subscribe", TOPIC_EVENT_XXX)
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", 5000)//old,1000000
.load()
.select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
.selectExpr("parsed_value.*")
val KAFKA_BOOTSTRAP_SERVERS = "Streaming01.simon.com:9090,Streaming02.simon.com:9090,Streaming03.simon.com:9090,Streaming04.simon.com:9090,Ingest01.simon.com:9090,Ingest02.simon.com:9090,Notify01.simon.com:9090,Notify02.simon.com:9090,Serving01.simon.com:9090,Serving02.simon.com:9090,"
var waybillStatesKafkaSinkQuery = waybillStates.selectExpr("to_json(struct(*)) AS value")
.writeStream
.outputMode("append")
.format("kafka") // can be "orc", "json", "csv",memory,console etc.
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("topic", TOPIC_TIMECHAIN_WAYBILL) //TIMECHAIN.WAYBILL.ALL //TIMECHAIN.WAYBILL.TL //TOPIC_TIMECHAIN_WAYBILL
.option("checkpointLocation", CHECKPOINT_PATH_TL_EVENT_WAYBILL_STATES)
.option("kafka.max.request.size", "164217728")//134217728//209715200
.option("kafka.buffer.memory", "164217728")
.option("kafka.timeout.ms",180000)
.option("kafka.request.timeout.ms",180000)
.option("kafka.session.timeout.ms",180000)
.option("kafka.heartbeat.interval.ms",120000)
.option("kafka.retries",100)
.option("failOnDataLoss","false")//后添加的【2018-07-11】
.start()
The following error occurred while running the above program.:
org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.