Transforming Binary Column is Spark SQL - scala

What is the best way to transform a binary column in Spark Dataframe, here are the details:
I have a Kafka stream which has this structure: Kafka Schema
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I want to use from_avro function but I have to get the bytes after the fifth element because of Kafka wire protocol. from_avro function gets a column and a schema and returns a new column.
val jsonSchema = ??? // I have the schema
df.select(
col("key").cast("string"),
col("value")
.as("binaryValue"),
from_avro(col("value"), jsonSchema)
.as("avro")
)
I basically want to convert col("value") to another column that each array of bytes starts from the fifth element, e.g. in Scala arrayValue.slice(4, arrayValue.length)

Related

How to join two streaming datasets when one dataset involves aggregation

I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))

pyspark : Spark Streaming via socket

I am trying to read the datastream from the socket (ps.pndsn.com) and writing it into the temp_table for further processing but currently, issue I am facing is that temp_table which I created as part of writeStream is empty even though streaming is happening at real-time. So looking for help in this regard.
Below is the code snippet :
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
Streaming output :
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
The output of the temp_table is empty.

Word count group by window

I have a Structured Streaming program that counts words:
#1
var inputTable = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "XX.XX.XXX.XX:9092")
.option("subscribe", "topic-name")
.option("startingOffsets", "earliest")
.load()
#2
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
.groupBy($"word")
.count
#3
val query = df.select($"word", $"count").writeStream.outputMode("complete").format("console").start()
#4
query.awaitTermination()
Now I want to window it by event time (there is a "timestamp" column in the input table).
So I need to change #2. I've tried:
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"), "timestamp")
.groupBy(window($"timestamp", "1 minute", $"word"))
.count
But obviously the compiler complains that select method does not match the method signature.
All arguments need to be of type Column
This should work (replaced "timestamp" with col("timestamp") in select) :
import org.apache.spark.sql.functions._
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"), col("timestamp"))
.groupBy(window($"timestamp", "1 minute", $"word"))
.count

Is there anyway i can convert the Value data into actual column names in StructuredStreaming from Kafka?

I have a csv file which has columns and for the testing purpose i push it manually to Kafka and from there i read it into Spark and apply some parsing and i do a console output for testing purposes. Now i understand the csv data is streamed as Value in structured streaming for which i cast it to String. My requirement is if i can convert the value data to the actual columns. There are hundreds of columns in the csv file but i am only looking at two specific columns "SERVICE_NAME8" & "_raw"
I use spark.sql to extract these columns when i read the csv file from a path but now i use structured streaming i am not sure if i can extract these specific columns as a new dataframe and apply my parsing thereafter
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.160.172.45:9092, 10.160.172.46:9092, 10.160.172.100:9092")
.option("subscribe", "TOPIC_WITH_COMP_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.load()
val dfs = df.selectExpr("CAST(value AS STRING)").toDF()
val data =dfs.withColumn("splitted", split($"value", "/"))
.select($"splitted".getItem(4).alias("region"),$"splitted".getItem(5).alias("service"),col("value"))
.withColumn("service_type", regexp_extract($"service", """.*(Inbound|Outbound|Outound).*""",1))
.withColumn("region_type", concat(
when(col("region").isNotNull,col("region")).otherwise(lit("null")), lit(" "),
when(col("service").isNotNull,col("service_type")).otherwise(lit("null"))))
val extractedDF = data.filter(
col("region").isNotNull &&
col("service").isNotNull &&
col("value").isNotNull &&
col("service_type").isNotNull &&
col("region_type").isNotNull)
.filter("region != ''")
.filter("service != ''")
.filter("value != ''")
.filter("service_type != ''")
.filter("region_type != ''")
val query = extractedDF
.writeStream
.format("console")
.outputMode("append")
.trigger(ProcessingTime("20 seconds"))
.start()
After val dfs = df.selectExpr("CAST(value AS STRING)").toDF() i somehow need to extract only two columns "SERVICE_NAME8" & "_raw" and parsing should do the rest and produce output
In Spark structured streaming quick example you could see that
df.as[String].map(_.split("/"))
should transform stream to same data as you have in spark.sql code.
Next you could extract only desired columns and process it. For example
data.map(line=>(line[SERVICE_NAME_COLUMN_INDEX], line[RAW_COLUMN_INDEX]))
will get Tuple of two columns for each line.
Please note it's just example. I don't run it. Also I suppose Tuple is not the best solution.

Why could streaming join of queries over kafka topics take so long?

I'm using Spark Structured Streaming and joining two streams from Kafka topics.
I noticed that the streaming query takes around 15 seconds for each record. In the below screenshot, the stage id 2 takes 15s. Why could that be?
The code is as follows:
val kafkaTopic1 = "demo2"
val kafkaTopic2 = "demo3"
val bootstrapServer = "localhost:9092"
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic1)
.option("failOnDataLoss", false)
.load
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic2)
.option("failOnDataLoss", false)
.load
val order_details = df1
.withColumn(...)
.select(...)
val invoice_details = df2
.withColumn(...)
.where(...)
order_details
.join(invoice_details)
.where(order_details.col("s_order_id") === invoice_details.col("order_id"))
.select(...)
.writeStream
.format("console")
.option("truncate", false)
.start
.awaitTermination()
Code-wise everything works fine. The only problem is the time to join the two streams. How could this query be optimised?
It's fairly possible that the execution time is not satisfactory given the master URL, i.e. .master("local"). Change it to local[*] at the very least and you should find the join faster.