Word count group by window - scala

I have a Structured Streaming program that counts words:
#1
var inputTable = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "XX.XX.XXX.XX:9092")
.option("subscribe", "topic-name")
.option("startingOffsets", "earliest")
.load()
#2
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"))
.groupBy($"word")
.count
#3
val query = df.select($"word", $"count").writeStream.outputMode("complete").format("console").start()
#4
query.awaitTermination()
Now I want to window it by event time (there is a "timestamp" column in the input table).
So I need to change #2. I've tried:
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"), "timestamp")
.groupBy(window($"timestamp", "1 minute", $"word"))
.count
But obviously the compiler complains that select method does not match the method signature.

All arguments need to be of type Column
This should work (replaced "timestamp" with col("timestamp") in select) :
import org.apache.spark.sql.functions._
val df = inputTable.select(explode(split($"value".cast("string"), "\\s+")).as("word"), col("timestamp"))
.groupBy(window($"timestamp", "1 minute", $"word"))
.count

Related

Transforming Binary Column is Spark SQL

What is the best way to transform a binary column in Spark Dataframe, here are the details:
I have a Kafka stream which has this structure: Kafka Schema
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I want to use from_avro function but I have to get the bytes after the fifth element because of Kafka wire protocol. from_avro function gets a column and a schema and returns a new column.
val jsonSchema = ??? // I have the schema
df.select(
col("key").cast("string"),
col("value")
.as("binaryValue"),
from_avro(col("value"), jsonSchema)
.as("avro")
)
I basically want to convert col("value") to another column that each array of bytes starts from the fifth element, e.g. in Scala arrayValue.slice(4, arrayValue.length)

Multiple aggregations and Distinct Function in Spark Structured Streaming

I need to make some aggregations on streaming data from Kafka and output top 10 rows of result to console every M seconds.
input_df = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "page_views")
.load()
.selectExpr('cast(value as string)')
)
...
...
# info has 2 cols: domain, uid (info = transformation of input_df)
# It's an example of what I want to do (like in simple pyspark)
stat = (
info
.groupby('domain')
.agg(
F.count(F.col('UID')).alias('view'),
F.countDistinct(F.col('UID')).alias('unique')
)
.sort(F.col("view").desc())
.limit(10)
)
query = (
stat
.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "true")
.start()
)
This example without time trigger, but I can make it by myself.
Because of it's not allowed to use countDistinct, I haven't ideas of making my exercise.
I tried to make 2 dfs for each aggregation(df_1 = (domain, view), df_2 = (domain, unique)) and then join df_1 with df_2, but it's also not allowed to have several aggregations. So it's dead end for me.
It will be cool to have decision for it.
Thanks for your attention!
You can make it by flatMapGroupWithState, which is arbitrary state function.Besides,it supports append mode and update mode.

How to join two streaming datasets when one dataset involves aggregation

I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))

Why could streaming join of queries over kafka topics take so long?

I'm using Spark Structured Streaming and joining two streams from Kafka topics.
I noticed that the streaming query takes around 15 seconds for each record. In the below screenshot, the stage id 2 takes 15s. Why could that be?
The code is as follows:
val kafkaTopic1 = "demo2"
val kafkaTopic2 = "demo3"
val bootstrapServer = "localhost:9092"
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic1)
.option("failOnDataLoss", false)
.load
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic2)
.option("failOnDataLoss", false)
.load
val order_details = df1
.withColumn(...)
.select(...)
val invoice_details = df2
.withColumn(...)
.where(...)
order_details
.join(invoice_details)
.where(order_details.col("s_order_id") === invoice_details.col("order_id"))
.select(...)
.writeStream
.format("console")
.option("truncate", false)
.start
.awaitTermination()
Code-wise everything works fine. The only problem is the time to join the two streams. How could this query be optimised?
It's fairly possible that the execution time is not satisfactory given the master URL, i.e. .master("local"). Change it to local[*] at the very least and you should find the join faster.

How to do custom partition in spark dataframe with saveAsTextFile

I have created data in Spark and then performed a join operation, finally I have to save the output to partitioned files.
I am converting data frame into RDD and then saving as text file that allows me to use multi-char delimiter. My question is to how use dataframe columns as custom partition in this case.
I can not use below option for custom partition because it does not support multi-char delimiter:
dfMainOutput.write.partitionBy("DataPartiotion","StatementTypeCode")
.format("csv")
.option("delimiter", "^")
.option("nullValue", "")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
To use multi-char delimiter I have converted this in RDD like below code:
dfMainOutput.rdd.map(x=>x.mkString("|^|")).saveAsTextFile("dir path to store")
But in above option how would I do custom partition based on the columns "DataPartiotion" and "StatementTypeCode"?
Do I have to convert back to again from RDD to a dataframe?
Here is my code that i have tried
val dfMainOutput = df1result.join(latestForEachKey, Seq("LineItem_organizationId", "LineItem_lineItemId"), "outer")
.select($"LineItem_organizationId", $"LineItem_lineItemId",
when($"DataPartition_1".isNotNull, $"DataPartition_1").otherwise($"DataPartition_1").as("DataPartition_1"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").as("StatementTypeCode"),
when($"StatementTypeCode_1".isNotNull, $"StatementTypeCode_1").otherwise($"StatementTypeCode").alias("StatementtypeCode"),
when($"LineItemName_1".isNotNull, $"LineItemName_1").otherwise($"LineItemName").as("LineItemName"),
when($"LocalLanguageLabel_1".isNotNull, $"LocalLanguageLabel_1").otherwise($"LocalLanguageLabel").as("LocalLanguageLabel"),
when($"FinancialConceptLocal_1".isNotNull, $"FinancialConceptLocal_1").otherwise($"FinancialConceptLocal").as("FinancialConceptLocal"),
when($"FinancialConceptGlobal_1".isNotNull, $"FinancialConceptGlobal_1").otherwise($"FinancialConceptGlobal").as("FinancialConceptGlobal"),
when($"IsDimensional_1".isNotNull, $"IsDimensional_1").otherwise($"IsDimensional").as("IsDimensional"),
when($"InstrumentId_1".isNotNull, $"InstrumentId_1").otherwise($"InstrumentId").as("InstrumentId"),
when($"LineItemSequence_1".isNotNull, $"LineItemSequence_1").otherwise($"LineItemSequence").as("LineItemSequence"),
when($"PhysicalMeasureId_1".isNotNull, $"PhysicalMeasureId_1").otherwise($"PhysicalMeasureId").as("PhysicalMeasureId"),
when($"FinancialConceptCodeGlobalSecondary_1".isNotNull, $"FinancialConceptCodeGlobalSecondary_1").otherwise($"FinancialConceptCodeGlobalSecondary").as("FinancialConceptCodeGlobalSecondary"),
when($"IsRangeAllowed_1".isNotNull, $"IsRangeAllowed_1").otherwise($"IsRangeAllowed".cast(DataTypes.StringType)).as("IsRangeAllowed"),
when($"IsSegmentedByOrigin_1".isNotNull, $"IsSegmentedByOrigin_1").otherwise($"IsSegmentedByOrigin".cast(DataTypes.StringType)).as("IsSegmentedByOrigin"),
when($"SegmentGroupDescription".isNotNull, $"SegmentGroupDescription").otherwise($"SegmentGroupDescription").as("SegmentGroupDescription"),
when($"SegmentChildDescription_1".isNotNull, $"SegmentChildDescription_1").otherwise($"SegmentChildDescription").as("SegmentChildDescription"),
when($"SegmentChildLocalLanguageLabel_1".isNotNull, $"SegmentChildLocalLanguageLabel_1").otherwise($"SegmentChildLocalLanguageLabel").as("SegmentChildLocalLanguageLabel"),
when($"LocalLanguageLabel_languageId_1".isNotNull, $"LocalLanguageLabel_languageId_1").otherwise($"LocalLanguageLabel_languageId").as("LocalLanguageLabel_languageId"),
when($"LineItemName_languageId_1".isNotNull, $"LineItemName_languageId_1").otherwise($"LineItemName_languageId").as("LineItemName_languageId"),
when($"SegmentChildDescription_languageId_1".isNotNull, $"SegmentChildDescription_languageId_1").otherwise($"SegmentChildDescription_languageId").as("SegmentChildDescription_languageId"),
when($"SegmentChildLocalLanguageLabel_languageId_1".isNotNull, $"SegmentChildLocalLanguageLabel_languageId_1").otherwise($"SegmentChildLocalLanguageLabel_languageId").as("SegmentChildLocalLanguageLabel_languageId"),
when($"SegmentGroupDescription_languageId_1".isNotNull, $"SegmentGroupDescription_languageId_1").otherwise($"SegmentGroupDescription_languageId").as("SegmentGroupDescription_languageId"),
when($"SegmentMultipleFundbDescription_1".isNotNull, $"SegmentMultipleFundbDescription_1").otherwise($"SegmentMultipleFundbDescription").as("SegmentMultipleFundbDescription"),
when($"SegmentMultipleFundbDescription_languageId_1".isNotNull, $"SegmentMultipleFundbDescription_languageId_1").otherwise($"SegmentMultipleFundbDescription_languageId").as("SegmentMultipleFundbDescription_languageId"),
when($"IsCredit_1".isNotNull, $"IsCredit_1").otherwise($"IsCredit".cast(DataTypes.StringType)).as("IsCredit"),
when($"FinancialConceptLocalId_1".isNotNull, $"FinancialConceptLocalId_1").otherwise($"FinancialConceptLocalId").as("FinancialConceptLocalId"),
when($"FinancialConceptGlobalId_1".isNotNull, $"FinancialConceptGlobalId_1").otherwise($"FinancialConceptGlobalId").as("FinancialConceptGlobalId"),
when($"FinancialConceptCodeGlobalSecondaryId_1".isNotNull, $"FinancialConceptCodeGlobalSecondaryId_1").otherwise($"FinancialConceptCodeGlobalSecondaryId").as("FinancialConceptCodeGlobalSecondaryId"),
when($"FFAction_1".isNotNull, $"FFAction_1").otherwise((concat(col("FFAction"), lit("|!|"))).as("FFAction")))
.filter(!$"FFAction".contains("D"))
val dfMainOutputFinal = dfMainOutput.select(concat_ws("|^|", columns.map(c => col(c)): _*).as("concatenated"))
dfMainOutputFinal.write.partitionBy("DataPartition_1","StatementTypeCode")
.format("csv")
.option("codec", "gzip")
.save("s3://trfsdisu/SPARK/FinancialLineItem/output")
This can be done by using concat_ws, this function works similarly to mkString but can be performed on directly on dataframe. This makes the conversion step to rdd redundant and the df.write.partitionBy() method can be used. A small example that will concatenate all available columns,
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(("01", "20000", "45.30"), ("01", "30000", "45.30"))
.toDF("col1", "col2", "col3")
val df2 = df.select($"DataPartiotion", $"StatementTypeCode",
concat_ws("|^|", df.schema.fieldNames.map(c => col(c)): _*).as("concatenated"))
This will give you a resulting dataframe like this,
+--------------+-----------------+------------------+
|DataPartiotion|StatementTypeCode| concatenated|
+--------------+-----------------+------------------+
| 01| 20000|01|^|20000|^|45.30|
| 01| 30000|01|^|30000|^|45.30|
+--------------+-----------------+------------------+