I'm using structured streaming and I'm trying to send my result into a kafka topic, named "results".
I get the following error:
'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Can anyone help?
query1 = prediction.writeStream.format("kafka")\
.option("topic", "results")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("checkpointLocation", "checkpoint")\
.start()
query1.awaitTermination()
prediction schema is:
root
|-- prediction: double (nullable = false)
|-- count: long (nullable = false)
Am I missing something?
The error message gives a hint on what is missing: a watermark.
Watermarks are used to handle late incoming data when you are aggregating stream data. Details can be found in the Spark documentation for Structured Streaming.
It is important that withWatermark is used on the same column as the timestamp column used in the aggregate.
An example on how to use withWatermark is given in the Spark documentation:
words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String }
# Group the data by window and word and compute the count of each group
windowedCounts = words \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word) \
.count()
Related
What is the best way to transform a binary column in Spark Dataframe, here are the details:
I have a Kafka stream which has this structure: Kafka Schema
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
I want to use from_avro function but I have to get the bytes after the fifth element because of Kafka wire protocol. from_avro function gets a column and a schema and returns a new column.
val jsonSchema = ??? // I have the schema
df.select(
col("key").cast("string"),
col("value")
.as("binaryValue"),
from_avro(col("value"), jsonSchema)
.as("avro")
)
I basically want to convert col("value") to another column that each array of bytes starts from the fifth element, e.g. in Scala arrayValue.slice(4, arrayValue.length)
I need to make some aggregations on streaming data from Kafka and output top 10 rows of result to console every M seconds.
input_df = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "page_views")
.load()
.selectExpr('cast(value as string)')
)
...
...
# info has 2 cols: domain, uid (info = transformation of input_df)
# It's an example of what I want to do (like in simple pyspark)
stat = (
info
.groupby('domain')
.agg(
F.count(F.col('UID')).alias('view'),
F.countDistinct(F.col('UID')).alias('unique')
)
.sort(F.col("view").desc())
.limit(10)
)
query = (
stat
.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "true")
.start()
)
This example without time trigger, but I can make it by myself.
Because of it's not allowed to use countDistinct, I haven't ideas of making my exercise.
I tried to make 2 dfs for each aggregation(df_1 = (domain, view), df_2 = (domain, unique)) and then join df_1 with df_2, but it's also not allowed to have several aggregations. So it's dead end for me.
It will be cool to have decision for it.
Thanks for your attention!
You can make it by flatMapGroupWithState, which is arbitrary state function.Besides,it supports append mode and update mode.
I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))
I have a spark structured streaming job which gets records from Kafka (10,000 as maxOffsetsPerTrigger). I get all those records by spark's readStream method. This dataframe has a column named "key".
I need string(set(all values in that column 'key')) to use this string in a query to ElasticSearch.
I have already tried df.select("key").collect().distinct() but it throws exception:
collect() is not supported with structured streaming.
Thanks.
EDIT:
DATAFRAME:
+-------+-------------------+----------+
| key| ex|new column|
+-------+-------------------+----------+
| fruits| [mango, apple]| |
|animals| [cat, dog, horse]| |
| human|[ram, shyam, karun]| |
+-------+-------------------+----------+
SCHEMA:
root
|-- key: string (nullable = true)
|-- ex: array (nullable = true)
| |-- element: string (containsNull = true)
|-- new column: string (nullable = true)
STRING I NEED:
'["fruits", "animals", "human"]'
You can not apply collect on streaming dataframe. streamingDf here refers reading from Kafka.
val query = streamingDf
.select(col("Key").cast(StringType))
.writeStream
.format("console")
.start()
query.awaitTermination()
It will print your data in the console. To write data in an external source, you have to give an implementation of foreachWriter. For reference, refer
In given link, data is streamed using Kafka, read by spark and written to Cassandra eventually.
Hope, it will help.
For such use case, I'd recommend using foreachBatch operator:
foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]
Sets the output of the streaming query to be processed using the provided function. This is supported only the in the micro-batch execution modes (that is, when the trigger is not continuous).
In every micro-batch, the provided function will be called in every micro-batch with (i) the output rows as a Dataset and (ii) the batch identifier.
The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems. The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).
Quoting the official documentation (with a few modifications):
The foreachBatch operation allows you to apply arbitrary operations and writing logic on the output of a streaming query.
foreachBatch allows arbitrary operations and custom logic on the output of each micro-batch.
And in the same official documentation you can find a sample code that shows that you could do your use case fairly easily.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.select("key").collect().distinct()
}
I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. I have extracted the value field from kafka and making dataframe. My issue is, I need to get the timestamp field (from kafka) also along with the other columns.
Here is my current code:
val kafkaDatademostr = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
.option("subscribe","csvstream")
.load
val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
.select("csv.*")
val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
"split(value,',')[1] as DFW",
"split(value,',')[2] as DTG",
"split(value,',')[3] as CDF",
"split(value,',')[4] as DFO",
"split(value,',')[5] as SAD",
"split(value,',')[6] as DER",
"split(value,',')[7] as time_for",
"split(value,',')[8] as fort")
How can I get the timestamp from kafka and add as columns along with other columns?
Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below.
val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")
At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
There you can find information about the schema of DataFrame that is loaded from Kafka.
Each row from Kafka source has following columns:
key - message key
value - message value
topic - name message topic
partition - partitions from which that message came from
offset - offset of the message
timestamp - timestamp
timestampType timestamp type
All of above columns are available to query.
In your example you use only value, so to get timestamp just need to add timestamp to your select statement:
val allFields = kafkaDatademostr.selectExpr(
s"CAST(value AS STRING) AS csv",
s"CAST(key AS STRING) AS key",
s"topic as topic",
s"partition as partition",
s"offset as offset",
s"timestamp as timestamp",
s"timestampType as timestampType"
)
In my case of Kafka, I was receiving the values in JSON format. Which contains the actual data along with original Event Time not kafka timestamp. Below is the schema.
val mySchema = StructType(Array(
StructField("time", LongType),
StructField("close", DoubleType)
))
In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format.
val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", mySchema).as("data"))
.select(col("data.time").cast("timestamp").alias("time"),col("data.close"))
Now you can use the time field for window operation as well as watermarking purpose.
import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
.groupBy(
window(col("time"), "1 minute", "30 seconds"),
$"close"
).count()
I hope this answer clarifies.