How to include kafka timestamp value as columns in spark structured streaming?

How to include kafka timestamp value as columns in spark structured streaming? - scala

I am looking for the solution for adding timestamp value of kafka to my Spark structured streaming schema. I have extracted the value field from kafka and making dataframe. My issue is, I need to get the timestamp field (from kafka) also along with the other columns.
Here is my current code:
val kafkaDatademostr = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers","zzzz.xxx.xxx.xxx.com:9002")
.option("subscribe","csvstream")
.load
val interval = kafkaDatademostr.select(col("value").cast("string")).alias("csv")
.select("csv.*")
val xmlData = interval.selectExpr("split(value,',')[0] as ddd" ,
"split(value,',')[1] as DFW",
"split(value,',')[2] as DTG",
"split(value,',')[3] as CDF",
"split(value,',')[4] as DFO",
"split(value,',')[5] as SAD",
"split(value,',')[6] as DER",
"split(value,',')[7] as time_for",
"split(value,',')[8] as fort")
How can I get the timestamp from kafka and add as columns along with other columns?

Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below.
val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")

At Apache Spark official web page you can find guide: Structured Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher)
There you can find information about the schema of DataFrame that is loaded from Kafka.
Each row from Kafka source has following columns:
key - message key
value - message value
topic - name message topic
partition - partitions from which that message came from
offset - offset of the message
timestamp - timestamp
timestampType timestamp type
All of above columns are available to query.
In your example you use only value, so to get timestamp just need to add timestamp to your select statement:
val allFields = kafkaDatademostr.selectExpr(
s"CAST(value AS STRING) AS csv",
s"CAST(key AS STRING) AS key",
s"topic as topic",
s"partition as partition",
s"offset as offset",
s"timestamp as timestamp",
s"timestampType as timestampType"
)

In my case of Kafka, I was receiving the values in JSON format. Which contains the actual data along with original Event Time not kafka timestamp. Below is the schema.
val mySchema = StructType(Array(
StructField("time", LongType),
StructField("close", DoubleType)
))
In order to use watermarking feature of Spark Structured Streaming, I had to cast the time field into the timestamp format.
val df1 = df.selectExpr("CAST(value AS STRING)").as[(String)]
.select(from_json($"value", mySchema).as("data"))
.select(col("data.time").cast("timestamp").alias("time"),col("data.close"))
Now you can use the time field for window operation as well as watermarking purpose.
import spark.implicits._
val windowedData = df1.withWatermark("time","1 minute")
.groupBy(
window(col("time"), "1 minute", "30 seconds"),
$"close"
).count()
I hope this answer clarifies.

Related

Spark Structured Streaming cannot writeStream in kafka

I'm using structured streaming and I'm trying to send my result into a kafka topic, named "results".
I get the following error:
'Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;
Can anyone help?
query1 = prediction.writeStream.format("kafka")\
.option("topic", "results")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("checkpointLocation", "checkpoint")\
.start()
query1.awaitTermination()
prediction schema is:
root
|-- prediction: double (nullable = false)
|-- count: long (nullable = false)
Am I missing something?

The error message gives a hint on what is missing: a watermark.
Watermarks are used to handle late incoming data when you are aggregating stream data. Details can be found in the Spark documentation for Structured Streaming.
It is important that withWatermark is used on the same column as the timestamp column used in the aggregate.
An example on how to use withWatermark is given in the Spark documentation:
words = ... # streaming DataFrame of schema { timestamp: Timestamp, word: String }
# Group the data by window and word and compute the count of each group
windowedCounts = words \
.withWatermark("timestamp", "10 minutes") \
.groupBy(
window(words.timestamp, "10 minutes", "5 minutes"),
words.word) \
.count()

How to handle late arrival records while deduplication in Spark structured streaming?

I have just started my journey with Spark Streaming where I am reading data from a Kafka queue using Spark structured streaming. I need to do deduplication in each micro-batch based on key columns(product and org) and order by timestamp field (booked_at).
Deduplicator function will create a micro-batch from streaming dataframe at every 2 seconds. Deduplicated dataframe will be returned to another function for further processing and finally persisted in a DynamoDb. My code works fine except when there are late arriving records in Kafka. For example consider below 4 events in Kafka where first two belongs to first micro-batch after starting streaming job and last two belongs to second micro-batch
{"product":"p1","org":"US","quantity":1,"booked_at":"2020-02-05T00:00:05"}
{"product":"p1","org":"US","quantity":2,"booked_at":"2020-02-05T00:00:06"}
{"product":"p1","org":"US","quantity":3,"booked_at":"2020-02-05T00:00:01"}
{"product":"p1","org":"US","quantity":4,"booked_at":"2020-02-05T00:00:03"}
Current output of my deduplicator function for each batch is as below
+-------+---+--------+-------------------+
|product|org|quantity| booked_at|
+-------+---+--------+-------------------+
| p1 | US| 3|2020-02-05 00:00:06|
+-------+---+--------+-------------------+
+-------+---+--------+-------------------+
|product|org|quantity| booked_at|
+-------+---+--------+-------------------+
| p1 | US| 2|2020-02-05 00:00:03|
+-------+---+--------+-------------------+
I can safely assume that my late arriving records will not be more than 60 seconds in Kafka and considering this assumption I want my function to somehow consider last 60 seconds of events to do deduplication. So if micro-batch2 records lies within 60 seconds window then output should be below -
+-------+---+--------+-------------------+
|product|org|quantity| booked_at|
+-------+---+--------+-------------------+
| p1 | US| 3|2020-02-05 00:00:06|
+-------+---+--------+-------------------+
+-------+---+--------+-------------------+
|product|org|quantity| booked_at|
+-------+---+--------+-------------------+
| p1 | US| 3|2020-02-05 00:00:06|
+-------+---+--------+-------------------+
Some solutions that I am able to think of -
Take 60 seconds worth of data every 2 seconds. Currently it is
set to consider only last 2 seconds data in every 2 seconds. But this is perhaps not possible as read from Kafka will be based on offsets instead of kafka timestamp.
Persist last 60 seconds of data into a cached dataframe and use it while deduplicating micro-batch having 2 seconds of data. This way, I will return only 2 seconds worth of data to next function which will do actual processing.
Leave the deduplicator function as it is right now and have another function it end which will compare the processed record timestamp with what is there in DynamoDB and will only write if the last processed record's timestamp is greater than existing one in Dynamo
Is there any other way by which I can handle it better?
Existing code :
def get_deduplicated_dataframe(inputDf: DataFrame, grpKeys: List[String], orderByKey: String)
: DataFrame = {
val outputDf = inputDf.withColumn("rowNumber",row_number()
.over(Window.partitionBy(grpKeys.head, grpKeys.tail: _*)
.orderBy(col(orderByKey).desc))).filter("rowNumber = 1").drop("rowNumber")
outputDf.collect()
outputDf.cache()
outputDf
}
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
val payload_df = df.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
val query = payload_df.writeStream
.trigger(Trigger.ProcessingTime(2000L))
.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
val groupKeys = List("product", "org")
val orderByKey = "booked_at"
val outputDf = get_deduplicated_dataframe(batchDF, groupKeys, orderByKey)
outputDf.show()
}.start()
query.awaitTermination()

spark OOM on simple read and write

I'm using spark to read from a postgres table and dump it to Google cloud storage as json. The table is quite big, many 100's of GBs. The code is relatively straightforward (plz see below) but it fails with OOM. It seems like spark is trying to read the entire table in memory before starting to write it. Is this true? How can I change the behavior such that it reads and writes in a streaming fashion?
Thanks.
SparkSession sparkSession = SparkSession
.builder()
.appName("01-Getting-Started")
.getOrCreate();
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", properties);
dataset.write().mode(SaveMode.Append).json("gs://some/path");

There are a couple of overloaded DataFrameReader.jdbc() methods that are useful for splitting up JDBC data on input.
jdbc(String url, String table, String[] predicates, java.util.Properties connectionProperties) - the resulting DataFrame will have one partition for each predicate given, e.g.
String[] preds = {“state=‘Alabama’”, “state=‘Alaska’”, “state=‘Arkansas’”, …};
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", preds, properties);
jdbc(String url, String table, String columnName, long lowerBound, long upperBound, int numPartitions, java.util.Properties connectionProperties) - Spark will divide the data based on a numeric column columnName into numPartitions partitions between lowerBound and upperBound inclusive, e.g.:
Dataset<Row> dataset = sparkSession.read().jdbc("jdbc:postgresql://<ip>:<port>/<db>", "<table>", “<idColumn>”, 1, 1000, 100, properties);

Spark dataframe cast column for Kudu compatibility

(I am new to Spark, Impala and Kudu.) I am trying to copy a table from an Oracle DB to an Impala table having the same structure, in Spark, through Kudu. I am getting an error when the code tries to map an Oracle NUMBER to a Kudu data type. How can I change the data type of a Spark DataFrame to make it compatible with Kudu?
This is intended to be a 1-to-1 copy of data from Oracle to Impala. I have extracted the Oracle schema of the source table and created a target Impala table with the same structure (same column names and a reasonable mapping of data types). I was hoping that Spark+Kudu would map everything automatically and just copy the data. Instead, Kudu complains that it cannot map DecimalType(38,0).
I would like to specify that "column #1, with name SOME_COL, which is a NUMBER in Oracle, should be mapped to a LongType, which is supported in Kudu".
How can I do that?
// This works
val df: DataFrame = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
// This does not work
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")
// Error: No support for Spark SQL type DecimalType(38,0)
// See https://github.com/cloudera/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/SparkUtil.scala
// So let's see the Spark data types
df.dtypes.foreach{case (colName, colType) => println(s"$colName: $colType")}
// Spark data type: SOME_COL DecimalType(38,0)
// Oracle data type: SOME_COL NUMBER -- no precision specifier; values are int/long
// Kudu data type: SOME_COL BIGINT

Apparently, we can specify a custom schema when reading from a JDBC data source.
connectionProperties.put("customSchema", "id DECIMAL(38, 0), name STRING")
val jdbcDF3 = spark.read
.jdbc("jdbc:postgresql:dbserver", "schema.tablename", connectionProperties)
That worked. I was able to specify a customSchema like so:
col1 Long, col2 Timestamp, col3 Double, col4 String
and with that, the code works:
import spark.implicits._
val df: Dataset[case_class_for_table] = spark.read
.option("fetchsize", 10000)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.jdbc("jdbc:oracle:thin:#(DESCRIPTION=...)", "SCHEMA.TABLE_NAME", partitions, props)
.as[case_class_for_table]
kuduContext.insertRows(df.toDF(colNamesLower: _*), "impala::schema.table_name")

How to retrieve last 24-hours data from Spark DataFrame (Scala)?

I want to retrieve the last 24-hours data from my DataFrame.
val data = spark.read.parquet(path_to_parquet_file)
data.createOrReplaceTempView("table")
var df = spark.sql("SELECT datetime, product_PK FROM table WHERE datetime BETWEEN (datetime - 24*3600000) AND datetime")
However, I do not know how to convert datetime to milliseconds using Spark SQL (Spark 2.2.0 and Scala 2.11).
I can do it using DataFrame, but don't know how to merge everything together:
import org.apache.spark.sql.functions.unix_timestamp
df = df.withColumn("unix_timestamp",unix_timestamp(col("datetime"))).drop("datetime")

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to include kafka timestamp value as columns in spark structured streaming? - scala

Timestamp is included in the source schema. Just add a "select timestamp" to get the timestamp like the below. val interval = kafkaDatademostr.select(col("value").cast("string").alias("csv"), col("timestamp")).select("csv.*", "timestamp")

Related

Spark Structured Streaming cannot writeStream in kafka

How to handle late arrival records while deduplication in Spark structured streaming?

spark OOM on simple read and write

Spark dataframe cast column for Kudu compatibility

How to retrieve last 24-hours data from Spark DataFrame (Scala)?

Categories

Resources