Spark structured streaming - how to queue bytes value to Kafka? - scala

I'm writing a Spark application that uses structured streaming. The app reads messages from a Kafka topic topic1, constructs a new message, serializes it to an Array[Byte] and publishes them to another Kafka topic topic2.
The serializing to a byte array is important because I use a specific serializer/deserializer that the downstream consumer of topic2 also uses.
I've trouble producing to Kafka though. I'm not even sure how to do so..there's only plenty of examples online about queueing JSON data.
The code -
case class OutputMessage(id: String, bytes: Array[Byte])
implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo
val outputMessagesDataSet: DataSet[OutputMessage] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()
.select($"value")
.mapPartitions{r =>
val messages: Iterator[OutputMessage] = createMessages(r)
messages
}
outputMessagesDataSet
.writeStream
.selectExpr("CAST(id AS String) AS key", "bytes AS value")
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("topic", "topic2")
.option("checkpointLocation", loc)
.trigger(trigger)
.start
.awaitTermination
However, that throws exception org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;
How do I queue to Kafka with id as the key and bytes as the value?

You can check the schema of the dataframe that "collects" the message. As you are collecting only the "value" field, incoming events arrive in the following form:
+-------------------+
| value |
+-------------------+
| field1,field2,.. |
+-------------------+
Yo need to query for the key as well like in the Spark documentation:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
or
df.select(col("key").cast(StringType), col("value").cast(StringType))

As #EmiCareOfCell44 suggested, I printed out the schema -
If I do messagesDataSet.printSchema() then I get only one value with binary type. But if I do
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()
df.printSchema()
Then it prints
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
But the Dataframe hasn't undergone the transformation that is needed, which is done in
.mapPartitions{r =>
val messages: Iterator[OutputMessage] = createMessages(r)
messages
}
It looks like the Dataset's value has only one binary value.
I searched for some answers here, then found this post - Value Type is binary after Spark Dataset mapGroups operation even return a String in the function
I had an Encoder set up -
implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo
That was causing the value to be converted into binary. Since OutputMessage is a scala class, the Encoder isn't required, so I removed it. After that, printing out the schema showed two fields (String and bytes which is what I wanted). After that, line .selectExpr("CAST(id AS String) AS key", "bytes AS value") worked perfectly well.

Related

Spark Structured Streaming: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`

I'm using Spark Structured Streaming (3.2.1) with Kafka.
I'm trying to simply read JSON from Kafka using a defined schema.
My problem is that in defined schema I got non-nullable field that is ignored when I read messages from Kafka. I use the from_json functions that seems to ignore that some fields can't be null.
Here is my code example:
val schemaTest = new StructType()
.add("firstName", StringType)
.add("lastName", StringType)
.add("birthDate", LongType, nullable = false)
val loader =
spark
.readStream
.format("kafka")
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", "BROKER:PORT")
.option("subscribe", "TOPIC")
.load()
val df = loader.
selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
df.printSchema()
val q = df.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
I got this when I'm printing the schema of df which is different of my schemaTest:
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- birthDate: long (nullable = true)
And received data are like that:
+---------+--------+----------+
|firstName|lastName|birthDate |
+---------+--------+----------+
|Toto |Titi |1643799912|
+---------+--------+----------+
|Tutu |Tata |null |
+---------+--------+----------+
We also try to add option to change mode in from_json function from default one PERMISSIVE to others (DROPMALFORMED, FAILFAST) but in fact the second record that doesn't respect the defined schema is simply not considered as corrupted because the field birthDate is nullable..
Maybe I missed something but if it's not the case, I got following questions.
Do you know why the printSchema of df is not like my schemaTest ? (With non nullable field)
And also, how can I manage non-nullable value in my case ? I know that I can filter but I would like to know if there is an alternative using schema like it's supposed to work. And also, It's not quite simple to filter if I got a schema with lots of fields non-nullable.
This is actually the intended behavior of from_json function. You can read the following from the source code:
// The JSON input data might be missing certain fields. We force the nullability
// of the user-provided schema to avoid data corruptions. In particular, the parquet-mr encoder
// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = schema.asNullable
override def nullable: Boolean = true
If you have multiple fields which are mandatory then you can construct the filter expression from your schemaTest (or list of columns) and use it like this:
val filterExpr = schemaTest.fields
.filter(!_.nullable)
.map(f => col(f.name).isNotNull)
.reduce(_ and _)
val df = loader
.selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.filter(filterExpr)
I would like to propose a different way of doing :
def isCorrupted(df: DataFrame): DataFrame = {
val filterNullable = schemaTest
.filter(e => !e.nullable)
.map(_.name)
filterNullable.foldLeft(df) { case ((accumulator), (columnName)) =>
accumulator.withColumn("isCorrupted", when(col(columnName).isNull, 1).otherwise(0))
}
.filter(col("isCorrupted") === lit(0))
.drop(col("isCorrupted"))
}
val df = loader
.selectExpr("CAST(value as STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.transform(isCorrupted)

Saving empty dataframe to parquet results in error - Spark 2.4.4

I have a piece of code where at the end, I am write dataframe to parquet file.
The logic is such that the dataframe could be empty sometimes and hence I get the below error.
df.write.format("parquet").mode("overwrite").save(somePath)
org.apache.spark.sql.AnalysisException: Parquet data source does not support null data type.;
When I print the schema of "df", I get below.
df.schema
res2: org.apache.spark.sql.types.StructType =
StructType(
StructField(rpt_date_id,IntegerType,true),
StructField(rpt_hour_no,ShortType,true),
StructField(kpi_id,IntegerType,false),
StructField(kpi_scnr_cd,StringType,false),
StructField(channel_x_id,IntegerType,false),
StructField(brand_id,ShortType,true),
StructField(kpi_value,FloatType,false),
StructField(src_lst_updt_dt,NullType,true),
StructField(etl_insrt_dt,DateType,false),
StructField(etl_updt_dt,DateType,false)
)
Is there a workaround to just write the empty file with schema, or not write the file at all when empty?
Thanks
The error you are getting is not related with the fact that your dataframe is empty. I don't see the point of saving an empty dataframe but you can do it if you want. Try this if you don't believe me:
val schema = StructType(
Array(
StructField("col1",StringType,true),
StructField("col2",StringType,false)
)
)
spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
.write
.format("parquet")
.save("/tmp/test_empty_df")
You are getting that error because one of your columns is of NullType and as the exception that was thrown indicates "Parquet data source does not support null data type"
I can't know for sure why you have a column with Null type but that usually happens when you read your data from a source and let spark infer the schema. If in that source there is an empty column, spark won't be able to infer the schema and will set it to null type.
If this is what's happening, my advice is that you specify the schema on read.
If this is not the case, a possible solution is to cast all the columns of NullType to a parquet-compatible type (like StringType). Here is an example on how to do it:
//df is a dataframe with a column of NullType
val df = Seq(("abc",null)).toDF("col1", "col2")
df.printSchema
root
|-- col1: string (nullable = true)
|-- col2: null (nullable = true)
//fold left to cast all NullType to StringType
val df1 = df.columns.foldLeft(df){
(acc,cur) => {
if(df.schema(cur).dataType == NullType)
acc.withColumn(cur, col(cur).cast(StringType))
else
acc
}
}
df1.printSchema
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
Hope this helps
'or not write the file at all when empty?' Check if df is not empty & then only write it.
if (!df.isEmpty)
df.write.format("parquet").mode("overwrite").save("somePath")

How to read redis map in spark using spark-redis

I have a normal scala map in Redis (key and value). Now I want to read that map in one of my spark-streaming program and use this as a broadcast variable so that my slaves can use that map to resolve key mapping. I am using spark-redis 2.3.1 library, but now sure how to read that.
Map in redis table "employee" -
name | value
------------------
123 David
124 John
125 Alex
This is how I am trying to read in spark (Not sure if this is correct- please correct me) --
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.schema(
StructType(Array(
StructField("name", IntegerType),
StructField("value", StringType)
)
))
.option("table", "employee")
.option("key.column", "name")
.load()
loadedDf.show()
The above code does not show anything, I get empty output.
You could use the below code to your task but you need to utilize Spark Dataset (case Dataframe to case class) to do this task. Below is a full example to read and write in Redis.
object DataFrameExample {
case class employee(name: String, value: Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val personSeq = Seq(employee("John", 30), employee("Peter", 45)
val df = spark.createDataFrame(personSeq)
df.write
.format("org.apache.spark.sql.redis")
.option("table", "person")
.mode(SaveMode.Overwrite)
.save()
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.option("table", "person")
.load()
loadedDf.printSchema()
loadedDf.show()
}
}
Output is below
root
|-- name: string (nullable = true)
|-- value: integer (nullable = false)
+-----+-----+
| name|value|
+-----+-----+
| John| 30 |
|Peter| 45 |
+-----+-----+
You could also check more details in Redis documentation

Spark : Kafka consumer getting data as base64 encoded strings even though Producer does not explictly encode

I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)

Unable to find encoder for type stored in a Dataset. in spark structured streaming

I am trying example of spark structured streaming given on the spark website but it is throwing error
1. Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
2. not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[data])org.apache.spark.sql.Dataset[data].
Unspecified value parameter evidence$2.
val ds: Dataset[data] = df.as[data]
Here is my code
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
object final_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
case class data(name: String, id: String)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.load()
println(df.isStreaming)
val ds: Dataset[data] = df.as[data]
val value = ds.select("name").where("id > 10")
value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
any help on how to make this work.?
I want final output like this
i want output like this
+-----+--------+
| name| id
+-----+--------+
|Jacek| 1
+-----+--------+
The reason for the error is that you are dealing with Array[Byte] as coming from Kafka and there are no fields to match data case class.
scala> println(schema.treeString)
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Change the line df.as[data] to the following:
df.
select($"value" cast "string").
map(value => ...parse the value to get name and id here...).
as[data]
I strongly recommend using select and functions object to deal with the incoming data.
The error is due to mismatch of number of column in dataframe and your case class.
You have [topic, timestamp, value, key, offset, timestampType, partition] columns in dataframe
Whereas your case class is only with two columns
case class data(name: String, id: String)
You can display the content of dataframe as
val display = df.writeStream.format("console").start()
Sleep for some seconds and then
display.stop()
And also use option("startingOffsets", "earliest") as mentioned here
Then create a case class as per your data.
Hope this helps!