Spark Structured Streaming: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false` - scala

I'm using Spark Structured Streaming (3.2.1) with Kafka.
I'm trying to simply read JSON from Kafka using a defined schema.
My problem is that in defined schema I got non-nullable field that is ignored when I read messages from Kafka. I use the from_json functions that seems to ignore that some fields can't be null.
Here is my code example:
val schemaTest = new StructType()
.add("firstName", StringType)
.add("lastName", StringType)
.add("birthDate", LongType, nullable = false)
val loader =
spark
.readStream
.format("kafka")
.option("startingOffsets", "earliest")
.option("kafka.bootstrap.servers", "BROKER:PORT")
.option("subscribe", "TOPIC")
.load()
val df = loader.
selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
df.printSchema()
val q = df.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
I got this when I'm printing the schema of df which is different of my schemaTest:
root
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- birthDate: long (nullable = true)
And received data are like that:
+---------+--------+----------+
|firstName|lastName|birthDate |
+---------+--------+----------+
|Toto |Titi |1643799912|
+---------+--------+----------+
|Tutu |Tata |null |
+---------+--------+----------+
We also try to add option to change mode in from_json function from default one PERMISSIVE to others (DROPMALFORMED, FAILFAST) but in fact the second record that doesn't respect the defined schema is simply not considered as corrupted because the field birthDate is nullable..
Maybe I missed something but if it's not the case, I got following questions.
Do you know why the printSchema of df is not like my schemaTest ? (With non nullable field)
And also, how can I manage non-nullable value in my case ? I know that I can filter but I would like to know if there is an alternative using schema like it's supposed to work. And also, It's not quite simple to filter if I got a schema with lots of fields non-nullable.

This is actually the intended behavior of from_json function. You can read the following from the source code:
// The JSON input data might be missing certain fields. We force the nullability
// of the user-provided schema to avoid data corruptions. In particular, the parquet-mr encoder
// can generate incorrect files if values are missing in columns declared as non-nullable.
val nullableSchema = schema.asNullable
override def nullable: Boolean = true
If you have multiple fields which are mandatory then you can construct the filter expression from your schemaTest (or list of columns) and use it like this:
val filterExpr = schemaTest.fields
.filter(!_.nullable)
.map(f => col(f.name).isNotNull)
.reduce(_ and _)
val df = loader
.selectExpr("CAST(value AS STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.filter(filterExpr)

I would like to propose a different way of doing :
def isCorrupted(df: DataFrame): DataFrame = {
val filterNullable = schemaTest
.filter(e => !e.nullable)
.map(_.name)
filterNullable.foldLeft(df) { case ((accumulator), (columnName)) =>
accumulator.withColumn("isCorrupted", when(col(columnName).isNull, 1).otherwise(0))
}
.filter(col("isCorrupted") === lit(0))
.drop(col("isCorrupted"))
}
val df = loader
.selectExpr("CAST(value as STRING)")
.withColumn("value", from_json(col("value"), schemaTest))
.select(col("value.*"))
.transform(isCorrupted)

Related

How to Read data from kafka topic with different schema (has some optional objects) in structured streaming

i have data coming in kafka topic which has an optional object , and since its optional i am missing those records when reading with a defined schema
ex :
schema i have :
val schema =new StructType()
.add("obj_1", new ArrayType(
new StructType(
Array(
StructField("field1",StringType),
StructField("field2",StringType),
StructField("obj_2",new ArrayType(
new StructType(
Array(
StructField("field3",StringType),
StructField("field4",LongType),
StructField("obj_3",new ArrayType(
new StructType(
Array(
StructField("field5",StringType),
StructField("field6",StringType),
)
),containsNull = true
)
)
)
),containsNull = true
)),
StructField("field7",StringType),
StructField("field8",StringType))), containsNull = true))
when publishing data to this topic we sometimes will not send obj_3 based on some conditions.
so when reading the topic and mapping it to above schema we are missing data which will not have those obj_3 and will only contain data with that obj_3 present .
how to read the data which will not have obj_3 sometime.
sample code :
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","bootstrap.servers")
.option("subscribe", "topic.name")
.option("startingOffsets", "offset.reset)
.option("failOnDataLoss","true")
.load()
val cast = df.selectExpr( "CAST(value AS STRING)")
.as[( String)]
val resultedDf = cast.select(from_json(col("value"), schema).as("newDF"))
val finalDf = resultedDf.select(col("newDF.*"))
You could either
use a flag (e.g. called "obj3flag" within a JSON structure) in the Key of the Kafka message that tells your structured streaming job if the obj_3 is existing in the Kafka Value and then choose either the one or the other schema to parse the json string. Something like:
import org.apache.spark.sql.functions._
val schemaWithObj3 = ...
val schemaWithOutObj3 = ...
val cast = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
val resultedDf = cast
.withColumn("obj_3_flag", get_json_object(col("key"), "$.obj3flag"))
.withColumn("data", when(col("obj3_flag") === lit(1), from_json(col("value"), schemaWithObj3).otherwise(from_json(col("value"), schemaWithOutObj3)))
.select(col("data.*"))
do a string search on "obj_3" in the Kafka value (casted as string) and if the string is found apply one or the other schema to parse the json. The code will look very similar to the one for the other option.
Please note, that I have written the code on my mobile, so you may find some syntax issues. However, hope the idea gets across.

Spark structured streaming - how to queue bytes value to Kafka?

I'm writing a Spark application that uses structured streaming. The app reads messages from a Kafka topic topic1, constructs a new message, serializes it to an Array[Byte] and publishes them to another Kafka topic topic2.
The serializing to a byte array is important because I use a specific serializer/deserializer that the downstream consumer of topic2 also uses.
I've trouble producing to Kafka though. I'm not even sure how to do so..there's only plenty of examples online about queueing JSON data.
The code -
case class OutputMessage(id: String, bytes: Array[Byte])
implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo
val outputMessagesDataSet: DataSet[OutputMessage] = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()
.select($"value")
.mapPartitions{r =>
val messages: Iterator[OutputMessage] = createMessages(r)
messages
}
outputMessagesDataSet
.writeStream
.selectExpr("CAST(id AS String) AS key", "bytes AS value")
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("topic", "topic2")
.option("checkpointLocation", loc)
.trigger(trigger)
.start
.awaitTermination
However, that throws exception org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;
How do I queue to Kafka with id as the key and bytes as the value?
You can check the schema of the dataframe that "collects" the message. As you are collecting only the "value" field, incoming events arrive in the following form:
+-------------------+
| value |
+-------------------+
| field1,field2,.. |
+-------------------+
Yo need to query for the key as well like in the Spark documentation:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
or
df.select(col("key").cast(StringType), col("value").cast(StringType))
As #EmiCareOfCell44 suggested, I printed out the schema -
If I do messagesDataSet.printSchema() then I get only one value with binary type. But if I do
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()
df.printSchema()
Then it prints
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
But the Dataframe hasn't undergone the transformation that is needed, which is done in
.mapPartitions{r =>
val messages: Iterator[OutputMessage] = createMessages(r)
messages
}
It looks like the Dataset's value has only one binary value.
I searched for some answers here, then found this post - Value Type is binary after Spark Dataset mapGroups operation even return a String in the function
I had an Encoder set up -
implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo
That was causing the value to be converted into binary. Since OutputMessage is a scala class, the Encoder isn't required, so I removed it. After that, printing out the schema showed two fields (String and bytes which is what I wanted). After that, line .selectExpr("CAST(id AS String) AS key", "bytes AS value") worked perfectly well.

Cleaning CSV/Dataframe of size ~40GB using Spark and Scala

I am kind of newbie to big data world. I have a initial CSV which has a data size of ~40GB but in some kind of shifted order. I mean if you see initial CSV, for Jenny there is no age, so sex column value is shifted to age and remaining column value keeps shifting till the last element in the row.
I want clean/process this CVS using dataframe with Spark in Scala. I tried quite a few solution with withColumn() API and all, but nothing worked for me.
If anyone can suggest me some sort of logic or API available which is out there to solve this in a cleaner way. I might not need proper solution but pointers will also do. Help much appreciated!!
Initial CSV/Dataframe
Required CSV/Dataframe
EDIT:
This is how I'm reading the data:
val spark = SparkSession .builder .appName("SparkSQL")
.master("local[*]") .config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val df = spark.read.option("header", true").csv("path/to/csv.csv")
This pretty much looks like the data is flawed. To handle this, I would suggest reading each line of the csv file as a single string and the applying a map() function to handle the data
case class myClass(name: String, age: Integer, sex: String, siblings: Integer)
val myNewDf = myDf.map(row => {
val myRow: String = row.getAs[String]("MY_SINGLE_COLUMN")
val myRowValues = myRow.split(",")
if (4 == myRowValues.size()) {
//everything as expected
return myClass(myRowValues[0], myRowValues[1], myRowValues[2], myRowValues[3])
} else {
//do foo to guess missing values
}
}
As in your case Data is not properly formatted. To handle this first data has to be cleansed, i.e all rows of CSV should have same Schema or same no of delimiter/columns.
Basic approach to do this in spark could be:
Load data as Text
Apply map operation on loaded DF/DS to clean it
Create Schema manually
Apply Schema on the cleansed DF/DS
Sample Code
//Sample CSV
John,28,M,3
Jenny,M,3
//Sample Code
val schema = StructType(
List(
StructField("name", StringType, nullable = true),
StructField("age", IntegerType, nullable = true),
StructField("sex", StringType, nullable = true),
StructField("sib", IntegerType, nullable = true)
)
)
import spark.implicits._
val rawdf = spark.read.text("test.csv")
rawdf.show(10)
val rdd = rawdf.map(row => {
val raw = row.getAs[String]("value")
//TODO: Data cleansing has to be done.
val values = raw.split(",")
if (values.length != 4) {
s"${values(0)},,${values(1)},${values(2)}"
} else {
raw
}
})
val df = spark.read.schema(schema).csv(rdd)
df.show(10)
You can try to define a case class with Optional field for age and load your csv with schema directly into a Dataset.
Something like that :
import org.apache.spark.sql.{Encoders}
import sparkSession.implicits._
case class Person(name: String, age: Option[Int], sex: String, siblings: Int)
val schema = Encoders.product[Person].schema
val dfInput = sparkSession.read
.format("csv")
.schema(schema)
.option("header", "true")
.load("path/to/csv.csv")
.as[Person]

How to read redis map in spark using spark-redis

I have a normal scala map in Redis (key and value). Now I want to read that map in one of my spark-streaming program and use this as a broadcast variable so that my slaves can use that map to resolve key mapping. I am using spark-redis 2.3.1 library, but now sure how to read that.
Map in redis table "employee" -
name | value
------------------
123 David
124 John
125 Alex
This is how I am trying to read in spark (Not sure if this is correct- please correct me) --
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.schema(
StructType(Array(
StructField("name", IntegerType),
StructField("value", StringType)
)
))
.option("table", "employee")
.option("key.column", "name")
.load()
loadedDf.show()
The above code does not show anything, I get empty output.
You could use the below code to your task but you need to utilize Spark Dataset (case Dataframe to case class) to do this task. Below is a full example to read and write in Redis.
object DataFrameExample {
case class employee(name: String, value: Int)
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("redis-df")
.master("local[*]")
.config("spark.redis.host", "localhost")
.config("spark.redis.port", "6379")
.getOrCreate()
val personSeq = Seq(employee("John", 30), employee("Peter", 45)
val df = spark.createDataFrame(personSeq)
df.write
.format("org.apache.spark.sql.redis")
.option("table", "person")
.mode(SaveMode.Overwrite)
.save()
val loadedDf = spark.read
.format("org.apache.spark.sql.redis")
.option("table", "person")
.load()
loadedDf.printSchema()
loadedDf.show()
}
}
Output is below
root
|-- name: string (nullable = true)
|-- value: integer (nullable = false)
+-----+-----+
| name|value|
+-----+-----+
| John| 30 |
|Peter| 45 |
+-----+-----+
You could also check more details in Redis documentation

Unable to find encoder for type stored in a Dataset. in spark structured streaming

I am trying example of spark structured streaming given on the spark website but it is throwing error
1. Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
2. not enough arguments for method as: (implicit evidence$2: org.apache.spark.sql.Encoder[data])org.apache.spark.sql.Dataset[data].
Unspecified value parameter evidence$2.
val ds: Dataset[data] = df.as[data]
Here is my code
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types._
import org.apache.spark.sql.Encoders
object final_stream {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("kafka-consumer")
.master("local[*]")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("WARN")
case class data(name: String, id: String)
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "172.21.0.187:9093")
.option("subscribe", "test")
.load()
println(df.isStreaming)
val ds: Dataset[data] = df.as[data]
val value = ds.select("name").where("id > 10")
value.writeStream
.outputMode("append")
.format("console")
.start()
.awaitTermination()
}
}
any help on how to make this work.?
I want final output like this
i want output like this
+-----+--------+
| name| id
+-----+--------+
|Jacek| 1
+-----+--------+
The reason for the error is that you are dealing with Array[Byte] as coming from Kafka and there are no fields to match data case class.
scala> println(schema.treeString)
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
Change the line df.as[data] to the following:
df.
select($"value" cast "string").
map(value => ...parse the value to get name and id here...).
as[data]
I strongly recommend using select and functions object to deal with the incoming data.
The error is due to mismatch of number of column in dataframe and your case class.
You have [topic, timestamp, value, key, offset, timestampType, partition] columns in dataframe
Whereas your case class is only with two columns
case class data(name: String, id: String)
You can display the content of dataframe as
val display = df.writeStream.format("console").start()
Sleep for some seconds and then
display.stop()
And also use option("startingOffsets", "earliest") as mentioned here
Then create a case class as per your data.
Hope this helps!