Multiplying event case class depending on the list based on nested IDs - scala

I am processing a dataframe and converting into Dataset[Event] using Event case class.How ever there are nested Ids for which i need to multiply the events based on the flattening of nested device:os.
I am able to return the case class Event at the Kafka event level. But not sure how to multiply events .
Kafka incoming Event:
{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100
]
},
{
"deviceId": 987875
"osId": [
101
]
}
]
}
}
]
}
}
The expected output case classes for Event
Event("3MFETP501","1004678","2010","3MFETP501:890479804","MFE Total Protection 2021 Family Pack","999795_100", Map("targetId"->"999795_100") )
Event("3MFETP501","1004678","2010","3MFETP501:890479804","MFE Total Protection 2021 Family Pack","987875_100", Map("targetId"->"987875_100") )
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS:String,
var featureMap: mutable.Map[String, String])
val finalDataset:Dataset[Event] = inputDataFrame.flatMap(
row=> {
val productId = row.getAs[String]("productId")
val userActions = row.getAs[Row]("userActions")
val userEvents:mutable.Seq[Row] = userActions.getAs[mutable.WrappedArray[Row]]("events")
val processedEvents:mutable.Seq[Row]= userEvents.map(
event=>
val productTypeId = event.getAs[Int]("productTypeId")
val familyId = event.getAs[String]("familyId")
val features = activity.getAs[mutable.WrappedArray[Row]]("features")
val serialId = activity.getAs[String]("serialId")
val key = productId+":"+serialId
val features = mutable.Map[String, String]().withDefaultValue(null)
val device_os_list=List("999795_100","987875_101")
//Feature Map is for every device_os ( example "targetId"->"999795_100") for 999795_100
if (familyId == 2010 )
{
val a: Option[List[String]] = flatten the deviceId,osId ..
a.get.map(i=>{
val key: String = methodToCombinedeviceIdAndosId
val featureMapping: mutable.Map[String, String] = getfeatureMapForInvidualKey
Event(productId,productTypeId,familyId,key,productName,device_os,feature) ---> This is returning **List[Event]**
})
}
else{
Event(productId,productTypeId,familyId,key,productName,device_os,feature) --> This is returning **Event**. THIS WORKS
}
)
}
)

I do not implement it fully the same but I think it will be possible to understand logic and apply it on your case.
I created json file like kafka.json and put there code like this(your event):
[{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100
]
},
{
"deviceId": 987875,
"osId": [
101
]
}
]
}
}
]
}
}]
Please find below first solution that is based on flatMap and for loop.
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import scala.collection.mutable
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
import spark.implicits._
val finalDataset: Dataset[Event] = inputDataFrame.flatMap(
row => {
val userActions = row.getAs[Row]("userActions")
val productId = userActions.getAs[String]("productId")
val userEvents = userActions.getAs[mutable.WrappedArray[Row]]("events")
for (event <- userEvents;
familyId = event.getAs[Int]("familyId").toString;
productTypeId = event.getAs[Int]("productTypeId").toString;
serialId = event.getAs[String]("serialID");
productName = event.getAs[String]("productName");
key = s"$productId:$serialId";
features = event.getAs[Row]("features");
mappings = features.getAs[mutable.WrappedArray[Row]]("mapping");
mappingRow <- mappings;
deviceId = mappingRow.getAs[Long]("deviceId");
osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId");
osId <- osIds;
deviseOs = deviceId + "_" + osId
) yield Event(productId, familyId, productTypeId, key, productName, deviseOs, Map("target" -> (deviseOs)))
}
)
finalDataset.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
Also, you can solve this task using select, withColumn, explode, concat functions.
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions.{col, explode, concat, lit, map}
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
val transformedDataFrame = inputDataFrame
.select(col("userActions.productId").as("productId"),
explode(col("userActions.events")).as("event"))
.select(col("productId"),
col("event.familyId").as("familyId"),
col("event.productTypeId").as("productTypeId"),
col("event.serialID").as("serialID"),
col("event.productName").as("productName"),
explode(col("event.features.mapping")).as("features")
)
.select(
col("productId"),
col("familyId"),
col("productTypeId"),
col("serialID"),
col("productName"),
col("features.deviceId").as("deviceId"),
explode(col("features.osId")).as("osId")
)
.withColumn("key", concat(col("productId"), lit(":"), col("serialID")))
.withColumn("deviceOS", concat(col("deviceId"), lit("_"), col("osId")))
.withColumn("featureMap", map(lit("target"), col("deviceOS")))
import spark.implicits._
private val result: Dataset[Event] = transformedDataFrame.as[Event]
result.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
Add option to customize response based on the value one of the field. I replace here for comprehension to map/flatmap, so you can return as response one or several events based on the type. Also, I customized json a little bit to show more examples in the result.
New json:
[{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100,
110
]
},
{
"deviceId": 987875,
"osId": [
101
]
}
]
}
},
{
"GUID": "1111-2222-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-03-26T11:19:35.786Z",
"familyId": 2011,
"productTypeId": 1004679,
"serialID": "890479805",
"productName": "Product name",
"features": {
"mapping": [
{
"deviceId": 999796,
"osId": [
103
]
},
{
"deviceId": 987877,
"osId": [
104
]
}
]
}
}
]
}
}]
Please find code below:
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
import spark.implicits._
val finalDataset: Dataset[Event] = inputDataFrame.flatMap(
row => {
val userActions = row.getAs[Row]("userActions")
val productId = userActions.getAs[String]("productId")
val userEvents = userActions.getAs[mutable.WrappedArray[Row]]("events")
for (event <- userEvents;
productTypeId = event.getAs[Int]("productTypeId").toString;
serialId = event.getAs[String]("serialID");
productName = event.getAs[String]("productName");
key = s"$productId:$serialId";
familyId = event.getAs[Int]("familyId").toString;
features = event.getAs[Row]("features");
mappings = features.getAs[mutable.WrappedArray[Row]]("mapping");
mappingRow <- mappings;
deviceId = mappingRow.getAs[Long]("deviceId");
osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId");
osId <- osIds;
deviseOs = deviceId + "_" + osId
) yield Event(productId, familyId, productTypeId, key, productName, deviseOs, Map("target" -> deviseOs))
userEvents.flatMap(event => {
val productTypeId = event.getAs[Int]("productTypeId").toString
val serialId = event.getAs[String]("serialID")
val productName = event.getAs[String]("productName")
val key = s"$productId:$serialId"
val familyId = event.getAs[Long]("familyId")
if(familyId == 2010) {
val features = event.getAs[Row]("features")
val mappings = features.getAs[mutable.WrappedArray[Row]]("mapping")
mappings.flatMap(mappingRow => {
val deviceId = mappingRow.getAs[Long]("deviceId")
val osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId")
osIds.map(osId => {
val devise_os = deviceId + "_" + osId
Event(productId, familyId.toString, productTypeId, key, productName, devise_os, Map("target" -> devise_os))
})
})
} else {
Seq(Event(productId, familyId.toString, productTypeId, key, productName, "default_defice_os", Map("target" -> "default_defice_os")))
}
})
}
)
finalDataset.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_110,Map(target -> 999795_110))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
// Event(3MFETP501,2011,1004679,3MFETP501:890479805,Product name,default_defice_os,Map(target -> default_defice_os))

As this is under a Row of DataFrame, returning Event case class , converts into DataSet.Issue here is for one condition ,i am getting List[Event] and rest type , i am getting only Event class
FYI :This is not an answer. But my further attempt to solve.
if (familyId == 2010 )
{
val a: Option[List[String]] = flatten the deviceId,osId ..
a.get.map(i=>{
val key: String = methodToCombinedeviceIdAndosId
val featureMapping: mutable.Map[String, String] = getfeatureMapForInvidualKey
Event(productId,productTypeId,familyId,key,productName,device_os,feature) ---> This is returning List[Event]
})
}
else{
Event(productId,productTypeId,familyId,key,productName,device_os,feature) --> This is returning Event
}

Related

Extracting list from avro record and converting to new record

I am trying to extract the following avro record
{
"StateName": "Alabama",
"Capital": "Montgomery",
"Counties": [{
"CountyName": "Baldwin",
"CountyPopulation": 200000,
"Cities": [{
"CityName": "Daphne",
"CityPopulation": 20000
},
{
"CityName": "Foley",
"CityPopulation": 14000
}
]
}, {
"CountyName": "Calhoun",
"CountyPopulation": 100000,
"Cities": [{
"CityName": "Anniston",
"CityPopulation": 23000
},
{
"CityName": "Glencoe",
"CityPopulation": 5000
}
]
}]
}
and modify them and create new individual record like this(Extract each county and create new records based on county like this)
{
"StateName": "Alabama",
"Capital": "Montgomery",
"CountyName": "Baldwin",
"CountyPopulation": 200000,
"Cities": [{
"CityName": "Daphne",
"CityPopulation": 20000
},
{
"CityName": "Foley",
"CityPopulation": 14000
}
]
}
I am trying to extract the records using the json4s. Taken the reference from https://nmatpt.com/blog/2017/01/29/json4s-custom-serializer/
val StateName = avroRecord.get("StateName").asInstanceOf[Utf8].toString
val Capital = avroRecord.get("Capital").asInstanceOf[Utf8].toString
val CountyArray = avroRecord.get("Counties").toString
val jsonData = parse(CountyArray, useBigDecimalForDouble = true)
val CountyList = jsonData match {
case JArray(_) =>
jsonData.extract[List[CountyArrayRecord]]
case JObject(_) =>
List(jsonData.extract[CountyArrayRecord])
List()
}
Custom serializer
implicit val formats: Formats = Serialization.formats(NoTypeHints) + new TestSerializer
class TestSerializer extends CustomSerializer[CountyArrayRecord](format => (
{ case jsonObj: JObject =>
val countyName = (jsonObj \ "CountyName").extract[String]
val countyPopulation = (jsonObj \ "CountyPopulation").extract[Int]
val cities = (jsonObj \ "Cities").extract[List[GenericRecord]]
CountyArrayRecord(countyName, countyPopulation, cities)
}
)
)
Once extracted trying to create list new records using avro4s.Taken reference from this https://github.com/sksamuel/avro4s#avro-records
val returnList = CountyList.map { CountyListRecord =>
val record = FinalCountyRecord (StateName, Capital, CountyListRecord.CountyName, CountyListRecord.CountyPopulation, CountyListRecord.Cities)
val format = RecordFormat[FinalCountyRecord]
format.to(record)
}
returnList
But this does not seem to work since county list has another list(Cities) inside.

How to parse list of dictionaries as string in scala?

I am trying to parse list of dictionaries(which is in string) inside scala. Basically i want to build another list so that i can traverse through the list using a for loop.
When i have one single list of dictionaries it works fine.
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object A extends CC[List[Any]] //for s3
object I extends CC[Double]
object S extends CC[String]
object E extends CC[String]
object F extends CC[String]
object G extends CC[Map[String, Any]]
val jsonString =
"""
{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
}
""".stripMargin
//println(List(JSON.parseFull(jsonString)))
val result = for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
//L(text) = map("text")
//M(texts) <- text
I(index) = map("index")
S(source) = map("source")
N(name) = map("name")
A(s3q)=map("s3")
G(s3data) <- s3q
F(path) = s3data("path")
} yield {
(index.toInt,source,name,path)
}
But when i aded another list, it gives error stating "java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to scala.collection.immutable.Map"
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object A extends CC[List[Any]] //for s3
object I extends CC[Double]
object S extends CC[String]
object E extends CC[String]
object F extends CC[String]
object G extends CC[Map[String, Any]]
val jsonString =
"""
[{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
},{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
}]
""".stripMargin
//println(List(JSON.parseFull(jsonString)))
val result = for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
//L(text) = map("text")
//M(texts) <- text
I(index) = map("index")
S(source) = map("source")
N(name) = map("name")
A(s3q)=map("s3")
G(s3data) <- s3q
F(path) = s3data("path")
} yield {
(index.toInt,source,name,path)
}

Scala JSON If key matches value return string

I have the JSon response as given below.
If metadata's Organic=true then label='true-Organic', else label='non-Organic'
in the end => return List or Map[modelId,label]
import net.liftweb.json.{DefaultFormats, _}
object test1 extends App {
val json_response =
"""{
"requestId": "91ee60d5f1b45e#316",
"error": null,
"errorMessages": [
],
"entries": [
{
"modelId":"RT001",
"sku": "SKU-ASC001",
"store": "New Jersey",
"ttlInSeconds": 8000,
"metadata": {
"manufactured_date": "2019-01-22T01:25Z",
"organic": "true"
}
},
{
"modelId":"RT002",
"sku": "SKU-ASC002",
"store": "livingstone",
"ttlInSeconds": 8000,
"metadata": {
"manufactured_date": "2019-10-03T01:25Z",
"organic": "false"
}
}
] }"""
tried like this :
val json = parse(json_response)
implicit val formats = DefaultFormats
var map = Map[String, String]()
case class Sales(modelId: String, sku: String, store: String, ttlInSeconds: Int, metadata:
Map[String, String])
case class Response(entries: List[Sales])
val response = json.extract[Response]
After this, not sure how to proceed.
This is a straightforward map operation on the entries field:
response.entries.map{ e =>
e.modelId ->
if (e.metadata.get("organic").contains("true")) {
"true-Organic"
} else {
"non-Organic"
}
}
This will return List[(String, String)], but you can call toMap to turn this into a Map if required.

Create a json deserializer and use it

How do you create a jackson custom serializer and use it in your program? The serializer is used to serialize data from a kafka stream, because my job fails if it encounters a null.
I tried the following to create a serializer.
import org.json4s._
import org.json4s.jackson.JsonMethods._
case class Person(
val user: Option[String]
)
object PersonSerializer extends CustomSerializer[Person](formats => ( {
case JObject(JField("user", JString(user)) :: Nil) => Person(Some(user))
case JObject(JField("user", null) :: Nil) => Person(None)
},
{
case Person(Some(user)) => JObject(JField("user", JString(user)) :: Nil)
case Person(None) => JObject(JField("user", JString(null)) :: Nil)
}))
I am trying to use it this way.
object ConvertJsonTOASTDeSerializer extends App
{
case class Address(street : String, city : String)
case class PersonAddress(name : String, address : Address)
val testJson1 =
"""
{ "user": null,
"address": {
"street": "Bulevard",
"city": "Helsinki",
"country": {
"code": "CD" }
},
"children": [
{
"name": "Mary",
"age": 5,
"birthdate": "2004-09-04T18:06:22Z"
},
{
"name": "Mazy",
"age": 3
}
]
}
"""
implicit var formats : Formats = DefaultFormats + PersonSerializer
val output = parse(testJson1).as[Person]
println(output.user)
}
I am getting an error saying that
Error:(50, 35) No JSON deserializer found for type com.examples.json4s.Person. Try to implement an implicit Reader or JsonFormat for this type.
val output = parse(testJson1).as[Person]
Not sure if I answer your question. I provide the runnable code:
import org.json4s._
import org.json4s.jackson.JsonMethods._
case class Person(
user: Option[String],
address: Address,
children: List[Child]
)
case class Address(
street: String,
city: String,
country: Country
)
case class Country(
code: String
)
case class Child(
name: String,
age: Int
)
val s =
"""
{ "user": null,
"address": {
"street": "Bulevard",
"city": "Helsinki",
"country": {
"code": "CD" }
},
"children": [
{
"name": "Mary",
"age": 5,
"birthdate": "2004-09-04T18:06:22Z"
},
{
"name": "Mazy",
"age": 3
}
]
}
"""
implicit val formats : Formats = DefaultFormats
parse(s).extract[Person] // Person(None,Address(Bulevard,Helsinki,Country(CD)),List(Child(Mary,5), Child(Mazy,3)))

How to map fields in RDD[String] to broad cast?

How does one get particular fields from RDD[String] to a List of maps with the specific field. I have an RDD[String]: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] Each entry is JSON in this format:
{
count: 1,
itemId: "1122334",
country: {
code: {
preferred: "USA"
},
name: {
preferred: "America"
}
},
states: "50",
self: {
otherInfo: [
],
preferred: "National Parks"
},
Rating: 4
}
How do I get a list of maps that have only itemId as the key and self.preferred as the value ({itemid , self.preferred}):
itemId : 1122334 self.preferred : "National Parks"
itemId : 3444444 self.preferred : "State Parks"
...
Is it efficient to broadcast the resulting map across all nodes? I need this map to be shared/referenced by further calculations.
You can try :
val filteredMappingsList = countryMapping.filter(x=> {
val jsonObj = new JSONObject(x)
jsonObj.has("itemId")
})
val finalMapping = filteredMappingsList.map(x=>{
val jsonObj = new JSONObject(x);
val itemId = jsonObj.get("itemId").toString()
val preferred = jsonObj.getJSONObject("self").get("preferred").toString()
(itemId, preferred)
}).collectAsMap
To Broadcast :
val broadcastedAsins = sc.broadcast(finalMapping)