Multiplying event case class depending on the list based on nested IDs

Multiplying event case class depending on the list based on nested IDs - scala

I am processing a dataframe and converting into Dataset[Event] using Event case class.How ever there are nested Ids for which i need to multiply the events based on the flattening of nested device:os.
I am able to return the case class Event at the Kafka event level. But not sure how to multiply events .
Kafka incoming Event:
{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100
]
},
{
"deviceId": 987875
"osId": [
101
]
}
]
}
}
]
}
}
The expected output case classes for Event
Event("3MFETP501","1004678","2010","3MFETP501:890479804","MFE Total Protection 2021 Family Pack","999795_100", Map("targetId"->"999795_100") )
Event("3MFETP501","1004678","2010","3MFETP501:890479804","MFE Total Protection 2021 Family Pack","987875_100", Map("targetId"->"987875_100") )
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS:String,
var featureMap: mutable.Map[String, String])
val finalDataset:Dataset[Event] = inputDataFrame.flatMap(
row=> {
val productId = row.getAs[String]("productId")
val userActions = row.getAs[Row]("userActions")
val userEvents:mutable.Seq[Row] = userActions.getAs[mutable.WrappedArray[Row]]("events")
val processedEvents:mutable.Seq[Row]= userEvents.map(
event=>
val productTypeId = event.getAs[Int]("productTypeId")
val familyId = event.getAs[String]("familyId")
val features = activity.getAs[mutable.WrappedArray[Row]]("features")
val serialId = activity.getAs[String]("serialId")
val key = productId+":"+serialId
val features = mutable.Map[String, String]().withDefaultValue(null)
val device_os_list=List("999795_100","987875_101")
//Feature Map is for every device_os ( example "targetId"->"999795_100") for 999795_100
if (familyId == 2010 )
{
val a: Option[List[String]] = flatten the deviceId,osId ..
a.get.map(i=>{
val key: String = methodToCombinedeviceIdAndosId
val featureMapping: mutable.Map[String, String] = getfeatureMapForInvidualKey
Event(productId,productTypeId,familyId,key,productName,device_os,feature) ---> This is returning **List[Event]**
})
}
else{
Event(productId,productTypeId,familyId,key,productName,device_os,feature) --> This is returning **Event**. THIS WORKS
}
)
}
)

I do not implement it fully the same but I think it will be possible to understand logic and apply it on your case.
I created json file like kafka.json and put there code like this(your event):
[{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100
]
},
{
"deviceId": 987875,
"osId": [
101
]
}
]
}
}
]
}
}]
Please find below first solution that is based on flatMap and for loop.
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, Row, SparkSession}
import scala.collection.mutable
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
import spark.implicits._
val finalDataset: Dataset[Event] = inputDataFrame.flatMap(
row => {
val userActions = row.getAs[Row]("userActions")
val productId = userActions.getAs[String]("productId")
val userEvents = userActions.getAs[mutable.WrappedArray[Row]]("events")
for (event <- userEvents;
familyId = event.getAs[Int]("familyId").toString;
productTypeId = event.getAs[Int]("productTypeId").toString;
serialId = event.getAs[String]("serialID");
productName = event.getAs[String]("productName");
key = s"$productId:$serialId";
features = event.getAs[Row]("features");
mappings = features.getAs[mutable.WrappedArray[Row]]("mapping");
mappingRow <- mappings;
deviceId = mappingRow.getAs[Long]("deviceId");
osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId");
osId <- osIds;
deviseOs = deviceId + "_" + osId
) yield Event(productId, familyId, productTypeId, key, productName, deviseOs, Map("target" -> (deviseOs)))
}
)
finalDataset.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
Also, you can solve this task using select, withColumn, explode, concat functions.
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions.{col, explode, concat, lit, map}
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
val transformedDataFrame = inputDataFrame
.select(col("userActions.productId").as("productId"),
explode(col("userActions.events")).as("event"))
.select(col("productId"),
col("event.familyId").as("familyId"),
col("event.productTypeId").as("productTypeId"),
col("event.serialID").as("serialID"),
col("event.productName").as("productName"),
explode(col("event.features.mapping")).as("features")
)
.select(
col("productId"),
col("familyId"),
col("productTypeId"),
col("serialID"),
col("productName"),
col("features.deviceId").as("deviceId"),
explode(col("features.osId")).as("osId")
)
.withColumn("key", concat(col("productId"), lit(":"), col("serialID")))
.withColumn("deviceOS", concat(col("deviceId"), lit("_"), col("osId")))
.withColumn("featureMap", map(lit("target"), col("deviceOS")))
import spark.implicits._
private val result: Dataset[Event] = transformedDataFrame.as[Event]
result.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
Add option to customize response based on the value one of the field. I replace here for comprehension to map/flatmap, so you can return as response one or several events based on the type. Also, I customized json a little bit to show more examples in the result.
New json:
[{
"partition": 1,
"key": "34768_20220203_MFETP501",
"offset": 1841543,
"createTime": 1646041475348,
"topic": "topic_int",
"publishTime": 1646041475344,
"errorCode": 0,
"userActions": {
"productId": "3MFETP501",
"createdDate": "2022-02-26T11:19:35.786Z",
"events": [
{
"GUID": "dbb1-f38b-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-02-26T11:19:35.786Z",
"familyId": 2010,
"productTypeId": 1004678,
"serialID": "890479804",
"productName": "MFE Total Protection 2021 Family Pack",
"features": {
"mapping": [
{
"deviceId": 999795,
"osId": [
100,
110
]
},
{
"deviceId": 987875,
"osId": [
101
]
}
]
}
},
{
"GUID": "1111-2222-f7f0-44af-90da-80179412f89c",
"eventDate": "2022-03-26T11:19:35.786Z",
"familyId": 2011,
"productTypeId": 1004679,
"serialID": "890479805",
"productName": "Product name",
"features": {
"mapping": [
{
"deviceId": 999796,
"osId": [
103
]
},
{
"deviceId": 987877,
"osId": [
104
]
}
]
}
}
]
}
}]
Please find code below:
case class Event(
productId: String,
familyId: String,
productTypeId: String,
key: String,
productName: String,
deviceOS: String,
featureMap: Map[String, String])
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession
.builder
.appName("StructuredStreaming")
.master("local[*]")
.getOrCreate()
private val inputDataFrame = spark.read.option("multiline", "true").format("json").load("/absolute_path_to_kafka.json")
import spark.implicits._
val finalDataset: Dataset[Event] = inputDataFrame.flatMap(
row => {
val userActions = row.getAs[Row]("userActions")
val productId = userActions.getAs[String]("productId")
val userEvents = userActions.getAs[mutable.WrappedArray[Row]]("events")
for (event <- userEvents;
productTypeId = event.getAs[Int]("productTypeId").toString;
serialId = event.getAs[String]("serialID");
productName = event.getAs[String]("productName");
key = s"$productId:$serialId";
familyId = event.getAs[Int]("familyId").toString;
features = event.getAs[Row]("features");
mappings = features.getAs[mutable.WrappedArray[Row]]("mapping");
mappingRow <- mappings;
deviceId = mappingRow.getAs[Long]("deviceId");
osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId");
osId <- osIds;
deviseOs = deviceId + "_" + osId
) yield Event(productId, familyId, productTypeId, key, productName, deviseOs, Map("target" -> deviseOs))
userEvents.flatMap(event => {
val productTypeId = event.getAs[Int]("productTypeId").toString
val serialId = event.getAs[String]("serialID")
val productName = event.getAs[String]("productName")
val key = s"$productId:$serialId"
val familyId = event.getAs[Long]("familyId")
if(familyId == 2010) {
val features = event.getAs[Row]("features")
val mappings = features.getAs[mutable.WrappedArray[Row]]("mapping")
mappings.flatMap(mappingRow => {
val deviceId = mappingRow.getAs[Long]("deviceId")
val osIds = mappingRow.getAs[mutable.WrappedArray[Long]]("osId")
osIds.map(osId => {
val devise_os = deviceId + "_" + osId
Event(productId, familyId.toString, productTypeId, key, productName, devise_os, Map("target" -> devise_os))
})
})
} else {
Seq(Event(productId, familyId.toString, productTypeId, key, productName, "default_defice_os", Map("target" -> "default_defice_os")))
}
})
}
)
finalDataset.foreach(e => println(e))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_100,Map(target -> 999795_100))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,999795_110,Map(target -> 999795_110))
// Event(3MFETP501,2010,1004678,3MFETP501:890479804,MFE Total Protection 2021 Family Pack,987875_101,Map(target -> 987875_101))
// Event(3MFETP501,2011,1004679,3MFETP501:890479805,Product name,default_defice_os,Map(target -> default_defice_os))

As this is under a Row of DataFrame, returning Event case class , converts into DataSet.Issue here is for one condition ,i am getting List[Event] and rest type , i am getting only Event class
FYI :This is not an answer. But my further attempt to solve.
if (familyId == 2010 )
{
val a: Option[List[String]] = flatten the deviceId,osId ..
a.get.map(i=>{
val key: String = methodToCombinedeviceIdAndosId
val featureMapping: mutable.Map[String, String] = getfeatureMapForInvidualKey
Event(productId,productTypeId,familyId,key,productName,device_os,feature) ---> This is returning List[Event]
})
}
else{
Event(productId,productTypeId,familyId,key,productName,device_os,feature) --> This is returning Event
}

Related

Extracting list from avro record and converting to new record

I am trying to extract the following avro record
{
"StateName": "Alabama",
"Capital": "Montgomery",
"Counties": [{
"CountyName": "Baldwin",
"CountyPopulation": 200000,
"Cities": [{
"CityName": "Daphne",
"CityPopulation": 20000
},
{
"CityName": "Foley",
"CityPopulation": 14000
}
]
}, {
"CountyName": "Calhoun",
"CountyPopulation": 100000,
"Cities": [{
"CityName": "Anniston",
"CityPopulation": 23000
},
{
"CityName": "Glencoe",
"CityPopulation": 5000
}
]
}]
}
and modify them and create new individual record like this(Extract each county and create new records based on county like this)
{
"StateName": "Alabama",
"Capital": "Montgomery",
"CountyName": "Baldwin",
"CountyPopulation": 200000,
"Cities": [{
"CityName": "Daphne",
"CityPopulation": 20000
},
{
"CityName": "Foley",
"CityPopulation": 14000
}
]
}
I am trying to extract the records using the json4s. Taken the reference from https://nmatpt.com/blog/2017/01/29/json4s-custom-serializer/
val StateName = avroRecord.get("StateName").asInstanceOf[Utf8].toString
val Capital = avroRecord.get("Capital").asInstanceOf[Utf8].toString
val CountyArray = avroRecord.get("Counties").toString
val jsonData = parse(CountyArray, useBigDecimalForDouble = true)
val CountyList = jsonData match {
case JArray(_) =>
jsonData.extract[List[CountyArrayRecord]]
case JObject(_) =>
List(jsonData.extract[CountyArrayRecord])
List()
}
Custom serializer
implicit val formats: Formats = Serialization.formats(NoTypeHints) + new TestSerializer
class TestSerializer extends CustomSerializer[CountyArrayRecord](format => (
{ case jsonObj: JObject =>
val countyName = (jsonObj \ "CountyName").extract[String]
val countyPopulation = (jsonObj \ "CountyPopulation").extract[Int]
val cities = (jsonObj \ "Cities").extract[List[GenericRecord]]
CountyArrayRecord(countyName, countyPopulation, cities)
}
)
)
Once extracted trying to create list new records using avro4s.Taken reference from this https://github.com/sksamuel/avro4s#avro-records
val returnList = CountyList.map { CountyListRecord =>
val record = FinalCountyRecord (StateName, Capital, CountyListRecord.CountyName, CountyListRecord.CountyPopulation, CountyListRecord.Cities)
val format = RecordFormat[FinalCountyRecord]
format.to(record)
}
returnList
But this does not seem to work since county list has another list(Cities) inside.

How to parse list of dictionaries as string in scala?

I am trying to parse list of dictionaries(which is in string) inside scala. Basically i want to build another list so that i can traverse through the list using a for loop.
When i have one single list of dictionaries it works fine.
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object A extends CC[List[Any]] //for s3
object I extends CC[Double]
object S extends CC[String]
object E extends CC[String]
object F extends CC[String]
object G extends CC[Map[String, Any]]
val jsonString =
"""
{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
}
""".stripMargin
//println(List(JSON.parseFull(jsonString)))
val result = for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
//L(text) = map("text")
//M(texts) <- text
I(index) = map("index")
S(source) = map("source")
N(name) = map("name")
A(s3q)=map("s3")
G(s3data) <- s3q
F(path) = s3data("path")
} yield {
(index.toInt,source,name,path)
}
But when i aded another list, it gives error stating "java.lang.ClassCastException: scala.collection.immutable.$colon$colon cannot be cast to scala.collection.immutable.Map"
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object A extends CC[List[Any]] //for s3
object I extends CC[Double]
object S extends CC[String]
object E extends CC[String]
object F extends CC[String]
object G extends CC[Map[String, Any]]
val jsonString =
"""
[{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
},{
"index": 1,
"source": "a",
"name": "v",
"s3": [{
"path": "s3://1",
"bucket": "p",
"key": "r"
}]
}]
""".stripMargin
//println(List(JSON.parseFull(jsonString)))
val result = for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
//L(text) = map("text")
//M(texts) <- text
I(index) = map("index")
S(source) = map("source")
N(name) = map("name")
A(s3q)=map("s3")
G(s3data) <- s3q
F(path) = s3data("path")
} yield {
(index.toInt,source,name,path)
}

Scala JSON If key matches value return string

I have the JSon response as given below.
If metadata's Organic=true then label='true-Organic', else label='non-Organic'
in the end => return List or Map[modelId,label]
import net.liftweb.json.{DefaultFormats, _}
object test1 extends App {
val json_response =
"""{
"requestId": "91ee60d5f1b45e#316",
"error": null,
"errorMessages": [
],
"entries": [
{
"modelId":"RT001",
"sku": "SKU-ASC001",
"store": "New Jersey",
"ttlInSeconds": 8000,
"metadata": {
"manufactured_date": "2019-01-22T01:25Z",
"organic": "true"
}
},
{
"modelId":"RT002",
"sku": "SKU-ASC002",
"store": "livingstone",
"ttlInSeconds": 8000,
"metadata": {
"manufactured_date": "2019-10-03T01:25Z",
"organic": "false"
}
}
] }"""
tried like this :
val json = parse(json_response)
implicit val formats = DefaultFormats
var map = Map[String, String]()
case class Sales(modelId: String, sku: String, store: String, ttlInSeconds: Int, metadata:
Map[String, String])
case class Response(entries: List[Sales])
val response = json.extract[Response]
After this, not sure how to proceed.

This is a straightforward map operation on the entries field:
response.entries.map{ e =>
e.modelId ->
if (e.metadata.get("organic").contains("true")) {
"true-Organic"
} else {
"non-Organic"
}
}
This will return List[(String, String)], but you can call toMap to turn this into a Map if required.

Create a json deserializer and use it

How do you create a jackson custom serializer and use it in your program? The serializer is used to serialize data from a kafka stream, because my job fails if it encounters a null.
I tried the following to create a serializer.
import org.json4s._
import org.json4s.jackson.JsonMethods._
case class Person(
val user: Option[String]
)
object PersonSerializer extends CustomSerializer[Person](formats => ( {
case JObject(JField("user", JString(user)) :: Nil) => Person(Some(user))
case JObject(JField("user", null) :: Nil) => Person(None)
},
{
case Person(Some(user)) => JObject(JField("user", JString(user)) :: Nil)
case Person(None) => JObject(JField("user", JString(null)) :: Nil)
}))
I am trying to use it this way.
object ConvertJsonTOASTDeSerializer extends App
{
case class Address(street : String, city : String)
case class PersonAddress(name : String, address : Address)
val testJson1 =
"""
{ "user": null,
"address": {
"street": "Bulevard",
"city": "Helsinki",
"country": {
"code": "CD" }
},
"children": [
{
"name": "Mary",
"age": 5,
"birthdate": "2004-09-04T18:06:22Z"
},
{
"name": "Mazy",
"age": 3
}
]
}
"""
implicit var formats : Formats = DefaultFormats + PersonSerializer
val output = parse(testJson1).as[Person]
println(output.user)
}
I am getting an error saying that
Error:(50, 35) No JSON deserializer found for type com.examples.json4s.Person. Try to implement an implicit Reader or JsonFormat for this type.
val output = parse(testJson1).as[Person]

Not sure if I answer your question. I provide the runnable code:
import org.json4s._
import org.json4s.jackson.JsonMethods._
case class Person(
user: Option[String],
address: Address,
children: List[Child]
)
case class Address(
street: String,
city: String,
country: Country
)
case class Country(
code: String
)
case class Child(
name: String,
age: Int
)
val s =
"""
{ "user": null,
"address": {
"street": "Bulevard",
"city": "Helsinki",
"country": {
"code": "CD" }
},
"children": [
{
"name": "Mary",
"age": 5,
"birthdate": "2004-09-04T18:06:22Z"
},
{
"name": "Mazy",
"age": 3
}
]
}
"""
implicit val formats : Formats = DefaultFormats
parse(s).extract[Person] // Person(None,Address(Bulevard,Helsinki,Country(CD)),List(Child(Mary,5), Child(Mazy,3)))

How to map fields in RDD[String] to broad cast?

How does one get particular fields from RDD[String] to a List of maps with the specific field. I have an RDD[String]: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] Each entry is JSON in this format:
{
count: 1,
itemId: "1122334",
country: {
code: {
preferred: "USA"
},
name: {
preferred: "America"
}
},
states: "50",
self: {
otherInfo: [
],
preferred: "National Parks"
},
Rating: 4
}
How do I get a list of maps that have only itemId as the key and self.preferred as the value ({itemid , self.preferred}):
itemId : 1122334 self.preferred : "National Parks"
itemId : 3444444 self.preferred : "State Parks"
...
Is it efficient to broadcast the resulting map across all nodes? I need this map to be shared/referenced by further calculations.

You can try :
val filteredMappingsList = countryMapping.filter(x=> {
val jsonObj = new JSONObject(x)
jsonObj.has("itemId")
})
val finalMapping = filteredMappingsList.map(x=>{
val jsonObj = new JSONObject(x);
val itemId = jsonObj.get("itemId").toString()
val preferred = jsonObj.getJSONObject("self").get("preferred").toString()
(itemId, preferred)
}).collectAsMap
To Broadcast :
val broadcastedAsins = sc.broadcast(finalMapping)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Multiplying event case class depending on the list based on nested IDs - scala

Related

Extracting list from avro record and converting to new record

How to parse list of dictionaries as string in scala?

Scala JSON If key matches value return string

Create a json deserializer and use it

How to map fields in RDD[String] to broad cast?

Categories

Resources