How to add complex logic to updateExpr in a Delta Table

How to add complex logic to updateExpr in a Delta Table - scala

I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a update/replace
val historicalDF = Seq(
(1, 0, "Roger", Seq(Map("score" -> 5, "year" -> 2012)))
).toDF("id", "ts", "user", "scores")
historicalDF.write
.format("delta")
.mode("overwrite")
.save(table_path)
val hist_dt : DeltaTable = DeltaTable.forPath(spark, table_path)
val incrementalDF = Seq(
(1, 1, "Roger Rabbit", Seq(Map("score" -> 7, "year" -> 2013)))
).toDF("id", "ts", "user", "scores")
What I would like to have after the merge something is like this:
+---+---+------------+--------------------------------------------------------+
|id |ts |user |scores |
+---+---+------------+--------------------------------------------------------+
|1 |1 |Roger Rabbit|[{score -> 7, year -> 2013}, {score -> 7, year -> 2013}]|
+---+---+------------+--------------------------------------------------------+
What I tried to perform this concatenation is:
hist_dt
.as("ex")
.merge(incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched
.updateExpr(
Map(
"ts" -> "in.ts",
"user" -> "in.user",
"scores" -> "in.scores" ++ "ex.scores"
)
)
.whenNotMatched
.insertAll()
.execute()
But the columns "in.scores" and "ex.scores" are interpreted as String, so I am getting the following error:
error: value ++ is not a member of (String, String)
If there a way to add some complex logic to updateExpr?

Using update() instead of updateExpr() let me pass the required columns to a udf, so I can add there a more complex logic
def join_seq_map(incremental: Seq[Map[String,Integer]], existing: Seq[Map[String,Integer]]) : Seq[Map[String,Integer]] = {
(incremental, existing) match {
case ( null , null) => null
case ( null, e ) => e
case ( i , null) => i
case ( i , e ) => (i ++ e).distinct
}
}
def join_seq_map_udf = udf(join_seq_map _)
hist_dt
.as("ex")
.merge(
incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched("ex.ts < in.ts")
.update(Map(
"ts" -> col("in.ts"),
"user" -> col("in.user"),
"scores" -> join_seq_map_udf(col("in.scores"), col("ex.scores"))
))
.whenNotMatched
.insertAll()
.execute()

Related

Filter Seq[Map[k,v]] given a Seq[String]

I have this Seq[Map[String, String]] :
val val1 = Seq(
Map("Name" -> "Heidi",
"City" -> "Paris",
"Age" -> "23"),
Map(("Country" -> "France")),
Map("Color" -> "Blue",
"City" -> "Paris"))
and I have this Seq[String]
val val2 = Seq["Name", "Country", "City", "Department"]
Expected output is val1 with all keys present in val2 (I want to filter out the (k,v) from v1 that have keys that are not present in val2) :
val expected = Seq(Map("Name" -> "Heidi", "City" -> "Paris"), Map( "Country" -> "France")), Map("City" -> "Paris"))
Age and Color are strings that are not in val2, I want to omit them from val1 map.

I'm not sure if what you propose is a right approach but nevertheless, it can be done like this:
val1.map(_.filter {
case (key, value) => val2.contains(key)
})

It seems you want something like this:
(note that I used a Set instead of a List to make contains faster)
def ensureMapsHaveOnlyValidKeys[K, V](validKeys: Set[K])(data: IterableOnce[Map[K, V]]): List[Map[K, V]] =
data
.iterator
.filter(_.keysIterator.forall(validKeys.contains))
.toList

How to avoid using of collect in Spark RDD in Scala?

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}

Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.

How to create a Spark SQL Dataframe with list of Map objects

I have multiple Map[String, String] in a List (Scala). For example:
map1 = Map("EMP_NAME" -> “Ahmad”, "DOB" -> “01-10-1991”, "CITY" -> “Dubai”)
map2 = Map("EMP_NAME" -> “Rahul”, "DOB" -> “06-12-1991”, "CITY" -> “Mumbai”)
map3 = Map("EMP_NAME" -> “John”, "DOB" -> “11-04-1996”, "CITY" -> “Toronto”)
list = List(map1, map2, map3)
Now I want to create a single dataframe with something like this:
EMP_NAME DOB CITY
Ahmad 01-10-1991 Dubai
Rahul 06-12-1991 Mumbai
John 11-04-1996 Toronto
How do I achieve this?

you can do it like this :
import spark.implicits._
val df = list
.map( m => (m.get("EMP_NAME"),m.get("DOB"),m.get("CITY")))
.toDF("EMP_NAME","DOB","CITY")
df.show()
+--------+----------+-------+
|EMP_NAME| DOB| CITY|
+--------+----------+-------+
| Ahmad|01-10-1991| Dubai|
| Rahul|06-12-1991| Mumbai|
| John|11-04-1996|Toronto|
+--------+----------+-------+

Slightly less specific approach, e.g:
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")
///...
val list = List(map1, map2) // map3, ...
val RDDmap = sc.parallelize(list)
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
df.show(false)
returns:
+--------+----------+------+
|EMP_NAME|DOB |CITY |
+--------+----------+------+
|Ahmad |01-10-1991|Dubai |
|John |01-10-1992|Mumbai|
+--------+----------+------+
or to answer your sub-question, here as follows - at least I think this is what you are asking, but probably not:
val RDDmap = sc.parallelize(List(
Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai"),
Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")))
...
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
You can build a list dynamically of course, but you still need to assign the Map elements. See Appending Data to List or any other collection Dynamically in scala. I would just read in from file and be done with it.

import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object DataFrameTest2 extends Serializable {
var sparkSession: SparkSession = _
var sparkContext: SparkContext = _
var sqlContext: SQLContext = _
def main(args: Array[String]): Unit = {
sparkSession = SparkSession.builder().appName("TestMaster").master("local").getOrCreate()
sparkContext = sparkSession.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "Rahul", "DOB" -> "06-12-1991", "CITY" -> "Mumbai")
val map3 = Map("EMP_NAME" -> "John", "DOB" -> "11-04-1996", "CITY" -> "Toronto")
val list = List(map1, map2, map3)
//create your rows
val rows = list.map(m => Row(m.values.toSeq:_*))
//create the schema from the header
val header = list.head.keys.toList
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
//create your rdd
val rdd = sparkContext.parallelize(rows)
//create your dataframe using rdd
val df = sparkSession.createDataFrame(rdd, schema)
df.show()
}
}

Scala: How to print a list of map using scala

I am writing below code,
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.flatMap(_.values)
println(header)
println(data)
I am getting the below output,
List(id, Name)
List(1, divya, 2, gaya)
however I am expecting output as below,
id Name
1 Divya
2 gaya
here in this case I am having only 2 header but in my map it may contain more than 2 headers how to display all in rows. Please help me.

val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(header.mkString(" "))
data.foreach(x => println(x.mkString(" ")))

Extracting Values from a Map[String, Any] where Any is a Map itself

I have a Map which contains another Map in its value field. Here is an example of some records ;
(8702168053422489,Map(sequence -> 5, id -> 8702168053422489, type -> List(AppExperience, Session), time -> 527780267713))
(8702170626376335,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161702, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702170626376335, sequence -> 59, time -> 527780275219, type -> List(NavigationLevel, Session), view -> details))
(8702168347359313,Map(muting -> false, id -> 8702168347359313, level -> 1, type -> List(Volume)))
(8702168321522401,Map(utcOffset -> 3600, type -> List(TimeZone), id -> 8702168321522401))
(8702171157207449,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161356, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702171157207449, sequence -> 72, startOffset -> 0, time -> 527780278061, type -> List(StartPlay, Action, Session)))
The actual records I've interested in are the ones that contain trackingInfo, records 2 and 5.
What I would like to do is extract those and then extract some of the keys from there such as trackId. Something like this;
val trackingInfo = json("trackingInfo").asInstanceOf[Map[String, Any]]
val videoId = trackingInfo("videoId").asInstanceOf[Int]
val id = json("id").asInstanceOf[Long]
val sequence = json("sequence").asInstanceOf[Int]
val time = json("time").asInstanceOf[Long]
val eventType = json.get("type").getOrElse(List("")).asInstanceOf[List[String]]
To the extract the inner map, I've tired;
myMap.map {case (k,v: collection.Map[_,_]) => v.toMap case _ => }
Which brings back the inner map but as a scala.collection.immutable.Iterable[Any] which leaves me in a puzzle on extracting values from it.
Any help is appreciated

Let's say you have a real map (I cut it a little bit)
val data: Map[ BigInt, Any ] = Map(
BigInt( 8702168053422489L ) -> Map("sequence" -> "5", "id" -> BigInt( 8702168053422489L ) ),
BigInt( 8702170626376335L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702170626376335L ) ),
BigInt( 8702168347359313L ) -> Map("id" -> BigInt( 8702168347359313L ) ),
BigInt( 8702168321522401L ) -> Map("id" -> BigInt( 8702168321522401L ) ),
BigInt( 8702171157207449L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702171157207449L ) )
)
And you want to get records which have a trackingInfo key
val onlyWithTracking = data.filter( ( row ) => {
val recordToFilter = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackId" -> Map() )
}
recordToFilter.contains( "trackingInfo" )
} )
And then process those records in some way
onlyWithTracking.foreach( ( row ) => {
val record = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackingInfo" -> Map() )
}
val trackingInfo = record( "trackingInfo" ) match {
case trackRow: Map[ String, Any ] => trackRow
case _ => Map( "trackId" -> "error" )
}
val trackId = trackingInfo( "trackId" )
println( trackId )
} )
With this pattern matching I'm trying to ensure that using keys like trackingInfo or trackId is somewhat safe. You should implement more strict approach.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to add complex logic to updateExpr in a Delta Table - scala

Related

Filter Seq[Map[k,v]] given a Seq[String]

How to avoid using of collect in Spark RDD in Scala?

How to create a Spark SQL Dataframe with list of Map objects

Scala: How to print a list of map using scala

Extracting Values from a Map[String, Any] where Any is a Map itself

Categories

Resources