how to convert map into key value pair - scala

Given below:
test: Array[scala.collection.immutable.Map[String,Any]] = Array(
Map(_c3 -> "foobar", _c5 -> "impt", _c0 -> Key1, _c4 -> 20.0, _c1 -> "next", _c2 -> 1.0),
Map(_c3 -> "high", _c5 -> "low", _c0 -> Key2, _c4 -> 19.0, _c1 -> "great", _c2 -> 0.0),
Map(_c3 -> "book", _c5 -> "game", _c0 -> Key3, _c4 -> 42.0, _c1 -> "name", _c2 -> 0.5)
)
How can I transform it to Key Value pairs based on _c0 that only include Strings?
like below
Key1 foobar
Key1 impt
Key1 next
Key2 high
Key2 low
Key2 great
Key3 book
Key3 game
Key3 name

Please check this out
test.map(
_.filter(!_._2.toString.matches("[+-]?\\d+.?\\d+"))
).flatMap(
data =>
{
val key = data.getOrElse("_c0", "key_not_found")
data
.filter(_._1 != "_c0")
.map(
key +" "+_._2.toString()
)
}
)

Try this method
import org.apache.spark.sql.functions._
# first extract all values which are string
val rdd = sc.parallelize(test).map(x => (x.getOrElse("_c0","no key").toString -> (x - "_c0").values.filter(_.isInstanceOf[String]).asInstanceOf[List[String]]))
val df = spark.createDataFrame(rdd).toDF("key", "vals")
# use explode function to add new rows
df.withColumn("vals", explode(col("vals"))).show()

How about:
test
.map(row => row.getOrElse(_c0, "") -> (row - _c0).values.filter(_.isInstanceOf[String]))
.flatMap { case (key, innerList) => innerList.map(key -> _) }

Related

How to add complex logic to updateExpr in a Delta Table

I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a update/replace
val historicalDF = Seq(
(1, 0, "Roger", Seq(Map("score" -> 5, "year" -> 2012)))
).toDF("id", "ts", "user", "scores")
historicalDF.write
.format("delta")
.mode("overwrite")
.save(table_path)
val hist_dt : DeltaTable = DeltaTable.forPath(spark, table_path)
val incrementalDF = Seq(
(1, 1, "Roger Rabbit", Seq(Map("score" -> 7, "year" -> 2013)))
).toDF("id", "ts", "user", "scores")
What I would like to have after the merge something is like this:
+---+---+------------+--------------------------------------------------------+
|id |ts |user |scores |
+---+---+------------+--------------------------------------------------------+
|1 |1 |Roger Rabbit|[{score -> 7, year -> 2013}, {score -> 7, year -> 2013}]|
+---+---+------------+--------------------------------------------------------+
What I tried to perform this concatenation is:
hist_dt
.as("ex")
.merge(incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched
.updateExpr(
Map(
"ts" -> "in.ts",
"user" -> "in.user",
"scores" -> "in.scores" ++ "ex.scores"
)
)
.whenNotMatched
.insertAll()
.execute()
But the columns "in.scores" and "ex.scores" are interpreted as String, so I am getting the following error:
error: value ++ is not a member of (String, String)
If there a way to add some complex logic to updateExpr?
Using update() instead of updateExpr() let me pass the required columns to a udf, so I can add there a more complex logic
def join_seq_map(incremental: Seq[Map[String,Integer]], existing: Seq[Map[String,Integer]]) : Seq[Map[String,Integer]] = {
(incremental, existing) match {
case ( null , null) => null
case ( null, e ) => e
case ( i , null) => i
case ( i , e ) => (i ++ e).distinct
}
}
def join_seq_map_udf = udf(join_seq_map _)
hist_dt
.as("ex")
.merge(
incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched("ex.ts < in.ts")
.update(Map(
"ts" -> col("in.ts"),
"user" -> col("in.user"),
"scores" -> join_seq_map_udf(col("in.scores"), col("ex.scores"))
))
.whenNotMatched
.insertAll()
.execute()

How to avoid using of collect in Spark RDD in Scala?

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.

toMap on List of List generates error

Given that this example:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=wellington")
val myMap = myList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
works fine, returning:
myList: List[String] = List(age=21, name=xyz, profession=Tester, city=cuba, age=43, name=abc, profession=Programmer, city=wellington)
myMap: scala.collection.immutable.Map[String,String] = Map(age -> 43, name -> abc, profession -> Programmer, city -> wellington)
I am wondering why the following which is just N sets of values:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=Sydney")
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
generates the error, and how to solve:
notebook:9: error: value split is not a member of List[String]
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
I must be missing something elementary here.
myList.grouped(4).toList returns a nested list – List[List[String]].
To transform the grouped sublists into Maps:
val myMap = myList.grouped(4).toList.
map(_.map(_.split("=")).map(a => (a(0) -> a(1))).toMap)
// myMap: List[scala.collection.immutable.Map[String,String]] = List(
// Map(age -> 21, name -> xyz, profession -> Tester, city -> cuba),
// Map(age -> 43, name -> abc, profession -> Programmer, city -> Sydney)
// )

Extracting Values from a Map[String, Any] where Any is a Map itself

I have a Map which contains another Map in its value field. Here is an example of some records ;
(8702168053422489,Map(sequence -> 5, id -> 8702168053422489, type -> List(AppExperience, Session), time -> 527780267713))
(8702170626376335,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161702, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702170626376335, sequence -> 59, time -> 527780275219, type -> List(NavigationLevel, Session), view -> details))
(8702168347359313,Map(muting -> false, id -> 8702168347359313, level -> 1, type -> List(Volume)))
(8702168321522401,Map(utcOffset -> 3600, type -> List(TimeZone), id -> 8702168321522401))
(8702171157207449,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161356, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702171157207449, sequence -> 72, startOffset -> 0, time -> 527780278061, type -> List(StartPlay, Action, Session)))
The actual records I've interested in are the ones that contain trackingInfo, records 2 and 5.
What I would like to do is extract those and then extract some of the keys from there such as trackId. Something like this;
val trackingInfo = json("trackingInfo").asInstanceOf[Map[String, Any]]
val videoId = trackingInfo("videoId").asInstanceOf[Int]
val id = json("id").asInstanceOf[Long]
val sequence = json("sequence").asInstanceOf[Int]
val time = json("time").asInstanceOf[Long]
val eventType = json.get("type").getOrElse(List("")).asInstanceOf[List[String]]
To the extract the inner map, I've tired;
myMap.map {case (k,v: collection.Map[_,_]) => v.toMap case _ => }
Which brings back the inner map but as a scala.collection.immutable.Iterable[Any] which leaves me in a puzzle on extracting values from it.
Any help is appreciated
Let's say you have a real map (I cut it a little bit)
val data: Map[ BigInt, Any ] = Map(
BigInt( 8702168053422489L ) -> Map("sequence" -> "5", "id" -> BigInt( 8702168053422489L ) ),
BigInt( 8702170626376335L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702170626376335L ) ),
BigInt( 8702168347359313L ) -> Map("id" -> BigInt( 8702168347359313L ) ),
BigInt( 8702168321522401L ) -> Map("id" -> BigInt( 8702168321522401L ) ),
BigInt( 8702171157207449L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702171157207449L ) )
)
And you want to get records which have a trackingInfo key
val onlyWithTracking = data.filter( ( row ) => {
val recordToFilter = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackId" -> Map() )
}
recordToFilter.contains( "trackingInfo" )
} )
And then process those records in some way
onlyWithTracking.foreach( ( row ) => {
val record = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackingInfo" -> Map() )
}
val trackingInfo = record( "trackingInfo" ) match {
case trackRow: Map[ String, Any ] => trackRow
case _ => Map( "trackId" -> "error" )
}
val trackId = trackingInfo( "trackId" )
println( trackId )
} )
With this pattern matching I'm trying to ensure that using keys like trackingInfo or trackId is somewhat safe. You should implement more strict approach.

Get words and values Between Parentheses in Scala-Spark

here is my data :
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
etc...
and i want to extract words and values and then store them in two separated matrices , matrix_1 is (docID words) and matrix_2 is (docID values) ;
input.txt
=========
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
val inputText = sc.textFile("input.txt")
var digested = input.map(line => line.split(":"))
.map(row => row(0) -> row(1).trim.split(" "))
.map(row => row._1 -> row._2.map(_.stripPrefix("(").stripSuffix(")").trim.split(",")))
var matrix_1 = digested.map(row => row._1 -> row._2.map( a => a(0)))
var matrix_2 = digested.map(row => row._1 -> row._2.map( a => a(1)))
gives:
List(
(doc1 -> Does,just,what,was,needed,to,charge,the,Macbook),
(doc2 -> Pro,G4,13inch,laptop),
(doc3 -> Only,beef,was,it,no,longer,lights,up,the)
)
List(
(doc1 -> 1,-1,0,1,1,0,1,0,1),
(doc2 -> 1,-1,0,1),
(doc3 -> 1,0,1,0,-1,0,-1,0,-1)
)