Given below:
test: Array[scala.collection.immutable.Map[String,Any]] = Array(
Map(_c3 -> "foobar", _c5 -> "impt", _c0 -> Key1, _c4 -> 20.0, _c1 -> "next", _c2 -> 1.0),
Map(_c3 -> "high", _c5 -> "low", _c0 -> Key2, _c4 -> 19.0, _c1 -> "great", _c2 -> 0.0),
Map(_c3 -> "book", _c5 -> "game", _c0 -> Key3, _c4 -> 42.0, _c1 -> "name", _c2 -> 0.5)
)
How can I transform it to Key Value pairs based on _c0 that only include Strings?
like below
Key1 foobar
Key1 impt
Key1 next
Key2 high
Key2 low
Key2 great
Key3 book
Key3 game
Key3 name
Please check this out
test.map(
_.filter(!_._2.toString.matches("[+-]?\\d+.?\\d+"))
).flatMap(
data =>
{
val key = data.getOrElse("_c0", "key_not_found")
data
.filter(_._1 != "_c0")
.map(
key +" "+_._2.toString()
)
}
)
Try this method
import org.apache.spark.sql.functions._
# first extract all values which are string
val rdd = sc.parallelize(test).map(x => (x.getOrElse("_c0","no key").toString -> (x - "_c0").values.filter(_.isInstanceOf[String]).asInstanceOf[List[String]]))
val df = spark.createDataFrame(rdd).toDF("key", "vals")
# use explode function to add new rows
df.withColumn("vals", explode(col("vals"))).show()
How about:
test
.map(row => row.getOrElse(_c0, "") -> (row - _c0).values.filter(_.isInstanceOf[String]))
.flatMap { case (key, innerList) => innerList.map(key -> _) }
Related
I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a update/replace
val historicalDF = Seq(
(1, 0, "Roger", Seq(Map("score" -> 5, "year" -> 2012)))
).toDF("id", "ts", "user", "scores")
historicalDF.write
.format("delta")
.mode("overwrite")
.save(table_path)
val hist_dt : DeltaTable = DeltaTable.forPath(spark, table_path)
val incrementalDF = Seq(
(1, 1, "Roger Rabbit", Seq(Map("score" -> 7, "year" -> 2013)))
).toDF("id", "ts", "user", "scores")
What I would like to have after the merge something is like this:
+---+---+------------+--------------------------------------------------------+
|id |ts |user |scores |
+---+---+------------+--------------------------------------------------------+
|1 |1 |Roger Rabbit|[{score -> 7, year -> 2013}, {score -> 7, year -> 2013}]|
+---+---+------------+--------------------------------------------------------+
What I tried to perform this concatenation is:
hist_dt
.as("ex")
.merge(incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched
.updateExpr(
Map(
"ts" -> "in.ts",
"user" -> "in.user",
"scores" -> "in.scores" ++ "ex.scores"
)
)
.whenNotMatched
.insertAll()
.execute()
But the columns "in.scores" and "ex.scores" are interpreted as String, so I am getting the following error:
error: value ++ is not a member of (String, String)
If there a way to add some complex logic to updateExpr?
Using update() instead of updateExpr() let me pass the required columns to a udf, so I can add there a more complex logic
def join_seq_map(incremental: Seq[Map[String,Integer]], existing: Seq[Map[String,Integer]]) : Seq[Map[String,Integer]] = {
(incremental, existing) match {
case ( null , null) => null
case ( null, e ) => e
case ( i , null) => i
case ( i , e ) => (i ++ e).distinct
}
}
def join_seq_map_udf = udf(join_seq_map _)
hist_dt
.as("ex")
.merge(
incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched("ex.ts < in.ts")
.update(Map(
"ts" -> col("in.ts"),
"user" -> col("in.user"),
"scores" -> join_seq_map_udf(col("in.scores"), col("ex.scores"))
))
.whenNotMatched
.insertAll()
.execute()
I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.
Given that this example:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=wellington")
val myMap = myList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
works fine, returning:
myList: List[String] = List(age=21, name=xyz, profession=Tester, city=cuba, age=43, name=abc, profession=Programmer, city=wellington)
myMap: scala.collection.immutable.Map[String,String] = Map(age -> 43, name -> abc, profession -> Programmer, city -> wellington)
I am wondering why the following which is just N sets of values:
val myList = List("age=21", "name=xyz", "profession=Tester", "city=cuba", "age=43", "name=abc", "profession=Programmer", "city=Sydney")
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
generates the error, and how to solve:
notebook:9: error: value split is not a member of List[String]
val myMap = myList.grouped(4).toList.map(text => text.split("=")).map(a => (a(0) -> a(1))).toMap
I must be missing something elementary here.
myList.grouped(4).toList returns a nested list – List[List[String]].
To transform the grouped sublists into Maps:
val myMap = myList.grouped(4).toList.
map(_.map(_.split("=")).map(a => (a(0) -> a(1))).toMap)
// myMap: List[scala.collection.immutable.Map[String,String]] = List(
// Map(age -> 21, name -> xyz, profession -> Tester, city -> cuba),
// Map(age -> 43, name -> abc, profession -> Programmer, city -> Sydney)
// )
I have a Map which contains another Map in its value field. Here is an example of some records ;
(8702168053422489,Map(sequence -> 5, id -> 8702168053422489, type -> List(AppExperience, Session), time -> 527780267713))
(8702170626376335,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161702, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702170626376335, sequence -> 59, time -> 527780275219, type -> List(NavigationLevel, Session), view -> details))
(8702168347359313,Map(muting -> false, id -> 8702168347359313, level -> 1, type -> List(Volume)))
(8702168321522401,Map(utcOffset -> 3600, type -> List(TimeZone), id -> 8702168321522401))
(8702171157207449,Map(trackingInfo -> Map(trackId -> 14183197, location -> Browse, listId -> 3393626f-98e3-4973-8d38-6b2fb17454b5_27331247X28X6839X1506087469573, videoId -> 80161356, rank -> 0, row -> 1, imageKey -> boxshot|AD_e01f4a50-7e2b-11e7-a327-12789459b73e|en, requestId -> 662d92c2-6a1c-41a6-8ac4-bf2ae9f1ce68-417037), id -> 8702171157207449, sequence -> 72, startOffset -> 0, time -> 527780278061, type -> List(StartPlay, Action, Session)))
The actual records I've interested in are the ones that contain trackingInfo, records 2 and 5.
What I would like to do is extract those and then extract some of the keys from there such as trackId. Something like this;
val trackingInfo = json("trackingInfo").asInstanceOf[Map[String, Any]]
val videoId = trackingInfo("videoId").asInstanceOf[Int]
val id = json("id").asInstanceOf[Long]
val sequence = json("sequence").asInstanceOf[Int]
val time = json("time").asInstanceOf[Long]
val eventType = json.get("type").getOrElse(List("")).asInstanceOf[List[String]]
To the extract the inner map, I've tired;
myMap.map {case (k,v: collection.Map[_,_]) => v.toMap case _ => }
Which brings back the inner map but as a scala.collection.immutable.Iterable[Any] which leaves me in a puzzle on extracting values from it.
Any help is appreciated
Let's say you have a real map (I cut it a little bit)
val data: Map[ BigInt, Any ] = Map(
BigInt( 8702168053422489L ) -> Map("sequence" -> "5", "id" -> BigInt( 8702168053422489L ) ),
BigInt( 8702170626376335L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702170626376335L ) ),
BigInt( 8702168347359313L ) -> Map("id" -> BigInt( 8702168347359313L ) ),
BigInt( 8702168321522401L ) -> Map("id" -> BigInt( 8702168321522401L ) ),
BigInt( 8702171157207449L ) -> Map("trackingInfo" -> Map("trackId" -> BigInt( 14183197 ), "location" -> "Browse" ), "id" -> BigInt( 8702171157207449L ) )
)
And you want to get records which have a trackingInfo key
val onlyWithTracking = data.filter( ( row ) => {
val recordToFilter = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackId" -> Map() )
}
recordToFilter.contains( "trackingInfo" )
} )
And then process those records in some way
onlyWithTracking.foreach( ( row ) => {
val record = row._2 match {
case trackRecord: Map[ String, Any ] => trackRecord
case _ => Map( "trackingInfo" -> Map() )
}
val trackingInfo = record( "trackingInfo" ) match {
case trackRow: Map[ String, Any ] => trackRow
case _ => Map( "trackId" -> "error" )
}
val trackId = trackingInfo( "trackId" )
println( trackId )
} )
With this pattern matching I'm trying to ensure that using keys like trackingInfo or trackId is somewhat safe. You should implement more strict approach.
here is my data :
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
etc...
and i want to extract words and values and then store them in two separated matrices , matrix_1 is (docID words) and matrix_2 is (docID values) ;
input.txt
=========
doc1: (Does,1) (just,-1) (what,0) (was,1) (needed,1) (to,0) (charge,1) (the,0) (Macbook,1)
doc2: (Pro,1) (G4,-1) (13inch,0) (laptop,1)
doc3: (Only,1) (beef,0) (was,1) (it,0) (no,-1) (longer,0) (lights,-1) (up,0) (the,-1)
val inputText = sc.textFile("input.txt")
var digested = input.map(line => line.split(":"))
.map(row => row(0) -> row(1).trim.split(" "))
.map(row => row._1 -> row._2.map(_.stripPrefix("(").stripSuffix(")").trim.split(",")))
var matrix_1 = digested.map(row => row._1 -> row._2.map( a => a(0)))
var matrix_2 = digested.map(row => row._1 -> row._2.map( a => a(1)))
gives:
List(
(doc1 -> Does,just,what,was,needed,to,charge,the,Macbook),
(doc2 -> Pro,G4,13inch,laptop),
(doc3 -> Only,beef,was,it,no,longer,lights,up,the)
)
List(
(doc1 -> 1,-1,0,1,1,0,1,0,1),
(doc2 -> 1,-1,0,1),
(doc3 -> 1,0,1,0,-1,0,-1,0,-1)
)