I have written the following code
val list = List(
Map("empid" -> "12", "empName" -> "Rohan", "depId" -> "201"),
Map("empid" -> "13", "empName" -> "swathi", "depId" -> "202")
).flatten.toMap
val mapRDD= sc.parallelize(Seq(list))
val columns=mapRDD.take(1).flatMap(a=>a.keys)
val columnval=mapRDD.take(2).flatMap(a=>a.keys)
val resultantDF=mapRDD.map{value=>
val list=value.values.toList
(list(0),list(1),list(2))
}.toDF(columns:_*)
resultantDF.show()
i am expecting the below output,
+-----+-------+-----+
|empid|empName|depId|
+-----+-------+-----+
| 12| Rohan| 201|
| 13|SWATHI|202 |
but i am getting only,
+-----+-------+-----+
|empid|empName|depId|
+-----+-------+-----+
| 13|SWATHI|202
Please let me know where i am doing the mistake.
The problem lies in your first line only,
scala> val list = List(
| Map("empid" -> "12", "empName" -> "Rohan", "depId" -> "201"),
| Map("empid" -> "13", "empName" -> "swathi", "depId" -> "202")
| ).flatten.toMap
// list: scala.collection.immutable.Map[String,String] = Map(empid -> 13, empName -> swathi, depId -> 202)
Your list actually ends up becoming a Map. And a Map can have only 1 value for each key.
Let's do the first line step by step,
So, first you created a list of maps,
scala> val listOfMaps = List(
| Map("empid" -> "12", "empName" -> "Rohan", "depId" -> "201"),
| Map("empid" -> "13", "empName" -> "swathi", "depId" -> "202")
| )
// list: List[scala.collection.immutable.Map[String,String]] = List(Map(empid -> 12, empName -> Rohan, depId -> 201), Map(empid -> 13, empName -> swathi, depId -> 202))
Then, you flattened the maps inside the listOfMaps which will result in a list of key-value pairs.
scala> val flattenedListOfMaps = listOfMaps.flatten
// flattenedListOfMaps: List[(String, String)] = List((empid,12), (empName,Rohan), (depId,201), (empid,13), (empName,swathi), (depId,202))
Now, you are converting it to a Map using toMap, which will keep on overriding the values of keys and result in a Map with unique keys,
scala> scala> val yourMap = flattenedListOfMaps.toMap
// yourMap: scala.collection.immutable.Map[String,String] = Map(empid -> 13, empName -> swathi, depId -> 202)
As already pointed out in the previous answer and comment, at the moment your list variable is actually a map (which is confusing at least).
What you probably want initially as input is a list.
Hence what you need is:
1.
get rid of .flatten.toMap:
val list = List(
Map("empid" -> "12", "empName" -> "Rohan", "depId" -> "201"),
Map("empid" -> "13", "empName" -> "swathi", "depId" -> "202")
)
2.
Also when calling sc.parallelize you don't need to create a separate Seq from original input (in fact, otherwise you would have a compile error).
So you also need to change it like this:
val mapRDD = sc.parallelize(list)
After making only those two changes you will receive expected result, i.e. 2 records shown in console output.
Related
val mapa = Map("a" -> Array(Map("b" -> "c", "d" -> Array("e"))))
val mapa2 = Map("a" -> Array(Map("b" -> "c", "d" -> Array("e"))))
Is there way how to get key and value from both same maps and compare them?
or how to get all key from map with such structure?
I have this Seq[Map[String, String]] :
val val1 = Seq(
Map("Name" -> "Heidi",
"City" -> "Paris",
"Age" -> "23"),
Map(("Country" -> "France")),
Map("Color" -> "Blue",
"City" -> "Paris"))
and I have this Seq[String]
val val2 = Seq["Name", "Country", "City", "Department"]
Expected output is val1 with all keys present in val2 (I want to filter out the (k,v) from v1 that have keys that are not present in val2) :
val expected = Seq(Map("Name" -> "Heidi", "City" -> "Paris"), Map( "Country" -> "France")), Map("City" -> "Paris"))
Age and Color are strings that are not in val2, I want to omit them from val1 map.
I'm not sure if what you propose is a right approach but nevertheless, it can be done like this:
val1.map(_.filter {
case (key, value) => val2.contains(key)
})
It seems you want something like this:
(note that I used a Set instead of a List to make contains faster)
def ensureMapsHaveOnlyValidKeys[K, V](validKeys: Set[K])(data: IterableOnce[Map[K, V]]): List[Map[K, V]] =
data
.iterator
.filter(_.keysIterator.forall(validKeys.contains))
.toList
I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.
I have data like:
[Michael, 100, Montreal,Toronto, Male,30, DB:80, Product:DeveloperLead]
[Will, 101, Montreal, Male,35, Perl:85, Product:Lead,Test:Lead]
[Steven, 102, New York, Female,27, Python:80, Test:Lead,COE:Architect]
[Lucy, 103, Vancouver, Female,57, Sales:89,HR:94, Sales:Lead]
So I have to read this data and define a case class using Spark. I have written the below program, but I get an error while converting the case class to a data frame. What's wrong in my code, and how can I correct it?
case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])
I get an error (see the picture below) in the below line:
val d = df.map(w=> Ayush(w(0),w(1),w(2)._1,w(2)._2,w(3)._1,w(3)._2,w(4)._1,w(4)._2,w(5)._1,w(5)._2._1,w(5)._2._2))).toDF
I have changed your data. Wrap workplace and department data in double quotes so that I can get data with comma separated value. Then add a custom separator so that later I can use the separator to separate data. You can use your own separator. The image is below:
The data is as follows:
Michael,100," Montreal,Toronto", Male,30, DB:80," Product,DeveloperLead"
Will,101, Montreal, Male,35, Perl:85," Product,Lead,Test,Lead"
Steven,102, New York, Female,27, Python:80," Test,Lead,COE,Architect"
Lucy,103, Vancouver, Female,57, Sales:89_HR:94," Sales,Lead"
Below are the code changes I have performed which worked fine for me:
val df = spark.read.csv("CSV PATH HERE")
case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])
val resultDF = df.map { x => {
val departTitleData = x(6).toString
val skill_score = x(5).toString
val skill_Map = scala.collection.mutable.Map[String, String]()
// Separate skill by underscore I can get each skill:Num then i will add each one in map
skill_score.split("_").foreach { x => skill_Map += (x.split(":")(0) -> x.split(":")(1)) }
// Putting data into case class
new Ayush(x(0).toString(), x(1).toString, x(2).toString.split(","), Map(x(3).toString -> x(4).toString), skill_Map.toMap, Map(x(6).toString.split(",")(0) -> x(6).toString.split(",")) )
}}
//End Here
The above code output is:
===============================================================================
+-------+-----------+--------------------+------------------+--------------------+--------------------+
| name|employee_id| work_place| sex_age| skills_score| depart_title|
+-------+-----------+--------------------+------------------+--------------------+--------------------+
|Michael| 100|[ Montreal, Toronto]| Map( Male -> 30)| Map( DB -> 80)|Map( Product -> W...|
| Will| 101| [ Montreal]| Map( Male -> 35)| Map( Perl -> 85)|Map( Product -> W...|
| Steven| 102| [ New York]|Map( Female -> 27)| Map( Python -> 80)|Map( Test -> Wrap...|
| Lucy| 103| [ Vancouver]|Map( Female -> 57)|Map(HR -> 94, Sa...|Map( Sales -> Wra...|
+-------+-----------+--------------------+------------------+--------------------+--------------------+
It may not be as what you expected, but it may help you achieve what you are trying to do...
#vishal I dont know if this question is still valid but here is my solution without changing the source data, fair warning it might be a little cringy :)
def main(args:Array[String]):Unit= {
val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("file:///C:/Users/k.sandeep.varma/Downloads/documents/documents/spark_data/employee_data.txt")
val clean_rdd=rdd1.map(x=>x.replace("[","")).map(x=>x.replace("]",""))
val schema_rdd=clean_rdd.map(x=>x.split(", ")).map(x=>schema(x(0),x(1),x(2).split(","),Map(x(3).split(",")(0)->x(3).split(",")(1)),Map(x(4).split(":")(0)->x(4).split(":")(1)),Map(x(5).split(":")(0)->x(5).split(":"))))
val df1=schema_rdd.toDF()
df1.printSchema()
df1.show(false)
output:
|name |employee_id|work_place |sex_age |skills_score |depart_title |
+-------+-----------+-------------------+--------------+----------------+---------------------------------------+
|Michael|100 |[Montreal, Toronto]|[Male -> 30] |[DB -> 80] |[Product -> [Product, DeveloperLead]] |
|Will |101 |[Montreal] |[Male -> 35] |[Perl -> 85] |[Product -> [Product, Lead,Test, Lead]]|
|Steven |102 |[New York] |[Female -> 27]|[Python -> 80] |[Test -> [Test, Lead,COE, Architect]] |
|Lucy |103 |[Vancouver] |[Female -> 57]|[Sales -> 89,HR]|[Sales -> [Sales, Lead]] |
I am writing below code,
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.flatMap(_.values)
println(header)
println(data)
I am getting the below output,
List(id, Name)
List(1, divya, 2, gaya)
however I am expecting output as below,
id Name
1 Divya
2 gaya
here in this case I am having only 2 header but in my map it may contain more than 2 headers how to display all in rows. Please help me.
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(header.mkString(" "))
data.foreach(x => println(x.mkString(" ")))