How to avoid using of collect in Spark RDD in Scala? - scala

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}

Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.

Related

How to add complex logic to updateExpr in a Delta Table

I am Updating a Delta Table with some incremental records. Two of the fields require just a plain update, but there is another one which is a collection of maps which I would like to concatenate all the existing values instead of doing a update/replace
val historicalDF = Seq(
(1, 0, "Roger", Seq(Map("score" -> 5, "year" -> 2012)))
).toDF("id", "ts", "user", "scores")
historicalDF.write
.format("delta")
.mode("overwrite")
.save(table_path)
val hist_dt : DeltaTable = DeltaTable.forPath(spark, table_path)
val incrementalDF = Seq(
(1, 1, "Roger Rabbit", Seq(Map("score" -> 7, "year" -> 2013)))
).toDF("id", "ts", "user", "scores")
What I would like to have after the merge something is like this:
+---+---+------------+--------------------------------------------------------+
|id |ts |user |scores |
+---+---+------------+--------------------------------------------------------+
|1 |1 |Roger Rabbit|[{score -> 7, year -> 2013}, {score -> 7, year -> 2013}]|
+---+---+------------+--------------------------------------------------------+
What I tried to perform this concatenation is:
hist_dt
.as("ex")
.merge(incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched
.updateExpr(
Map(
"ts" -> "in.ts",
"user" -> "in.user",
"scores" -> "in.scores" ++ "ex.scores"
)
)
.whenNotMatched
.insertAll()
.execute()
But the columns "in.scores" and "ex.scores" are interpreted as String, so I am getting the following error:
error: value ++ is not a member of (String, String)
If there a way to add some complex logic to updateExpr?
Using update() instead of updateExpr() let me pass the required columns to a udf, so I can add there a more complex logic
def join_seq_map(incremental: Seq[Map[String,Integer]], existing: Seq[Map[String,Integer]]) : Seq[Map[String,Integer]] = {
(incremental, existing) match {
case ( null , null) => null
case ( null, e ) => e
case ( i , null) => i
case ( i , e ) => (i ++ e).distinct
}
}
def join_seq_map_udf = udf(join_seq_map _)
hist_dt
.as("ex")
.merge(
incrementalDF.as("in"),
"ex.id = in.id")
.whenMatched("ex.ts < in.ts")
.update(Map(
"ts" -> col("in.ts"),
"user" -> col("in.user"),
"scores" -> join_seq_map_udf(col("in.scores"), col("ex.scores"))
))
.whenNotMatched
.insertAll()
.execute()

How to read from a csv file to create a scala Map object?

I have a path to a csv I'd like to read from. This csv includes three columns: "topic, key, value" I am using spark to read this file as a csv file. The file looks like the following(lookupFile.csv):
Topic,Key,Value
fruit,aaa,apple
fruit,bbb,orange
animal,ccc,cat
animal,ddd,dog
//I'm reading the file as follows
val lookup = SparkSession.read.option("delimeter", ",").option("header", "true").csv(lookupFile)
I'd like to take what I just read and return a map that has the following properties:
The map uses the topic as a key
The value of this map is a map of the "Key" and "Value" columns
My hope is that I would get a map that looks like the following:
val result = Map("fruit" -> Map("aaa" -> "apple", "bbb" -> "orange"),
"animal" -> Map("ccc" -> "cat", "ddd" -> "dog"))
Any ideas on how I can do this?
scala> val in = spark.read.option("header", true).option("inferSchema", true).csv("""Topic,Key,Value
| fruit,aaa,apple
| fruit,bbb,orange
| animal,ccc,cat
| animal,ddd,dog""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [Topic: string, Key: string ... 1 more field]
scala> val res = in.groupBy('Topic).agg(map_from_entries(collect_list(struct('Key, 'Value))).as("subMap"))
res: org.apache.spark.sql.DataFrame = [Topic: string, subMap: map<string,string>]
scala> val scalaMap = res.collect.map{
| case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
| }.toMap
<console>:26: warning: non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure
case org.apache.spark.sql.Row(k : String, v : Map[String, String]) => (k, v)
^
scalaMap: scala.collection.immutable.Map[String,Map[String,String]] = Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))
read in your data
val df1= spark.read.format("csv").option("inferSchema", "true").option("header", "true").load(path)
first put "key,value" into and array and groupBy Topic to get your target
separted into a key part and a value part.
val df2= df.groupBy("Topic").agg(collect_list(array($"Key",$"Value")).as("arr"))
now convert to dataset
val ds= df2.as[(String,Seq[Seq[String]])]
apply logic on the fields to get your map of maps and collect
val ds1 =ds.map(x=> (x._1,x._2.map(y=> (y(0),y(1))).toMap)).collect
now you data is set up with the Topic as a key and "key,value" as a Value, so now apply Map to get your result
ds1.toMap
Map(animal -> Map(ccc -> cat, ddd -> dog), fruit -> Map(aaa -> apple, bbb -> orange))

How to create a Spark SQL Dataframe with list of Map objects

I have multiple Map[String, String] in a List (Scala). For example:
map1 = Map("EMP_NAME" -> “Ahmad”, "DOB" -> “01-10-1991”, "CITY" -> “Dubai”)
map2 = Map("EMP_NAME" -> “Rahul”, "DOB" -> “06-12-1991”, "CITY" -> “Mumbai”)
map3 = Map("EMP_NAME" -> “John”, "DOB" -> “11-04-1996”, "CITY" -> “Toronto”)
list = List(map1, map2, map3)
Now I want to create a single dataframe with something like this:
EMP_NAME DOB CITY
Ahmad 01-10-1991 Dubai
Rahul 06-12-1991 Mumbai
John 11-04-1996 Toronto
How do I achieve this?
you can do it like this :
import spark.implicits._
val df = list
.map( m => (m.get("EMP_NAME"),m.get("DOB"),m.get("CITY")))
.toDF("EMP_NAME","DOB","CITY")
df.show()
+--------+----------+-------+
|EMP_NAME| DOB| CITY|
+--------+----------+-------+
| Ahmad|01-10-1991| Dubai|
| Rahul|06-12-1991| Mumbai|
| John|11-04-1996|Toronto|
+--------+----------+-------+
Slightly less specific approach, e.g:
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")
///...
val list = List(map1, map2) // map3, ...
val RDDmap = sc.parallelize(list)
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
df.show(false)
returns:
+--------+----------+------+
|EMP_NAME|DOB |CITY |
+--------+----------+------+
|Ahmad |01-10-1991|Dubai |
|John |01-10-1992|Mumbai|
+--------+----------+------+
or to answer your sub-question, here as follows - at least I think this is what you are asking, but probably not:
val RDDmap = sc.parallelize(List(
Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai"),
Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")))
...
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
You can build a list dynamically of course, but you still need to assign the Map elements. See Appending Data to List or any other collection Dynamically in scala. I would just read in from file and be done with it.
import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object DataFrameTest2 extends Serializable {
var sparkSession: SparkSession = _
var sparkContext: SparkContext = _
var sqlContext: SQLContext = _
def main(args: Array[String]): Unit = {
sparkSession = SparkSession.builder().appName("TestMaster").master("local").getOrCreate()
sparkContext = sparkSession.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "Rahul", "DOB" -> "06-12-1991", "CITY" -> "Mumbai")
val map3 = Map("EMP_NAME" -> "John", "DOB" -> "11-04-1996", "CITY" -> "Toronto")
val list = List(map1, map2, map3)
//create your rows
val rows = list.map(m => Row(m.values.toSeq:_*))
//create the schema from the header
val header = list.head.keys.toList
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
//create your rdd
val rdd = sparkContext.parallelize(rows)
//create your dataframe using rdd
val df = sparkSession.createDataFrame(rdd, schema)
df.show()
}
}

Spark Scala case class with array and map datatype

I have data like:
[Michael, 100, Montreal,Toronto, Male,30, DB:80, Product:DeveloperLead]
[Will, 101, Montreal, Male,35, Perl:85, Product:Lead,Test:Lead]
[Steven, 102, New York, Female,27, Python:80, Test:Lead,COE:Architect]
[Lucy, 103, Vancouver, Female,57, Sales:89,HR:94, Sales:Lead]
So I have to read this data and define a case class using Spark. I have written the below program, but I get an error while converting the case class to a data frame. What's wrong in my code, and how can I correct it?
case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])
I get an error (see the picture below) in the below line:
val d = df.map(w=> Ayush(w(0),w(1),w(2)._1,w(2)._2,w(3)._1,w(3)._2,w(4)._1,w(4)._2,w(5)._1,w(5)._2._1,w(5)._2._2))).toDF
I have changed your data. Wrap workplace and department data in double quotes so that I can get data with comma separated value. Then add a custom separator so that later I can use the separator to separate data. You can use your own separator. The image is below:
The data is as follows:
Michael,100," Montreal,Toronto", Male,30, DB:80," Product,DeveloperLead"
Will,101, Montreal, Male,35, Perl:85," Product,Lead,Test,Lead"
Steven,102, New York, Female,27, Python:80," Test,Lead,COE,Architect"
Lucy,103, Vancouver, Female,57, Sales:89_HR:94," Sales,Lead"
Below are the code changes I have performed which worked fine for me:
val df = spark.read.csv("CSV PATH HERE")
case class Ayush(name: String,employee_id:String ,work_place: Array[String],sex_age: Map [String,String],skills_score: Map[String,String],depart_title: Map[String,Array[String]])
val resultDF = df.map { x => {
val departTitleData = x(6).toString
val skill_score = x(5).toString
val skill_Map = scala.collection.mutable.Map[String, String]()
// Separate skill by underscore I can get each skill:Num then i will add each one in map
skill_score.split("_").foreach { x => skill_Map += (x.split(":")(0) -> x.split(":")(1)) }
// Putting data into case class
new Ayush(x(0).toString(), x(1).toString, x(2).toString.split(","), Map(x(3).toString -> x(4).toString), skill_Map.toMap, Map(x(6).toString.split(",")(0) -> x(6).toString.split(",")) )
}}
//End Here
The above code output is:
===============================================================================
+-------+-----------+--------------------+------------------+--------------------+--------------------+
| name|employee_id| work_place| sex_age| skills_score| depart_title|
+-------+-----------+--------------------+------------------+--------------------+--------------------+
|Michael| 100|[ Montreal, Toronto]| Map( Male -> 30)| Map( DB -> 80)|Map( Product -> W...|
| Will| 101| [ Montreal]| Map( Male -> 35)| Map( Perl -> 85)|Map( Product -> W...|
| Steven| 102| [ New York]|Map( Female -> 27)| Map( Python -> 80)|Map( Test -> Wrap...|
| Lucy| 103| [ Vancouver]|Map( Female -> 57)|Map(HR -> 94, Sa...|Map( Sales -> Wra...|
+-------+-----------+--------------------+------------------+--------------------+--------------------+
It may not be as what you expected, but it may help you achieve what you are trying to do...
#vishal I dont know if this question is still valid but here is my solution without changing the source data, fair warning it might be a little cringy :)
def main(args:Array[String]):Unit= {
val conf=new SparkConf().setAppName("first_demo").setMaster("local[*]")
val sc=new SparkContext(conf)
val spark=SparkSession.builder().getOrCreate()
import spark.implicits._
val rdd1=sc.textFile("file:///C:/Users/k.sandeep.varma/Downloads/documents/documents/spark_data/employee_data.txt")
val clean_rdd=rdd1.map(x=>x.replace("[","")).map(x=>x.replace("]",""))
val schema_rdd=clean_rdd.map(x=>x.split(", ")).map(x=>schema(x(0),x(1),x(2).split(","),Map(x(3).split(",")(0)->x(3).split(",")(1)),Map(x(4).split(":")(0)->x(4).split(":")(1)),Map(x(5).split(":")(0)->x(5).split(":"))))
val df1=schema_rdd.toDF()
df1.printSchema()
df1.show(false)
output:
|name |employee_id|work_place |sex_age |skills_score |depart_title |
+-------+-----------+-------------------+--------------+----------------+---------------------------------------+
|Michael|100 |[Montreal, Toronto]|[Male -> 30] |[DB -> 80] |[Product -> [Product, DeveloperLead]] |
|Will |101 |[Montreal] |[Male -> 35] |[Perl -> 85] |[Product -> [Product, Lead,Test, Lead]]|
|Steven |102 |[New York] |[Female -> 27]|[Python -> 80] |[Test -> [Test, Lead,COE, Architect]] |
|Lucy |103 |[Vancouver] |[Female -> 57]|[Sales -> 89,HR]|[Sales -> [Sales, Lead]] |

Spark-Scala: save as csv file (RDD) [duplicate]

This question already has answers here:
How to save a spark DataFrame as csv on disk?
(4 answers)
Closed 5 years ago.
I have tried to stream twitter data using Apache Spark and I want to save streamed data as csv file but I couldn't
how can I fix my code to get it in csv
I use RDD.
this is my main code:
val ssc = new StreamingContext(conf, Seconds(3600))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(t => {
Map(
// This is for tweet
"text" -> t.getText,
"retweet_count" -> t.getRetweetCount,
"favorited" -> t.isFavorited,
"truncated" -> t.isTruncated,
"id_str" -> t.getId,
"in_reply_to_screen_name" -> t.getInReplyToScreenName,
"source" -> t.getSource,
"retweeted" -> t.isRetweetedByMe,
"created_at" -> t.getCreatedAt,
"in_reply_to_status_id_str" -> t.getInReplyToStatusId,
"in_reply_to_user_id_str" -> t.getInReplyToUserId,
// This is for tweet's user
"listed_count" -> t.getUser.getListedCount,
"verified" -> t.getUser.isVerified,
"location" -> t.getUser.getLocation,
"user_id_str" -> t.getUser.getId,
"description" -> t.getUser.getDescription,
"geo_enabled" -> t.getUser.isGeoEnabled,
"user_created_at" -> t.getUser.getCreatedAt,
"statuses_count" -> t.getUser.getStatusesCount,
"followers_count" -> t.getUser.getFollowersCount,
"favorites_count" -> t.getUser.getFavouritesCount,
"protected" -> t.getUser.isProtected,
"user_url" -> t.getUser.getURL,
"name" -> t.getUser.getName,
"time_zone" -> t.getUser.getTimeZone,
"user_lang" -> t.getUser.getLang,
"utc_offset" -> t.getUser.getUtcOffset,
"friends_count" -> t.getUser.getFriendsCount,
"screen_name" -> t.getUser.getScreenName
)
})
tweets.repartition(1).saveAsTextFiles("~/streaming/tweets")
You need to convert the tweets which is RDD[Map[String, String]] to a dataframe to save as CSV. The reason is simple RDD doesn't have a schema. Whereas csv format has a specific schema. So you have to convert the RDD to dataframe which has a schema.
There are several ways of doing that. One approach could be using a case class instead of putting the data into maps.
case class(text:String, retweetCount:Int ...)
Now instead of Map(...) you instantiate the case class with proper parameters.
Finally convert tweets to dataframe using spark implicit conversion
import spark.implicits._
tweets.toDF.write.csv(...) // saves as CSV
Alternatively you can convert the Map to a dataframe using the solution given here