Update a key within an object inside a JsObject in Play - scala

I am creating a JsObject using the following code:
var json = Json.obj(
"spec" -> Json.obj(
"type" -> "Scala",
"mode" -> "cluster",
"image" -> sparkImage,
"imagePullPolicy" -> "Always",
"mainClass" -> mainClass,
"mainApplicationFile" -> jarFile,
"sparkVersion" -> "2.4.4",
"sparkConf" -> Json.obj(
"spark.kubernetes.driver.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.log-volume.mount.path" -> "/opt/spark/event-logs/",
"spark.eventLog.enabled" -> "true",
"spark.eventLog.dir" -> "/opt/spark/event-logs/"
)
)
)
Now, I will be fetching some additional sparkConf parameters from my Database. Once I fetch it, I will be storing it inside a regular Scala Map (Map[String, String]) which will contain the key-value pairs that should go into the sparkConf.
I need to update that the sparkConf within the spec inside my JsObject. So ideally I would want to apply a transformation like this:
val sparkSession = Map[String, JsString]("spark.eventLog.enabled" -> JsString("true"))
val transformer = (__ \ "spec" \ "sparkConf").json.update(
__.read[JsObject].map(e => e + sparkSession)
)
However, I'm not getting ways to do this.

Related

How to call remote SQL function inside PySpark or Scala databriks notebook

I am writing databriks scala / python notebook which connect SQL server database.
and i want to execute sql server function from notebook with custom paramters.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()
here function is FN_cal_udf which stored in sql server database -'DBsample'
I got error :
jdbc.SQLServerException: Parameters were not supplied for the function
How i can pass parameter and call SQL function inside notebook in scala or pyspark.
Here you can first make query string which stores function calling statement with dynamic parameters.
and then use in congig.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val query = " [dbo].[FN_cal_udf]('"+ID+"','"+name+"')"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()

How to avoid using of collect in Spark RDD in Scala?

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.

Spark-Scala: save as csv file (RDD) [duplicate]

This question already has answers here:
How to save a spark DataFrame as csv on disk?
(4 answers)
Closed 5 years ago.
I have tried to stream twitter data using Apache Spark and I want to save streamed data as csv file but I couldn't
how can I fix my code to get it in csv
I use RDD.
this is my main code:
val ssc = new StreamingContext(conf, Seconds(3600))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(t => {
Map(
// This is for tweet
"text" -> t.getText,
"retweet_count" -> t.getRetweetCount,
"favorited" -> t.isFavorited,
"truncated" -> t.isTruncated,
"id_str" -> t.getId,
"in_reply_to_screen_name" -> t.getInReplyToScreenName,
"source" -> t.getSource,
"retweeted" -> t.isRetweetedByMe,
"created_at" -> t.getCreatedAt,
"in_reply_to_status_id_str" -> t.getInReplyToStatusId,
"in_reply_to_user_id_str" -> t.getInReplyToUserId,
// This is for tweet's user
"listed_count" -> t.getUser.getListedCount,
"verified" -> t.getUser.isVerified,
"location" -> t.getUser.getLocation,
"user_id_str" -> t.getUser.getId,
"description" -> t.getUser.getDescription,
"geo_enabled" -> t.getUser.isGeoEnabled,
"user_created_at" -> t.getUser.getCreatedAt,
"statuses_count" -> t.getUser.getStatusesCount,
"followers_count" -> t.getUser.getFollowersCount,
"favorites_count" -> t.getUser.getFavouritesCount,
"protected" -> t.getUser.isProtected,
"user_url" -> t.getUser.getURL,
"name" -> t.getUser.getName,
"time_zone" -> t.getUser.getTimeZone,
"user_lang" -> t.getUser.getLang,
"utc_offset" -> t.getUser.getUtcOffset,
"friends_count" -> t.getUser.getFriendsCount,
"screen_name" -> t.getUser.getScreenName
)
})
tweets.repartition(1).saveAsTextFiles("~/streaming/tweets")
You need to convert the tweets which is RDD[Map[String, String]] to a dataframe to save as CSV. The reason is simple RDD doesn't have a schema. Whereas csv format has a specific schema. So you have to convert the RDD to dataframe which has a schema.
There are several ways of doing that. One approach could be using a case class instead of putting the data into maps.
case class(text:String, retweetCount:Int ...)
Now instead of Map(...) you instantiate the case class with proper parameters.
Finally convert tweets to dataframe using spark implicit conversion
import spark.implicits._
tweets.toDF.write.csv(...) // saves as CSV
Alternatively you can convert the Map to a dataframe using the solution given here

Get multiple columns from database?

I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?
This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")

Adding JsValue to Mongo

Let's say I have a JsValue.
val dbo = MongoDBObject("id" -> "0001", "name" -> "Kevin", "age" -> "100")
val json: JsValue = Json.parse(dbo.toString)
I tried to insert json via:
val obj = MongoDBObject("key" -> json)
collection.insert(obj)
However, there are many brackets [ and ] added to the json part when I do a db.collection.findOne() from the Mongo shell.
How can I properly add a JsValue into Mongo in Casbah?