Nested fields in mongodb Documents with scala - mongodb

When upgrading the mongodb connection from a scala application from Mongodb+Casbah to mongo-scala-driver 2.3.0 (scala 2.11.8) we are facing some problems when creating the Documents to insert in the DB. Basically I'm facing problems with nested fields of the type Map[String,Any] or Map[Int,Int].
If my field is of type Map["String", Int] there's no problem and the code would compile no problem:
val usersPerPage = Map("home" -> 23, "contact" -> 12) //Map[String,Int]
Document("page_id" -> pageId, "users_per_page" -> Document(usersPerPage))
//Compiles
val usersPerTime = Map(180 -> 23, 68 -> 34) //Map[Int,Int]
Document("page_id" -> pageId, "users_per_time" -> Document(usersPerTime))
//Doesn't compile
val usersConf = Map("age" -> 32, "country" -> "Spain") //Map[String,Any]
Document("user_id" -> userId, "user_conf" -> Document(usersConf))
//Doesn't compile
I've tried many workarounds but I'm not able to create a whole Document to insert with fields of the type Map[Int,Int] neither Map[String,Any], I thought by upgrading to a newer version of Mongo would make things easier.. what am I missing?

Have in mind that the type Map[Int,Int] is not a valid Document map, as Documents are
k,v -> String, BsonValue format.
This will therefore compile:
val usersPerTime = Map("180" -> 23, "68" -> 34) //Map[String,Int]
Document("page_id" -> pageId, "users_per_time" -> Document(usersPerTime))
For both cases, do it directly with a Document class instead of Map:
val usersConf = Document("age" -> 32, "country" -> "Spain")
Document("user_id" -> userId, "user_conf" -> usersConf)
This works well with "org.mongodb.scala" %% "mongo-scala-driver" % "2.1.0"

Related

Update a key within an object inside a JsObject in Play

I am creating a JsObject using the following code:
var json = Json.obj(
"spec" -> Json.obj(
"type" -> "Scala",
"mode" -> "cluster",
"image" -> sparkImage,
"imagePullPolicy" -> "Always",
"mainClass" -> mainClass,
"mainApplicationFile" -> jarFile,
"sparkVersion" -> "2.4.4",
"sparkConf" -> Json.obj(
"spark.kubernetes.driver.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.log-volume.mount.path" -> "/opt/spark/event-logs/",
"spark.eventLog.enabled" -> "true",
"spark.eventLog.dir" -> "/opt/spark/event-logs/"
)
)
)
Now, I will be fetching some additional sparkConf parameters from my Database. Once I fetch it, I will be storing it inside a regular Scala Map (Map[String, String]) which will contain the key-value pairs that should go into the sparkConf.
I need to update that the sparkConf within the spec inside my JsObject. So ideally I would want to apply a transformation like this:
val sparkSession = Map[String, JsString]("spark.eventLog.enabled" -> JsString("true"))
val transformer = (__ \ "spec" \ "sparkConf").json.update(
__.read[JsObject].map(e => e + sparkSession)
)
However, I'm not getting ways to do this.

Java HashMap in Scala throws error

The following Scala code that uses java.util.HashMap (I need to use Java because it's for a Java interface) works fine:
val row1 = new HashMap[String,String](Map("code" -> "B1", "name" -> "Store1"))
val row2 = new HashMap[String,String](Map("code" -> "B2", "name" -> "Store 2"))
val map1 = Array[Object](row1,row2)
Now, I'm trying to dynamically create map1 :
val values: Seq[Seq[String]] = ....
val mapx = values.map {
row => new HashMap[String,String](Map(row.map( col => "someName" -> col))) <-- error
}
val map1 = Array[Object](mapx)
But I get the following compilation error:
type mismatch; found : Seq[(String, String)] required: (?, ?)
How to fix this?
We can simplify your code a bit more:
val mapx = Map(Seq("someKey" -> "someValue"))
This still produces the same error message, so the error wasn't actually related to your use of Java HashMaps, but trying to use a Seq as an argument to Scala's Map.
The problem is that Map is variadic and expects key-value-pairs as its arguments, not some data structure containing them. In Java a variadic method can also be called with an array instead, without any type of conversion. This isn't true in Scala. In Scala you need to use : _* to explicitly convert a sequence to a list of arguments when calling a variadic method. So this works:
val mapx = Map(mySequence : _*)
Alternatively, you can just use .to_map to create a Map from a sequence of tuples:
val mapx = mySequence.toMap

Spark-Scala: save as csv file (RDD) [duplicate]

This question already has answers here:
How to save a spark DataFrame as csv on disk?
(4 answers)
Closed 5 years ago.
I have tried to stream twitter data using Apache Spark and I want to save streamed data as csv file but I couldn't
how can I fix my code to get it in csv
I use RDD.
this is my main code:
val ssc = new StreamingContext(conf, Seconds(3600))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(t => {
Map(
// This is for tweet
"text" -> t.getText,
"retweet_count" -> t.getRetweetCount,
"favorited" -> t.isFavorited,
"truncated" -> t.isTruncated,
"id_str" -> t.getId,
"in_reply_to_screen_name" -> t.getInReplyToScreenName,
"source" -> t.getSource,
"retweeted" -> t.isRetweetedByMe,
"created_at" -> t.getCreatedAt,
"in_reply_to_status_id_str" -> t.getInReplyToStatusId,
"in_reply_to_user_id_str" -> t.getInReplyToUserId,
// This is for tweet's user
"listed_count" -> t.getUser.getListedCount,
"verified" -> t.getUser.isVerified,
"location" -> t.getUser.getLocation,
"user_id_str" -> t.getUser.getId,
"description" -> t.getUser.getDescription,
"geo_enabled" -> t.getUser.isGeoEnabled,
"user_created_at" -> t.getUser.getCreatedAt,
"statuses_count" -> t.getUser.getStatusesCount,
"followers_count" -> t.getUser.getFollowersCount,
"favorites_count" -> t.getUser.getFavouritesCount,
"protected" -> t.getUser.isProtected,
"user_url" -> t.getUser.getURL,
"name" -> t.getUser.getName,
"time_zone" -> t.getUser.getTimeZone,
"user_lang" -> t.getUser.getLang,
"utc_offset" -> t.getUser.getUtcOffset,
"friends_count" -> t.getUser.getFriendsCount,
"screen_name" -> t.getUser.getScreenName
)
})
tweets.repartition(1).saveAsTextFiles("~/streaming/tweets")
You need to convert the tweets which is RDD[Map[String, String]] to a dataframe to save as CSV. The reason is simple RDD doesn't have a schema. Whereas csv format has a specific schema. So you have to convert the RDD to dataframe which has a schema.
There are several ways of doing that. One approach could be using a case class instead of putting the data into maps.
case class(text:String, retweetCount:Int ...)
Now instead of Map(...) you instantiate the case class with proper parameters.
Finally convert tweets to dataframe using spark implicit conversion
import spark.implicits._
tweets.toDF.write.csv(...) // saves as CSV
Alternatively you can convert the Map to a dataframe using the solution given here

Get multiple columns from database?

I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?
This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")

How to convert a map's keys from String to Int?

This is my initial RDD output
scala> results
scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926,
rating -> 1, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
I am removing a string Key to keep only numbers.
scala> val resultsInt = results.filterKeys(_ != "rating")
resultsInt: scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
Sorting the RDD based on values, it gives expected output, but I would like to convert the key from String to int before sorting to get consistent output.
scala> val sortedOut2 = resultsInt.toSeq.sortBy(_._1)
sortedOut2: Seq[(String, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997), (2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
I am new to Scala and just started writing my Spark program. Please let me know some insights to convert the key of Map object.
Based on your sample output, I suppose you meant converting the key to Double?
val results: scala.collection.Map[String, Long] = Map(
"4.5" -> 1534824, "0.5" -> 239125, "3.0" -> 4291193, "3.5" -> 2200156,
"2.0" -> 1430997, "1.5" -> 279252, "4.0" -> 5561926, "rating" -> 1,
"1.0" -> 680732, "2.5" -> 883398, "5.0" -> 2898660
)
results.filterKeys(_ != "rating").
map{ case(k, v) => (k.toDouble, v) }.
toSeq.sortBy(_._1)
res1: Seq[(Double, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997),
(2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
To map between different type, you just need to use map Spark/Scala operator.
You can check the syntax from here
Convert a Map[String, String] to Map[String, Int] in Scala
The same method can be used with Spark and Scala.
please see Scala - Convert keys from a Map to lower case?
the approach should be similar,
case class row (id: String, value:String)
val rddData = sc.parallelize(Seq(row("1", "hello world"), row("2", "hello there")))
rddData.map{
currentRow => (currentRow.id.toInt, currentRow.value)}
//scala> org.apache.spark.rdd.RDD[(Int, String)]
even if you didn't define a case class for the structure of the rdd and you used something like Tuple2 instead, you can just write
currentRow._1.toInt // instead of currentRow.id.toInt
please research on casting for information (when converting from String to Int), there's a few ways to go about that
hope this helps! good luck :)
Distilling your RDD into a Map is legal, but it defeats the purpose of using Spark in the first place. If you are operating at scale, your current approach renders the RDD meaningless. If you aren't, then you can just do Scala collection manipulation as you suggest, but then why bother with the overhead of Spark at all?
I would instead operate at the DataFrame level of abstraction and transform that String column into a Double like this:
import sparkSession.implicits._
dataFrame
.select("key", "value")
.withColumn("key", 'key.cast(DoubleType))
And this is of course assuming that Spark didn't recognize the key as a Double already after setting the inferSchema to true on initial data ingest.
If you are trying to filter out the key being non-number, you may just do the following:
import scala.util.{Try,Success,Failure}
(results map { case (k,v) => Try (k.toFloat) match {
case Success(x) => Some((x,v))
case Failure(_) => None
}}).flatten
res1: Iterable[(Float, Long)] = List((4.5,1534824), (0.5,239125), (3.0,4291193), (3.5,2200156), (2.0,1430997), (1.5,279252), (4.0,5561926), (1.0,680732), (2.5,883398), (5.0,2898660))