Spark-Scala: save as csv file (RDD) [duplicate] - scala

This question already has answers here:
How to save a spark DataFrame as csv on disk?
(4 answers)
Closed 5 years ago.
I have tried to stream twitter data using Apache Spark and I want to save streamed data as csv file but I couldn't
how can I fix my code to get it in csv
I use RDD.
this is my main code:
val ssc = new StreamingContext(conf, Seconds(3600))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(t => {
Map(
// This is for tweet
"text" -> t.getText,
"retweet_count" -> t.getRetweetCount,
"favorited" -> t.isFavorited,
"truncated" -> t.isTruncated,
"id_str" -> t.getId,
"in_reply_to_screen_name" -> t.getInReplyToScreenName,
"source" -> t.getSource,
"retweeted" -> t.isRetweetedByMe,
"created_at" -> t.getCreatedAt,
"in_reply_to_status_id_str" -> t.getInReplyToStatusId,
"in_reply_to_user_id_str" -> t.getInReplyToUserId,
// This is for tweet's user
"listed_count" -> t.getUser.getListedCount,
"verified" -> t.getUser.isVerified,
"location" -> t.getUser.getLocation,
"user_id_str" -> t.getUser.getId,
"description" -> t.getUser.getDescription,
"geo_enabled" -> t.getUser.isGeoEnabled,
"user_created_at" -> t.getUser.getCreatedAt,
"statuses_count" -> t.getUser.getStatusesCount,
"followers_count" -> t.getUser.getFollowersCount,
"favorites_count" -> t.getUser.getFavouritesCount,
"protected" -> t.getUser.isProtected,
"user_url" -> t.getUser.getURL,
"name" -> t.getUser.getName,
"time_zone" -> t.getUser.getTimeZone,
"user_lang" -> t.getUser.getLang,
"utc_offset" -> t.getUser.getUtcOffset,
"friends_count" -> t.getUser.getFriendsCount,
"screen_name" -> t.getUser.getScreenName
)
})
tweets.repartition(1).saveAsTextFiles("~/streaming/tweets")

You need to convert the tweets which is RDD[Map[String, String]] to a dataframe to save as CSV. The reason is simple RDD doesn't have a schema. Whereas csv format has a specific schema. So you have to convert the RDD to dataframe which has a schema.
There are several ways of doing that. One approach could be using a case class instead of putting the data into maps.
case class(text:String, retweetCount:Int ...)
Now instead of Map(...) you instantiate the case class with proper parameters.
Finally convert tweets to dataframe using spark implicit conversion
import spark.implicits._
tweets.toDF.write.csv(...) // saves as CSV
Alternatively you can convert the Map to a dataframe using the solution given here

Related

Update a key within an object inside a JsObject in Play

I am creating a JsObject using the following code:
var json = Json.obj(
"spec" -> Json.obj(
"type" -> "Scala",
"mode" -> "cluster",
"image" -> sparkImage,
"imagePullPolicy" -> "Always",
"mainClass" -> mainClass,
"mainApplicationFile" -> jarFile,
"sparkVersion" -> "2.4.4",
"sparkConf" -> Json.obj(
"spark.kubernetes.driver.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.log-volume.mount.path" -> "/opt/spark/event-logs/",
"spark.eventLog.enabled" -> "true",
"spark.eventLog.dir" -> "/opt/spark/event-logs/"
)
)
)
Now, I will be fetching some additional sparkConf parameters from my Database. Once I fetch it, I will be storing it inside a regular Scala Map (Map[String, String]) which will contain the key-value pairs that should go into the sparkConf.
I need to update that the sparkConf within the spec inside my JsObject. So ideally I would want to apply a transformation like this:
val sparkSession = Map[String, JsString]("spark.eventLog.enabled" -> JsString("true"))
val transformer = (__ \ "spec" \ "sparkConf").json.update(
__.read[JsObject].map(e => e + sparkSession)
)
However, I'm not getting ways to do this.

How to avoid using of collect in Spark RDD in Scala?

I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.

Nested fields in mongodb Documents with scala

When upgrading the mongodb connection from a scala application from Mongodb+Casbah to mongo-scala-driver 2.3.0 (scala 2.11.8) we are facing some problems when creating the Documents to insert in the DB. Basically I'm facing problems with nested fields of the type Map[String,Any] or Map[Int,Int].
If my field is of type Map["String", Int] there's no problem and the code would compile no problem:
val usersPerPage = Map("home" -> 23, "contact" -> 12) //Map[String,Int]
Document("page_id" -> pageId, "users_per_page" -> Document(usersPerPage))
//Compiles
val usersPerTime = Map(180 -> 23, 68 -> 34) //Map[Int,Int]
Document("page_id" -> pageId, "users_per_time" -> Document(usersPerTime))
//Doesn't compile
val usersConf = Map("age" -> 32, "country" -> "Spain") //Map[String,Any]
Document("user_id" -> userId, "user_conf" -> Document(usersConf))
//Doesn't compile
I've tried many workarounds but I'm not able to create a whole Document to insert with fields of the type Map[Int,Int] neither Map[String,Any], I thought by upgrading to a newer version of Mongo would make things easier.. what am I missing?
Have in mind that the type Map[Int,Int] is not a valid Document map, as Documents are
k,v -> String, BsonValue format.
This will therefore compile:
val usersPerTime = Map("180" -> 23, "68" -> 34) //Map[String,Int]
Document("page_id" -> pageId, "users_per_time" -> Document(usersPerTime))
For both cases, do it directly with a Document class instead of Map:
val usersConf = Document("age" -> 32, "country" -> "Spain")
Document("user_id" -> userId, "user_conf" -> usersConf)
This works well with "org.mongodb.scala" %% "mongo-scala-driver" % "2.1.0"

Get multiple columns from database?

I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?
This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")

How to convert a map's keys from String to Int?

This is my initial RDD output
scala> results
scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926,
rating -> 1, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
I am removing a string Key to keep only numbers.
scala> val resultsInt = results.filterKeys(_ != "rating")
resultsInt: scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)
Sorting the RDD based on values, it gives expected output, but I would like to convert the key from String to int before sorting to get consistent output.
scala> val sortedOut2 = resultsInt.toSeq.sortBy(_._1)
sortedOut2: Seq[(String, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997), (2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
I am new to Scala and just started writing my Spark program. Please let me know some insights to convert the key of Map object.
Based on your sample output, I suppose you meant converting the key to Double?
val results: scala.collection.Map[String, Long] = Map(
"4.5" -> 1534824, "0.5" -> 239125, "3.0" -> 4291193, "3.5" -> 2200156,
"2.0" -> 1430997, "1.5" -> 279252, "4.0" -> 5561926, "rating" -> 1,
"1.0" -> 680732, "2.5" -> 883398, "5.0" -> 2898660
)
results.filterKeys(_ != "rating").
map{ case(k, v) => (k.toDouble, v) }.
toSeq.sortBy(_._1)
res1: Seq[(Double, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997),
(2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
To map between different type, you just need to use map Spark/Scala operator.
You can check the syntax from here
Convert a Map[String, String] to Map[String, Int] in Scala
The same method can be used with Spark and Scala.
please see Scala - Convert keys from a Map to lower case?
the approach should be similar,
case class row (id: String, value:String)
val rddData = sc.parallelize(Seq(row("1", "hello world"), row("2", "hello there")))
rddData.map{
currentRow => (currentRow.id.toInt, currentRow.value)}
//scala> org.apache.spark.rdd.RDD[(Int, String)]
even if you didn't define a case class for the structure of the rdd and you used something like Tuple2 instead, you can just write
currentRow._1.toInt // instead of currentRow.id.toInt
please research on casting for information (when converting from String to Int), there's a few ways to go about that
hope this helps! good luck :)
Distilling your RDD into a Map is legal, but it defeats the purpose of using Spark in the first place. If you are operating at scale, your current approach renders the RDD meaningless. If you aren't, then you can just do Scala collection manipulation as you suggest, but then why bother with the overhead of Spark at all?
I would instead operate at the DataFrame level of abstraction and transform that String column into a Double like this:
import sparkSession.implicits._
dataFrame
.select("key", "value")
.withColumn("key", 'key.cast(DoubleType))
And this is of course assuming that Spark didn't recognize the key as a Double already after setting the inferSchema to true on initial data ingest.
If you are trying to filter out the key being non-number, you may just do the following:
import scala.util.{Try,Success,Failure}
(results map { case (k,v) => Try (k.toFloat) match {
case Success(x) => Some((x,v))
case Failure(_) => None
}}).flatten
res1: Iterable[(Float, Long)] = List((4.5,1534824), (0.5,239125), (3.0,4291193), (3.5,2200156), (2.0,1430997), (1.5,279252), (4.0,5561926), (1.0,680732), (2.5,883398), (5.0,2898660))