I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?
This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")
Related
I am creating a JsObject using the following code:
var json = Json.obj(
"spec" -> Json.obj(
"type" -> "Scala",
"mode" -> "cluster",
"image" -> sparkImage,
"imagePullPolicy" -> "Always",
"mainClass" -> mainClass,
"mainApplicationFile" -> jarFile,
"sparkVersion" -> "2.4.4",
"sparkConf" -> Json.obj(
"spark.kubernetes.driver.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.files-volume.mount.path" -> "/opt/spark/files/",
"spark.kubernetes.executor.volumes.persistentVolumeClaim.jar-volume.mount.path" -> "/opt/spark/work-dir/",
"spark.kubernetes.driver.volumes.persistentVolumeClaim.log-volume.mount.path" -> "/opt/spark/event-logs/",
"spark.eventLog.enabled" -> "true",
"spark.eventLog.dir" -> "/opt/spark/event-logs/"
)
)
)
Now, I will be fetching some additional sparkConf parameters from my Database. Once I fetch it, I will be storing it inside a regular Scala Map (Map[String, String]) which will contain the key-value pairs that should go into the sparkConf.
I need to update that the sparkConf within the spec inside my JsObject. So ideally I would want to apply a transformation like this:
val sparkSession = Map[String, JsString]("spark.eventLog.enabled" -> JsString("true"))
val transformer = (__ \ "spec" \ "sparkConf").json.update(
__.read[JsObject].map(e => e + sparkSession)
)
However, I'm not getting ways to do this.
I am writing databriks scala / python notebook which connect SQL server database.
and i want to execute sql server function from notebook with custom paramters.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()
here function is FN_cal_udf which stored in sql server database -'DBsample'
I got error :
jdbc.SQLServerException: Parameters were not supplied for the function
How i can pass parameter and call SQL function inside notebook in scala or pyspark.
Here you can first make query string which stores function calling statement with dynamic parameters.
and then use in congig.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val query = " [dbo].[FN_cal_udf]('"+ID+"','"+name+"')"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()
I have a List and has to create Map from this for further use, I am using RDD, but with use of collect(), job is failing in cluster. Any help is appreciated.
Please help. Below is the sample code from List to rdd.collect.
I have to use this Map data further but how to use without collect?
This code creates a Map from RDD (List) Data. List Format->(asdfg/1234/wert,asdf)
//List Data to create Map
val listData = methodToGetListData(ListData).toList
//Creating RDD from above List
val rdd = sparkContext.makeRDD(listData)
implicit val formats = Serialization.formats(NoTypeHints)
val res = rdd
.map(map => (getRPath(map._1), getAttribute(map._1), map._2))
.groupBy(_._1)
.map(tuple => {
Map(
"P_Id" -> "1234",
"R_Time" -> "27-04-2020",
"S_Time" -> "27-04-2020",
"r_path" -> tuple._1,
"S_Tag" -> "12345,
tuple._1 -> (tuple._2.map(a => (a._2, a._3)).toMap)
)
})
res.collect()
}
Q: how to use without collect?
Answer : collect will hit.. it will move the data to driver node. if data is
huge. Never do that.
I dont exactly know what is the use case to prepare a map but it can be achievable using built in spark API i.e.collectionAccumulator ... in detail,
collectionAccumulator[scala.collection.mutable.Map[String, String]]
Lets suppose, this is your sample dataframe and you want to make a map.
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
From this you want to make a map (nested map I prefixed with nestedmap key name in your example) then...
Below is the full example have a look and modify accordingly.
package examples
import org.apache.log4j.Level
object GrabMapbetweenClosure extends App {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.master("local[*]")
.appName(this.getClass.getName)
.getOrCreate()
import spark.implicits._
var mutableMapAcc = spark.sparkContext.collectionAccumulator[scala.collection.mutable.Map[String, String]]("mutableMap")
val df = Seq(
("-0909", "1234", "Cables-1", "23-12-2020", "LC", "Installed", "ABCD1234"
, "0", "Cables", "ASDF123", "12345", "Start~>HInfo->Cables->Cables-1")
, ("-09091", "1234111", "Cables-11", "23-12-2022", "LC1", "Installed1", "ABCD12341"
, "0", "Cables1", "ASDF1231", "123451", "Start~>HInfo->Cables->Cables-11")
).toDF("Item_Id", "Parent_Id", "object_class_instance", "Received_Time", "CablesName", "CablesStatus", "CablesHInfoID",
"CablesIndex", "object_class", "ServiceTag", "Scan_Time", "relation_tree"
)
df.show(false)
df.foreachPartition { partition => // for performance sake I used foreachPartition
partition.foreach {
record => {
mutableMapAcc.add(scala.collection.mutable.Map(
"Item_Id" -> record.getAs[String]("Item_Id")
, "CablesStatus" -> record.getAs[String]("CablesStatus")
, "CablesHInfoID" -> record.getAs[String]("CablesHInfoID")
, "Parent_Id" -> record.getAs[String]("Parent_Id")
, "CablesIndex" -> record.getAs[String]("CablesIndex")
, "object_class_instance" -> record.getAs[String]("object_class_instance")
, "Received_Time" -> record.getAs[String]("Received_Time")
, "object_class" -> record.getAs[String]("object_class")
, "CablesName" -> record.getAs[String]("CablesName")
, "ServiceTag" -> record.getAs[String]("ServiceTag")
, "Scan_Time" -> record.getAs[String]("Scan_Time")
, "relation_tree" -> record.getAs[String]("relation_tree")
)
)
}
}
}
println("FinalMap : " + mutableMapAcc.value.toString)
}
Result :
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|Item_Id|Parent_Id|object_class_instance|Received_Time|CablesName|CablesStatus|CablesHInfoID|CablesIndex|object_class|ServiceTag|Scan_Time|relation_tree |
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
|-0909 |1234 |Cables-1 |23-12-2020 |LC |Installed |ABCD1234 |0 |Cables |ASDF123 |12345 |Start~>HInfo->Cables->Cables-1 |
|-09091 |1234111 |Cables-11 |23-12-2022 |LC1 |Installed1 |ABCD12341 |0 |Cables1 |ASDF1231 |123451 |Start~>HInfo->Cables->Cables-11|
+-------+---------+---------------------+-------------+----------+------------+-------------+-----------+------------+----------+---------+-------------------------------+
FinalMap : [Map(Scan_Time -> 123451, ServiceTag -> ASDF1231, Received_Time -> 23-12-2022, object_class_instance -> Cables-11, CablesHInfoID -> ABCD12341, Parent_Id -> 1234111, Item_Id -> -09091, CablesIndex -> 0, object_class -> Cables1, relation_tree -> Start~>HInfo->Cables->Cables-11, CablesName -> LC1, CablesStatus -> Installed1), Map(Scan_Time -> 12345, ServiceTag -> ASDF123, Received_Time -> 23-12-2020, object_class_instance -> Cables-1, CablesHInfoID -> ABCD1234, Parent_Id -> 1234, Item_Id -> -0909, CablesIndex -> 0, object_class -> Cables, relation_tree -> Start~>HInfo->Cables->Cables-1, CablesName -> LC, CablesStatus -> Installed)]
Similar problem was solved here.
This question already has answers here:
How to save a spark DataFrame as csv on disk?
(4 answers)
Closed 5 years ago.
I have tried to stream twitter data using Apache Spark and I want to save streamed data as csv file but I couldn't
how can I fix my code to get it in csv
I use RDD.
this is my main code:
val ssc = new StreamingContext(conf, Seconds(3600))
val stream = TwitterUtils.createStream(ssc, None, filters)
val tweets = stream.map(t => {
Map(
// This is for tweet
"text" -> t.getText,
"retweet_count" -> t.getRetweetCount,
"favorited" -> t.isFavorited,
"truncated" -> t.isTruncated,
"id_str" -> t.getId,
"in_reply_to_screen_name" -> t.getInReplyToScreenName,
"source" -> t.getSource,
"retweeted" -> t.isRetweetedByMe,
"created_at" -> t.getCreatedAt,
"in_reply_to_status_id_str" -> t.getInReplyToStatusId,
"in_reply_to_user_id_str" -> t.getInReplyToUserId,
// This is for tweet's user
"listed_count" -> t.getUser.getListedCount,
"verified" -> t.getUser.isVerified,
"location" -> t.getUser.getLocation,
"user_id_str" -> t.getUser.getId,
"description" -> t.getUser.getDescription,
"geo_enabled" -> t.getUser.isGeoEnabled,
"user_created_at" -> t.getUser.getCreatedAt,
"statuses_count" -> t.getUser.getStatusesCount,
"followers_count" -> t.getUser.getFollowersCount,
"favorites_count" -> t.getUser.getFavouritesCount,
"protected" -> t.getUser.isProtected,
"user_url" -> t.getUser.getURL,
"name" -> t.getUser.getName,
"time_zone" -> t.getUser.getTimeZone,
"user_lang" -> t.getUser.getLang,
"utc_offset" -> t.getUser.getUtcOffset,
"friends_count" -> t.getUser.getFriendsCount,
"screen_name" -> t.getUser.getScreenName
)
})
tweets.repartition(1).saveAsTextFiles("~/streaming/tweets")
You need to convert the tweets which is RDD[Map[String, String]] to a dataframe to save as CSV. The reason is simple RDD doesn't have a schema. Whereas csv format has a specific schema. So you have to convert the RDD to dataframe which has a schema.
There are several ways of doing that. One approach could be using a case class instead of putting the data into maps.
case class(text:String, retweetCount:Int ...)
Now instead of Map(...) you instantiate the case class with proper parameters.
Finally convert tweets to dataframe using spark implicit conversion
import spark.implicits._
tweets.toDF.write.csv(...) // saves as CSV
Alternatively you can convert the Map to a dataframe using the solution given here
I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception.
I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199.
SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million.
While
SELECT COUNT(*) FROM ReportDs = 57million.
Cluster details: CPU: 40 cores, Memory: 160G.
Here is my sample code:
...
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.setLevel(Level.INFO)
val conf = new SparkConf().setAppName("ExampleJob")
//.setMaster("local[*]")
//.set("spark.sql.shuffle.partitions", "3000")
//.set("spark.sql.crossJoin.enabled", "true")
.set("spark.storage.memoryFraction", "0.02")
.set("spark.shuffle.memoryFraction", "0.8")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.default.parallelism", (CPU * 3).toString)
val sparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
val reportOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> "REPORT_TBL",
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> RPT_NUM_PARTITION,
"lowerBound" -> RPT_LOWER_BOUND,
"upperBound" -> RPT_UPPER_BOUND,
"numPartitions" -> "200"
)
val accountOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> ACCOUNT_TBL,
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> ACCT_NUM_PARTITION,
"lowerBound" -> ACCT_LOWER_BOUND,
"upperBound" -> ACCT_UPPER_BOUND,
"numPartitions" -> "200"
)
val sc = sparkSession.sparkContext;
import sparkSession.implicits._
val reportDs = sparkSession.read.format("jdbc").options(reportOpts).load.cache().alias("a")
val accountDs = sparkSession.read.format("jdbc").options(accountOpts).load.cache().alias("c")
val reportData = reportDs.join(accountDs, reportDs("report_audit") === accountDs("reference_id"))
.withColumn("report_name", when($"report_id" === "xxxx-xxx-asd", $"report_id_ref_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_id_ref_2")
.otherwise($"report_id_ref_1" + ":" + $"report_id_ref_2"))
.withColumn("report_version", when($"report_id" === "xxxx-xxx-asd", $"report_version_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_version_2")
.otherwise($"report_version_3"))
.withColumn("status", when($"report_id" === "xxxx-xxx-asd", $"report_status")
.when($"report_id" === "demoasd-asdad-asda", $"report_status_1")
.otherwise($"report_id"))
.select("...")
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
reportData.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", "cust_report_data", prop)
sparkSession.stop()
I think there should be an elegant way to handle this sort of data skewness.
Your values for partitionColumn, upperBound, and lowerBound could cause this exact behavior if they aren't set correctly. For instance, if lowerBound == upperBound, then all of the data would be loaded into a single partition, regardless of numPartitions.
The combination of these attributes determines which (or how many) records get loaded into your DataFrame partitions from your SQL database.