Apache spark join with dynamic re-partitionion

Apache spark join with dynamic re-partitionion - scala

I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception.
I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199.
SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million.
While
SELECT COUNT(*) FROM ReportDs = 57million.
Cluster details: CPU: 40 cores, Memory: 160G.
Here is my sample code:
...
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.setLevel(Level.INFO)
val conf = new SparkConf().setAppName("ExampleJob")
//.setMaster("local[*]")
//.set("spark.sql.shuffle.partitions", "3000")
//.set("spark.sql.crossJoin.enabled", "true")
.set("spark.storage.memoryFraction", "0.02")
.set("spark.shuffle.memoryFraction", "0.8")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.default.parallelism", (CPU * 3).toString)
val sparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
val reportOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> "REPORT_TBL",
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> RPT_NUM_PARTITION,
"lowerBound" -> RPT_LOWER_BOUND,
"upperBound" -> RPT_UPPER_BOUND,
"numPartitions" -> "200"
)
val accountOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> ACCOUNT_TBL,
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> ACCT_NUM_PARTITION,
"lowerBound" -> ACCT_LOWER_BOUND,
"upperBound" -> ACCT_UPPER_BOUND,
"numPartitions" -> "200"
)
val sc = sparkSession.sparkContext;
import sparkSession.implicits._
val reportDs = sparkSession.read.format("jdbc").options(reportOpts).load.cache().alias("a")
val accountDs = sparkSession.read.format("jdbc").options(accountOpts).load.cache().alias("c")
val reportData = reportDs.join(accountDs, reportDs("report_audit") === accountDs("reference_id"))
.withColumn("report_name", when($"report_id" === "xxxx-xxx-asd", $"report_id_ref_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_id_ref_2")
.otherwise($"report_id_ref_1" + ":" + $"report_id_ref_2"))
.withColumn("report_version", when($"report_id" === "xxxx-xxx-asd", $"report_version_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_version_2")
.otherwise($"report_version_3"))
.withColumn("status", when($"report_id" === "xxxx-xxx-asd", $"report_status")
.when($"report_id" === "demoasd-asdad-asda", $"report_status_1")
.otherwise($"report_id"))
.select("...")
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
reportData.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", "cust_report_data", prop)
sparkSession.stop()
I think there should be an elegant way to handle this sort of data skewness.

Your values for partitionColumn, upperBound, and lowerBound could cause this exact behavior if they aren't set correctly. For instance, if lowerBound == upperBound, then all of the data would be loaded into a single partition, regardless of numPartitions.
The combination of these attributes determines which (or how many) records get loaded into your DataFrame partitions from your SQL database.

Related

How to create a Spark SQL Dataframe with list of Map objects

I have multiple Map[String, String] in a List (Scala). For example:
map1 = Map("EMP_NAME" -> “Ahmad”, "DOB" -> “01-10-1991”, "CITY" -> “Dubai”)
map2 = Map("EMP_NAME" -> “Rahul”, "DOB" -> “06-12-1991”, "CITY" -> “Mumbai”)
map3 = Map("EMP_NAME" -> “John”, "DOB" -> “11-04-1996”, "CITY" -> “Toronto”)
list = List(map1, map2, map3)
Now I want to create a single dataframe with something like this:
EMP_NAME DOB CITY
Ahmad 01-10-1991 Dubai
Rahul 06-12-1991 Mumbai
John 11-04-1996 Toronto
How do I achieve this?

you can do it like this :
import spark.implicits._
val df = list
.map( m => (m.get("EMP_NAME"),m.get("DOB"),m.get("CITY")))
.toDF("EMP_NAME","DOB","CITY")
df.show()
+--------+----------+-------+
|EMP_NAME| DOB| CITY|
+--------+----------+-------+
| Ahmad|01-10-1991| Dubai|
| Rahul|06-12-1991| Mumbai|
| John|11-04-1996|Toronto|
+--------+----------+-------+

Slightly less specific approach, e.g:
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")
///...
val list = List(map1, map2) // map3, ...
val RDDmap = sc.parallelize(list)
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
df.show(false)
returns:
+--------+----------+------+
|EMP_NAME|DOB |CITY |
+--------+----------+------+
|Ahmad |01-10-1991|Dubai |
|John |01-10-1992|Mumbai|
+--------+----------+------+
or to answer your sub-question, here as follows - at least I think this is what you are asking, but probably not:
val RDDmap = sc.parallelize(List(
Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai"),
Map("EMP_NAME" -> "John", "DOB" -> "01-10-1992", "CITY" -> "Mumbai")))
...
// Get cols dynamically
val cols = RDDmap.take(1).flatMap(x=> x.keys)
// Map is K,V like per Map entry
val df = RDDmap.map{ value=>
val list=value.values.toList
(list(0), list(1), list(2))
}.toDF(cols:_*) // dynamic column names assigned
You can build a list dynamically of course, but you still need to assign the Map elements. See Appending Data to List or any other collection Dynamically in scala. I would just read in from file and be done with it.

import org.apache.spark.SparkContext
import org.apache.spark.sql._
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object DataFrameTest2 extends Serializable {
var sparkSession: SparkSession = _
var sparkContext: SparkContext = _
var sqlContext: SQLContext = _
def main(args: Array[String]): Unit = {
sparkSession = SparkSession.builder().appName("TestMaster").master("local").getOrCreate()
sparkContext = sparkSession.sparkContext
val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val map1 = Map("EMP_NAME" -> "Ahmad", "DOB" -> "01-10-1991", "CITY" -> "Dubai")
val map2 = Map("EMP_NAME" -> "Rahul", "DOB" -> "06-12-1991", "CITY" -> "Mumbai")
val map3 = Map("EMP_NAME" -> "John", "DOB" -> "11-04-1996", "CITY" -> "Toronto")
val list = List(map1, map2, map3)
//create your rows
val rows = list.map(m => Row(m.values.toSeq:_*))
//create the schema from the header
val header = list.head.keys.toList
val schema = StructType(header.map(fieldName => StructField(fieldName, StringType, true)))
//create your rdd
val rdd = sparkContext.parallelize(rows)
//create your dataframe using rdd
val df = sparkSession.createDataFrame(rdd, schema)
df.show()
}
}

Change forEachRDD

I am having a problem, we are using Kafka and spark.
we are using forEachRDD like this:
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message)}
println(newRDD.count())
}
but we are passing the processMessage(message) method. This method will call a class that is creating the sparkContext. I have been reading and it will throw an error if you created the sparkContext inside the foreachRDD.
I have changed it like this:
messages.map{
case (msg) =>
val newRDD3 = (processMessage(msg))
(newRDD3)
}
but I am not sure if this is doing the same as the foreachRDD.
Could you please help me with this?
Any help will be really appreciate it.

use sparksession
SparkConf conf = new SparkConf()
.setAppName("appName")
.setMaster("local");
SparkSession sparkSession = SparkSession
.builder()
.config(conf)
.getOrCreate();
return sparkSession;

I created the streamContext, then declare the topic and the KafkaParams.
Finally, I created the messages. Please see code below:
def main(args: Array[String]) {
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
val conf = ConfigFactory.load("spfin.conf")
println("kafka.duration --- "+ conf.getString("kafka.duration").toLong)
val mlFeatures: MLFeatures = new MLFeatures()
// Create context with custom second batch interval
//val sparkConf = new SparkConf().setAppName("SpFinML")
//val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
val ssc = new StreamingContext(mlFeatures.sc, Seconds(conf.getString("kafka.duration").toLong))
// Create direct kafka stream with brokers and topics
val topicsSet = conf.getString("kafka.requesttopic").split(",").toSet
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups2"),
"auto.offset.reset" -> conf.getString("kafka.autoOffset"),
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
/* this code is to get messages from request topic*/
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
processMessage(message,"Inside processMessage")}
println("Inside map")
// println(newRDD.count())
}

The processMessage method is this:
I think I might need to change this method, right?
def processMessage(message: ConsumerRecord[String, String],msg: String): ConsumerRecord[String, String] = {
println(msg)
println("Message processed is " + message.value())
val req_message = message.value()
//val tableName = conf.getString("hbase.tableName")
//println("Hbase table name : " + tableName)
//val decisionTree_res = "PredictionModelOutput "
// val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message)
// println(decisionTree_res)
//
// kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
kafkaProducer(conf.getString("kafka.responsetopic"), """[{"payorId":53723,"therapyType":"RMIV","ndcNumber":"'66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"'9427535101","serviceDate":"20161102","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":22957.55,"daysOrUnits":140,"algorithmType":"LR ,RF ,NB ,VoteResult","label":0.0,"prediction":"0.0 ,0.0 ,0.0 ,0.0","finalPrediction":"Approved","rejectOutcome":"Y","neighborCounter":0,"probability":"0.9307022947278968 - 0.06929770527210313 ,0.9879908798891663 - 0.012009120110833629 ,1.0 - 0.0 ,","patientGender":"M","invoiceId":0,"therapyClass":"REMODULIN","patientAge":52,"npiId":0,"prescriptionId":0,"refillNo":0,"requestId":"419568891","requestDateTime":"20171909213055","responseId":"419568891","responseDateTime":"201801103503"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":1,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":2,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":3,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":4,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160621","responseDate":"20160818","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":19677.9,"daysOrUnits":120,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":5,"invoiceId":45226407,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":0,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"A"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":6,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":7,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":8,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170218","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9427535101","serviceDate":"20160829","responseDate":"20161020","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":9,"invoiceId":46347660,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":2,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9556908,"procedureCode":"J3285","authNbr":"9450829801","serviceDate":"20160714","responseDate":"20160908","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":26237.2,"daysOrUnits":160,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":10,"invoiceId":45631877,"patientAge":0,"npiId":0,"prescriptionId":1174215,"refillNo":1,"authExpiry":"20170211","exceedFlag":"N","appealFlag":"N","authNNType":"P"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4879911,"procedureCode":"J3285","authNbr":"9818182501","serviceDate":"20160901","responseDate":"20161027","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":11,"invoiceId":46419758,"patientAge":0,"npiId":0,"prescriptionId":1095509,"refillNo":7,"authExpiry":"20170626","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":4056274,"procedureCode":"J3285","authNbr":"8914616801","serviceDate":"20160727","responseDate":"20161019","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":13118.6,"daysOrUnits":80,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":12,"invoiceId":45820271,"patientAge":0,"npiId":0,"prescriptionId":1055447,"refillNo":10,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":3476365,"procedureCode":"J3285","authNbr":"9262852501","serviceDate":"20160809","responseDate":"20161013","serviceBranchId":65,"serviceDuration":30,"placeOfService":12,"charges":16398.25,"daysOrUnits":100,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":13,"invoiceId":46027459,"patientAge":0,"npiId":0,"prescriptionId":1169479,"refillNo":2,"authExpiry":"20161231","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":1064449,"procedureCode":"J3285","authNbr":"9327540001","serviceDate":"20160825","responseDate":"20161013","serviceBranchId":35,"serviceDuration":14,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":14,"invoiceId":46303200,"patientAge":0,"npiId":0,"prescriptionId":714169,"refillNo":0,"authExpiry":"20170112","exceedFlag":"N","appealFlag":"N","authNNType":"O"},{"payorId":53723,"therapyType":"RMIV","ndcNumber":"66302010201","patientId":9205248,"procedureCode":"J3285","authNbr":"0000000","serviceDate":"20160823","responseDate":"20161013","serviceBranchId":65,"serviceDuration":20,"placeOfService":12,"charges":6559.3,"daysOrUnits":40,"algorithmType":"NN","label":0.0,"rejectOutcome":"Y","neighborCounter":15,"invoiceId":46257476,"patientAge":0,"npiId":0,"prescriptionId":1206606,"refillNo":0,"authExpiry":"-","exceedFlag":"N","appealFlag":"N","authNNType":"O"}]""")
//saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
message
}

If someone is interested in the solution.
For messages(InputDStream) I use foreachRDD, which convert it to a RDD, next use map to get the json from the consumerRecord. Next, convert the RDD to an Array[String] and pass it to the processMessage method.
messages.foreachRDD{ rdd =>
val newRDD = rdd.map{message =>
val req_message = message.value()
(message.value())
}
println("Request messages: " + newRDD.count())
var resultrows = newRDD.collect()//.collectAsList()
processMessage(resultrows, mlFeatures: MLFeatures)
}
Inside the processMessage method, there is a for loop to process all the strings. Also we are inserting the request message and response message to a hbase table.
def processMessage(message: Array[String], mlFeatures: MLFeatures) = {
for (j <- 0 until message.size){
val req_message = message(j)//.get(j).toString()
val decisionTree_res = PriorAuthPredict.processPriorAuthPredict(req_message,mlFeatures)
println("Message processed is " + req_message)
kafkaProducer(conf.getString("kafka.responsetopic"), decisionTree_res)
var startTime = new Date().getTime();
saveToHBase(conf.getString("hbase.tableName"), req_message, decisionTree_res)
var endTime = new Date().getTime();
println("Kafka Consumer savetoHBase took : "+ (endTime - startTime) / 1000 + " seconds")
}
}

Add and delete directly from PostgreSQL from Spark and Scala

I would like to compare the size of two DataFrames that have been extracted from Oracle and PostgreSQL databases. I would like to compare them, then either add new rows or delete rows. How does one directly add or delete from PostreSQL? Here is what I did:
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse");
val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val df = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_STRUCTURE"))
import sparkSession.implicits._
val usedGold = df.filter(length($"CODE") === 2) // get column with length equal to 2
val article_groups = spark.load("jdbc", Map(
"url" -> "jdbc:postgresql://localhost:5432/gemodb?user=postgres&password=maher",
"dbtable" -> "article_groups")).select("id", "name")
val usedArticleGroup = article_groups.select($"*", $"id".cast(StringType) as "newId") // cast column code to Long
val usedPostg = usedArticleGroup.select("newId", "name")
// val df3 = usedGold.join(usedPostg, $"code" === $"newId", "outer")
//get different rows
val differentData = usedGold.except(usedPostg).toDF("code", "name")
if (usedGold.count > usedPostg.count) {
//insert into usedPostg values(differentData("code"),differentData("name"))
} else if (usedGold.count < usedPostg.count) {
// delete from usedPostg where newId= differentData("code") in postgresql
}

Get multiple columns from database?

I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?

This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")

How to write spark DataFrames to Postgres DB

I use Spark 1.3.0
Let's say I have a dataframe in Spark and I need to store this to Postgres DB (postgresql-9.2.18-1-linux-x64) on a 64bit ubuntu machine.
I also use postgresql9.2jdbc41.jar as a driver to connect to postgres
I was able to read data from postgres DB using the below commands
import org.postgresql.Driver
val url="jdbc:postgresql://localhost/postgres?user=user&password=pwd"
val driver = "org.postgresql.Driver"
val users = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "cdimemployee",
"partitionColumn" -> "intempdimkey",
"lowerBound" -> "0",
"upperBound" -> "500",
"numPartitions" -> "50"
))
}
val get_all_emp = users.select("*")
val empDF = get_all_emp.toDF
get_all_emp.foreach(println)
I want to write this DF back to postgres after some processing.
Is this below code right?
empDF.write.jdbc("jdbc:postgresql://localhost/postgres", "test", Map("user" -> "user", "password" -> "pwd"))
Any pointers(scala) would be helpful.

You should follow the code below.
val database = jobConfig.getString("database")
val url: String = s"jdbc:postgresql://localhost/$database"
val tableName: String = jobConfig.getString("tableName")
val user: String = jobConfig.getString("user")
val password: String = jobConfig.getString("password")
val sql = jobConfig.getString("sql")
val df = sc.sql(sql)
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
properties.put("driver", "org.postgresql.Driver")
df.write.mode(SaveMode.Overwrite).jdbc(url, tableName, properties)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Apache spark join with dynamic re-partitionion - scala

Related

How to create a Spark SQL Dataframe with list of Map objects

Change forEachRDD

Add and delete directly from PostgreSQL from Spark and Scala

Get multiple columns from database?

How to write spark DataFrames to Postgres DB

Categories

Resources