How to write spark DataFrames to Postgres DB - postgresql

I use Spark 1.3.0
Let's say I have a dataframe in Spark and I need to store this to Postgres DB (postgresql-9.2.18-1-linux-x64) on a 64bit ubuntu machine.
I also use postgresql9.2jdbc41.jar as a driver to connect to postgres
I was able to read data from postgres DB using the below commands
import org.postgresql.Driver
val url="jdbc:postgresql://localhost/postgres?user=user&password=pwd"
val driver = "org.postgresql.Driver"
val users = {
sqlContext.load("jdbc", Map(
"url" -> url,
"driver" -> driver,
"dbtable" -> "cdimemployee",
"partitionColumn" -> "intempdimkey",
"lowerBound" -> "0",
"upperBound" -> "500",
"numPartitions" -> "50"
))
}
val get_all_emp = users.select("*")
val empDF = get_all_emp.toDF
get_all_emp.foreach(println)
I want to write this DF back to postgres after some processing.
Is this below code right?
empDF.write.jdbc("jdbc:postgresql://localhost/postgres", "test", Map("user" -> "user", "password" -> "pwd"))
Any pointers(scala) would be helpful.

You should follow the code below.
val database = jobConfig.getString("database")
val url: String = s"jdbc:postgresql://localhost/$database"
val tableName: String = jobConfig.getString("tableName")
val user: String = jobConfig.getString("user")
val password: String = jobConfig.getString("password")
val sql = jobConfig.getString("sql")
val df = sc.sql(sql)
val properties = new Properties()
properties.setProperty("user", user)
properties.setProperty("password", password)
properties.put("driver", "org.postgresql.Driver")
df.write.mode(SaveMode.Overwrite).jdbc(url, tableName, properties)

Related

Spark unit test -- Mock Azure SQLJDBC connection

I want to unit test the below piece of code so that i can get good code coverage. I am using FunSuite with Mockito. Can you please let me know how can i mock the database connectiion and do the unit testing.
def getSummaryConfig() : Config = {
Config(Map(
"url" -> configUtil.getProperty("azure.host.name"),
"databaseName" -> configUtil.getProperty("azure.database.name"),
"dbTable" -> configUtil.getProperty("azure.summary.table"),
"user" -> configUtil.getProperty("azure.user.name"),
"password" -> configUtil.getProperty("azure.database.password")
))
}
def getSummaryDF(summaryConfig : Config) : DataFrame = {
val summaryDF = spark.read.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver").sqlDB(summaryConfig)
summaryDF
}
val summaryConfig = getSummaryConfig()
val summaryDF = getSummaryDF(summaryConfig)

How to call remote SQL function inside PySpark or Scala databriks notebook

I am writing databriks scala / python notebook which connect SQL server database.
and i want to execute sql server function from notebook with custom paramters.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()
here function is FN_cal_udf which stored in sql server database -'DBsample'
I got error :
jdbc.SQLServerException: Parameters were not supplied for the function
How i can pass parameter and call SQL function inside notebook in scala or pyspark.
Here you can first make query string which stores function calling statement with dynamic parameters.
and then use in congig.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val ID = "1"
val name = "A"
val query = " [dbo].[FN_cal_udf]('"+ID+"','"+name+"')"
val config = Config(Map(
"url" -> "sample-p-vm.all.test.azure.com",
"databaseName" -> "DBsample",
"dbTable" -> "dbo.FN_cal_udf",
"user" -> "useer567",
"password" -> "pppp#345%",
"connectTimeout" -> "5", //seconds
"queryTimeout" -> "5" //seconds
))
val collection = sqlContext.read.sqlDB(config)
collection.show()

Add and delete directly from PostgreSQL from Spark and Scala

I would like to compare the size of two DataFrames that have been extracted from Oracle and PostgreSQL databases. I would like to compare them, then either add new rows or delete rows. How does one directly add or delete from PostreSQL? Here is what I did:
System.setProperty("hadoop.home.dir", "C:\\hadoop");
System.setProperty("spark.sql.warehouse.dir", "file:///C:/spark-warehouse");
val sparkSession = SparkSession.builder.master("local").appName("spark session example").getOrCreate()
//connect to table TMP_STRUCTURE oracle
val spark = sparkSession.sqlContext
val df = spark.load("jdbc",
Map("url" -> "jdbc:oracle:thin:System/maher#//localhost:1521/XE",
"dbtable" -> "IPTECH.TMP_STRUCTURE"))
import sparkSession.implicits._
val usedGold = df.filter(length($"CODE") === 2) // get column with length equal to 2
val article_groups = spark.load("jdbc", Map(
"url" -> "jdbc:postgresql://localhost:5432/gemodb?user=postgres&password=maher",
"dbtable" -> "article_groups")).select("id", "name")
val usedArticleGroup = article_groups.select($"*", $"id".cast(StringType) as "newId") // cast column code to Long
val usedPostg = usedArticleGroup.select("newId", "name")
// val df3 = usedGold.join(usedPostg, $"code" === $"newId", "outer")
//get different rows
val differentData = usedGold.except(usedPostg).toDF("code", "name")
if (usedGold.count > usedPostg.count) {
//insert into usedPostg values(differentData("code"),differentData("name"))
} else if (usedGold.count < usedPostg.count) {
// delete from usedPostg where newId= differentData("code") in postgresql
}

Get multiple columns from database?

I've using the following the code to get a list of columns from a database table.
val result =
sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load()
.select("column1") // Now I need to select("col1", "col2", "col3")
.as[Int]
Now I need to get multiple columns from the database table and I want the result to be strongly typed (DataSet?). How should the code be written?
This should do the trick:-
val colNames = Seq("column1","col1","col2",....."coln")
val result = sqlContext.read.format("jdbc").options(Map(
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"url" -> jdbcSqlConn,
"dbtable" -> s"...."
)).load().select(colNames.head, colNames.tail: _*)
val newResult = result.withColumn("column1New", result.column1.cast(IntegerType))
.drop("column1").withColumnRenamed("column1New", "column1")

Apache spark join with dynamic re-partitionion

I'm trying to do a fairly straightforward join on two tables, nothing complicated. Load both tables, do a join and update columns but it keeps throwing an exception.
I noticed the task is stuck on the last partition 199/200 and eventually crashes. My suspicion is that the data is skewed causing all the data to be loaded in the last partition 199.
SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million.
While
SELECT COUNT(*) FROM ReportDs = 57million.
Cluster details: CPU: 40 cores, Memory: 160G.
Here is my sample code:
...
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.setLevel(Level.INFO)
val conf = new SparkConf().setAppName("ExampleJob")
//.setMaster("local[*]")
//.set("spark.sql.shuffle.partitions", "3000")
//.set("spark.sql.crossJoin.enabled", "true")
.set("spark.storage.memoryFraction", "0.02")
.set("spark.shuffle.memoryFraction", "0.8")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.default.parallelism", (CPU * 3).toString)
val sparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
val reportOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> "REPORT_TBL",
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> RPT_NUM_PARTITION,
"lowerBound" -> RPT_LOWER_BOUND,
"upperBound" -> RPT_UPPER_BOUND,
"numPartitions" -> "200"
)
val accountOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> ACCOUNT_TBL,
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> ACCT_NUM_PARTITION,
"lowerBound" -> ACCT_LOWER_BOUND,
"upperBound" -> ACCT_UPPER_BOUND,
"numPartitions" -> "200"
)
val sc = sparkSession.sparkContext;
import sparkSession.implicits._
val reportDs = sparkSession.read.format("jdbc").options(reportOpts).load.cache().alias("a")
val accountDs = sparkSession.read.format("jdbc").options(accountOpts).load.cache().alias("c")
val reportData = reportDs.join(accountDs, reportDs("report_audit") === accountDs("reference_id"))
.withColumn("report_name", when($"report_id" === "xxxx-xxx-asd", $"report_id_ref_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_id_ref_2")
.otherwise($"report_id_ref_1" + ":" + $"report_id_ref_2"))
.withColumn("report_version", when($"report_id" === "xxxx-xxx-asd", $"report_version_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_version_2")
.otherwise($"report_version_3"))
.withColumn("status", when($"report_id" === "xxxx-xxx-asd", $"report_status")
.when($"report_id" === "demoasd-asdad-asda", $"report_status_1")
.otherwise($"report_id"))
.select("...")
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
reportData.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", "cust_report_data", prop)
sparkSession.stop()
I think there should be an elegant way to handle this sort of data skewness.
Your values for partitionColumn, upperBound, and lowerBound could cause this exact behavior if they aren't set correctly. For instance, if lowerBound == upperBound, then all of the data would be loaded into a single partition, regardless of numPartitions.
The combination of these attributes determines which (or how many) records get loaded into your DataFrame partitions from your SQL database.