I have a spark (1.2.1 v) job that inserts a content of an rdd to postgres using postgresql.Driver for scala:
rdd.foreachPartition(iter => {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
iter.foreach(row => {
val mapRequest = Utils.getInsertMap(row)
val query = Utils.getInsertRequest(squares_table, mapRequest)
try { statement.execute(query) }
catch {
case pe: PSQLException => println("exception caught: " + pe);
In the above code I open new connection to postgres for each partition of the rdd and close it. I think that the right way to go would be to use connection pool to postgres that I can take connections from (as described here), but its just pseudo-code:
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
What is the right way to connect to postgres with connection pool from spark?
This code will work for spark 2 or grater version and scala , First you have to add spark jdbc driver.
If you are using Maven then you can work this way. add this setting to your pom file
write this code to scala file
import org.apache.spark.sql.SparkSession
object PostgresConnection {
def main(args: Array[String]) {
val spark =
val prop = new java.util.Properties
prop.setProperty("user", "username")
prop.setProperty("password", "password")
val url = "jdbc:postgresql://"
val df = spark.read.jdbc(url, "table_name",prop)
I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI.
The code I am using is below,
When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne...
I can see from the Spark UI that, the Configurations for TableNames are being passed to the Spark job, but these values are not getting reflected in the Code.
try {
val inputConfig = AppConfig.getConfig("input")
val outputConfig = AppConfig.getConfig("output")
val tableA = inputConfig.getString("tableA")
val tableB = inputConfig.getString("tableB")
val tableC = outputConfig.getString("tableC")
val userDataInTable = sparkSession.createDataFrame(Seq(
(1, "dummy", "dummy", "dummy")
)).toDF("id", "col2", "col3", "col4")
println("Completed Entry ")
} catch {
case e: Exception =>
//application.conf contains below text,
val appConfig = ConfigFactory.load("application.conf")
def getConfig(moduleName: String): Config = {
val config = appConfig.getConfig(moduleName)
I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
I am trying to create a spark application which is useful to
create, read, write and update MySQL data. So, is there any way to create a MySQL table using Spark?
Below I have a Scala-JDBC code that creates a table in MySQL
database. How can I do this through Spark?
package SparkMysqlJdbcConnectivity
import org.apache.spark.sql.SparkSession
import java.util.Properties
import java.lang.Class
import java.sql.Connection
import java.sql.DriverManager
object MysqlSparkJdbcProgram {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("MysqlJDBC Connections")
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/world"
val operationtype = "create table"
val tablename = "country"
val tablename2 = "state"
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDf = spark.read.jdbc(url, s"${tablename}", connectionProperties)
operationtype.trim() match {
case "create table" => {
// Class.forName(driver)
val con:Connection = DriverManager.getConnection(url,connectionProperties)
val result = con.prepareStatement(s"create table ${tablename2} (name varchar(255), country varchar(255))").execute()
if(result) println("table creation is unsucessful") else println("table creation is unsucessful")
case "read table" => {
val jdbcDf = spark.read.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
case "write table" => {}
case "drop table" => {}
The tables will be created automatically when you write the jdbcDf dataframe.
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
In case if you want to specify the table schema,
.option("createTableColumnTypes", "name VARCHAR(500), col1 VARCHAR(1024), col3 int")
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
object App {
def main(args: Array[String]) {
val conf = new spark.SparkConf().setMaster("local[2]").setAppName("mySparkApp")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcUrl = ""
val jdbcUser = "someUser"
val jdbcPassword = "xxxxxxxxxxxxxxxxxxxx"
val tableName = "myTable"
val driver = "org.postgresql.Driver"
val df = sqlContext
.option("driver", driver)
.option("url", jdbcUrl)
.option("userName", jdbcUser)
.option("password", jdbcPassword)
.option("dbtable", tableName) // NullPointerException occurs here
I want to connect to a Postgres database on my LAN from Spark. During runtime, the following error occurs:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at <redacted>?.main(App.scala:42)
at <redacted>.App.main(App.scala)
Is there an obvious reason why there's a nullpointer exception at the option("dbtable", tableName) line? I'm using spark-2.3.1-bin-hadoop2.7 with Scala 2.11.12. For the postgres dependency, I'm using this version:
The error message (which isn't very helpful for troubleshooting) is probably not against option dbtable, but option url.
It looks like your jdbcUrl is missing the URL protocol jdbc:postgresql:// as its prefix. Here's a link re: Spark's JDBC data sources.
I want to create a new mongodb RDD each time I enter inside foreachRDD. However I have serialization issues:
.foreachRDD(rdd => {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
// ssc is my StreamingContext
val modelsRDDRaw = ssc.sparkContext.parallelize(coll.find().toList) })
This will give me an error:
object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#31133b6e)
Any idea?
You might try to use rdd.context that returns either a SparkContext or a SparkStreamingContext (if rdd is a DStream).
mydstream foreachRDD { rdd => {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
val modelsRDDRaw = rdd.context.parallelize(coll.find().toList) })
Actually, it seems that RDD has also a .sparkContext method. I honestly don't know the difference, maybe they are aliases (?).
In my understanding you have to add if you have a "not serializable" object, you need to pass it through foreachPartition so you can make a connection to database on each node before running your processing.
mydstream.foreachRDD(rdd => {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
// ssc is my StreamingContext
val modelsRDDRaw = ssc.sparkContext.parallelize(coll.find().toList) }})