I have a spark (1.2.1 v) job that inserts a content of an rdd to postgres using postgresql.Driver for scala:
rdd.foreachPartition(iter => {
//connect to postgres database on the localhost
val driver = "org.postgresql.Driver"
var connection:Connection = null
Class.forName(driver)
connection = DriverManager.getConnection(url, username, password)
val statement = connection.createStatement()
iter.foreach(row => {
val mapRequest = Utils.getInsertMap(row)
val query = Utils.getInsertRequest(squares_table, mapRequest)
try { statement.execute(query) }
catch {
case pe: PSQLException => println("exception caught: " + pe);
}
})
connection.close()
})
In the above code I open new connection to postgres for each partition of the rdd and close it. I think that the right way to go would be to use connection pool to postgres that I can take connections from (as described here), but its just pseudo-code:
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
}
What is the right way to connect to postgres with connection pool from spark?
This code will work for spark 2 or grater version and scala , First you have to add spark jdbc driver.
If you are using Maven then you can work this way. add this setting to your pom file
<dependency>
<groupId>postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.1-901-1.jdbc4</version>
</dependency>
write this code to scala file
import org.apache.spark.sql.SparkSession
object PostgresConnection {
def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-Basic")
.master("local[4]")
.getOrCreate()
val prop = new java.util.Properties
prop.setProperty("driver","org.postgresql.Driver")
prop.setProperty("user", "username")
prop.setProperty("password", "password")
val url = "jdbc:postgresql://127.0.0.1:5432/databaseName"
val df = spark.read.jdbc(url, "table_name",prop)
println(df.show(5))
}
}
Related
I have a sample Spark Code where I am trying to access the Values for tables from the Spark Configurations provided by spark_conf Option by using the typeSafe application.conf and Spark Conf in the Databricks UI.
The code I am using is below,
When I hit the Run Button in the Databricks UI, the job is finishing successfully, but the println function is printing dummyValue instead of ThisIsTableAOne,ThisIsTableBOne...
I can see from the Spark UI that, the Configurations for TableNames are being passed to the Spark job, but these values are not getting reflected in the Code.
try {
val inputConfig = AppConfig.getConfig("input")
val outputConfig = AppConfig.getConfig("output")
val tableA = inputConfig.getString("tableA")
val tableB = inputConfig.getString("tableB")
val tableC = outputConfig.getString("tableC")
println(
s"""
|$tableA
|$tableB
|$tableC
|""".stripMargin)
val userDataInTable = sparkSession.createDataFrame(Seq(
(1, "dummy", "dummy", "dummy")
)).toDF("id", "col2", "col3", "col4")
userDataInTable.show(false)
println("Completed Entry ")
} catch {
case e: Exception =>
sparkSession.stop()
e.printStackTrace
}
//application.conf contains below text,
spark.tableNameA="dummyValue"
spark.tableNameB="dummyValue"
spark.tableNameC="dummyValue"
input{
tableA=${spark.tableNameA}
tableB=${spark.tableNameB}
}
output{
tableC=${spark.tableNameC}
}
//AppConfig
val appConfig = ConfigFactory.load("application.conf")
def getConfig(moduleName: String): Config = {
val config = appConfig.getConfig(moduleName)
config
}
I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
I am trying to create a spark application which is useful to
create, read, write and update MySQL data. So, is there any way to create a MySQL table using Spark?
Below I have a Scala-JDBC code that creates a table in MySQL
database. How can I do this through Spark?
package SparkMysqlJdbcConnectivity
import org.apache.spark.sql.SparkSession
import java.util.Properties
import java.lang.Class
import java.sql.Connection
import java.sql.DriverManager
object MysqlSparkJdbcProgram {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("MysqlJDBC Connections")
.master("local[*]")
.getOrCreate()
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/world"
val operationtype = "create table"
val tablename = "country"
val tablename2 = "state"
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDf = spark.read.jdbc(url, s"${tablename}", connectionProperties)
operationtype.trim() match {
case "create table" => {
// Class.forName(driver)
try{
val con:Connection = DriverManager.getConnection(url,connectionProperties)
val result = con.prepareStatement(s"create table ${tablename2} (name varchar(255), country varchar(255))").execute()
println(result)
if(result) println("table creation is unsucessful") else println("table creation is unsucessful")
}
}
case "read table" => {
val jdbcDf = spark.read.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
jdbcDf.show()
}
case "write table" => {}
case "drop table" => {}
}
}
}
The tables will be created automatically when you write the jdbcDf dataframe.
jdbcDf
.write
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
In case if you want to specify the table schema,
jdbcDf
.write
.option("createTableColumnTypes", "name VARCHAR(500), col1 VARCHAR(1024), col3 int")
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
object App {
def main(args: Array[String]) {
val conf = new spark.SparkConf().setMaster("local[2]").setAppName("mySparkApp")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcUrl = "1.2.34.567"
val jdbcUser = "someUser"
val jdbcPassword = "xxxxxxxxxxxxxxxxxxxx"
val tableName = "myTable"
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = sqlContext
.read
.format("jdbc")
.option("driver", driver)
.option("url", jdbcUrl)
.option("userName", jdbcUser)
.option("password", jdbcPassword)
.option("dbtable", tableName) // NullPointerException occurs here
.load()
}
}
I want to connect to a Postgres database on my LAN from Spark. During runtime, the following error occurs:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at <redacted>?.main(App.scala:42)
at <redacted>.App.main(App.scala)
Is there an obvious reason why there's a nullpointer exception at the option("dbtable", tableName) line? I'm using spark-2.3.1-bin-hadoop2.7 with Scala 2.11.12. For the postgres dependency, I'm using this version:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4-1200-jdbc41</version>
</dependency>
The error message (which isn't very helpful for troubleshooting) is probably not against option dbtable, but option url.
It looks like your jdbcUrl is missing the URL protocol jdbc:postgresql:// as its prefix. Here's a link re: Spark's JDBC data sources.
I want to create a new mongodb RDD each time I enter inside foreachRDD. However I have serialization issues:
mydstream
.foreachRDD(rdd => {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
// ssc is my StreamingContext
val modelsRDDRaw = ssc.sparkContext.parallelize(coll.find().toList) })
This will give me an error:
object not serializable (class: org.apache.spark.streaming.StreamingContext, value: org.apache.spark.streaming.StreamingContext#31133b6e)
Any idea?
You might try to use rdd.context that returns either a SparkContext or a SparkStreamingContext (if rdd is a DStream).
mydstream foreachRDD { rdd => {
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
val modelsRDDRaw = rdd.context.parallelize(coll.find().toList) })
Actually, it seems that RDD has also a .sparkContext method. I honestly don't know the difference, maybe they are aliases (?).
In my understanding you have to add if you have a "not serializable" object, you need to pass it through foreachPartition so you can make a connection to database on each node before running your processing.
mydstream.foreachRDD(rdd => {
rdd.foreachPartition{
val mongoClient = MongoClient("localhost", 27017)
val db = mongoClient(mongoDatabase)
val coll = db(mongoCollection)
// ssc is my StreamingContext
val modelsRDDRaw = ssc.sparkContext.parallelize(coll.find().toList) }})