Scala sql query remote access error from GCP to on-premises - scala

I had the following code:
import org.jooq._
import org.jooq.impl._
import org.jooq.impl.DSL._
import java.sql.DriverManager
import org.apache.log4j.receivers.db.dialect.SQLDialect
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
val userName = "user"
val password = "pass"
val c = DriverManager.getConnection("jdbc:mysql://blah_blah.com", userName, password)
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then I got the following error:
found : org.jooq.SQL
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df1 = sql(s"select * from $db1_name.$tb1_name")
^
found : java.sql.PreparedStatement
required: org.apache.spark.sql.DataFrame
(which expands to)
org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
df2 = c.prepareStatement(s"select * from $db2_name.$tb2_name")
Then per the comments suggestions I changed my code to:
I have the following Scala code:
val userName = "user"
val password = "pass"
val session = SparkSession.builder().getOrCreate()
var df1 = session.emptyDataFrame
var df2 = session.emptyDataFrame
....
....
df1 = sql(s"select * from $db1_name.$tb1_name")
df2 = session.read.format("jdbc").
option("url", "jdbc:mysql://blah_blah.com").
option("driver", "com.mysql.jdbc.Driver").
option("useUnicode", "true").
option("continueBatchOnError","true").
option("useSSL", "false").
option("user", userName).
option("password", password).
option("dbtable",s"select * from $db2_name.$tb2_name").load()
I am getting errors as following:
The last packet sent successfully to the server was 0 milliseconds
ago. The driver has not received any packets from the server.
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:632)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1016)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2194)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2225)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2024)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:779)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:389)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:330)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:56)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:115)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:52)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 78 elided
Caused by: java.io.EOFException: Can not read response from server.
Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
at com.mysql.jdbc.MysqlIO.readPacket(MysqlIO.java:567)
... 100 more
Any solution or suggestion on these two errors?
I have tried postgresql and h2 driver as well => org.postgresql.Driver
But I get similar errors (not exact maybe)

Your issue is that the scala compilere have already initialized the var ds1 and ds2 as empty dataframe.
you have to try to read directly from spark:
spark.read.format("jdbc")
.option("url", jdbcUrl)
.option("query", "select c1, c2 from t1")
.load()
for other info you can check directly on the apache spark page
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

You can simply get a DataFrame by reading as below. Set you connection details:
val jdbcHostname = "some.host.name"
val jdbcDatabase = "some_db"
val driver = "com.mysql.cj.jdbc.Driver" // update driver as needed, In your case it will be `org.postgresql.Driver`
// url to DB
val jdbcUrl = s"jdbc:mysql://$jdbcHostname:3306/$jdbcDatabase"
val username = "someUser"
val password = "somePass"
// create a properties map for your DB connection
val connectionProperties = new Properties()
connectionProperties.put("user", s"${username}")
connectionProperties.put("password", s"${password}")
connectionProperties.setProperty("Driver", driver)
and then read from JDBC as:
// use above created url and connection properties to fetch data
val tableName = "some-table"
val mytable = spark.read.jdbc(jdbcUrl, tableName, connectionProperties)
Spark automatically reads the schema from the database table and maps its types back to Spark SQL types.
You can use the above mytable dataframe to run your queries or save data.
Say you want to select the columns like and save then
// your select query
val selectedDF = mytable.select("c1", "c2")
// now you can save above dataframe

Related

Push down DML commands to SQL using Pyspark on Databricks

I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()

Not able to create a table locally, getting Hive support is required

Getting error even after setting configuration
config("spark.sql.catalogImplementation","hive")
override def beforeAll(): Unit = {
super[SharedSparkContext].beforeAll()
SparkSessionProvider._sparkSession = SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.getOrCreate()
}
Edited:
This is how am setting up my local db and tables for testing.
val stgDb = "test_stagingDB"
val stgTbl_exp ="test_stagingDB_expected"
val stgTbl_result="test_stg_table_result"
val trgtDb = "test_activeDB"
val trgtTbl_exp ="test_activeDB_expected"
val trgtTbl_result ="test_activeDB_results"
def setUpDb ={
println("Set up DB started")
val localPath="file:/C:/Users/vmurthyms/Code-prdb/prdb/com.rxcorp.prdb"
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_stagingDB LOCATION '$localPath/test_stagingDB.db'")
spark.sql(s"CREATE DATABASE IF NOT EXISTS test_activeDB LOCATION '$localPath/test_sctiveDB.db'")
spark.sql(s"CREATE TABLE IF NOT EXISTS $trgtDb.${trgtTbl_exp}_ina (Id String, Name String)")
println("Set up DB done")
}
setUpDb
While running spark.sql("CREATE TABLE.., ") cmd , am getting below error:
Error:
Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;
'CreateTable test_activeDB.test_activeDB_expected_ina, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Ignore
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
at org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:349)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:92)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:641)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.setUpDb$1(SitecoreAPIExtractTest.scala:127)
at com.rxcorp.prdb.exe.SitecoreAPIExtractTest$$anonfun$2.apply$mcV$sp(SitecoreAPIExtractTest.scala:130)
It seems you are almost there(your error message is also giving you the clue), you need to call enableHiveSupport() when you are creating spark session. Eg.
SparkSession.builder()
.master("local[*]")
.config("spark.sql.catalogImplementation","hive")
.enableHiveSupport()
.getOrCreate()
And also when using enableHiveSupport(), setting config("spark.sql.catalogImplementation","hive") looks redundant. I think you can safely comment out that part.

NullpointerException when connection to Postgres from Spark -- why?

object App {
def main(args: Array[String]) {
val conf = new spark.SparkConf().setMaster("local[2]").setAppName("mySparkApp")
val sc = new spark.SparkContext(conf)
val sqlContext = new SQLContext(sc)
val jdbcUrl = "1.2.34.567"
val jdbcUser = "someUser"
val jdbcPassword = "xxxxxxxxxxxxxxxxxxxx"
val tableName = "myTable"
val driver = "org.postgresql.Driver"
Class.forName(driver)
val df = sqlContext
.read
.format("jdbc")
.option("driver", driver)
.option("url", jdbcUrl)
.option("userName", jdbcUser)
.option("password", jdbcPassword)
.option("dbtable", tableName) // NullPointerException occurs here
.load()
}
}
I want to connect to a Postgres database on my LAN from Spark. During runtime, the following error occurs:
Exception in thread "main" java.lang.NullPointerException
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:71)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation$.getSchema(JDBCRelation.scala:210)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:35)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at <redacted>?.main(App.scala:42)
at <redacted>.App.main(App.scala)
Is there an obvious reason why there's a nullpointer exception at the option("dbtable", tableName) line? I'm using spark-2.3.1-bin-hadoop2.7 with Scala 2.11.12. For the postgres dependency, I'm using this version:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4-1200-jdbc41</version>
</dependency>
The error message (which isn't very helpful for troubleshooting) is probably not against option dbtable, but option url.
It looks like your jdbcUrl is missing the URL protocol jdbc:postgresql:// as its prefix. Here's a link re: Spark's JDBC data sources.

write to a JDBC source in scala

I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()