Push down DML commands to SQL using Pyspark on Databricks - scala

I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)

You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()

Related

How to create table in mysql database using apache spark

I am trying to create a spark application which is useful to
create, read, write and update MySQL data. So, is there any way to create a MySQL table using Spark?
Below I have a Scala-JDBC code that creates a table in MySQL
database. How can I do this through Spark?
package SparkMysqlJdbcConnectivity
import org.apache.spark.sql.SparkSession
import java.util.Properties
import java.lang.Class
import java.sql.Connection
import java.sql.DriverManager
object MysqlSparkJdbcProgram {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("MysqlJDBC Connections")
.master("local[*]")
.getOrCreate()
val driver = "com.mysql.jdbc.Driver"
val url = "jdbc:mysql://localhost:3306/world"
val operationtype = "create table"
val tablename = "country"
val tablename2 = "state"
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDf = spark.read.jdbc(url, s"${tablename}", connectionProperties)
operationtype.trim() match {
case "create table" => {
// Class.forName(driver)
try{
val con:Connection = DriverManager.getConnection(url,connectionProperties)
val result = con.prepareStatement(s"create table ${tablename2} (name varchar(255), country varchar(255))").execute()
println(result)
if(result) println("table creation is unsucessful") else println("table creation is unsucessful")
}
}
case "read table" => {
val jdbcDf = spark.read.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
jdbcDf.show()
}
case "write table" => {}
case "drop table" => {}
}
}
}
The tables will be created automatically when you write the jdbcDf dataframe.
jdbcDf
.write
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)
In case if you want to specify the table schema,
jdbcDf
.write
.option("createTableColumnTypes", "name VARCHAR(500), col1 VARCHAR(1024), col3 int")
.jdbc("jdbc:mysql://localhost:3306/world", s"${tablename}", connectionProperties)

Delete azure sql database rows from azure databricks

I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe.
df.write \
.option('user', jdbcUsername) \
.option('password', jdbcPassword) \
.jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} )
But going forward I don't want to truncate and overwrite the entire table every time but rather use delete command. I was not able to achieve this using pushdown query either. Any help on this would be greatly appreciated.
You can also drop down to scala to do this, as the SQL Server JDBC driver is already installed. EG:
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = "xxxxx"
val jdbcPassword = "xxxxxx"
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "delete from sometable where someColumn > 4"
stmt.execute(sql)
connection.close()
Use pyodbc to execute a SQL Statement.
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
conn.execute('DELETE TableBlah WHERE 1=2')
It's a bit of a pain to get pyodbc working on Databricks - see details here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark

write to a JDBC source in scala

I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)

Getting error while converting DynamicFrame to a Spark DataFrame using toDF

I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()

Connect to SQLite in Apache Spark

I want to run a custom function on all tables in a SQLite database. The function is more or less the same, but depends on the schema of the individual table. Also, the tables and their schemata are only known at runtime (the program is called with an argument that specifies the path of the database).
This is what I have so far:
val conf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// somehow bind sqlContext to DB
val allTables = sqlContext.tableNames
for( t <- allTables) {
val df = sqlContext.table(t)
val schema = df.columns
sqlContext.sql("SELECT * FROM " + t + "...").map(x => myFunc(x,schema))
}
The only hint I found so far needs to know the table in advance, which is not the case in my scenario:
val tableData =
sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db", "dbtable" -> t))
.load()
I am using the xerial sqlite jdbc driver. So how can I conntect solely to a database, not to a table?
Edit: Using Beryllium's answer as a start I updated my code to this:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val metaData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> "(SELECT * FROM sqlite_master) AS t")).load()
val myTableNames = metaData.select("tbl_name").distinct()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.table(t.toString)
for (record <- tableData.select("*")) {
println(record)
}
}
At least I can read the table names at runtime which is a huge step forward for me. But I can't read the tables. I tried both
val tableData = sqlContext.table(t.toString)
and
val tableData = sqlContext.read.format("jdbc")
.options(Map("url" -> "jdbc:sqlite:/path/to/file.db",
"dbtable" -> t.toString)).load()
in the loop, but in both cases I get a NullPointerException. Although I can print the table names it seems I cannot connect to them.
Last but not least I always get an SQLITE_ERROR: Connection is closed error. It looks to be the same issue described in this question: SQLITE_ERROR: Connection is closed when connecting from Spark via JDBC to SQLite database
There are two options you can try
Use JDBC directly
Open a separate, plain JDBC connection in your Spark job
Get the tables names from the JDBC meta data
Feed these into your for comprehension
Use a SQL query for the "dbtable" argument
You can specify a query as the value for the dbtable argument. Syntactically this query must "look" like a table, so it must be wrapped in a sub query.
In that query, get the meta data from the database:
val df = sqlContext.read.format("jdbc").options(
Map(
"url" -> "jdbc:postgresql:xxx",
"user" -> "x",
"password" -> "x",
"dbtable" -> "(select * from pg_tables) as t")).load()
This example works with PostgreSQL, you have to adapt it for SQLite.
Update
It seems that the JDBC driver only supports to iterate over one result set.
Anyway, when you materialize the list of table names using collect(), then the following snippet should work:
val myTableNames = metaData.select("tbl_name").map(_.getString(0)).collect()
for (t <- myTableNames) {
println(t.toString)
val tableData = sqlContext.read.format("jdbc")
.options(
Map(
"url" -> "jdbc:sqlite:/x.db",
"dbtable" -> t)).load()
tableData.show()
}