Delete azure sql database rows from azure databricks - pyspark

I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe.
df.write \
.option('user', jdbcUsername) \
.option('password', jdbcPassword) \
.jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} )
But going forward I don't want to truncate and overwrite the entire table every time but rather use delete command. I was not able to achieve this using pushdown query either. Any help on this would be greatly appreciated.

You can also drop down to scala to do this, as the SQL Server JDBC driver is already installed. EG:
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = "xxxxx"
val jdbcPassword = "xxxxxx"
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "delete from sometable where someColumn > 4"
stmt.execute(sql)
connection.close()

Use pyodbc to execute a SQL Statement.
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
conn.execute('DELETE TableBlah WHERE 1=2')
It's a bit of a pain to get pyodbc working on Databricks - see details here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark

Related

Push down DML commands to SQL using Pyspark on Databricks

I'm using Azure's Databricks and want to pushdown a query to a Azure SQL using PySpark. I've tried many ways and found a solution using Scala (code below), but doing this I need to convert part of my code to scala then bring back to PySpark again.
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = username
val jdbcPassword = password
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = "entire-string-connection-to-Azure-SQL"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()
Is there a way to achieve the pushdown of a DML code using PySpark instead of Scala language?
Found something related but only works to read data and DDL commands:
jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.mysql.jdbc.Driver"
}
pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
You can actually achieve the same thing as the Scala example you provided in Python.
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
query = "YOUR SQL QUERY"
exec_statement = connection.prepareCall(query)
exec_statement.execute()
exec_statement.close()
connection.close()
For your case I would try
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
connection = driver_manager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
stmt = connection.createStatement()
sql = "TRUNCATE TABLE dbo.table"
stmt.execute(sql)
connection.close()

write to a JDBC source in scala

I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)

Table not found error while loading DataFrame into a Hive partition

I am trying to insert data into Hive table like this:
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
val partSchema = StructType(Array(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("salary",IntegerType,true),StructField("dept",StringType,true),StructField("location",StringType,true)))
val partRDD = partdata.map(p => Row(p(0).toInt,p(1),p(2).toInt,p(3),p(4)))
val partDF = sqlContext.createDataFrame(partRDD, partSchema)
Packages I imported:
import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.sql.types._
This is how I tried to insert the dataframe into Hive partition:
partDF.write.mode(saveMode.Append).partitionBy("location").insertInto("parttab")
Im getting the below error even though I have the Hive Table:
org.apache.spark.sql.AnalysisException: Table not found: parttab;
Could anyone tell me what is the mistake I am doing here and how can I correct it ?
To write data to Hive warehouse, you need to initialize hiveContext instance.
Upon doing that, it will take confs from Hive-Site.xml (from classpath); and connects to underlying Hive warehouse.
HiveContext is an extension to SQLContext to support and connect to hive.
To do so, try this::
val hc = new HiveContext(sc)
And perform your append-query onn this instance.
partDF.registerAsTempTable("temp")
hc.sql(".... <normal sql query to pick data from table `temp`; and insert in to Hive table > ....")
Please make sure that the table parttab is under db - default.
If the table in under another db, table name should be specified as : <db-name>.parttab
If you need to directly save the dataframe in to hive; use this:
df.saveAsTable("<db-name>.parttab")

How To Read .MDB files in scala

I have task to convert .mdb files to .csv files. with the help of the code below I am able to read only one table file from .mdb file. I am not able to read if .mdb files contains more than one table and I want to store all the files individually. Kindly help me on this.
object mdbfiles {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Positional File Reading").master("local[*]").getOrCreate()
val sc = spark.sparkContext // Just used to create test RDDs
def main(args: Array[String]): Unit = {
val inputfilepath = "C:/Users/phadpa01/Desktop/InputFiles/sample.mdb"
val outputfilepath ="C:/Users/phadpa01/Desktop/sample_mdb_output"
val db = DatabaseBuilder.open(new File(inputfilepath))
try {
val table = db.getTable("table1");
for ( row <- table) {
//System.out.println(row)
val opresult = row.values()
}
}
}
}
Your problem is that you are calling only one table to be read in with this bit of code
val table = db.getTable("table1");
You should get a list of available tables in the db and then loop over them.
val tableNames = db.getTableNames
Then you can iterate over the tableNames. That should solve the issue for you in reading in more than one table. You may need to update the rest of the code to get it how you want it though.
You should really find a JDBC driver that works with MS Access rather than manually trying to parse the file yourself.
For example UCanAccess
Then, it's a simple SparkSQL command, and you have a DataFrame
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:ucanaccess://c:/Users/phadpa01/Desktop/InputFiles/sample.mdb;memory=false")
.option("dbtable", "table1")
.load()
And one line to a CSV
jdbcDF.write.format("csv").save("table1.csv")
Don't forget to insert UcanAccess jars into context:
ucanaccess-4.0.2.jar,jackcess-2.1.6.jar,hsqldb.jar
Alernative solution
Run a terminal command
http://ucanaccess.sourceforge.net/site.html#clients

Writing DataFrame to MemSQL Table in Spark

Im trying to load a .parquet file into a MemSQL Database with Spark and MemSQL Connector.
package com.memsql.spark
import com.memsql.spark.context._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.memsql.spark.connector._
import com.mysql.jdbc._
object readParquet {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ReadParquet")
val sc = new SparkContext(conf)
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.37-bin.jar")
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/memsql-connector_2.10-1.1.0.jar")
Class.forName("com.mysql.jdbc.Driver")
val host = "xxxx"
val port = 3306
val dbName = "WP1"
val user = "root"
val password = ""
val tableName = "rt_acc"
val memsqlContext = new com.memsql.spark.context.MemSQLContext(sc, host, port, user, password)
val rt_acc = memsqlContext.read.parquet("tachyon://localhost:19998/rt_acc.parquet")
val func_rt_acc = new com.memsql.spark.connector.DataFrameFunctions(rt_acc)
func_rt_acc.saveToMemSQL(dbName, tableName, host, port, user, password)
}
}
I'm fairly certain that Tachyon is not causing the problem, as the same exceptions occur if loaded from disk and i can use sql-queries on the dataframe.
I've seen people suggest df.saveToMemSQL(..) however it seems this method is in DataFrameFunctions now.
Also the table doesnt exist yet but saveToMemSQL should do CREATE TABLE as documentation and source code tell me.
Edit: Ok i guess i misread something. saveToMemSQL doesn't create the table. Thanks.
Try using createMemSQLTableAs instead of saveToMemSQL.
saveToMemSQL loads a dataframe into an existing table, where as createMemSQLTableAs creates the table and then loads it.
It also returns a handy dataframe wrapping that MemSQL table :).