Writing DataFrame to MemSQL Table in Spark - scala

Im trying to load a .parquet file into a MemSQL Database with Spark and MemSQL Connector.
package com.memsql.spark
import com.memsql.spark.context._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.memsql.spark.connector._
import com.mysql.jdbc._
object readParquet {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ReadParquet")
val sc = new SparkContext(conf)
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.37-bin.jar")
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/memsql-connector_2.10-1.1.0.jar")
Class.forName("com.mysql.jdbc.Driver")
val host = "xxxx"
val port = 3306
val dbName = "WP1"
val user = "root"
val password = ""
val tableName = "rt_acc"
val memsqlContext = new com.memsql.spark.context.MemSQLContext(sc, host, port, user, password)
val rt_acc = memsqlContext.read.parquet("tachyon://localhost:19998/rt_acc.parquet")
val func_rt_acc = new com.memsql.spark.connector.DataFrameFunctions(rt_acc)
func_rt_acc.saveToMemSQL(dbName, tableName, host, port, user, password)
}
}
I'm fairly certain that Tachyon is not causing the problem, as the same exceptions occur if loaded from disk and i can use sql-queries on the dataframe.
I've seen people suggest df.saveToMemSQL(..) however it seems this method is in DataFrameFunctions now.
Also the table doesnt exist yet but saveToMemSQL should do CREATE TABLE as documentation and source code tell me.
Edit: Ok i guess i misread something. saveToMemSQL doesn't create the table. Thanks.

Try using createMemSQLTableAs instead of saveToMemSQL.
saveToMemSQL loads a dataframe into an existing table, where as createMemSQLTableAs creates the table and then loads it.
It also returns a handy dataframe wrapping that MemSQL table :).

Related

Delete azure sql database rows from azure databricks

I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe.
df.write \
.option('user', jdbcUsername) \
.option('password', jdbcPassword) \
.jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} )
But going forward I don't want to truncate and overwrite the entire table every time but rather use delete command. I was not able to achieve this using pushdown query either. Any help on this would be greatly appreciated.
You can also drop down to scala to do this, as the SQL Server JDBC driver is already installed. EG:
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = "xxxxx"
val jdbcPassword = "xxxxxx"
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "delete from sometable where someColumn > 4"
stmt.execute(sql)
connection.close()
Use pyodbc to execute a SQL Statement.
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
conn.execute('DELETE TableBlah WHERE 1=2')
It's a bit of a pain to get pyodbc working on Databricks - see details here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark

JDBC dialect when writing Spark to Hive

I am trying to write a dataframe from Spark 2.2.0 to Hive 2.1 through a JDBC connection. I know this is not the recommended way and that a direct connection using hive-site.xml should be configured but that is not an option for me currently due to factors outside of my control...so I'm stuck with JDBC at the moment.
I can read from Hive using JDBC but have to override the JDBC dialect quoteIdentifier method and specify the fetchsize to actually see any output from the dataframe in Spark.
Although inconvenient, this is okay with me for now. However, I am having trouble writing back to Hive now. I think I need to make additional changes to the JDBC dialect to write back as I'm getting this error message:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
This is my process for reading via JDBC:
1.) Created db-properties.flat file with my username, pass, url, and driver.
url=jdbc:hive2://xxxx.com:10000/default
driver=org.apache.hive.jdbc.HiveDriver
user=xxxxxx
password=xxxxxx
2.) Open Spark Shell and run the below code to read table:
import java.io.Fileimport java.util.Properties
import java.io.FileInputStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, DataFrame}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
val conf = new SparkConf()
val sqlContext = new HiveContext(sc)
val dbProperties = new Properties
dbProperties.load(new FileInputStream(new File("/home/xxxxxx/db-properties.flat")))
val url = dbProperties.getProperty("url")
val jdbcDriver = dbProperties.getProperty("driver")
val jdbcFetchsize = dbProperties.setProperty("fetchsize","10")
/* Update JDBC dialect */
val HiveDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2")
override def quoteIdentifier(colName: String): String ={ s"$colName" }
}
JdbcDialects.registerDialect(HiveDialect)
val myTable = "xxxxxx"
val df = spark.read.jdbc(url,myTable,dbProperties)
df.show()
3.) I am able to read the data without issues in step 2, but I cannot write to Hive from Spark using JDBC. Below is the code:
df.write.mode("error").jdbc(url,"newtable",dbProperties)
...which results in this error:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
Does anyone have recommendations as to how I can modify the dialect to write back to Hive from Spark JDBC or any other suggestions? Thank you!

Table not found error while loading DataFrame into a Hive partition

I am trying to insert data into Hive table like this:
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
val partSchema = StructType(Array(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("salary",IntegerType,true),StructField("dept",StringType,true),StructField("location",StringType,true)))
val partRDD = partdata.map(p => Row(p(0).toInt,p(1),p(2).toInt,p(3),p(4)))
val partDF = sqlContext.createDataFrame(partRDD, partSchema)
Packages I imported:
import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.sql.types._
This is how I tried to insert the dataframe into Hive partition:
partDF.write.mode(saveMode.Append).partitionBy("location").insertInto("parttab")
Im getting the below error even though I have the Hive Table:
org.apache.spark.sql.AnalysisException: Table not found: parttab;
Could anyone tell me what is the mistake I am doing here and how can I correct it ?
To write data to Hive warehouse, you need to initialize hiveContext instance.
Upon doing that, it will take confs from Hive-Site.xml (from classpath); and connects to underlying Hive warehouse.
HiveContext is an extension to SQLContext to support and connect to hive.
To do so, try this::
val hc = new HiveContext(sc)
And perform your append-query onn this instance.
partDF.registerAsTempTable("temp")
hc.sql(".... <normal sql query to pick data from table `temp`; and insert in to Hive table > ....")
Please make sure that the table parttab is under db - default.
If the table in under another db, table name should be specified as : <db-name>.parttab
If you need to directly save the dataframe in to hive; use this:
df.saveAsTable("<db-name>.parttab")

How To Read .MDB files in scala

I have task to convert .mdb files to .csv files. with the help of the code below I am able to read only one table file from .mdb file. I am not able to read if .mdb files contains more than one table and I want to store all the files individually. Kindly help me on this.
object mdbfiles {
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().appName("Positional File Reading").master("local[*]").getOrCreate()
val sc = spark.sparkContext // Just used to create test RDDs
def main(args: Array[String]): Unit = {
val inputfilepath = "C:/Users/phadpa01/Desktop/InputFiles/sample.mdb"
val outputfilepath ="C:/Users/phadpa01/Desktop/sample_mdb_output"
val db = DatabaseBuilder.open(new File(inputfilepath))
try {
val table = db.getTable("table1");
for ( row <- table) {
//System.out.println(row)
val opresult = row.values()
}
}
}
}
Your problem is that you are calling only one table to be read in with this bit of code
val table = db.getTable("table1");
You should get a list of available tables in the db and then loop over them.
val tableNames = db.getTableNames
Then you can iterate over the tableNames. That should solve the issue for you in reading in more than one table. You may need to update the rest of the code to get it how you want it though.
You should really find a JDBC driver that works with MS Access rather than manually trying to parse the file yourself.
For example UCanAccess
Then, it's a simple SparkSQL command, and you have a DataFrame
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:ucanaccess://c:/Users/phadpa01/Desktop/InputFiles/sample.mdb;memory=false")
.option("dbtable", "table1")
.load()
And one line to a CSV
jdbcDF.write.format("csv").save("table1.csv")
Don't forget to insert UcanAccess jars into context:
ucanaccess-4.0.2.jar,jackcess-2.1.6.jar,hsqldb.jar
Alernative solution
Run a terminal command
http://ucanaccess.sourceforge.net/site.html#clients

JDBC-HiveServer:'client_protocol is unset!'-Both 1.1.1 in CS

When I ask this question, I have already read many many article through google. Many answers show that is the mismatch version between client-side and server-side. So I decide to copy the jars from server-side to client-side directly, and the result is .... as you know, same exception:
org.apache.thrift.TApplicationException: Required field 'client_protocol' is unset! Struct:TOpenSessionReq(client_protocol:null, configuration:{use:database=default})
It goes well when I connect to hiveserver2 through beeline :)
see my connection.
So, I think it will work when I use jdbc too. But, unfortunately, it throws that exception, below is my jars in my project.
hive-jdbc-1.1.1.jar
hive-jdbc-standalone.jar
hive-metastore-1.1.1.jar
hive-service-1.1.1.jar
those hive jars are copied from server-side.
def connect_hive(master:String){
val conf = new SparkConf()
.setMaster(master)
.setAppName("Hive")
.set("spark.local.dir", "./tmp");
val sc = new SparkContext(conf);
val sqlContext = new SQLContext(sc);
val url = "jdbc:hive2://192.168.40.138:10000";
val prop= new Properties();
prop.setProperty("user", "hive");
prop.setProperty("password", "hive");
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver");
val conn = DriverManager.getConnection(url, prop);
sc.stop();
}
The configment of my server:
hadoop 2.7.3
spark 1.6.0
hive 1.1.1
Does anyone encounter the same situation when connecting hive through spark-JDBC?
Since beeline works, it is expected that your program also should execute correctly.
print current project class path
you can try some thing like this to understand your self.
import java.net.URL
import java.net.URLClassLoader
import scala.collection.JavaConversions._
object App {
def main(args: Array[String]) {
val cl = ClassLoader.getSystemClassLoader
val urls = cl.asInstanceOf[URLClassLoader].getURLs
for (url <- urls) {
println(url.getFile)
}
}
}
Also check hive.aux.jars.path=<file urls> to understand what jars are present in the classpath.