I am trying to insert data into Hive table like this:
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
val partSchema = StructType(Array(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("salary",IntegerType,true),StructField("dept",StringType,true),StructField("location",StringType,true)))
val partRDD = partdata.map(p => Row(p(0).toInt,p(1),p(2).toInt,p(3),p(4)))
val partDF = sqlContext.createDataFrame(partRDD, partSchema)
Packages I imported:
import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.sql.types._
This is how I tried to insert the dataframe into Hive partition:
partDF.write.mode(saveMode.Append).partitionBy("location").insertInto("parttab")
Im getting the below error even though I have the Hive Table:
org.apache.spark.sql.AnalysisException: Table not found: parttab;
Could anyone tell me what is the mistake I am doing here and how can I correct it ?
To write data to Hive warehouse, you need to initialize hiveContext instance.
Upon doing that, it will take confs from Hive-Site.xml (from classpath); and connects to underlying Hive warehouse.
HiveContext is an extension to SQLContext to support and connect to hive.
To do so, try this::
val hc = new HiveContext(sc)
And perform your append-query onn this instance.
partDF.registerAsTempTable("temp")
hc.sql(".... <normal sql query to pick data from table `temp`; and insert in to Hive table > ....")
Please make sure that the table parttab is under db - default.
If the table in under another db, table name should be specified as : <db-name>.parttab
If you need to directly save the dataframe in to hive; use this:
df.saveAsTable("<db-name>.parttab")
Related
I have a table in Azure SQL database from which I want to either delete selected rows based on some criteria or entire table from Azure Databricks. Currently I am using the truncate property of JDBC to truncate the entire table without dropping it and then re-write it with new dataframe.
df.write \
.option('user', jdbcUsername) \
.option('password', jdbcPassword) \
.jdbc('<connection_string>', '<table_name>', mode = 'overwrite', properties = {'truncate' : 'true'} )
But going forward I don't want to truncate and overwrite the entire table every time but rather use delete command. I was not able to achieve this using pushdown query either. Any help on this would be greatly appreciated.
You can also drop down to scala to do this, as the SQL Server JDBC driver is already installed. EG:
%scala
import java.util.Properties
import java.sql.DriverManager
val jdbcUsername = "xxxxx"
val jdbcPassword = "xxxxxx"
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://xxxxxx.database.windows.net:1433;database=AdventureWorks;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"
// Create a Properties() object to hold the parameters.
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
connectionProperties.setProperty("Driver", driverClass)
val connection = DriverManager.getConnection(jdbcUrl, jdbcUsername, jdbcPassword)
val stmt = connection.createStatement()
val sql = "delete from sometable where someColumn > 4"
stmt.execute(sql)
connection.close()
Use pyodbc to execute a SQL Statement.
import pyodbc
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=mydatabe.database.azure.net;'
'DATABASE=AdventureWorks;UID=jonnyFast;'
'PWD=MyPassword')
conn.execute('DELETE TableBlah WHERE 1=2')
It's a bit of a pain to get pyodbc working on Databricks - see details here: https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
I am trying to write classic sql query using scala to insert some information into a sql server database table.
The connection to my database works perfectly and I succeed to read data from JDBC, from a table recently created called "textspark" which has only 1 column called "firstname" create table textspark(firstname varchar(10)).
However, when I try to write data into the table , I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: textspark
this is my code:
//Step 1: Check that the JDBC driver is available
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
//Step 2: Create the JDBC URL
val jdbcHostname = "localhost"
val jdbcPort = 1433
val jdbcDatabase ="mydatabase"
val jdbcUsername = "mylogin"
val jdbcPassword = "mypwd"
// Create the JDBC URL without passing in the user and password parameters.
val jdbcUrl = s"jdbc:sqlserver://${jdbcHostname}:${jdbcPort};database=${jdbcDatabase}"
// Create a Properties() object to hold the parameters.
import java.util.Properties
val connectionProperties = new Properties()
connectionProperties.put("user", s"${jdbcUsername}")
connectionProperties.put("password", s"${jdbcPassword}")
//Step 3: Check connectivity to the SQLServer database
val driverClass = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionProperties.setProperty("Driver", driverClass)
//Read data from JDBC
val textspark_table = spark.read.jdbc(jdbcUrl, "textspark", connectionProperties)
textspark_table.show()
//the read operation works perfectly!!
//Write data to JDBC
import org.apache.spark.sql.SaveMode
spark.sql("insert into textspark values('test') ")
.write
.mode(SaveMode.Append) // <--- Append to the existing table
.jdbc(jdbcUrl, "textspark", connectionProperties)
//the write operation generates error!!
Can anyone help me please to fix this error?
You don't use insert statement in Spark. You specified the append mode what is ok. You shouldn't insert data, you should select / create it. Try something like this:
spark.sql("select 'text'")
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
or
Seq("test").toDS
.write
.mode(SaveMode.Append)
.jdbc(jdbcUrl, "textspark", connectionProperties)
I am trying to write a dataframe from Spark 2.2.0 to Hive 2.1 through a JDBC connection. I know this is not the recommended way and that a direct connection using hive-site.xml should be configured but that is not an option for me currently due to factors outside of my control...so I'm stuck with JDBC at the moment.
I can read from Hive using JDBC but have to override the JDBC dialect quoteIdentifier method and specify the fetchsize to actually see any output from the dataframe in Spark.
Although inconvenient, this is okay with me for now. However, I am having trouble writing back to Hive now. I think I need to make additional changes to the JDBC dialect to write back as I'm getting this error message:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
This is my process for reading via JDBC:
1.) Created db-properties.flat file with my username, pass, url, and driver.
url=jdbc:hive2://xxxx.com:10000/default
driver=org.apache.hive.jdbc.HiveDriver
user=xxxxxx
password=xxxxxx
2.) Open Spark Shell and run the below code to read table:
import java.io.Fileimport java.util.Properties
import java.io.FileInputStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, DataFrame}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
val conf = new SparkConf()
val sqlContext = new HiveContext(sc)
val dbProperties = new Properties
dbProperties.load(new FileInputStream(new File("/home/xxxxxx/db-properties.flat")))
val url = dbProperties.getProperty("url")
val jdbcDriver = dbProperties.getProperty("driver")
val jdbcFetchsize = dbProperties.setProperty("fetchsize","10")
/* Update JDBC dialect */
val HiveDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2")
override def quoteIdentifier(colName: String): String ={ s"$colName" }
}
JdbcDialects.registerDialect(HiveDialect)
val myTable = "xxxxxx"
val df = spark.read.jdbc(url,myTable,dbProperties)
df.show()
3.) I am able to read the data without issues in step 2, but I cannot write to Hive from Spark using JDBC. Below is the code:
df.write.mode("error").jdbc(url,"newtable",dbProperties)
...which results in this error:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
Does anyone have recommendations as to how I can modify the dialect to write back to Hive from Spark JDBC or any other suggestions? Thank you!
I am new to Spark. Here is something I wanna do.
I have created two data streams; first one reads data from text file and register it as a temptable using hivecontext. The other one continuously gets RDDs from Kafka and for each RDD, it it creates data streams and register the contents as temptable. Finally I join these two temp tables on a key to get final result set. I want to insert that result set in a hive table. But I am out of ideas. Tried to follow some exmples but that only create a table with one column in hive and that too not readable. Could you please show me how to insert results in a particular database and table of hive. Please note that I can see the results of join using show function so the real challenge lies with insertion in hive table.
Below is the code I am using.
imports.....
object MSCCDRFilter {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Flume, Kafka and Spark MSC CDRs Manipulation")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val cgiDF = sc.textFile("file:///tmp/omer-learning/spark/dim_cells.txt").map(_.split(",")).map(p => CGIList(p(0).trim, p(1).trim, p(2).trim,p(3).trim)).toDF()
cgiDF.registerTempTable("my_cgi_list")
val CGITable=sqlContext.sql("select *"+
" from my_cgi_list")
CGITable.show() // this CGITable is a structure I defined in the project
val streamingContext = new StreamingContext(sc, Seconds(10)
val zkQuorum="hadoopserver:2181"
val topics=Map[String, Int]("FlumeToKafka"->1)
val messages: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(streamingContext,zkQuorum,"myGroup",topics)
val logLinesDStream = messages.map(_._2) //获取数据
logLinesDStream.print()
val MSCCDRDStream = logLinesDStream.map(MSC_KPI.parseLogLine) // change MSC_KPI to MCSCDR_GO if you wanna change the class
// MSCCDR_GO and MSC_KPI are structures defined in the project
MSCCDRDStream.foreachRDD(MSCCDR => {
println("+++++++++++++++++++++NEW RDD ="+ MSCCDR.count())
if (MSCCDR.count() == 0) {
println("==================No logs received in this time interval=================")
} else {
val dataf=sqlContext.createDataFrame(MSCCDR)
dataf.registerTempTable("hive_msc")
cgiDF.registerTempTable("my_cgi_list")
val sqlquery=sqlContext.sql("select a.cdr_type,a.CGI,a.cdr_time, a.mins_int, b.Lat, b.Long,b.SiteID from hive_msc a left join my_cgi_list b"
+" on a.CGI=b.CGI")
sqlquery.show()
sqlContext.sql("SET hive.exec.dynamic.partition = true;")
sqlContext.sql("SET hive.exec.dynamic.partition.mode = nonstrict;")
sqlquery.write.mode("append").partitionBy("CGI").saveAsTable("omeralvi.msc_data")
val FilteredCDR = sqlContext.sql("select p.*, q.* " +
" from MSCCDRFiltered p left join my_cgi_list q " +
"on p.CGI=q.CGI ")
println("======================print result =================")
FilteredCDR.show()
streamingContext.start()
streamingContext.awaitTermination()
}
}
I have had some success writing to Hive, using the following:
dataFrame
.coalesce(n)
.write
.format("orc")
.options(Map("path" -> savePath))
.mode(SaveMode.Append)
.saveAsTable(fullTableName)
Our attempts to use partitions weren't followed through with, because I think there was some issue with our desired partitioning column.
The only limitation is with concurrent writes, where the table does not exist yet, then any task tries to create the table (because it didn't exist when it first attempted to write to the table) will Exception out.
Be aware, that writing to Hive in streaming applications is usually bad design, as you will often write many small files, which is very inefficient to read and store. So if you write more often than every hour or so to Hive, you should make sure you include logic for compaction, or add an intermediate storage layer more suited to transactional data.
Im trying to load a .parquet file into a MemSQL Database with Spark and MemSQL Connector.
package com.memsql.spark
import com.memsql.spark.context._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.memsql.spark.connector._
import com.mysql.jdbc._
object readParquet {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ReadParquet")
val sc = new SparkContext(conf)
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.37-bin.jar")
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/memsql-connector_2.10-1.1.0.jar")
Class.forName("com.mysql.jdbc.Driver")
val host = "xxxx"
val port = 3306
val dbName = "WP1"
val user = "root"
val password = ""
val tableName = "rt_acc"
val memsqlContext = new com.memsql.spark.context.MemSQLContext(sc, host, port, user, password)
val rt_acc = memsqlContext.read.parquet("tachyon://localhost:19998/rt_acc.parquet")
val func_rt_acc = new com.memsql.spark.connector.DataFrameFunctions(rt_acc)
func_rt_acc.saveToMemSQL(dbName, tableName, host, port, user, password)
}
}
I'm fairly certain that Tachyon is not causing the problem, as the same exceptions occur if loaded from disk and i can use sql-queries on the dataframe.
I've seen people suggest df.saveToMemSQL(..) however it seems this method is in DataFrameFunctions now.
Also the table doesnt exist yet but saveToMemSQL should do CREATE TABLE as documentation and source code tell me.
Edit: Ok i guess i misread something. saveToMemSQL doesn't create the table. Thanks.
Try using createMemSQLTableAs instead of saveToMemSQL.
saveToMemSQL loads a dataframe into an existing table, where as createMemSQLTableAs creates the table and then loads it.
It also returns a handy dataframe wrapping that MemSQL table :).