I am trying to write a dataframe from Spark 2.2.0 to Hive 2.1 through a JDBC connection. I know this is not the recommended way and that a direct connection using hive-site.xml should be configured but that is not an option for me currently due to factors outside of my control...so I'm stuck with JDBC at the moment.
I can read from Hive using JDBC but have to override the JDBC dialect quoteIdentifier method and specify the fetchsize to actually see any output from the dataframe in Spark.
Although inconvenient, this is okay with me for now. However, I am having trouble writing back to Hive now. I think I need to make additional changes to the JDBC dialect to write back as I'm getting this error message:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
This is my process for reading via JDBC:
1.) Created db-properties.flat file with my username, pass, url, and driver.
url=jdbc:hive2://xxxx.com:10000/default
driver=org.apache.hive.jdbc.HiveDriver
user=xxxxxx
password=xxxxxx
2.) Open Spark Shell and run the below code to read table:
import java.io.Fileimport java.util.Properties
import java.io.FileInputStream
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{SQLContext, DataFrame}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect}
val conf = new SparkConf()
val sqlContext = new HiveContext(sc)
val dbProperties = new Properties
dbProperties.load(new FileInputStream(new File("/home/xxxxxx/db-properties.flat")))
val url = dbProperties.getProperty("url")
val jdbcDriver = dbProperties.getProperty("driver")
val jdbcFetchsize = dbProperties.setProperty("fetchsize","10")
/* Update JDBC dialect */
val HiveDialect = new JdbcDialect {
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2")
override def quoteIdentifier(colName: String): String ={ s"$colName" }
}
JdbcDialects.registerDialect(HiveDialect)
val myTable = "xxxxxx"
val df = spark.read.jdbc(url,myTable,dbProperties)
df.show()
3.) I am able to read the data without issues in step 2, but I cannot write to Hive from Spark using JDBC. Below is the code:
df.write.mode("error").jdbc(url,"newtable",dbProperties)
...which results in this error:
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while
compiling statement: FAILED: ParseException line 1:28 cannot recognize input
near '.' 'name' 'TEXT' in column type
Does anyone have recommendations as to how I can modify the dialect to write back to Hive from Spark JDBC or any other suggestions? Thank you!
Related
While using Geomesa and Scala, I have been attempting to encode 2 columns in a Spark Dataframe using the below snippets, but I am continually receiving an issue where it appears that Scala cannot serialize the returned objects into a Dataframe. When using Postgres and PostGIS, life is easy - is this an easy issue, or is there a better library which can handle Geospatial querying coming from a Spark Dataframe that contains latitute and longitude in Double format?
The versions that I am using in my SBT are:
spark: 2.3.0
scala: 2.11.12
geomesa: 2.2.1
jst-*: 1.17.0-SNAPSHOT
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
import org.apache.spark.sql.SparkSession
import org.locationtech.jts.geom.{Coordinate, GeometryFactory}
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.types._
import org.locationtech.geomesa.spark.jts._
object GetRandomData {
def main(sysArgs: Array[String]) {
#transient val spark: SparkSession = {
SparkSession
.builder()
.config("spark.ui.enabled", "false")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.mb","24")
.appName("GetRandomData")
.master("local[*]")
.getOrCreate()
}
val sc = spark.sparkContext
sc.setLogLevel("ERROR")
import spark.sqlContext.implicits._
var coordinates = sc.parallelize(
List(
(35.40466, -80.905458),
(35.344079, -80.872267),
(35.139606, -80.840845),
(35.537786, -80.780051),
(35.525361, -83.031932),
(34.928323, -80.766732),
(35.533865, -82.72344),
(35.50997, -80.588572),
(35.286251, -83.150514),
(35.558519, -81.067069),
(35.569311, -80.916993),
(35.835867, -81.067904),
(35.221695, -82.662141)
)
).
toDS().
toDF("geo_lat", "geo_lng")
coordinates = coordinates.select(coordinates.columns.map(c => col(c).cast(DoubleType)) : _*)
coordinates.show()
val testing = coordinates.map(r => new GeometryFactory().createPoint(new Coordinate(3.4, 5.6)))
val coordinatesPointDf = coordinates.withColumn("point", st_makePoint(col("geo_lat"), col("geo_lng")))
}
}
The exception is:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for org.locationtech.jts.geom.Point
- root class: "org.locationtech.jts.geom.Point"
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:643)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor$1.apply(ScalaReflection.scala:445)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:445)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:434)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.locationtech.geomesa.spark.jts.encoders.SpatialEncoders$class.jtsPointEncoder(SpatialEncoders.scala:21)
at org.locationtech.geomesa.spark.jts.package$.jtsPointEncoder(package.scala:17)
at GetRandomData$.main(Main.scala:50)
at GetRandomData.main(Main.scala)
If you aren't using an underlying GeoMesa store to load data into a spark session you'll need to explicitly register the JTS types with:
org.apache.spark.sql.SQLTypes.init(spark.sqlContext)
This will register the ST_ operations as well as the JTS encoders.
In plain english, the exception is saying:
I don't known how to convert a Point to a Spark type.
If you keep the latitude and longitude as doubles in your Dataset then you should be fine but as soon as you use an object like Point then you'll need to tell Spark how to convert it. In Spark terms, these are called Encoders and you can create custom ones.
Or you switch to an RDD where no conversion is necessary as long as you don't mind losing Spark SQL stuff.
I stated using AWS Glue to read data using data catalog and GlueContext and transform as per requirement.
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
val sparkSession = glueContext.getSparkSession
// Data Catalog: database and table name
val dbName = "abcdb"
val tblName = "xyzdt_2017_12_05"
// S3 location for output
val outputDir = "s3://output/directory/abc"
// Read data into a DynamicFrame using the Data Catalog metadata
val stGBDyf = glueContext.getCatalogSource(database = dbName, tableName = tblName).getDynamicFrame()
val revisedDF = stGBDyf.toDf() // This line getting error
While executing above code I got following error,
Error : Syntax Error: error: value toDf is not a member of
com.amazonaws.services.glue.DynamicFrame val revisedDF =
stGBDyf.toDf() one error found.
I followed this example to convert DynamicFrame to Spark dataFrame.
Please suggest what will be the best way to resolve this problem
There's a typo. It should work fine with capital F in toDF:
val revisedDF = stGBDyf.toDF()
Spark version 2.0.2.6
Scala version 2.11.11
Using DataStax 5.0
import org.apache.log4j.{Level, Logger}
import java.util.Calendar
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
import org.apache.spark.sql._
object csvtocassandra {
def main(args: Array[String]): Unit = {
val key_space = scala.io.StdIn.readLine("Please enter cassandra Key Space Name: ")
val table_name = scala.io.StdIn.readLine("Please enter cassandra Table Name: ")
// Cassandra Part
val conf = new SparkConf().setAppName("Sample1").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
println(Calendar.getInstance.getTime)
// Scala Read CSV Part
val spark1 = org.apache.spark.sql.SparkSession.builder().master("local").config("spark.cassandra.connection.host", "127.0.0.1")
.appName("Spark SQL basic example").getOrCreate()
val csv_input = scala.io.StdIn.readLine("Please enter csv file location: ")
val df_csv = spark1.read.format("csv").option("header", "true").option("inferschema", "true").load(csv_input)
df_csv.printSchema()
}
}
Why am I not able to run this program as a Job trying to submit it to spark. When I run this program using IntelliJ it works.
But When I create a JAR and run it I am getting following Error.
Command:
> dse spark-submit --class "csvtospark" /Users/del/target/scala-2.11/csvtospark_2.11-1.0.jar
I am getting following Error:
ERROR 2017-11-02 11:46:10,245 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
org.apache.spark.sql.AnalysisException: Path does not exist: dsefs://127.0.0.1/Users/Desktop/csv/example.csv;
Why is it appending dsefs://127.0.0.1 part even though I am giving just the path /Users/Desktop/csv/example.csv when asked.
I tried giving --mater option as well. How ever I am getting the same error. I am running DataStax Spark in Local Machine. No Cluster.
Please correct me where I am doing things wrong.
Got it. Never mind. Sorry about that.
input should be file:///file_name
I am trying to insert data into Hive table like this:
val partfile = sc.textFile("partfile")
val partdata = partfile.map(p => p.split(","))
val partSchema = StructType(Array(StructField("id",IntegerType,true),StructField("name",StringType,true),StructField("salary",IntegerType,true),StructField("dept",StringType,true),StructField("location",StringType,true)))
val partRDD = partdata.map(p => Row(p(0).toInt,p(1),p(2).toInt,p(3),p(4)))
val partDF = sqlContext.createDataFrame(partRDD, partSchema)
Packages I imported:
import org.apache.spark.sql.Row
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType}
import org.apache.spark.sql.types._
This is how I tried to insert the dataframe into Hive partition:
partDF.write.mode(saveMode.Append).partitionBy("location").insertInto("parttab")
Im getting the below error even though I have the Hive Table:
org.apache.spark.sql.AnalysisException: Table not found: parttab;
Could anyone tell me what is the mistake I am doing here and how can I correct it ?
To write data to Hive warehouse, you need to initialize hiveContext instance.
Upon doing that, it will take confs from Hive-Site.xml (from classpath); and connects to underlying Hive warehouse.
HiveContext is an extension to SQLContext to support and connect to hive.
To do so, try this::
val hc = new HiveContext(sc)
And perform your append-query onn this instance.
partDF.registerAsTempTable("temp")
hc.sql(".... <normal sql query to pick data from table `temp`; and insert in to Hive table > ....")
Please make sure that the table parttab is under db - default.
If the table in under another db, table name should be specified as : <db-name>.parttab
If you need to directly save the dataframe in to hive; use this:
df.saveAsTable("<db-name>.parttab")
Im trying to load a .parquet file into a MemSQL Database with Spark and MemSQL Connector.
package com.memsql.spark
import com.memsql.spark.context._
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import com.memsql.spark.connector._
import com.mysql.jdbc._
object readParquet {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("ReadParquet")
val sc = new SparkContext(conf)
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/mysql-connector-java-5.1.37-bin.jar")
sc.addJar("/data/applications/spark-1.5.1-bin-hadoop2.6/lib/memsql-connector_2.10-1.1.0.jar")
Class.forName("com.mysql.jdbc.Driver")
val host = "xxxx"
val port = 3306
val dbName = "WP1"
val user = "root"
val password = ""
val tableName = "rt_acc"
val memsqlContext = new com.memsql.spark.context.MemSQLContext(sc, host, port, user, password)
val rt_acc = memsqlContext.read.parquet("tachyon://localhost:19998/rt_acc.parquet")
val func_rt_acc = new com.memsql.spark.connector.DataFrameFunctions(rt_acc)
func_rt_acc.saveToMemSQL(dbName, tableName, host, port, user, password)
}
}
I'm fairly certain that Tachyon is not causing the problem, as the same exceptions occur if loaded from disk and i can use sql-queries on the dataframe.
I've seen people suggest df.saveToMemSQL(..) however it seems this method is in DataFrameFunctions now.
Also the table doesnt exist yet but saveToMemSQL should do CREATE TABLE as documentation and source code tell me.
Edit: Ok i guess i misread something. saveToMemSQL doesn't create the table. Thanks.
Try using createMemSQLTableAs instead of saveToMemSQL.
saveToMemSQL loads a dataframe into an existing table, where as createMemSQLTableAs creates the table and then loads it.
It also returns a handy dataframe wrapping that MemSQL table :).