Remove Temporary Tables from Apache SQL Spark - scala

I have registertemptable in Apache Spark using Zeppelin below:
val hvacText = sc.textFile("...")
case class Hvac(date: String, time: String, targettemp: Integer, actualtemp: Integer, buildingID: String)
val hvac = hvacText.map(s => s.split(",")).filter(s => s(0) != "Date").map(
s => Hvac(s(0),
s(1),
s(2).toInt,
s(3).toInt,
s(6))).toDF()
hvac.registerTempTable("hvac")
After I have done with my queries with this temp table, how do I remove it ?
I checked all docs and it seems I am getting nowhere.
Any guidance ?

Spark 2.x
For temporary views you can use Catalog.dropTempView:
spark.catalog.dropTempView("df")
For global views you can use Catalog.dropGlobalTempView:
spark.catalog.dropGlobalTempView("df")
Both methods are safe to call if view doesn't exist and, since Spark 2.1, return boolean indicating if the operation succeed.
Spark 1.x
You can use SQLContext.dropTempTable:
scala.util.Try(sqlContext.dropTempTable("df"))
It can be still used in Spark 2.0, but delegates processing to Catalog.dropTempView and is safe to use if table doesn't exist.

If you want to remove your temp table on zeppelin, try like this.
sqlc.dropTempTable("hvac")
or
%sql DROP VIEW hvac
And you can get the informations you need from spark API Docs(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package)

in new ver (2.0 and latest) of spark.
one should use: createOrReplaceTempView in place of registerTempTable (depricated)
and corresponding method to deallocate is: dropTempView
spark.catalog.dropTempView("temp_view_name") //drops the table

You can use sql drop table/view statement to remove it like below
spark.sql("drop view hvac");

Related

Column value not properly passed to hive udf spark scala

I have created a hive udf like below,
Class customUdf extends UDF{
def evaluate(col : String): String = {
return col + "abc"
}
}
I then registered the udf in sparksession by,
sparksession.sql("""CREATE TEMPORARY FUNCTION testUDF AS 'testpkg.customUdf'""");
When I try to query hive table using below query in scala code it does not progress and does not throw error also,
SELECT testUDF(value) FROM t;
However when I pass a string like below from scala code it works
SELECT testUDF('str1') FROM t;
I am running the queries via sparksession.Tried with GenericUdf, but still facing same issue. This happens only when i pass hive column. What could be reason.
Try referencing your jar from hdfs:
create function testUDF as 'testpkg.customUdf' using jar 'hdfs:///jars/customUdf.jar';
I am not sure about implementation of UDFs in Scala, but when I faced similar issue in Java, I noticed a difference that if you plug in literal
select udf("some literal value")
then it is received by UDF as a String.
But when you select from a Hive table
select udf(some_column) from some_table
you may get what's called a LazyString for which you would need to use getObject to retrieve actual value. I am not sure is Scala handles these lazy values automatically.

spark-shell load existing hive table by partition?

In spark-shell, how do I load an existing Hive table, but only one of its partitions?
val df = spark.read.format("orc").load("mytable")
I was looking for a way so it only loads one particular partition of this table.
Thanks!
There is no direct way in spark.read.format but you can use where condition
val df = spark.read.format("orc").load("mytable").where(yourparitioncolumn)
unless until you perform an action nothing is loaded, since load (pointing to your orc file location ) is just a func in DataFrameReader like below it doesnt load until actioned.
see here DataFrameReader
def load(paths: String*): DataFrame = {
...
}
In above code i.e. spark.read.... where is just where condition when you specify this, again data wont be loaded immediately :-)
when you say df.count then your parition column will be appled on data path of orc.
There is no function available in Spark API to load only partition directory, but other way around this is partiton directory is nothing but column in where clause, here you can right simple sql query with partition column in where clause which will read data only from partition directoty. See if that will works for you.
val df = spark.sql("SELECT * FROM mytable WHERE <partition_col_name> = <expected_value>")

Spark Job simply stalls when querying full cassandra table

I have a rather peculiar problem. In a DSE spark analytics engine I produce frequent stats that I store to cassandra in a small table. Since I keep the table trimmed and it is supposed to serve a web interface with consolidated information, I simply want to query the whole table in spark and send the results over an API. I have tried two methods for this:
val a = Try(sc.cassandraTable[Data](keyspace, table).collect()).toOption
val query = "SELECT * FROM keyspace.table"
val df = spark.sqlContext.sql(query)
val list = df.collect()
I am doing this in a scala program. When I use method 1, spark job mysteriously gets stuck showing stage 10 of 12 forever. Verified in logs and spark jobs page. When I use the second method it simply tells me that no such table exists:
Unknown exception: org.apache.spark.sql.AnalysisException: Table or view not found: keyspace1.table1; line 1 pos 15;
'Project [*]
+- 'UnresolvedRelation keyspace1.table1
Interestingly, I tested both methods in spark shell on the cluster and they work just fine. My program has plenty of other queries done using method 1 and they all work fine, the key difference being that in each of them the main partition key always has a condition on it unlike in this query (holds true for this particular table too).
Here is the table structure:
CREATE TABLE keyspace1.table1 (
userid text,
stat_type text,
event_time bigint,
stat_value double,
PRIMARY KEY (userid, stat_type))
WITH CLUSTERING ORDER BY (stat_type ASC)
Any solid diagnosis of the problem or a work around would be much appreciated
When you do select * without where clause in cassandra, you're actually performing a full range query. This is not intended use case in cassandra (aside from peeking at the data perhaps). Just for the fun of it, try replacing with select * from keyspace.table limit 10 and see if it works, it might...
Anyway, my gut feeling says you're problem isn't with spark, but with cassandra. If you have visibility for cassandra metrics, look for the range query latencies.
Now, if your code above is complete - the reason that method 1 freezes, while method 2 doesn't, is that method 1 contains an action (collect), while method 2 doesn't involve any spark action, just schema inference. Should you add to method 2 df.collect you will face the same issue with cassandra

check if table exists in hive using spark 1.6 scala code

i am trying to check if a table exists in hive using spark 1.6 and scala coding.
i tried to explore over internet but couldn't find anything more useful than this
spark - scala - How can I check if a table exists in hive
here it is mentioned that if we use
sqlContext.tableNames.contains("mytable")
then it returns boolean. but when i try this it checks in default database and gives me false.
how can i set a database to be looked into while this check?
You could set the database first like this:
scala> sqlContext.sql("use dbName")
and then search for the table:
scala> sqlContext.tableNames.contains("tabName")
res3: Boolean = true

Spark Streaming: NullPointerException inside foreachPartition

I have a spark streaming job which reads from Kafka and does some comparisons with an existing table in Postgres before writing to Postrges again. This is what it looks like :
val message = KafkaUtils.createStream(...).map(_._2)
message.foreachRDD( rdd => {
if (!rdd.isEmpty){
val kafkaDF = sqlContext.read.json(rdd)
println("First")
kafkaDF.foreachPartition(
i =>{
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql://...",
"dbtable" -> "table", "user" -> "user", "password" -> "pwd" )).load()
createConnection()
i.foreach(
row =>{
println("Second")
connection.sendToTable()
}
)
closeConnection()
}
)
This code is giving me NullPointerException at the line val jbdcDF = ...
What am I doing wrong? Also, my log "First" works, but "Second" doesn't show up anywhere in the logs. I tried the entire code with kafkaDF.collect().foreach(...) and it works perfectly, but has very poor performance. I am looking to replace it with foreachPartition.
Thanks
It is not clear if there are any issues inside createConnection, closeConnection or connection.sendToTable but fundamental problem is an attempt to nest actions / transformations. It is not supported in Spark and Spark Streaming is not different.
It means that nested DataFrame initialization (val jdbcDF = sqlContext.read.format ...) simply cannot work and should be removed. If you use it as a reference it should be created at the same level as kafkaDF and refferenced using standard transformations (unionAll, join, ...).
If for some reason it is not an acceptable solution you can create plain JDBC connection inside forEachPartition and operate on PostgreSQL table (I guess it is what you're already do inside sendToTable).
As #zero323 correctly pointed out, you can't broadcast your jdbc connection around and you cannot create nested RDDs either. Spark simply does not support using sparkContext or sqlContext for that matter within an existing closure, i.e. foreachPartition, hence the null pointer exception.
The only way to solve this efficiently is to create a JDBC connection within foreachPartition and execute SQL directly on it to do whatever you intended and then use that same connection to write back the records.
As to your second, edited, question:
Change:
kafkaDF.foreachPartition(..)
to
kafkaDF.repartition(numPartition).foreachPartition(..)
where numPartition is the desired number of partitions. This will increase the number of partitions. If you have multiple executors (and multiple tasks per executor), these will run in parallel.