check if table exists in hive using spark 1.6 scala code - scala

i am trying to check if a table exists in hive using spark 1.6 and scala coding.
i tried to explore over internet but couldn't find anything more useful than this
spark - scala - How can I check if a table exists in hive
here it is mentioned that if we use
sqlContext.tableNames.contains("mytable")
then it returns boolean. but when i try this it checks in default database and gives me false.
how can i set a database to be looked into while this check?

You could set the database first like this:
scala> sqlContext.sql("use dbName")
and then search for the table:
scala> sqlContext.tableNames.contains("tabName")
res3: Boolean = true

Related

Column value not properly passed to hive udf spark scala

I have created a hive udf like below,
Class customUdf extends UDF{
def evaluate(col : String): String = {
return col + "abc"
}
}
I then registered the udf in sparksession by,
sparksession.sql("""CREATE TEMPORARY FUNCTION testUDF AS 'testpkg.customUdf'""");
When I try to query hive table using below query in scala code it does not progress and does not throw error also,
SELECT testUDF(value) FROM t;
However when I pass a string like below from scala code it works
SELECT testUDF('str1') FROM t;
I am running the queries via sparksession.Tried with GenericUdf, but still facing same issue. This happens only when i pass hive column. What could be reason.
Try referencing your jar from hdfs:
create function testUDF as 'testpkg.customUdf' using jar 'hdfs:///jars/customUdf.jar';
I am not sure about implementation of UDFs in Scala, but when I faced similar issue in Java, I noticed a difference that if you plug in literal
select udf("some literal value")
then it is received by UDF as a String.
But when you select from a Hive table
select udf(some_column) from some_table
you may get what's called a LazyString for which you would need to use getObject to retrieve actual value. I am not sure is Scala handles these lazy values automatically.

Spark - Hive UDF is working with Spark-SQL but not with DataFrame

If I use hive UDF in spark SQL it works. as mentioned below.
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.registerTempTable("test")
spark.sql("select default.encrypt(selling_price,'sales','','employee','id') from test").show
However following is not working.
//following is not working. not sure if you need to register a function for this
val encDF = df.withColumn("encrypted", default.encrypt($"selling_price","sales","","employee","id"))
encDF.show
Error
error: not found: value default
Hive UDF is only available if you access it through Spark SQL. It is not available in the Scala environment, because it was not defined there. But you can still access the Hive UDF using expr:
df.withColumn("encrypted", expr("default.encrypt(selling_price,'sales','','employee','id')"))

Reading ES from spark with elasticsearch-spark connector: all the fields are returned

I've done some experiments in the spark-shell with the elasticsearch-spark connector. Invoking spark:
] $SPARK_HOME/bin/spark-shell --master local[2] --jars ~/spark/jars/elasticsearch-spark-20_2.11-5.1.2.jar
In the scala shell:
scala> import org.elasticsearch.spark._
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery")
It works well, the result contains the good records as specified in myquery. The only thing is that I get all the fields, even if I specify a subset of these fields in the query. Example:
myquery = """{"query":..., "fields":["a","b"], "size":10}"""
returns all the fields, not only a and b (BTW, I noticed that size parameter is not taken in account neither : result contains more than 10 records). Maybe it's important to add that fields are nested, a and b are actually doc.a and doc.b.
Is it a bug in the connector or do I have the wrong syntax?
The spark elasticsearch connector uses fields thus you cannot apply projection.
If you wish to use fine-grained control over the mapping, you should be using DataFrame instead which are basically RDDs plus schema.
pushdown predicate should also be enabled to translate (push-down) Spark SQL into Elasticsearch Query DSL.
Now a semi-full example :
myQuery = """{"query":..., """
val df = spark.read.format("org.elasticsearch.spark.sql")
.option("query", myQuery)
.option("pushdown", "true")
.load("myindex/mytype")
.limit(10) // instead of size
.select("a","b") // instead of fields
what about calling:
scala> val es_rdd = sc.esRDD("myindex/mytype",query="myquery", Map[String, String] ("es.read.field.include"->"a,b"))
You want restrict fields returned from elasticsearch _search HTTP API? (I guess to improve download speed).
First of all, use a HTTP proxy to see what the elastic4hadoop plugin is doing (I use on MacOS Apache Zeppelin with Charles proxy). This will help you to understand how pushdown works.
There are several solutions to achieve this:
1. dataframe and pushdown
You specify fields, and the plugin will "forward" to ES (here the _source parameter):
POST ../events/_search?search_type=scan&scroll=5m&size=50&_source=client&preference=_shards%3A3%3B_local
(-) Not fully working for nested fields.
(+) Simple, straightaway, easy to read
2. RDD & query fields
With JavaEsSpark.esRDD, you can specify fields inside the JSON query, like you did. This only work with RDD (with DataFrame, the fields is not sent).
(-) no dataframe -> no Spark way
(+) more flexible, more control

Spark streaming - Custom receiver and dataframe infer schema

Consider the below code snippet at receiver
val incomingMessage = subscriberSocket.recv(0)
val stringMessages = new String(incomingMessage).stripLineEnd.split(',')
store(Row.fromSeq(Array(stringMessages(0)) ++ stringMessages.drop(2)))
At receiver, I would not be wanting to convert the table (which is indicated by stringMessages(0) ) each of the column types to actual table types.
At main section of the code, when I do
val df = sqlContext.createDataFrame(eachGDNRdd,getSchemaAsStructField)
println(df.collect().length)
I get the below error
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)
Now, the schema consist of both String and Int field. I have cross verified, that field match by type. However, looks like spark dataframe is not inferring the type.
Question
1. Shouldn't spark infer the type of the schema, in the run time (unless there is a contradiction)?
2. Since the table is dynamic, the schema varies based on the first element of each row (which contains table name). Is there any simple suggested way to modify the schema on-the-fly?
Or Am i missing something obvious?
I'm new to Spark and you didn't say what version you're running, but in v2.1.0, schema inference is disabled by default due to the specific reason you mentioned; if the record structure is inconsistent, Spark can't reliably infer the schema. You can enable schema inference by setting spark.sql.streaming.schemaInference to true, but I think you're better off specifying the schema yourself.

Remove Temporary Tables from Apache SQL Spark

I have registertemptable in Apache Spark using Zeppelin below:
val hvacText = sc.textFile("...")
case class Hvac(date: String, time: String, targettemp: Integer, actualtemp: Integer, buildingID: String)
val hvac = hvacText.map(s => s.split(",")).filter(s => s(0) != "Date").map(
s => Hvac(s(0),
s(1),
s(2).toInt,
s(3).toInt,
s(6))).toDF()
hvac.registerTempTable("hvac")
After I have done with my queries with this temp table, how do I remove it ?
I checked all docs and it seems I am getting nowhere.
Any guidance ?
Spark 2.x
For temporary views you can use Catalog.dropTempView:
spark.catalog.dropTempView("df")
For global views you can use Catalog.dropGlobalTempView:
spark.catalog.dropGlobalTempView("df")
Both methods are safe to call if view doesn't exist and, since Spark 2.1, return boolean indicating if the operation succeed.
Spark 1.x
You can use SQLContext.dropTempTable:
scala.util.Try(sqlContext.dropTempTable("df"))
It can be still used in Spark 2.0, but delegates processing to Catalog.dropTempView and is safe to use if table doesn't exist.
If you want to remove your temp table on zeppelin, try like this.
sqlc.dropTempTable("hvac")
or
%sql DROP VIEW hvac
And you can get the informations you need from spark API Docs(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package)
in new ver (2.0 and latest) of spark.
one should use: createOrReplaceTempView in place of registerTempTable (depricated)
and corresponding method to deallocate is: dropTempView
spark.catalog.dropTempView("temp_view_name") //drops the table
You can use sql drop table/view statement to remove it like below
spark.sql("drop view hvac");