I have created a hive udf like below,
Class customUdf extends UDF{
def evaluate(col : String): String = {
return col + "abc"
}
}
I then registered the udf in sparksession by,
sparksession.sql("""CREATE TEMPORARY FUNCTION testUDF AS 'testpkg.customUdf'""");
When I try to query hive table using below query in scala code it does not progress and does not throw error also,
SELECT testUDF(value) FROM t;
However when I pass a string like below from scala code it works
SELECT testUDF('str1') FROM t;
I am running the queries via sparksession.Tried with GenericUdf, but still facing same issue. This happens only when i pass hive column. What could be reason.
Try referencing your jar from hdfs:
create function testUDF as 'testpkg.customUdf' using jar 'hdfs:///jars/customUdf.jar';
I am not sure about implementation of UDFs in Scala, but when I faced similar issue in Java, I noticed a difference that if you plug in literal
select udf("some literal value")
then it is received by UDF as a String.
But when you select from a Hive table
select udf(some_column) from some_table
you may get what's called a LazyString for which you would need to use getObject to retrieve actual value. I am not sure is Scala handles these lazy values automatically.
Related
If I use hive UDF in spark SQL it works. as mentioned below.
val df=List(("$100", "$90", "$10")).toDF("selling_price", "market_price", "profit")
df.registerTempTable("test")
spark.sql("select default.encrypt(selling_price,'sales','','employee','id') from test").show
However following is not working.
//following is not working. not sure if you need to register a function for this
val encDF = df.withColumn("encrypted", default.encrypt($"selling_price","sales","","employee","id"))
encDF.show
Error
error: not found: value default
Hive UDF is only available if you access it through Spark SQL. It is not available in the Scala environment, because it was not defined there. But you can still access the Hive UDF using expr:
df.withColumn("encrypted", expr("default.encrypt(selling_price,'sales','','employee','id')"))
I'm trying to create a dataset from a json-string within a dataframe in Databricks 3.5 (Spark 2.2.1). In the code block below 'jsonSchema' is a StructType with the correct layout for the json-string which is in the 'body' column of the dataframe.
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema))
This returns a dataframe where the root object is
jsontostructs(CAST(body AS STRING)):struct
followed by the fields in the schema (looks correct). When I try another select on the newDF
val transform = newDF.select($"propertyNameInTheParsedJsonObject")
it throws the exception
org.apache.spark.sql.AnalysisException: cannot resolve '`columnName`' given
input columns: [jsontostructs(CAST(body AS STRING))];;
I'm aparently missing something. I hoped from_json would return a dataframe I could manipulate further.
My ultimate objective is to cast the json-string within the oldDF body-column to a dataset.
from_json returns a struct or (array<struct<...>>) column. It means it is a nested object. If you've provided a meaningful name:
val newDF = oldDF.select(from_json($"body".cast("string"), jsonSchema) as "parsed")
and the schema describes a plain struct you could use standard methods like
newDF.select($"parsed.propertyNameInTheParsedJsonObject")
otherwise please follow the instructions for accessing arrays.
I am trying to add some runtime type checks when writting a Spark Dataframe, basically I want to make sure that the DataFrame schema is compatible with a type T, compatible doesn't mean that it has to be exactly the same. Here is my code
def save[T: Encoder](dataframe: DataFrame, url: String): Unit = {
val encoder = implicitly[Encoder[T]]
assert(dataframe.schema == encoder.schema, s"Unable to save schemas don't match")
dataframe.write.parquet(url)
}
Currently I am checking that the schemas are equals, how could I check that they are compatible with the type T?
With compatible I mean that if I execute dataframe.as[T] it will work (but I don't want to execute that because it is quite expensive)
Create an empty dataframe with the same schema and call .as[T] on it. If it works the schema should be compatible!
The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.
I have registertemptable in Apache Spark using Zeppelin below:
val hvacText = sc.textFile("...")
case class Hvac(date: String, time: String, targettemp: Integer, actualtemp: Integer, buildingID: String)
val hvac = hvacText.map(s => s.split(",")).filter(s => s(0) != "Date").map(
s => Hvac(s(0),
s(1),
s(2).toInt,
s(3).toInt,
s(6))).toDF()
hvac.registerTempTable("hvac")
After I have done with my queries with this temp table, how do I remove it ?
I checked all docs and it seems I am getting nowhere.
Any guidance ?
Spark 2.x
For temporary views you can use Catalog.dropTempView:
spark.catalog.dropTempView("df")
For global views you can use Catalog.dropGlobalTempView:
spark.catalog.dropGlobalTempView("df")
Both methods are safe to call if view doesn't exist and, since Spark 2.1, return boolean indicating if the operation succeed.
Spark 1.x
You can use SQLContext.dropTempTable:
scala.util.Try(sqlContext.dropTempTable("df"))
It can be still used in Spark 2.0, but delegates processing to Catalog.dropTempView and is safe to use if table doesn't exist.
If you want to remove your temp table on zeppelin, try like this.
sqlc.dropTempTable("hvac")
or
%sql DROP VIEW hvac
And you can get the informations you need from spark API Docs(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.package)
in new ver (2.0 and latest) of spark.
one should use: createOrReplaceTempView in place of registerTempTable (depricated)
and corresponding method to deallocate is: dropTempView
spark.catalog.dropTempView("temp_view_name") //drops the table
You can use sql drop table/view statement to remove it like below
spark.sql("drop view hvac");