Created one project 'spark-udf' & written hive udf as below:
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
Built it & created jar for it. Tried to use this udf in another spark program:
spark.sql("CREATE OR REPLACE FUNCTION uppercase AS 'com.spark.udf.UpperCase' USING JAR '/home/swapnil/spark-udf/target/spark-udf-1.0.jar'")
But following line is giving me exception:
spark.sql("select uppercase(Car) as NAME from cars").show
Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
handler for UDAF 'com.spark.udf.UpperCase'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7 at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionExpression(SessionCatalog.scala:1105)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1247)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
Any help around this is really appreciated.
As mentioned in comments, it's better to write Spark UDF:
val uppercaseUDF = spark.udf.register("uppercase", (s : String) => s.toUpperCase)
spark.sql("select uppercase(Car) as NAME from cars").show
Main cause is that you didn't set enableHiveSupport during creation of SparkSession. In such situation, default SessionCatalog will be used and makeFunctionExpression function in SessionCatalog scans only for User Defined Aggregate Function. If function is not an UDAF, it won't be found.
Created Jira task to implement this
Issue is class needs to be public.
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
public class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
}
Related
I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case):
package mwe
import org.apache.spark.sql.api.java.UDF1
class SomeFun extends UDF1[Int, Int] {
private var prop: Int = 0
override def call(input: Int): Int = {
if (prop == 0) {
prop = input
}
prop + input
}
}
Now I'm attempting to use this class from within pyspark:
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set("spark.jars", "mwe.jar")
sc = SparkContext.getOrCreate(conf)
sqlContext = SQLContext.getOrCreate(sc)
sqlContext.registerJavaFunction("fun", "mwe.SomeFun")
df0 = sc.parallelize((i,) for i in range(6)).toDF(["num"])
df1 = df0.selectExpr("fun(num) + 3 as new_num")
df1.show()
And get the following exception:
pyspark.sql.utils.AnalysisException: u"cannot resolve '(UDF:fun(num) + 3)' due to data type mismatch: differing types in '(UDF:fun(num) + 3)' (struct<> and int).; line 1 pos 0;\n'Project [(UDF:fun(num#0L) + 3) AS new_num#2]\n+- AnalysisBarrier\n +- LogicalRDD [num#0L], false\n"
What is the correct way to implement this? Will I have to resort to Java itself for the class? I'd greatly appreciate hints!
The source of the exception is usage of incompatible types:
First of all o.a.s.sql.api.java.UDF* objects require external Java (not Scala types), so UDF expecting integers should take boxed Integer (java.lang.Integer) not Int.
class SomeFun extends UDF1[Integer, Integer] {
...
override def call(input: Integer): Integer = {
...
Unless you use legacy Python num column uses of LongType not IntegerType:
df0.printSchema()
root
|-- num: long (nullable = true)
So the actual signature should be
class SomeFun extends UDF1[java.lang.Long, java.lang.Long] {
...
override def call(input: java.lang.Long): java.lang.Long = {
...
or data should be casted before applying UDF
df0.selectExpr("fun(cast(num as integer)) + 3 as new_num")
Finally mutable state is not allowed in UDFs. It won't cause an exception but overall behavior will be non-deterministic.
Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}
//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}
i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2
I'm trying to offload some computations from Python to Scala when using Apache Spark. I would like to use the class interface from Java to be able to use a persistent variable, like so (this is a nonsensical MWE based on my more complex use case):
package mwe
import org.apache.spark.sql.api.java.UDF1
class SomeFun extends UDF1[Int, Int] {
private var prop: Int = 0
override def call(input: Int): Int = {
if (prop == 0) {
prop = input
}
prop + input
}
}
Now I'm attempting to use this class from within pyspark:
import pyspark
from pyspark.sql import SQLContext
from pyspark import SparkContext
conf = pyspark.SparkConf()
conf.set("spark.jars", "mwe.jar")
sc = SparkContext.getOrCreate(conf)
sqlContext = SQLContext.getOrCreate(sc)
sqlContext.registerJavaFunction("fun", "mwe.SomeFun")
df0 = sc.parallelize((i,) for i in range(6)).toDF(["num"])
df1 = df0.selectExpr("fun(num) + 3 as new_num")
df1.show()
And get the following exception:
pyspark.sql.utils.AnalysisException: u"cannot resolve '(UDF:fun(num) + 3)' due to data type mismatch: differing types in '(UDF:fun(num) + 3)' (struct<> and int).; line 1 pos 0;\n'Project [(UDF:fun(num#0L) + 3) AS new_num#2]\n+- AnalysisBarrier\n +- LogicalRDD [num#0L], false\n"
What is the correct way to implement this? Will I have to resort to Java itself for the class? I'd greatly appreciate hints!
The source of the exception is usage of incompatible types:
First of all o.a.s.sql.api.java.UDF* objects require external Java (not Scala types), so UDF expecting integers should take boxed Integer (java.lang.Integer) not Int.
class SomeFun extends UDF1[Integer, Integer] {
...
override def call(input: Integer): Integer = {
...
Unless you use legacy Python num column uses of LongType not IntegerType:
df0.printSchema()
root
|-- num: long (nullable = true)
So the actual signature should be
class SomeFun extends UDF1[java.lang.Long, java.lang.Long] {
...
override def call(input: java.lang.Long): java.lang.Long = {
...
or data should be casted before applying UDF
df0.selectExpr("fun(cast(num as integer)) + 3 as new_num")
Finally mutable state is not allowed in UDFs. It won't cause an exception but overall behavior will be non-deterministic.
Currently playing with custom tranformers in my spark-shell using both spark 2.0.1 and 2.2.1.
While writing a custom ml transformer, in order to add it to a pipeline, I noticed that there is an issue with the override of the copy method.
The copy method is called by the fit method of the TrainValidationSplit in my case.
The error I get :
java.lang.NoSuchMethodException: Custom.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.param.Params$class.defaultCopy(params.scala:718)
at org.apache.spark.ml.PipelineStage.defaultCopy(Pipeline.scala:42)
at Custom.copy(<console>:16)
... 48 elided
I then tried to directly call the copy method but I still get the same error.
Here is myclass and the call I perform :
import org.apache.spark.ml.Transformer
import org.apache.spark.sql.{Dataset, DataFrame}
import org.apache.spark.sql.types.{StructField, StructType, DataTypes}
import org.apache.spark.ml.param.{Param, ParamMap}
// Simple DF
val doubles = Seq((0, 5d, 100d), (1, 4d,500d), (2, 9d,700d)).toDF("id", "rating","views")
class Custom(override val uid: String) extends org.apache.spark.ml.Transformer {
def this() = this(org.apache.spark.ml.util.Identifiable.randomUID("custom"))
def copy(extra: org.apache.spark.ml.param.ParamMap): Custom = {
defaultCopy(extra)
}
override def transformSchema(schema: org.apache.spark.sql.types.StructType): org.apache.spark.sql.types.StructType = {
schema.add(org.apache.spark.sql.types.StructField("trending", org.apache.spark.sql.types.IntegerType, false))
}
def transform(df: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
df.withColumn("trending", (df.col("rating") > 4 && df.col("views") > 40))
}
}
val mycustom = new Custom("Custom")
// This call throws the exception.
mycustom.copy(new org.apache.spark.ml.param.ParamMap())
Does anyone know if this is a known issue ? I cant seem to find it anywhere.
Is there another way to implement the copy method in a custom transformer ?
Thanks
These are a couple of things that I would change about your custom Transformer (also to enable SerDe operations of your PipelineModel):
Implement the DefaultParamsWritable trait
Add a Companion object that extends the DefaultParamsReadable Interface
e.g.
class Custom(override val uid: String) extends Transformer
with DefaultParamsWritable {
...
...
}
object Custom extends DefaultParamsReadable[Custom]
Do take a look at the UnaryTransformer if you have only 1 Input/Output columns.
Finally, what's the need to call mycustom.copy(new ParamMap()) exactly??
I am getting above error when calling function in Spark SQL. I have written function in different scala file and calling in another scala file.
Ex: Function.Scala
object Utils extends Serializable {
def Formater (d:String):java.sql.Date =
{
val df=new SimpleDateFormat("yyyy-MM-dd")
val newFormat=df.format(d)
val dat= java.sql.Date.valueOf(newFormat)
return dat
}
}
I am calling above function in another scala file.
Registered UDF here :
sqlContext.udf.register("Formater",(s:String) => Utils.Formater(s))
and using here :-
val startdate =sqlContext.sql("select dateFormater(parameterValue) from Table").show()
If I remove the above function the code is running without any issue, If I include it is throwing me the above error.