I am getting above error when calling function in Spark SQL. I have written function in different scala file and calling in another scala file.
Ex: Function.Scala
object Utils extends Serializable {
def Formater (d:String):java.sql.Date =
{
val df=new SimpleDateFormat("yyyy-MM-dd")
val newFormat=df.format(d)
val dat= java.sql.Date.valueOf(newFormat)
return dat
}
}
I am calling above function in another scala file.
Registered UDF here :
sqlContext.udf.register("Formater",(s:String) => Utils.Formater(s))
and using here :-
val startdate =sqlContext.sql("select dateFormater(parameterValue) from Table").show()
If I remove the above function the code is running without any issue, If I include it is throwing me the above error.
Related
I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0
Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))
I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.
Created one project 'spark-udf' & written hive udf as below:
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
Built it & created jar for it. Tried to use this udf in another spark program:
spark.sql("CREATE OR REPLACE FUNCTION uppercase AS 'com.spark.udf.UpperCase' USING JAR '/home/swapnil/spark-udf/target/spark-udf-1.0.jar'")
But following line is giving me exception:
spark.sql("select uppercase(Car) as NAME from cars").show
Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
handler for UDAF 'com.spark.udf.UpperCase'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7 at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionExpression(SessionCatalog.scala:1105)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1247)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
Any help around this is really appreciated.
As mentioned in comments, it's better to write Spark UDF:
val uppercaseUDF = spark.udf.register("uppercase", (s : String) => s.toUpperCase)
spark.sql("select uppercase(Car) as NAME from cars").show
Main cause is that you didn't set enableHiveSupport during creation of SparkSession. In such situation, default SessionCatalog will be used and makeFunctionExpression function in SessionCatalog scans only for User Defined Aggregate Function. If function is not an UDAF, it won't be found.
Created Jira task to implement this
Issue is class needs to be public.
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
public class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
}
case class Response(jobCompleted:String,detailedMessage:String)
override def runJob(sc: HiveContext, runtime: JobEnvironment, data:
JobData): JobOutput = {
val generateResponse= new GenerateResponse(data,sc)
val response=generateResponse.generateResponse()
response.pettyPrint
}
I am trying to get ouput from spark job server in this format from my scala code.
" result":{
"jobCompleted":true,
"detailedMessage":"all good"
}
However what returns to me is the following result:{"{\"jobCompleted\":\"true\",\"detailedMessage.."}.
Can some one please point out what I am doing wrong and how to get the correct format. I also tried response.toJson which returns me the AST format
"result": [{
"jobCompleted": ["true"],
"detailedMessage": ["all good"]
}],
I finally figured it out. Based on this stack over flow question. If there is a better way kindly post here as I am new to scala and spark job server.
Convert DataFrame to RDD[Map] in Scala
So the key is to convert the response to a Map[String,JsValue]. Below is the sample code I was playing with.
case class Response(param1:String,param2:String,param3:List[SubResult])
case class SubResult(lst:List[String])
object ResultFormat extends DefaultJsonProtocol{
implicit val subresultformat=jsonFormat1(SubResult)
implicit val responsefomat=jsonFormat3(Response)
}
type JobOutput=Map[String,JsValue]
def runJob(....){
val xlst=List("one","two")
val ylst=List("three","four")
val subresult1=SubResult(xlst)
val subresult2=SubResult(ylst)
val subResultlist=List(subresult1,subresult2)
val r=Result("xxxx","yyy",subResultlist)
r.toJson.asJsObject.fields
//Returns output type of Map[String,JsValue] which the spark job server serializes correctly.
}
Please take a look at the following spark streaming code written in scala:
object HBase {
var hbaseTable = ""
val hConf = new HBaseConfiguration()
hConf.set("hbase.zookeeper.quorum", "zookeeperhost")
def init(input: (String)) {
hbaseTable = input
}
def display() {
print(hbaseTable)
}
def insertHbase(row: (String)) {
val hTable = new HTable(hConf,hbaseTable)
}
}
object mainHbase {
def main(args : Array[String]) {
if (args.length < 5) {
System.err.println("Usage: MetricAggregatorHBase <zkQuorum> <group> <topics> <numThreads> <hbaseTable>")
System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads, hbaseTable) = args
HBase.init(hbaseTable)
HBase.display()
val sparkConf = new SparkConf().setAppName("mainHbase")
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint("checkpoint")
val topicpMap = topics.split(",").map((_, numThreads.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicpMap).map(_._2)
val storeStg = lines.foreachRDD(rdd => rdd.foreach(HBase.insertHbase))
lines.print()
ssc.start()
}
}
I am trying to initialize the parameter hbaseTable in the object HBase by calling HBase.init method. It was setting the parameter properly. I confirmed that by calling the HBase.display method in the next line.
However when HBase.insertHbase method in the foreachRDD is called, its throwing error that hbaseTable is not set.
Update with exception:
java.lang.IllegalArgumentException: Table qualifier must not be empty
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:179)
org.apache.hadoop.hbase.TableName.isLegalTableQualifierName(TableName.java:149)
org.apache.hadoop.hbase.TableName.<init>(TableName.java:303)
org.apache.hadoop.hbase.TableName.createTableNameIfNecessary(TableName.java:339)
org.apache.hadoop.hbase.TableName.valueOf(TableName.java:426)
org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:156)
Can you please let me know how to make this code work.
"Where is this code running" - that's the question that we need to ask in order to understand what's going on.
HBase is a Scala object - by definition it's a singleton construct that gets initialized with 'only once' semantics in the JVM.
At the initialization point, HBase.init(hbaseTable) is executed in the driver of this Spark application, initializing this object with the given value in the VM of the driver.
But when we do: rdd.foreach(HBase.insertHbase), the closure is executed as a task on each executor that hosts a partition for the given RDD. At that point, the object HBase is initialized on each VM for each executor. As we can see, no initialization has happened on this object at that point.
There're two options:
We can add some checking "isInitialized" to the HBase object and add the -now conditional- call to initialize on each call to foreach.
Another option would be to use
rdd.foreachPartitition{partition =>
HBase.initialize(...)
partition.foreach(elem => HBase.insert(elem))
}
This construction will amortize any initialization by the amount of element in each partition. It's also possible to combine it with an initialization check to prevent unnecessary bootstrap work.