I am trying to run map partition on dataset of user case class.
On local box its working fine, but when I run it on yarn cluster I am getting an exception:
scala.ScalaReflectionException: Person in JavaMirror with sun.misc.Launcher$AppClassLoader#1761e840 of type class sun.misc.Launcher$AppClassLoader in [class path]
I am trying to do this operation :
case class Person(name: String, age: Long)
val df = sparkSession.read.parquet("path").as[Person]
df.mapPartitions((iter: Iterator[Person]) => {
//computations
})
I am working on Spark 2.0
Can some one help with this ?
Related
Since a few days I was wondering if it is possible to infer a schema for Spark in Scala for a given case class, but unknown at compile time.
The only input is a string containing the FQN of the class (that could be used for example to create an instance of the case class at runtime via reflection)
I was thinking if it was possible to do something like:
package com.my.namespace
case class MyCaseClass (name: String, num: Int)
//Somewhere else in codebase
// coming from external configuration file, so unknown at compile time
val fqn = "com.my.namespace.MyCaseClass"
val schema = Encoders.product [ getXYZ( fqn ) ].schema
Of course, any other techniques that is not using Encoders is fine (building StructType analysing an instance of the case class ? Is it even possible ?)
What is the best approach?
Is it something feasible ?
You can use reflective toolbox
package com.my.namespace
import org.apache.spark.sql.types.StructType
import scala.reflect.runtime
import scala.tools.reflect.ToolBox
case class MyCaseClass (name: String, num: Int)
object Main extends App {
val fqn = "com.my.namespace.MyCaseClass"
val runtimeMirror = runtime.currentMirror
val toolbox = runtimeMirror.mkToolBox()
val res = toolbox.eval(toolbox.parse(s"""
import org.apache.spark.sql.Encoders
Encoders.product[$fqn].schema
""")).asInstanceOf[StructType]
println(res) // StructType(StructField(name,StringType,true),StructField(num,IntegerType,false))
}
I am trying to broadcast an List and pass the broadcast variable to UDF (Scala code is present in separate file). But facing issues.
val Lookup_BroadCast = SC.broadcast(lookup_data)
UDF creation with 3 arguments
val Call_Sub_Pgm = udf(foo(_: String, Lookup_BroadCast: org.apache.spark.broadcast.Broadcast[List[String]], Trace: String))
Calling the UDF using "withColumn"
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
I am getting compilation error for above code - "found broadcast variable, required Sql Column"
If i remove "Lookup_BroadCast" variable from above
Out_DF = Out_DF.withColumn("Col-1", Call_Sub_Pgm(col(Col-1),Lookup_BroadCast,lit(Trace)))
then I get below error:
java.lang.ClassCastException: org.spark.masking.ExtractData$$anonfun$7 cannot be cast to scala.Function0
Serializable wrapper class can be created for function, with Broadcast in constructor:
class Wrapper(Lookup_BroadCast: Broadcast[List[String]]) extends Serializable {
def foo(v: String, s: String): String = {
// usage example
Lookup_BroadCast.value.head
}
}
And used like:
val wrapper = new Wrapper(Lookup_BroadCast)
val Call_Sub_Pgm = udf(wrapper.foo(_: String, _: String))
Trying to figure out why getting an error on encoders, any insight would be helpful!
ERROR Unable to find encoder for type SolrNewsDocument, An implicit Encoder[SolrNewsDocument] is needed to store `
Clearly I have imported spark.implicits._. I have also have provided an encoder as a case class.
def ingestDocsToSolr(newsItemDF: DataFrame) = {
case class SolrNewsDocument(
title: String,
body: String,
publication: String,
date: String,
byline: String,
length: String
)
import spark.implicits._
val solrDocs = newsItemDF.as[SolrNewsDocument].map { doc =>
val solrDoc = new SolrInputDocument
solrDoc.setField("title", doc.title.toString)
solrDoc.setField("body", doc.body)
solrDoc.setField("publication", doc.publication)
solrDoc.setField("date", doc.date)
solrDoc.setField("byline", doc.byline)
solrDoc.setField("length", doc.length)
solrDoc
}
// can be used for stream SolrSupport.
SolrSupport.indexDocs("localhost:2181", "collection", 10, solrDocs.rdd);
val solrServer = SolrSupport.getCachedCloudClient("localhost:2181")
solrServer.setDefaultCollection("collection")
solrServer.commit(false, false)
}
//Check this one.-Move case class declaration before function declaration.
//Encoder is created once case class statement is executed by compiler. Then only compiler will be able to use encoder inside function deceleration.
import spark.implicits._
case class SolrNewsDocument(title: String,body: String,publication: String,date: String,byline: String,length: String)
def ingestDocsToSolr(newsItemDF:DataFrame) = {
val solrDocs = newsItemDF.as[SolrNewsDocument]}
i got this error trying to iterate over a text file, and in my case, as of spark 2.4.x the issue was that i had to cast it to an RDD first (that used to be implicit)
textFile
.rdd
.flatMap(line=>line.split(" "))
Migrating our Scala codebase to Spark 2
I am using Scala 2.12 and Avro (org.apache.avro) 1.8.
I have the following schema:
Schema: {"name": "person","type": "record","fields": [{"name": "address","type": {"type" : "record","name" : "AddressUSRecord","fields" : [{"name": "streetaddress", "type": "string"},{"name": "city", "type":"string"}]}}]}
Corresponding Scala case classes are:
case class AddressUSRecord (streetaddress: String, name: String}
case class Address (addressUSRecord: List[AddressUSRecord])
case class Person (person: Address)
I am using GenericRecord to convert my object of case class PnlRecord into Avro.
val schema = new Schema.Parser().parse(new File(schemaFileName))
val avroRecord = new GenericData.Record(schema)
val writer = new GenericDatumWriter[GenericRecord](schema)
val out = new ByteArrayOutputStream()
val encoder = EncoderFactory.get().binaryEncoder(out, null)
val producer = new KafkaProducer[String, Array[Byte]](properties)
avroRecord.put("header", record.header)
//Please note that this pnlData (see above case class) is complex and created accordingly.
avroRecord.put("pnlData", record.pnlData)
writer.write(avroRecord, encoder)
val bytes = out.toByteArray
encoder.flush()
out.close()
I am getting the following error.
2019-03-13 21:57:29.832 [application-akka.actor.default-dispatcher-4] ERROR controllers.SAController.$anonfun$publishToSA$2(34) - ca.company.project.sa.model.MessageHeader cannot be cast to org.apache.avro.generic.IndexedRecord
java.lang.ClassCastException: ca.company.project.sa.model.MessageHeader cannot be cast to org.apache.avro.generic.IndexedRecord
at org.apache.avro.generic.GenericData.getField(GenericData.java:697)
at org.apache.avro.generic.GenericData.getField(GenericData.java:712)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:164)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
Why my MessageHeader case class cannot be cast to IndexedRecord? What am I missing here?
How do we convert such complex case class to avro object? Can someone help with such nested case class example to convert to avro record?
Thanks in advance.
Confluent Kafka Avro serializer is Java-based, and, as such, is most likely not designed to work with Scala objects. I see that your pnlBreakdown is declared as List[PnlBreakdown] - if this is a Scala list, the serializer won't even recognize it as a collection. Same goes for case classes - these won't be recognized as Java Beans without the #BeanProperty annotations
Created one project 'spark-udf' & written hive udf as below:
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
Built it & created jar for it. Tried to use this udf in another spark program:
spark.sql("CREATE OR REPLACE FUNCTION uppercase AS 'com.spark.udf.UpperCase' USING JAR '/home/swapnil/spark-udf/target/spark-udf-1.0.jar'")
But following line is giving me exception:
spark.sql("select uppercase(Car) as NAME from cars").show
Exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: No
handler for UDAF 'com.spark.udf.UpperCase'. Use
sparkSession.udf.register(...) instead.; line 1 pos 7 at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionExpression(SessionCatalog.scala:1105)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog$$anonfun$org$apache$spark$sql$catalyst$catalog$SessionCatalog$$makeFunctionBuilder$1.apply(SessionCatalog.scala:1085)
at
org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:115)
at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1247)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$52.apply(Analyzer.scala:1226)
at
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
Any help around this is really appreciated.
As mentioned in comments, it's better to write Spark UDF:
val uppercaseUDF = spark.udf.register("uppercase", (s : String) => s.toUpperCase)
spark.sql("select uppercase(Car) as NAME from cars").show
Main cause is that you didn't set enableHiveSupport during creation of SparkSession. In such situation, default SessionCatalog will be used and makeFunctionExpression function in SessionCatalog scans only for User Defined Aggregate Function. If function is not an UDAF, it won't be found.
Created Jira task to implement this
Issue is class needs to be public.
package com.spark.udf
import org.apache.hadoop.hive.ql.exec.UDF
public class UpperCase extends UDF with Serializable {
def evaluate(input: String): String = {
input.toUpperCase
}
}