I am trying to define a udf in spark(2.0) from a string containing scala function definition.Here is the snippet:
val universe: scala.reflect.runtime.universe.type = scala.reflect.runtime.universe
import universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
val toolbox = currentMirror.mkToolBox()
val f = udf(toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int])
sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show
This gives me an error :
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
However when I define the udf as :
val f = udf((s:String) => 5)
it works just fine. What is the issue here?The end objective is to take a string which has the defn of a scala function and use it as a udf.
As Giovanny observed, the problem lies in the class loaders being different (you can investigate this more by calling .getClass.getClassLoader on whatever object). Then, when the workers try to deserialize your reflected function, all hell breaks loose.
Here is a solution that does not involve any class loader hackery. The idea is to move the reflection step to the workers. We'll end up having to redo the reflection step, but only once per worker. I think this is pretty optimal - even if you did the reflection only once on the master node, you would have to do a fair bit of work per worker to get them to recognize the function.
val f = udf {
new Function1[String,Int] with Serializable {
import scala.reflect.runtime.universe._
import scala.reflect.runtime.currentMirror
import scala.tools.reflect.ToolBox
lazy val toolbox = currentMirror.mkToolBox()
lazy val func = {
println("reflected function") // triggered at every worker
toolbox.eval(toolbox.parse("(s:String) => 5")).asInstanceOf[String => Int]
}
def apply(s: String): Int = func(s)
}
}
Then, calling sc.parallelize(Seq("1","5")).toDF.select(f(col("value"))).show works just fine.
Feel free to comment out the println - it is just an easy way of counting how many times the reflection happened. In spark-shell --master 'local' that's only once, but in spark-shell --master 'local[2]' it's twice.
How it works
The UDF gets evaluated immediately, but it never gets used until it reaches the worker nodes, so the lazy values toolbox and func only get evaluated on the workers. Furthermore, since they are lazy, they only ever get evaluated once per worker.
I had the same error, and it doesn't show the ClassNotFoundException because the JavaDeserializationStream class is catching the exception, depending on your environment it is failing because it coudn't find the class you're trying to execute from your RDD/DataSet but it doesn't show the ClassNotFoundError . To fix this issue I had to generate a jar with all the classes on my project (including the function and dependencies ) and include the jar inside the spark environment
This for an standalone cluster
conf.setJars ( Array ("/fullpath/yourgeneratedjar.jar", "/fullpath/otherdependencies.jar") )
and this for a yarn cluster
conf.set("spark.yarn.jars", "/fullpath/yourgeneratedjar.jar,/fullpath/otherdependencies.jar")
Related
I'm trying to read a single row from a Hive table and use its fields as constants in the rest of my Spark application.
object IniConstHive extends Serializable {
val techTbl: DataFrame = spark.table("my_db.my_conf_table").orderBy(col("my_date").desc)
val firstrow: Row = techTbl.head
val my_firstfield: Double = firstrow.getAs[java.math.BigDecimal](0).doubleValue
val my_secondfield: Double = firstrow.getAs[java.math.BigDecimal](1).doubleValue
}
Unfortunately, I get this exception when I call IniConstHive.my_firstfield:
Caused by: java.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
at org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:676)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:329)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Is this the right approach to do this? Is there a more efficient way to acheive the result?
It turned out the problem was not in the IniConstHive object but the place where I called it. In fact, I was trying to call IniConstHive.my_firstfield inside a custom UDAF, which runs inside executors. For this reason my application tried to instantiate multiple SparkContext outside the driver.
I solved by calling IniConstHive.my_firstfield in the driver and then passing the result as a parameter in the UDAF constructor.
Below I provide my code. I iterate over the DataFrame prodRows and for each product_PK I find some matching sub-list of product_PKs from prodRows.
numRecProducts = 10
var listOfProducts: Map[Long,Array[(Long, Int)]] = Map()
prodRows.foreach{ row : Row =>
val product_PK = row.get(row.fieldIndex("product_PK")).toString.toLong
val gender = row.get(row.fieldIndex("gender_PK")).toString
val selection = prodRows.filter($"gender_PK" === gender || $"gender_PK" === "UNISEX").limit(numRecProducts).select($"product_PK")
var productList: Array[(Long, Int)] = Array()
if (!selection.rdd.isEmpty()) {
productList = selection.rdd.map(x => (x(0).toString.toLong,1)).collect()
}
listOfProducts = listOfProducts + (product_PK -> productList)
}
But when I execute it, it gives me the following error. It looks like selection is empty in some iterations. However, I do not understand how can I handle this error:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2324)
at org.test.ComputeNumSim.run(ComputeNumSim.scala:69)
at org.test.ComputeNumSimRunner$.main(ComputeNumSimRunner.scala:19)
at org.test.ComputeNumSimRunner.main(ComputeNumSimRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:170)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:61)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2877)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1304)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:74)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
What does it mean and how can I handle it?
You cannot access any of Spark's "driver-side" abstractions (RDDs, DataFrames, Datasets, SparkSession...) from within a function passed on to one of Spark's DataFrame/RDD transformations. You also cannot update driver-side mutable objects from within these functions.
In your case - you're trying to use prodRows and selection (both are DataFrames) within a function passed to DataFrame.foreach. You're also trying to update listOfProducts (a local driver-side variable) from within that same function.
Why?
DataFrames, RDDs, and SparkSession only exist on your Driver application. They serve as a "handle" to access data distributed over the cluster of worker machines.
Functions passed to RDD/DataFrame transformations get serialized and sent to that cluster, to be executed on the data partitions on each of the worker machines. When the serialized DataFrames/RDDs get deserialized on those machines - they are useless, they can't still represent the data on the cluster as they are just hollow copies of the ones created on the driver application, which actually maintains a connection to the cluster machines
For the same reason, attempting to update driver-side variables will fail: the variables (starting out as empty, in most cases) will be serialized, deserialized on each of the workers, get updated locally on the workers, and stay there... the original driver-side variable will remain unchanged
How can you solve this?
When working with Spark, especially with DataFrames, you should try to avoid "iteration" over the data, and use DataFrame's declarative operations instead. In most cases, when you want to reference data of another DataFrame for each record in your DataFrame, you'd want to use join to create a new DataFrame with records combining data from the two DataFrames.
In this specific case, here's a roughly equivalent solution that does what you're trying to do, if I managed to conclude it correctly. Try to use this and read the DataFrame documentation to figure out the details:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val numRecProducts = 10
val result = prodRows.as("left")
// self-join by gender:
.join(prodRows.as("right"), $"left.gender_PK" === $"right.gender_PK" || $"right.gender_PK" === "UNISEX")
// limit to 10 results per record:
.withColumn("rn", row_number().over(Window.partitionBy($"left.product_PK").orderBy($"right.product_PK")))
.filter($"rn" <= numRecProducts).drop($"rn")
// group and collect_list to create products column:
.groupBy($"left.product_PK" as "product_PK")
.agg(collect_list(struct($"right.product_PK", lit(1))) as "products")
The problem is that you try to access prodRows from within prodRows.foreach. You cannot use a dataframe within a transformation, dataframes only exist on the driver.
Below I provide my code. I iterate over the DataFrame prodRows and for each product_PK I find some matching sub-list of product_PKs from prodRows.
numRecProducts = 10
var listOfProducts: Map[Long,Array[(Long, Int)]] = Map()
prodRows.foreach{ row : Row =>
val product_PK = row.get(row.fieldIndex("product_PK")).toString.toLong
val gender = row.get(row.fieldIndex("gender_PK")).toString
val selection = prodRows.filter($"gender_PK" === gender || $"gender_PK" === "UNISEX").limit(numRecProducts).select($"product_PK")
var productList: Array[(Long, Int)] = Array()
if (!selection.rdd.isEmpty()) {
productList = selection.rdd.map(x => (x(0).toString.toLong,1)).collect()
}
listOfProducts = listOfProducts + (product_PK -> productList)
}
But when I execute it, it gives me the following error. It looks like selection is empty in some iterations. However, I do not understand how can I handle this error:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:916)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:916)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2325)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2324)
at org.test.ComputeNumSim.run(ComputeNumSim.scala:69)
at org.test.ComputeNumSimRunner$.main(ComputeNumSimRunner.scala:19)
at org.test.ComputeNumSimRunner.main(ComputeNumSimRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:635)
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:170)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:61)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2877)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1304)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:74)
at org.test.ComputeNumSim$$anonfun$run$1.apply(ComputeNumSim.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
What does it mean and how can I handle it?
You cannot access any of Spark's "driver-side" abstractions (RDDs, DataFrames, Datasets, SparkSession...) from within a function passed on to one of Spark's DataFrame/RDD transformations. You also cannot update driver-side mutable objects from within these functions.
In your case - you're trying to use prodRows and selection (both are DataFrames) within a function passed to DataFrame.foreach. You're also trying to update listOfProducts (a local driver-side variable) from within that same function.
Why?
DataFrames, RDDs, and SparkSession only exist on your Driver application. They serve as a "handle" to access data distributed over the cluster of worker machines.
Functions passed to RDD/DataFrame transformations get serialized and sent to that cluster, to be executed on the data partitions on each of the worker machines. When the serialized DataFrames/RDDs get deserialized on those machines - they are useless, they can't still represent the data on the cluster as they are just hollow copies of the ones created on the driver application, which actually maintains a connection to the cluster machines
For the same reason, attempting to update driver-side variables will fail: the variables (starting out as empty, in most cases) will be serialized, deserialized on each of the workers, get updated locally on the workers, and stay there... the original driver-side variable will remain unchanged
How can you solve this?
When working with Spark, especially with DataFrames, you should try to avoid "iteration" over the data, and use DataFrame's declarative operations instead. In most cases, when you want to reference data of another DataFrame for each record in your DataFrame, you'd want to use join to create a new DataFrame with records combining data from the two DataFrames.
In this specific case, here's a roughly equivalent solution that does what you're trying to do, if I managed to conclude it correctly. Try to use this and read the DataFrame documentation to figure out the details:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val numRecProducts = 10
val result = prodRows.as("left")
// self-join by gender:
.join(prodRows.as("right"), $"left.gender_PK" === $"right.gender_PK" || $"right.gender_PK" === "UNISEX")
// limit to 10 results per record:
.withColumn("rn", row_number().over(Window.partitionBy($"left.product_PK").orderBy($"right.product_PK")))
.filter($"rn" <= numRecProducts).drop($"rn")
// group and collect_list to create products column:
.groupBy($"left.product_PK" as "product_PK")
.agg(collect_list(struct($"right.product_PK", lit(1))) as "products")
The problem is that you try to access prodRows from within prodRows.foreach. You cannot use a dataframe within a transformation, dataframes only exist on the driver.
I have the following method:
def loadData(a:String, b:String)(implicit sparkContext: SparkContext) : RDD[Result]
I am trying to test it using this SharedSparkContext: https://github.com/holdenk/spark-testing-base/wiki/SharedSparkContext.
So, I made my test class extend SharedSparkContext:
class Ingest$Test extends FunSuite with SharedSparkContext
And within the test method I made this call:
val res: RDD[Result] = loadData("x", "y")
However, I am getting this error:
Error:(113, 64) could not find implicit value for parameter sparkContext: org.apache.spark.SparkContext
val result: RDD[Result] = loadData("x", "y")
So how can I make the SparkContext from the testing method visible?
EDIT:
I don't see how the question is related with Understanding implicit in Scala
What is the variable name of your Spark Context? If it is 'sc' as is typically the case, you will have to alias it to the variable name the method is looking for via implicit val sparkContext = sc and then proceed to call your method in the same environment
I am trying to use Spark 1.5.1 (with Scala 2.10.2) to read some .avro files from HDFS (with spark-avro 1.7.7) , in order to do some computation on them.
Now, starting with the assumption that I have already searched thoroughly the web to find a solution ( and the best link so far is this one that suggests to use a GenericRecord, while this one reports the same issue, and this one just does not work for me, because it gives almost the same code I have used ), I ask here, because it might be that someone had the same. This is the code :
import org.apache.avro.mapred.{AvroInputFormat, AvroWrapper}
import org.apache.hadoop.io.NullWritable
import org.apache.spark.{SparkConf, SparkContext}
object SparkPOC {
def main(args: Array[String]): Unit ={
val conf = new SparkConf()
.setAppName("SparkPOC")
.set("spark.master", "local[4]")
val sc = new SparkContext(conf)
val path = args(0)
val profiles = sc.hadoopFile(
path,
classOf[AvroInputFormat[MyRecord]],
classOf[AvroWrapper[MyRecord]],
classOf[NullWritable]
)
val timeStamps = profiles.map{ p => p._1.datum.getTimeStamp().toString}
timeStamps.foreach(print)
}
And I get the following message:
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to packagename.MyRecord
at packagename.SparkPOC$$anonfun$1.apply(SparkPOC.scala:24)
at packagename.SparkPOC$$anonfun$1.apply(SparkPOC.scala:24)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Does anybody have a clue? I was also considering the possibility of using spark-avro, but they don't support reading from multiple files at the same time (while .hadoopFile supports wildcards). Otherwise, it seems that I have to go for the GenericRecord and use the .get method, losing the advantage of the coded schema (MyRecord).
Thanks in advance.
I usually read it in as GenericRecord and explicitly cast as necessary, ie
val conf = sc.hadoopConfiguration
sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(_._1.datum().asInstanceOf[MyRecord])
The problem has gone after I have set KryoSerializer and a spark.kryo.registrator class, as follows:
val config = new SparkConf()
.setAppName(appName)
.set("spark.master", master)
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrator", "com.mypackage.AvroKryoRegistrator")
where AvroKryoRegistrator is something like this.