Spark Scala Error while saving DataFrame to Hive - scala

i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, DataFrame}
import org.apache.spark.ml.feature.RFormula
import java.text._
import java.util.Date
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.sql.hive.HiveContext
//case class Rows(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
object MLRCreate{
// case class Row(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MLRCreate")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
val ReqId = new java.text.SimpleDateFormat("yyyyMMddHHmmss").format(new java.util.Date())
val dirName = "/user/ec2-user/SavedModels/"+ReqId
val df = Functions.loadData(hiveContext,args(0),args(1))
val form = args(1).toLowerCase
val lbl = form.split("~")
var lrModel:LinearRegressionModel = null;
val Array(training, test) = df.randomSplit(Array(args(3).toDouble, (1-args(3).toDouble)), seed = args(4).toInt)
lrModel = Functions.mlr(training)
var columnnames = Functions.resultColumns(df).substring(1)
var columnsFinal = columnnames.split(",")
columnsFinal = "intercept" +: columnsFinal
var coeff = lrModel.coefficients.toArray.map(_.toString)
coeff = lrModel.intercept.toString +: coeff
var stdErr = lrModel.summary.coefficientStandardErrors.map(_.toString)
var tval = lrModel.summary.tValues.map(_.toString)
var pval = lrModel.summary.pValues.map(_.toString)
var Signif:Array[String] = new Array[String](pval.length)
for(j <- 0 to pval.length-1){
var sign = pval(j).toDouble;
sign = Math.abs(sign);
if(sign <= 0.001){
Signif(j) = "***";
}else if(sign <= 0.01){
Signif(j) = "**";
}else if(sign <= 0.05){
Signif(j) = "*";
}else if(sign <= 0.1){
Signif(j) = ".";
}else{Signif(j) = " ";
}
println(columnsFinal(j)+"#########"+coeff(j)+"#########"+stdErr(j)+"#########"+tval(j)+"#########"+pval(j)+"########"+Signif)
}
case class Row(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
// print(columnsFinali.mkString("#"),coeff.mkString("#"),stdErr.mkString("#"),tval.mkString("#"),pval.mkString("#"))
val sums = Array(columnsFinal, coeff, stdErr, tval, pval, Signif).transpose
val rdd = sc.parallelize(sums).map(ys => Row(ys(0), ys(1), ys(2), ys(3),ys(4),ys(5)))
// val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
// import hiveContext.implicits._
// import hiveContext.sql
val result = rdd.toDF("Name","Coefficients","Std_Error","tValue","pValue","Significance")
result.show()
result.saveAsTable("iaw_model_summary.IAW_"+ReqId)
print(ReqId)
lrModel.save(dirName)
}
}
And the following is the error i get,
16/05/12 07:17:25 ERROR Executor: Exception in task 2.0 in stage 23.0 (TID 839)
java.lang.ArrayIndexOutOfBoundsException: 1
at IAWMLRCreate$$anonfun$5.apply(IAWMLRCreate.scala:96)
at IAWMLRCreate$$anonfun$5.apply(IAWMLRCreate.scala:96)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Suggest you check the lengths of the arrays you are transposing: columnsFinal, coeff, stdErr, tval, pval, Signif. If any of these is shorter/longer than the others, then some of the rows after the transpose would be incomplete. Scala does not fill nulls or anything for you when transposing:
val a1 = Array(1,2,3)
val a2 = Array(5,6)
Array(a1, a2).transpose.foreach(x => println(x.toList))
prints:
List(1, 5)
List(2, 6)
List(3)

Related

Update to the delta table in spark not working

package jobs
import io.delta.tables.DeltaTable
import model.RandomUtils
import org.apache.spark.sql.streaming.{ OutputMode, Trigger }
import org.apache.spark.sql.{ DataFrame, Dataset, Encoder, Encoders, SparkSession }
import jobs.SystemJob.Rate
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
case class Student(firstName: String, lastName: String, age: Long, percentage: Long)
case class Rate(timestamp: Timestamp, value: Long)
case class College(name: String, address: String, principal: String)
object RCConfigDSCCDeltaLake {
def getSpark(): SparkSession = {
SparkSession.builder
.appName("Delta table demo")
.master("local[*]")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
}
def main(args: Array[String]): Unit = {
val spark = getSpark()
val rate = 1
val studentProfile = "student_profile"
if (!DeltaTable.isDeltaTable(s"spark-warehouse/$studentProfile")) {
val deltaTable: DataFrame = spark.sql(s"CREATE TABLE `$studentProfile` (firstName String, lastName String, age Long, percentage Long) USING delta")
deltaTable.show()
deltaTable.printSchema()
}
val studentProfileDT = DeltaTable.forPath(spark, s"spark-warehouse/$studentProfile")
def processStream(student: Dataset[Student], college: Dataset[College]) = {
val studentQuery = student.writeStream.outputMode(OutputMode.Update()).foreachBatch {
(st: Dataset[Student], y: Long) =>
val listOfStudents = st.collect().toList
println("list of students :::" + listOfStudents)
val (o, n) = ("oldData", "newData")
val colMap = Map(
"firstName" -> col(s"$n.firstName"),
"lastName" -> col(s"$n.lastName"),
"age" -> col(s"$n.age"),
"percentage" -> col(s"$n.percentage"))
studentProfileDT.as(s"$o").merge(st.toDF.as(s"$n"), s"$o.firstName = $n.firstName AND $o.lastName = $n.lastName")
.whenMatched.update(colMap)
.whenNotMatched.insert(colMap)
.execute()
}.start()
val os = spark.readStream.format("delta").load(s"spark-warehouse/$studentProfile").writeStream.format("console")
.outputMode(OutputMode.Append())
.option("truncate", value = false)
.option("checkpointLocation", "retrieved").start()
studentQuery.awaitTermination()
os.awaitTermination()
}
import spark.implicits._
implicit val encStudent: Encoder[Student] = Encoders.product[Student]
implicit val encCollege: Encoder[College] = Encoders.product[College]
def rateStream = spark
.readStream
.format("rate") // <-- use RateStreamSource
.option("rowsPerSecond", rate)
.load()
.as[Rate]
val studentStream: Dataset[Student] = rateStream.filter(_.value % 25 == 0).map {
stu =>
Student(...., ....., ....., .....) //fill with values
}
val collegeStream: Dataset[College] = rateStream.filter(_.value % 40 == 0).map {
stu =>
College(...., ....., ......) //fill with values
}
processStream(studentStream, collegeStream)
}
}
What I am trying to do is a simple UPSERT operation with streaming datasets. But it fails with error
22/04/13 19:50:33 ERROR MicroBatchExecution: Query [id = 8cf759fd-9bee-460f-
b0d9-91889c59c524, runId = 55723708-fd3c-4425-a2bc-83d737c37589] terminated with
error
java.lang.UnsupportedOperationException: Detected a data update (for example part-
00000-d026d92e-1798-4d21-a505-67ec72d334e2-c000.snappy.parquet) in the source table
at version 4. This is currently not supported. If you'd like to ignore updates, set
the option 'ignoreChanges' to 'true'. If you would like the data update to be
reflected, please restart this query with a fresh checkpoint directory.
Dependencies :
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
--packages io.delta:delta-core_2.12:0.7.0
The update query works when the datasets are not streamed and only hardcoded.
Am I doing something wrong here ?

How do I serialize org.joda.time.DateTime in Spark Streaming using Scala?

I created a DummySource that reads lines from a file and convert it to TaxiRide objects. The problem is that there are fields that correspond to org.joda.time.DateTime where I use org.joda.time.format.{DateTimeFormat, DateTimeFormatter} and SparkStreaming cannot serialize those fields.
How do I make SparkStreaming serialize them? My code is below together with the error.
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.sense.spark.util.TaxiRideSource
object TaxiRideCountCombineByKey {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("TaxiRideCountCombineByKey")
.setMaster("local[4]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val stream = ssc.receiverStream(new TaxiRideSource())
stream.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
import java.io.{BufferedReader, FileInputStream, InputStreamReader}
import java.nio.charset.StandardCharsets
import java.util.Locale
import java.util.zip.GZIPInputStream
import org.apache.spark.storage._
import org.apache.spark.streaming.receiver._
import org.joda.time.DateTime
import org.joda.time.format.{DateTimeFormat, DateTimeFormatter}
case class TaxiRide(rideId: Long, isStart: Boolean, startTime: DateTime, endTime: DateTime,
startLon: Float, startLat: Float, endLon: Float, endLat: Float,
passengerCnt: Short, taxiId: Long, driverId: Long)
class TaxiRideSource extends Receiver[TaxiRide](StorageLevel.MEMORY_AND_DISK_2) {
val timeFormatter: DateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss").withLocale(Locale.US).withZoneUTC()
val dataFilePath = "/home/flink/nycTaxiRides.gz";
val delayInNanoSeconds: Long = 1000000
def onStart() {new Thread("TaxiRide Source") {override def run() {receive()}}.start()}
def onStop() {}
private def receive() {
while (!isStopped()) {
val gzipStream = new GZIPInputStream(new FileInputStream(dataFilePath))
val reader: BufferedReader = new BufferedReader(new InputStreamReader(gzipStream, StandardCharsets.UTF_8))
var line: String = ""
while (reader.ready() && (line = reader.readLine()) != null) {
val startTime = System.nanoTime
// read the line on the file and yield the object
val taxiRide: TaxiRide = getTaxiRideFromString(line)
store(Iterator(taxiRide))
Thread.sleep(1000)
}
}
}
def getTaxiRideFromString(line: String): TaxiRide = {
// println(line)
val tokens: Array[String] = line.split(",")
if (tokens.length != 11) {
throw new RuntimeException("Invalid record: " + line)
}
val rideId: Long = tokens(0).toLong
val (isStart, startTime, endTime) = tokens(1) match {
case "START" => (true, DateTime.parse(tokens(2), timeFormatter), DateTime.parse(tokens(3), timeFormatter))
case "END" => (false, DateTime.parse(tokens(2), timeFormatter), DateTime.parse(tokens(3), timeFormatter))
case _ => throw new RuntimeException("Invalid record: " + line)
}
val startLon: Float = if (tokens(4).length > 0) tokens(4).toFloat else 0.0f
val startLat: Float = if (tokens(5).length > 0) tokens(5).toFloat else 0.0f
val endLon: Float = if (tokens(6).length > 0) tokens(6).toFloat else 0.0f
val endLat: Float = if (tokens(7).length > 0) tokens(7).toFloat else 0.0f
val passengerCnt: Short = tokens(8).toShort
val taxiId: Long = tokens(9).toLong
val driverId: Long = tokens(10).toLong
TaxiRide(rideId, isStart, startTime, endTime, startLon, startLat, endLon, endLat, passengerCnt, taxiId, driverId)
}
}
The error is this:
20/06/17 11:27:26 ERROR TaskSetManager: Failed to serialize task 59, not attempting to retry it.
java.io.NotSerializableException: org.joda.time.format.DateTimeFormatter
Serialization stack:
- object not serializable (class: org.joda.time.format.DateTimeFormatter, value: org.joda.time.format.DateTimeFormatter#19bf2eee)
- field (class: org.sense.spark.util.TaxiRideSource, name: timeFormatter, type: class org.joda.time.format.DateTimeFormatter)
- object (class org.sense.spark.util.TaxiRideSource, org.sense.spark.util.TaxiRideSource#2fef7647)
- element of array (index: 0)
- array (class [Lorg.apache.spark.streaming.receiver.Receiver;, size 1)
- field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.WrappedArray$ofRef, WrappedArray(org.sense.spark.util.TaxiRideSource#2fef7647))
- writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
- object (class org.apache.spark.rdd.ParallelCollectionPartition, org.apache.spark.rdd.ParallelCollectionPartition#107f)
- field (class: org.apache.spark.scheduler.ResultTask, name: partition, type: interface org.apache.spark.Partition)
- object (class org.apache.spark.scheduler.ResultTask, ResultTask(59, 0))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:472)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:453)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:453)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:295)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:290)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:375)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$9.apply(TaskSchedulerImpl.scala:373)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:373)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:370)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:370)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:85)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:64)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/06/17 11:27:26 INFO TaskSchedulerImpl: Removed TaskSet 59.0, whose tasks have all completed, from pool
AFAIK you cant serialize it
Best option is to create it as a Constant
package abc.xyz
object Constant {
val timeFormatter: DateTimeFormatter = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss")
.withLocale(Locale.US).withZoneUTC()
}
Add below import where you want to use it.
import abc.xyz.Constant._

How to read data from dynamo db table into dataframe?

Below is the code where I am trying to read data from dynamo db and load it into a dataframe.
Is it possible to do the same using scanamo?
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "GenreRatingCounts") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-2.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-2")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("dynamodb.awsAccessKeyId", "XXXXX")
jobConf.set("dynamodb.awsSecretAccessKey", "XXXXXXX")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
orders.map(t => t._2.getItem()).collect.foreach(println)
val simple2: RDD[(String)] = orders.map { case (text, dbwritable) => (dbwritable.toString)}
spark.read.json(simple2).registerTempTable("gooddata")
The output is of type: org.apache.spark.sql.DataFrame = [count: struct<n: string>, genre: struct<s: string> ... 1 more field]
+------+---------+------+
| count| genre|rating|
+------+---------+------+
|[4450]| [Action]| [4]|
|[5548]|[Romance]| [3.5]|
+------+---------+------+
How can I convert this dataframe column types to String instead of Struct?
EDIT-1
Now I am able to create dataframe using below code and able to read data from dynamodb table if it doesn't contain null.
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
def extractValue : (String => String) = (aws:String) => {
val pat_value = "\\s(.*),".r
val matcher = pat_value.findFirstMatchIn(aws)
matcher match {
case Some(number) => number.group(1).toString
case None => ""
}
}
val col_extractValue = udf(extractValue)
val rdd_add = orders.map {
case (text, dbwritable) => (dbwritable.getItem().get("genre").toString(), dbwritable.getItem().get("rating").toString(),dbwritable.getItem().get("ratingCount").toString())
val df_add = rdd_add.toDF()
.withColumn("genre", col_extractValue($"_1"))
.withColumn("rating", col_extractValue($"_2"))
.withColumn("ratingCount", col_extractValue($"_3"))
.select("genre","rating","ratingCount")
df_add.show
But I am getting below error if there is a record with no data in one of the column(null or blank).
ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 14)
java.lang.NullPointerException
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:67)
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:66)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/12/20 07:48:21 WARN TaskSetManager: Lost task 0.0 in stage 10.0 (TID 14, localhost, executor driver): java.lang.NullPointerException
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:67)
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:66)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How to handle null/blank while reading from Dynamodb to a dataframe in Spark/Scala?
After lots of trial and error, Below is the solution I have implemented. I am still getting error while reading from Dynamodb if any column is having blank (no data) using RDD to DATAFRAME. So, I made sure that instead of keeping that column blank, I write null.
Other option to handle that would be creating EXTERNAL HIVE tables on DYNAMO DB tables and then read from them.
Below is the code to first write the data into DYNAMO DB and then READ it back using SPARK/SCALA.
package com.esol.main
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RDD
import scala.util.matching.Regex
import java.util.HashMap
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
object dynamoDB {
def main(args: Array[String]): Unit = {
// val enum = Configurations.initializer()
//val table_name = args(0).trim()
implicit val spark = SparkSession.builder().appName("dynamoDB").master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
// Writing data into table
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.output.tableName", "eSol_MapSourceToRaw")
jobConf.set("dynamodb.throughput.write.percent", "0.5")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("dynamodb.awsAccessKeyId", "XXXXXXXX")
jobConf.set("dynamodb.awsSecretAccessKey", "XXXXXX")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.servicename", "dynamodb")
//giving column names is mandatory in below query. else it will fail.
var MapSourceToRaw = spark.sql("select RowKey,ProcessName,SourceType,Source,FileType,FilePath,FileName,SourceColumnDelimeter,SourceRowDelimeter,SourceColumn,TargetTable,TargetColumnFamily,TargetColumn,ColumnList,SourceColumnSequence,UniqueFlag,SourceHeader from dynamo.hive_MapSourceToRaw")
println("read data from hive table : "+ MapSourceToRaw.show())
val df_columns = MapSourceToRaw.columns.toList
var ddbInsertFormattedRDD = MapSourceToRaw.rdd.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
for(i <- 0 to df_columns.size -1)
{
val col=df_columns(i)
var column= new AttributeValue()
if(a.get(i) == null || a.get(i).toString.isEmpty)
{ column.setS("null")
ddbMap.put(col, column)
}
else
{
column.setS(a.get(i).toString)
ddbMap.put(col, column)
} }
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
})
println("ready to write into table")
ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
println("data written in dynamo db")
// READING DATA BACK
println("reading data from dynamo db")
jobConf.set("dynamodb.input.tableName", "eSol_MapSourceToRaw")
def extractValue : (String => String) = (aws:String) => {
val pat_value = "\\s(.*),".r
val matcher = pat_value.findFirstMatchIn(aws)
matcher match {
case Some(number) => number.group(1).toString
case None => ""
}
}
val col_extractValue = udf(extractValue)
var dynamoTable = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
val rdd_add = dynamoTable.map {
case (text, dbwritable) => (dbwritable.getItem().get("RowKey").toString(), dbwritable.getItem().get("ProcessName").toString(),dbwritable.getItem().get("SourceType").toString(),
dbwritable.getItem().get("Source").toString(),dbwritable.getItem().get("FileType").toString(),
dbwritable.getItem().get("FilePath").toString(),dbwritable.getItem().get("TargetColumn").toString())
}
val df_add = rdd_add.toDF()
.withColumn("RowKey", col_extractValue($"_1"))
.withColumn("ProcessName", col_extractValue($"_2"))
.withColumn("SourceType", col_extractValue($"_3"))
.withColumn("Source", col_extractValue($"_4"))
.withColumn("FileType", col_extractValue($"_5"))
.withColumn("FilePath", col_extractValue($"_6"))
.withColumn("TargetColumn", col_extractValue($"_7"))
.select("RowKey","ProcessName","SourceType","Source","FileType","FilePath","TargetColumn")
df_add.show
}
}

java.lang.UnsupportedOperationExceptionfieldIndex on a Row without schema is undefined: Exception on row.getAs[String]

The following code is throwing an Exception Caused by: java.lang.UnsupportedOperationException: fieldIndex on a Row without schema is undefined. This is happening when a on a dataframe that has been returned after a groupByKey and flatMap invocation on a dataframe using ExpressionEncoder, groupedByKey and a flatMap is invoked.
Logical flow:
originalDf->groupByKey->flatMap->groupByKey->flatMap->show
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.{ IntegerType, StructField, StructType}
import scala.collection.mutable.ListBuffer
object Test {
def main(args: Array[String]): Unit = {
val values = List(List("1", "One") ,List("1", "Two") ,List("2", "Three"),List("2","4")).map(x =>(x(0), x(1)))
val session = SparkSession.builder.config("spark.master", "local").getOrCreate
import session.implicits._
val dataFrame = values.toDF
dataFrame.show()
dataFrame.printSchema()
val newSchema = StructType(dataFrame.schema.fields
++ Array(
StructField("Count", IntegerType, false)
)
)
val expr = RowEncoder.apply(newSchema)
val tranform = dataFrame.groupByKey(row => row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
val inputSeq = inputItr.toSeq
val length = inputSeq.size
var listBuff = new ListBuffer[Row]()
var counter : Int= 0
for(i <- 0 until(length))
{
counter+=1
}
for(i <- 0 until length ) {
var x = inputSeq(i)
listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
}
listBuff.iterator
})(expr)
tranform.show
val newSchema1 = StructType(tranform.schema.fields
++ Array(
StructField("Count1", IntegerType, false)
)
)
val expr1 = RowEncoder.apply(newSchema1)
val tranform2 = tranform.groupByKey(row => row.getAs[String]("_1")).flatMapGroups((key, inputItr) => {
val inputSeq = inputItr.toSeq
val length = inputSeq.size
var listBuff = new ListBuffer[Row]()
var counter : Int= 0
for(i <- 0 until(length))
{
counter+=1
}
for(i <- 0 until length ) {
var x = inputSeq(i)
listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
}
listBuff.iterator
})(expr1)
tranform2.show
}
}
Following is the stacktrace
18/11/21 19:39:03 WARN TaskSetManager: Lost task 144.0 in stage 11.0 (TID 400, localhost, executor driver): java.lang.UnsupportedOperationException: fieldIndex on a Row without schema is undefined.
at org.apache.spark.sql.Row$class.fieldIndex(Row.scala:342)
at org.apache.spark.sql.catalyst.expressions.GenericRow.fieldIndex(rows.scala:166)
at org.apache.spark.sql.Row$class.getAs(Row.scala:333)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:166)
at com.quantuting.sparkutils.main.Test$$anonfun$4.apply(Test.scala:59)
at com.quantuting.sparkutils.main.Test$$anonfun$4.apply(Test.scala:59)
at org.apache.spark.sql.execution.AppendColumnsWithObjectExec$$anonfun$9$$anonfun$apply$3.apply(objects.scala:300)
at org.apache.spark.sql.execution.AppendColumnsWithObjectExec$$anonfun$9$$anonfun$apply$3.apply(objects.scala:298)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How to fix this code?
The reported problem could be avoided by replacing the fieldname version of getAs[T] method (used in the function for groupByKey):
groupByKey(row => row.getAs[String]("_1"))
with the field-position version:
groupByKey(row => row.getAs[String](fieldIndexMap("_1")))
where fieldIndexMap maps field names to their corresponding field indexes:
val fieldIndexMap = tranform.schema.fieldNames.zipWithIndex.toMap
As a side note, your function for flatMapGroups can be simplified into something like below:
val tranform2 = tranform.groupByKey(_.getAs[String](fieldIndexMap("_1"))).
flatMapGroups((key, inputItr) => {
val inputSeq = inputItr.toSeq
val length = inputSeq.size
inputSeq.map(r => Row.fromSeq(r.toSeq :+ length))
})(expr1)
The inconsistent behavior between applying the original groupByKey/flatMapGroups methods to "dataFrame" versus "tranform" is apparently related to how the methods handle a DataFrame versus a Dataset[Row].
Solution as received from JIRA on Spark project: https://issues.apache.org/jira/browse/SPARK-26436
This issue is caused by how you create the row:
listBuff += Row.fromSeq(x.toSeq ++ Array[Int](counter))
Row.fromSeq creates a GenericRow and GenericRow's fieldIndex is not implemented because GenericRow doesn't have schema.
Changing the line to create GenericRowWithSchema can solve it:
listBuff += new GenericRowWithSchema((x.toSeq ++ Array[Int](counter)).toArray, newSchema)

Spark Hadoop Failed to get broadcast

Running a spark-submit job and receiving a "Failed to get broadcast_58_piece0..." error. I'm really not sure what I'm doing wrong. Am I overusing UDFs? Too complicated a function?
As a summary of my objective, I am parsing text from pdfs, which are stored as base64 encoded strings in JSON objects. I'm using Apache Tika to get the text, and trying to make copious use of data frames to make things easier.
I had written a piece of code that ran the text extraction through tika as a function outside of "main" on the data as a RDD, and that worked flawlessly. When I try to bring the extraction into main as a UDF on data frames, though, it borks in various different ways. Before I got here I was actually trying to write the final data frame as:
valid.toJSON.saveAsTextFile(hdfs_dir)
This was giving me all sorts of "File/Path already exists" headaches.
Current code:
object Driver {
def main(args: Array[String]):Unit = {
val hdfs_dir = args(0)
val spark_conf = new SparkConf().setAppName("Spark Tika HDFS")
val sc = new SparkContext(spark_conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
// load json data into dataframe
val df = sqlContext.read.json("hdfs://hadoophost.com:8888/user/spark/data/in/*")
val extractInfo: (Array[Byte] => String) = (fp: Array[Byte]) => {
val parser:Parser = new AutoDetectParser()
val handler:BodyContentHandler = new BodyContentHandler(Integer.MAX_VALUE)
val config:TesseractOCRConfig = new TesseractOCRConfig()
val pdfConfig:PDFParserConfig = new PDFParserConfig()
val inputstream:InputStream = new ByteArrayInputStream(fp)
val metadata:Metadata = new Metadata()
val parseContext:ParseContext = new ParseContext()
parseContext.set(classOf[TesseractOCRConfig], config)
parseContext.set(classOf[PDFParserConfig], pdfConfig)
parseContext.set(classOf[Parser], parser)
parser.parse(inputstream, handler, metadata, parseContext)
handler.toString
}
val extract_udf = udf(extractInfo)
val df2 = df.withColumn("unbased_media", unbase64($"media_file")).drop("media_file")
val dfRenamed = df2.withColumn("media_corpus", extract_udf(col("unbased_media"))).drop("unbased_media")
val depuncter: (String => String) = (corpus: String) => {
val r = corpus.replaceAll("""[\p{Punct}]""", "")
val s = r.replaceAll("""[0-9]""", "")
s
}
val depuncter_udf = udf(depuncter)
val withoutPunct = dfRenamed.withColumn("sentence", depuncter_udf(col("media_corpus")))
val model = sc.objectFile[org.apache.spark.ml.PipelineModel]("hdfs://hadoophost.com:8888/user/spark/hawkeye-nb-ml-v2.0").first()
val with_predictions = model.transform(withoutPunct)
val fullNameChecker: ((String, String, String, String, String) => String) = (fname: String, mname: String, lname: String, sfx: String, text: String) =>{
val newtext = text.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_fname = fname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_mname = mname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_lname = lname.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val new_sfx = sfx.replaceAll(" ", "").replaceAll("""[0-9]""", "").replaceAll("""[\p{Punct}]""", "").toLowerCase
val name_full = new_fname.concat(new_mname).concat(new_lname).concat(new_sfx)
val c = name_full.r.findAllIn(newtext).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val fullNameChecker_udf = udf(fullNameChecker)
val stringChecker: ((String, String) => String) = (term: String, text: String) => {
val termLower = term.replaceAll("""[\p{Punct}]""", "").toLowerCase
val textLower = text.replaceAll("""[\p{Punct}]""", "").toLowerCase
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker_udf = udf(stringChecker)
val stringChecker2: ((String, String) => String) = (term: String, text: String) => {
val termLower = term takeRight 4
val textLower = text
val c = termLower.r.findAllIn(textLower).length
c match {
case 0 => "N"
case _ => "Y"
}
}
val stringChecker2_udf = udf(stringChecker)
val valids = with_predictions.withColumn("fname_valid", stringChecker_udf(col("first_name"), col("media_corpus")))
.withColumn("lname_valid", stringChecker_udf(col("last_name"), col("media_corpus")))
.withColumn("fname2_valid", stringChecker_udf(col("first_name_2"), col("media_corpus")))
.withColumn("lname2_valid", stringChecker_udf(col("last_name_2"), col("media_corpus")))
.withColumn("camt_valid", stringChecker_udf(col("chargeoff_amount"), col("media_corpus")))
.withColumn("ocan_valid", stringChecker2_udf(col("original_creditor_account_nbr"), col("media_corpus")))
.withColumn("dpan_valid", stringChecker2_udf(col("debt_provider_account_nbr"), col("media_corpus")))
.withColumn("full_name_valid", fullNameChecker_udf(col("first_name"), col("middle_name"), col("last_name"), col("suffix"), col("media_corpus")))
.withColumn("full_name_2_valid", fullNameChecker_udf(col("first_name_2"), col("middle_name_2"), col("last_name_2"), col("suffix_2"), col("media_corpus")))
valids.write.mode(SaveMode.Overwrite).format("json").save(hdfs_dir)
}
}
Full stack trace starting with error:
16/06/14 15:02:01 WARN TaskSetManager: Lost task 0.0 in stage 4.0 (TID 53, hdpd11n05.squaretwofinancial.com): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:222)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9$$anonfun$apply$7.apply(CountVectorizer.scala:221)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:221)
at org.apache.spark.ml.feature.CountVectorizerModel$$anonfun$9.apply(CountVectorizer.scala:218)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr43$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
... 8 more
Caused by: org.apache.spark.SparkException: Failed to get broadcast_58_piece0 of broadcast_58
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 25 more
I encountered a similar error.
It turns out to be caused by the broadcast usage in CounterVectorModel. Following is the detailed cause in my case:
When model.transform() is called , the vocabulary is broadcasted and saved as an attribute broadcastDic in model implicitly. Therefore, if the CounterVectorModel is saved after calling model.transform(), the private var attribute broadcastDic is also saved. But unfortunately, in Spark, broadcasted object is context-sensitive, which means it is embedded in SparkContext. If that CounterVectorModel is loaded in a different SparkContext, it will fail to find the previous saved broadcastDic.
So either solution is to prevent calling model.transform() before saving the model, or clone the model by method model.copy().
For anyone coming across this, it turns out the model I was loading was malformed. I found out by using spark-shell in yarn-client mode and stepping through the code. When I tried to load the model it was fine, but running it against the datagram (model.transform) through errors about not finding a metadata directory.
I went back and found a good model, ran against that and it worked fine. This code is actually sound.