I am pretty new to GraphFrames and Scala. I am writing some sort of label propagation algorithm (very different from the library one). Essentially each vertex has an array "memVector" and the edge has a float value "floatWeights". I want to update each vertex's memVector to the be the sum of (floatWeights * memVector) from all of its neighbors. This is the code I have written for the same:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SQLContext
import org.apache.spark.SparkContext
import org.graphframes._
import org.graphframes.lib.AggregateMessages
import org.apache.spark.sql.functions.udf
val sqlContext = new SQLContext(sc)
val edges = spark.read.parquet("code/SampleGraphEdge")
val vertices = spark.read.parquet("code/SampleGraphVer")
val toInteger: String => Int = _.toInt
val toIntegerUDF = udf(toInteger)
val newEdges = edges.withColumn("floatWeights", toIntegerUDF('weights)).drop("weights")
val graph = GraphFrame(vertices, newEdges)
val AM = AggregateMessages
val msgToSrc = AM.dst("memVector")
val msgToDst = AM.src("memVector")
val msgFromEdge = AM.edge("floatWeights")
def aggfunc(msg: org.apache.spark.sql.Column) = sum(msg.getField("weights") * AM.msg.getField("memVector"))
val agg = graph.aggregateMessages.sendToSrc(msgToSrc).sendToDst(msgToDst).sendToSrc(sendFromEdge).sendToDst(sendFromEdge).agg(aggfunc(AM.msg).as("UpdatedVector"))
Now the aggfunc I wrote is not right as I cannot multiply an array and a float directly. I am running the above in spark-shell and I am getting the following error at the last line:
org.apache.spark.sql.AnalysisException: Can't extract value from MSG#750;
at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:613)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$2.apply(TreeNode.scala:312)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:311)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:305)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:269)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:283)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:283)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$8.apply(QueryPlan.scala:288)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:288)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$9.applyOrElse(Analyzer.scala:547)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:547)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:484)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:65)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:51)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.RelationalGroupedDataset.toDF(RelationalGroupedDataset.scala:62)
at org.apache.spark.sql.RelationalGroupedDataset.agg(RelationalGroupedDataset.scala:222)
at org.graphframes.lib.AggregateMessages.agg(AggregateMessages.scala:127)
... 50 elided
Am I approaching it right? Any workarounds/solutions will be greatly appreciated.
Related
I'm working on in Scala & Spark to load a big file (60+ GB) JSON file and process it. Since using sparksession.read.json leads to out-of-memory exception, I've went to the RDD route.
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.json4s.jackson.Serialization.{read}
val submissions_rdd = sc.textFile("/home/user/repos/concepts/abcde/RS_2019-09")
//val columns_subset = Set("author", "title", "selftext", "score", "created_utc", "subreddit")
case class entry(title: String,
selftext: String,
score: Double,
created_utc: Double,
subreddit: String,
author: String)
def jsonExtractObject(jsonStr: String) = {
implicit val formats = org.json4s.DefaultFormats
read[entry](jsonStr)
}
Upon testing my function on a single entry, I get the desired result:
val res = jsonExtractObject(submissions_rdd.take(1)(0))
res: entry =
entry(Last Ditch Effort,​
https://preview.redd.it/9x4ld036ivj31.jpg?width=780&format=pjpg&auto=webp&s=acaed6cc0d913ec31b54235ca8bb73971bcfe598,1.0,1.567296E9,YellowOnlineUnion,Sgedelta)
Problem is, after trying to map the same function to the RDD I'm getting an error:
val subset = submissions_rdd.map(line => jsonExtractObject(line) )
subset.take(5)
org.apache.spark.SparkDriverExecutionException: Execution error at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1485)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2236)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2120) at
org.apache.spark.SparkContext.runJob(SparkContext.scala:2139) at
org.apache.spark.rdd.RDD.$anonfun$take$1(RDD.scala:1423) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388) at
org.apache.spark.rdd.RDD.take(RDD.scala:1396) ... 41 elided Caused
by: java.lang.ArrayStoreException: [Lentry; at
scala.runtime.ScalaRunTime$.array_update(ScalaRunTime.scala:75) at
org.apache.spark.SparkContext.$anonfun$runJob$4(SparkContext.scala:2120)
at
org.apache.spark.SparkContext.$anonfun$runJob$4$adapted(SparkContext.scala:2120)
at
org.apache.spark.scheduler.JobWaiter.taskSucceeded(JobWaiter.scala:59)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1481)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2236)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Would appreciate if any has any hints on how to go around that. Thanks!
I am having an issue using UDF in Spark (Scala). This is a sample code:
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions.{col, udf}
val spark = SparkSession.builder.appName("test")
.master("local[*]")
.getOrCreate()
import spark.implicits._
def func(a: Array[Int]): Array[Int] = a
val funcUDF = udf((a: Array[Int]) => func(a))
var data = Seq(Array(1, 2, 3), Array(3, 4, 5), Array(6, 2, 4)).toDF("items")
data = data.withColumn("a", funcUDF(col("items")))
data.show()
The error I get is related to a ClassCastException, saying that it is impossible to cast from scala.collection.mutable.WrappedArray$ofRef to org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2. I add a part of the stack below. If it can help, I am using https://community.cloud.databricks.com/.
Caused by: java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to [I at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:155)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1125)
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$15.$anonfun$applyOrElse$70(Optimizer.scala:1557)
at
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392) at
scala.collection.TraversableLike.map(TraversableLike.scala:238) at
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at
scala.collection.immutable.List.map(List.scala:298) at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$15.applyOrElse(Optimizer.scala:1557)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$15.applyOrElse(Optimizer.scala:1552)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:322)
at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:80)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:153)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:151)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:412)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:250)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:410)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:363)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:153)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:151)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:412)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:250)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:410)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:363)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:153)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:151)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:412)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:250)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:410)
at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:363)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:327)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:153)
at
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:151)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:311)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$.apply(Optimizer.scala:1552)
at
org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$.apply(Optimizer.scala:1551)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:152)
at
scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at
scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:38)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:149)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:141)
at scala.collection.immutable.List.foreach(List.scala:392) at
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:141)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:119)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:119)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:107)
at
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:171)
at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:171)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:104)
at
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:104)
at
org.apache.spark.sql.execution.QueryExecution.$anonfun$writePlans$4(QueryExecution.scala:246)
at
org.apache.spark.sql.catalyst.plans.QueryPlan$.append(QueryPlan.scala:466)
at
org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$writePlans(QueryExecution.scala:246)
at
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:256)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:109)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
at
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
at
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:836)
at
org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3700) at
org.apache.spark.sql.Dataset.head(Dataset.scala:2711) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2918) at
org.apache.spark.sql.Dataset.getRows(Dataset.scala:305) at
org.apache.spark.sql.Dataset.showString(Dataset.scala:342) at
org.apache.spark.sql.Dataset.show(Dataset.scala:838) at
org.apache.spark.sql.Dataset.show(Dataset.scala:797) at
org.apache.spark.sql.Dataset.show(Dataset.scala:806) at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:14)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:164)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:166)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:168)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:170)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:172)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:174)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:176)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:178)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:180)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:182)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:184)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:186)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:188)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:190)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:192)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:194)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:196)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:198)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:200)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:202)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:204)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:206) at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw$$iw.(command-1114467142343660:208)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw$$iw.(command-1114467142343660:210)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw$$iw.(command-1114467142343660:212)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw$$iw.(command-1114467142343660:214)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$$iw.(command-1114467142343660:216)
at
lineedcf33d032244134ad784ac9de826d3b265.$read.(command-1114467142343660:218)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$.(command-1114467142343660:222)
at
lineedcf33d032244134ad784ac9de826d3b265.$read$.(command-1114467142343660)
at
lineedcf33d032244134ad784ac9de826d3b265.$eval$.$print$lzycompute(:7)
at
lineedcf33d032244134ad784ac9de826d3b265.$eval$.$print(:6)
at lineedcf33d032244134ad784ac9de826d3b265.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:745)
at
scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1021)
at
scala.tools.nsc.interpreter.IMain.$anonfun$interpret$1(IMain.scala:574)
at
scala.reflect.internal.util.ScalaClassLoader.asContext(ScalaClassLoader.scala:41)
at
scala.reflect.internal.util.ScalaClassLoader.asContext$(ScalaClassLoader.scala:37)
at
scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:41)
at scala.tools.nsc.interpreter.IMain.loadAndRunReq$1(IMain.scala:573)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:600) at
scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:570) at
com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:219)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.$anonfun$repl$1(ScalaDriverLocal.scala:204)
at
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:773)
at
com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:726)
at
com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:204)
at
com.databricks.backend.daemon.driver.DriverLocal.$anonfun$execute$10(DriverLocal.scala:431)
at
com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:239)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at
com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:234)
at
com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:231)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48)
at
com.databricks.logging.UsageLogging.withAttributionTags(UsageLogging.scala:276)
at
com.databricks.logging.UsageLogging.withAttributionTags$(UsageLogging.scala:269)
at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48)
at
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:408)
at
com.databricks.backend.daemon.driver.DriverWrapper.$anonfun$tryExecutingCommand$1(DriverWrapper.scala:653)
at scala.util.Try$.apply(Try.scala:213) at
com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:645)
at
com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:486)
at
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:598)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391)
at
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
The problem is that your "items" column is of type WrappedArray (which is the Spark type for every array like type). And there is no implicit conversion between Array and WrappedArray. So I would suggest to use Seq because WrappedArray is a subclass of Seq but it is not a subclass of Array.
This works :
import org.apache.spark.sql.{SparkSession, DataFrame}
import org.apache.spark.sql.functions.{col, udf}
val spark = SparkSession.builder.appName("test")
.master("local[*]")
.getOrCreate()
import spark.implicits._
def func(a: Array[Int]): Array[Int] = a
val funcUDF = udf((a: Seq[Int]) => func(a.toArray))
var data = Seq(Array(1, 2, 3), Array(3, 4, 5), Array(6, 2, 4)).toDF("items")
data = data.withColumn("a", funcUDF(col("items")))
data.show()
Below is the code where I am trying to read data from dynamo db and load it into a dataframe.
Is it possible to do the same using scanamo?
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io.LongWritable
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.servicename", "dynamodb")
jobConf.set("dynamodb.input.tableName", "GenreRatingCounts") // Pointing to DynamoDB table
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-2.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-2")
jobConf.set("dynamodb.throughput.read", "1")
jobConf.set("dynamodb.throughput.read.percent", "1")
jobConf.set("dynamodb.version", "2011-12-05")
jobConf.set("dynamodb.awsAccessKeyId", "XXXXX")
jobConf.set("dynamodb.awsSecretAccessKey", "XXXXXXX")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
orders.map(t => t._2.getItem()).collect.foreach(println)
val simple2: RDD[(String)] = orders.map { case (text, dbwritable) => (dbwritable.toString)}
spark.read.json(simple2).registerTempTable("gooddata")
The output is of type: org.apache.spark.sql.DataFrame = [count: struct<n: string>, genre: struct<s: string> ... 1 more field]
+------+---------+------+
| count| genre|rating|
+------+---------+------+
|[4450]| [Action]| [4]|
|[5548]|[Romance]| [3.5]|
+------+---------+------+
How can I convert this dataframe column types to String instead of Struct?
EDIT-1
Now I am able to create dataframe using below code and able to read data from dynamodb table if it doesn't contain null.
var orders = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
def extractValue : (String => String) = (aws:String) => {
val pat_value = "\\s(.*),".r
val matcher = pat_value.findFirstMatchIn(aws)
matcher match {
case Some(number) => number.group(1).toString
case None => ""
}
}
val col_extractValue = udf(extractValue)
val rdd_add = orders.map {
case (text, dbwritable) => (dbwritable.getItem().get("genre").toString(), dbwritable.getItem().get("rating").toString(),dbwritable.getItem().get("ratingCount").toString())
val df_add = rdd_add.toDF()
.withColumn("genre", col_extractValue($"_1"))
.withColumn("rating", col_extractValue($"_2"))
.withColumn("ratingCount", col_extractValue($"_3"))
.select("genre","rating","ratingCount")
df_add.show
But I am getting below error if there is a record with no data in one of the column(null or blank).
ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 14)
java.lang.NullPointerException
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:67)
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:66)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/12/20 07:48:21 WARN TaskSetManager: Lost task 0.0 in stage 10.0 (TID 14, localhost, executor driver): java.lang.NullPointerException
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:67)
at $line117.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:66)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.apply(RDD.scala:945)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How to handle null/blank while reading from Dynamodb to a dataframe in Spark/Scala?
After lots of trial and error, Below is the solution I have implemented. I am still getting error while reading from Dynamodb if any column is having blank (no data) using RDD to DATAFRAME. So, I made sure that instead of keeping that column blank, I write null.
Other option to handle that would be creating EXTERNAL HIVE tables on DYNAMO DB tables and then read from them.
Below is the code to first write the data into DYNAMO DB and then READ it back using SPARK/SCALA.
package com.esol.main
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.rdd.RDD
import scala.util.matching.Regex
import java.util.HashMap
import com.amazonaws.services.dynamodbv2.model.AttributeValue
import org.apache.hadoop.io.Text;
import org.apache.hadoop.dynamodb.DynamoDBItemWritable
import org.apache.hadoop.dynamodb.read.DynamoDBInputFormat
import org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat
import org.apache.hadoop.mapred.JobConf
import org.apache.hadoop.io
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
object dynamoDB {
def main(args: Array[String]): Unit = {
// val enum = Configurations.initializer()
//val table_name = args(0).trim()
implicit val spark = SparkSession.builder().appName("dynamoDB").master("local").getOrCreate()
val sc = spark.sparkContext
import spark.implicits._
// Writing data into table
var jobConf = new JobConf(sc.hadoopConfiguration)
jobConf.set("dynamodb.output.tableName", "eSol_MapSourceToRaw")
jobConf.set("dynamodb.throughput.write.percent", "0.5")
jobConf.set("mapred.input.format.class", "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat")
jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat")
jobConf.set("dynamodb.awsAccessKeyId", "XXXXXXXX")
jobConf.set("dynamodb.awsSecretAccessKey", "XXXXXX")
jobConf.set("dynamodb.endpoint", "dynamodb.us-east-1.amazonaws.com")
jobConf.set("dynamodb.regionid", "us-east-1")
jobConf.set("dynamodb.servicename", "dynamodb")
//giving column names is mandatory in below query. else it will fail.
var MapSourceToRaw = spark.sql("select RowKey,ProcessName,SourceType,Source,FileType,FilePath,FileName,SourceColumnDelimeter,SourceRowDelimeter,SourceColumn,TargetTable,TargetColumnFamily,TargetColumn,ColumnList,SourceColumnSequence,UniqueFlag,SourceHeader from dynamo.hive_MapSourceToRaw")
println("read data from hive table : "+ MapSourceToRaw.show())
val df_columns = MapSourceToRaw.columns.toList
var ddbInsertFormattedRDD = MapSourceToRaw.rdd.map(a => {
var ddbMap = new HashMap[String, AttributeValue]()
for(i <- 0 to df_columns.size -1)
{
val col=df_columns(i)
var column= new AttributeValue()
if(a.get(i) == null || a.get(i).toString.isEmpty)
{ column.setS("null")
ddbMap.put(col, column)
}
else
{
column.setS(a.get(i).toString)
ddbMap.put(col, column)
} }
var item = new DynamoDBItemWritable()
item.setItem(ddbMap)
(new Text(""), item)
})
println("ready to write into table")
ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf)
println("data written in dynamo db")
// READING DATA BACK
println("reading data from dynamo db")
jobConf.set("dynamodb.input.tableName", "eSol_MapSourceToRaw")
def extractValue : (String => String) = (aws:String) => {
val pat_value = "\\s(.*),".r
val matcher = pat_value.findFirstMatchIn(aws)
matcher match {
case Some(number) => number.group(1).toString
case None => ""
}
}
val col_extractValue = udf(extractValue)
var dynamoTable = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable])
val rdd_add = dynamoTable.map {
case (text, dbwritable) => (dbwritable.getItem().get("RowKey").toString(), dbwritable.getItem().get("ProcessName").toString(),dbwritable.getItem().get("SourceType").toString(),
dbwritable.getItem().get("Source").toString(),dbwritable.getItem().get("FileType").toString(),
dbwritable.getItem().get("FilePath").toString(),dbwritable.getItem().get("TargetColumn").toString())
}
val df_add = rdd_add.toDF()
.withColumn("RowKey", col_extractValue($"_1"))
.withColumn("ProcessName", col_extractValue($"_2"))
.withColumn("SourceType", col_extractValue($"_3"))
.withColumn("Source", col_extractValue($"_4"))
.withColumn("FileType", col_extractValue($"_5"))
.withColumn("FilePath", col_extractValue($"_6"))
.withColumn("TargetColumn", col_extractValue($"_7"))
.select("RowKey","ProcessName","SourceType","Source","FileType","FilePath","TargetColumn")
df_add.show
}
}
I'm trying to prepare a DataFrame to be stored in HFile format on HBase using Apache Spark. I'm using Spark 2.1.0, Scala 2.11 and HBase 1.1.2
Here is my code:
val df = createDataframeFromRow(Row("mlk", "kpo", "opi"), "a b c")
val cols = df.columns.sorted
val colsorteddf = df.select(cols.map(x => col(x)): _*)
val valcols = cols.filterNot(x => x.equals("U_ID"))
So far so good. I only sort the columns of my dataframe
val pdd = colsorteddf.map(row => {
(row(0).toString, (row(1).toString, row(2).toString))
})
val tdd = pdd.flatMap(x => {
val rowKey = PLong.INSTANCE.toBytes(x._1)
for(i <- 0 until valcols.length - 1) yield {
val colname = valcols(i).toString
val colvalue = x._2.productElement(i).toString
val colfam = "data"
(rowKey, (colfam, colname, colvalue))
}
})
After this, I transform each row into this key value format (rowKey, (colfam, colname, colvalue))
No here's when the problem happens. I try to map each row of tdd into a pair of (ImmutableBytesWritable, KeyValue)
import org.apache.hadoop.hbase.KeyValue
val output = tdd.map(x => {
val rowKey: Array[Byte] = x._1
val immutableRowKey = new ImmutableBytesWritable(rowKey)
val colfam = x._2._1
val colname = x._2._2
val colvalue = x._2._3
val kv = new KeyValue(
rowKey,
colfam.getBytes(),
colname.getBytes(),
Bytes.toBytes(colvalue.toString)
)
(immutableRowKey, kv)
})
It renders this stack trace :
java.lang.AssertionError: assertion failed: no symbol could be loaded from interface org.apache.hadoop.hbase.classification.InterfaceAudience$Public in object InterfaceAudience with name Public and classloader scala.reflect.internal.util.ScalaClassLoader$URLClassLoader#3269cbb7
at scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$JavaMirror$$classToScala1(JavaMirrors.scala:1021)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToScala$1.apply(JavaMirrors.scala:980)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToScala$1.apply(JavaMirrors.scala:980)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$toScala$1.apply(JavaMirrors.scala:97)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toScala$1.apply(TwoWayCaches.scala:38)
at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toScala(TwoWayCaches.scala:33)
at scala.reflect.runtime.JavaMirrors$JavaMirror.toScala(JavaMirrors.scala:95)
at scala.reflect.runtime.JavaMirrors$JavaMirror.classToScala(JavaMirrors.scala:980)
at scala.reflect.runtime.JavaMirrors$JavaMirror$JavaAnnotationProxy.<init>(JavaMirrors.scala:163)
at scala.reflect.runtime.JavaMirrors$JavaMirror$JavaAnnotationProxy$.apply(JavaMirrors.scala:162)
at scala.reflect.runtime.JavaMirrors$JavaMirror$JavaAnnotationProxy$.apply(JavaMirrors.scala:162)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at scala.reflect.runtime.JavaMirrors$JavaMirror.scala$reflect$runtime$JavaMirrors$JavaMirror$$copyAnnotations(JavaMirrors.scala:683)
at scala.reflect.runtime.JavaMirrors$JavaMirror$FromJavaClassCompleter.load(JavaMirrors.scala:733)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$typeParams$1.apply(SynchronizedSymbols.scala:140)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anonfun$typeParams$1.apply(SynchronizedSymbols.scala:133)
at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:123)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$8.gilSynchronizedIfNotThreadsafe(SynchronizedSymbols.scala:168)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$class.typeParams(SynchronizedSymbols.scala:132)
at scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$8.typeParams(SynchronizedSymbols.scala:168)
at scala.reflect.internal.Types$NoArgsTypeRef.typeParams(Types.scala:1926)
at scala.reflect.internal.Types$NoArgsTypeRef.isHigherKinded(Types.scala:1925)
at scala.reflect.internal.transform.UnCurry$class.scala$reflect$internal$transform$UnCurry$$expandAlias(UnCurry.scala:22)
at scala.reflect.internal.transform.UnCurry$$anon$2.apply(UnCurry.scala:26)
at scala.reflect.internal.transform.UnCurry$$anon$2.apply(UnCurry.scala:24)
at scala.collection.immutable.List.loop$1(List.scala:173)
at scala.collection.immutable.List.mapConserve(List.scala:189)
at scala.reflect.internal.tpe.TypeMaps$TypeMap.mapOver(TypeMaps.scala:115)
at scala.reflect.internal.transform.UnCurry$$anon$2.apply(UnCurry.scala:46)
at scala.reflect.internal.transform.Transforms$class.transformedType(Transforms.scala:43)
at scala.reflect.internal.SymbolTable.transformedType(SymbolTable.scala:16)
at scala.reflect.internal.Types$TypeApiImpl.erasure(Types.scala:225)
at scala.
It seems like a scala issue. Has anyone ever run into the same problem? If so how did you overcome this?
PS: I'm using running this code through spark-shell.
i have framed a DataFrame by combining multiple Arrays. I am trying to save this into a hive table, i am getting ArrayIndexOutofBound Exception. Following is the code and the Error i got. i tried with adding case class outside and inside main def but still getting the same error.
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext, DataFrame}
import org.apache.spark.ml.feature.RFormula
import java.text._
import java.util.Date
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.spark.ml.regression.LinearRegressionModel
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.RandomForestClassificationModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.sql.hive.HiveContext
//case class Rows(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
object MLRCreate{
// case class Row(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("MLRCreate")
val sc = new SparkContext(conf)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
val ReqId = new java.text.SimpleDateFormat("yyyyMMddHHmmss").format(new java.util.Date())
val dirName = "/user/ec2-user/SavedModels/"+ReqId
val df = Functions.loadData(hiveContext,args(0),args(1))
val form = args(1).toLowerCase
val lbl = form.split("~")
var lrModel:LinearRegressionModel = null;
val Array(training, test) = df.randomSplit(Array(args(3).toDouble, (1-args(3).toDouble)), seed = args(4).toInt)
lrModel = Functions.mlr(training)
var columnnames = Functions.resultColumns(df).substring(1)
var columnsFinal = columnnames.split(",")
columnsFinal = "intercept" +: columnsFinal
var coeff = lrModel.coefficients.toArray.map(_.toString)
coeff = lrModel.intercept.toString +: coeff
var stdErr = lrModel.summary.coefficientStandardErrors.map(_.toString)
var tval = lrModel.summary.tValues.map(_.toString)
var pval = lrModel.summary.pValues.map(_.toString)
var Signif:Array[String] = new Array[String](pval.length)
for(j <- 0 to pval.length-1){
var sign = pval(j).toDouble;
sign = Math.abs(sign);
if(sign <= 0.001){
Signif(j) = "***";
}else if(sign <= 0.01){
Signif(j) = "**";
}else if(sign <= 0.05){
Signif(j) = "*";
}else if(sign <= 0.1){
Signif(j) = ".";
}else{Signif(j) = " ";
}
println(columnsFinal(j)+"#########"+coeff(j)+"#########"+stdErr(j)+"#########"+tval(j)+"#########"+pval(j)+"########"+Signif)
}
case class Row(col1: String, col2: String, col3: String, col4: String, col5: String, col6: String)
// print(columnsFinali.mkString("#"),coeff.mkString("#"),stdErr.mkString("#"),tval.mkString("#"),pval.mkString("#"))
val sums = Array(columnsFinal, coeff, stdErr, tval, pval, Signif).transpose
val rdd = sc.parallelize(sums).map(ys => Row(ys(0), ys(1), ys(2), ys(3),ys(4),ys(5)))
// val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
// import hiveContext.implicits._
// import hiveContext.sql
val result = rdd.toDF("Name","Coefficients","Std_Error","tValue","pValue","Significance")
result.show()
result.saveAsTable("iaw_model_summary.IAW_"+ReqId)
print(ReqId)
lrModel.save(dirName)
}
}
And the following is the error i get,
16/05/12 07:17:25 ERROR Executor: Exception in task 2.0 in stage 23.0 (TID 839)
java.lang.ArrayIndexOutOfBoundsException: 1
at IAWMLRCreate$$anonfun$5.apply(IAWMLRCreate.scala:96)
at IAWMLRCreate$$anonfun$5.apply(IAWMLRCreate.scala:96)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Suggest you check the lengths of the arrays you are transposing: columnsFinal, coeff, stdErr, tval, pval, Signif. If any of these is shorter/longer than the others, then some of the rows after the transpose would be incomplete. Scala does not fill nulls or anything for you when transposing:
val a1 = Array(1,2,3)
val a2 = Array(5,6)
Array(a1, a2).transpose.foreach(x => println(x.toList))
prints:
List(1, 5)
List(2, 6)
List(3)