Split a DataFrame Spark Scala [duplicate] - scala

This question already has answers here:
How to split a dataframe into dataframes with same column values?
(3 answers)
Closed 4 years ago.
I have this dataframe called rankedDF:
+----------+-----------+----------+-------------+-----------------------+
|TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID| Items|
+----------+-----------+----------+-------------+-----------------------+
| 1| 2017-03-01|2017-05-30| TxnHeader1|Womens Socks, Men...|
| 1| 2017-03-01|2017-05-30| TxnHeader4|Mens Pants, Mens ... |
| 1| 2017-03-01|2017-05-30| TxnHeader7|Womens Socks, Men...|
| 2| 2019-03-01|2017-05-30| TxnHeader1|Calcetas Mujer, Calc...|
| 2| 2019-03-01|2017-05-30| TxnHeader4|Pantalones H, Pan ... |
| 2| 2019-03-01|2017-05-30| TxnHeader7|Calcetas Mujer, Pan...|
So, I need to split this dataframe by each “TimePeriod” as an input for another function, but only with the column Items.
I’ve tried this:
val timePeriods = rankedDF.select(“TimePeriod”).distinct()
so at this point I have:
| Time Period |
| 1 |
| 2 |
According to this “timePeriods” I need to call my function twice:
timePeriods.foreach{
n=> val justItems = rankedDF.filter(col(“TimePeriod”)===n.getAsInt(0)) .select(“Items”)
}
Well... I was waiting for this DataFrame:
|TimePeriod|
|Womens Socks, Men...
|Mens Pants, Mens ...
|Womens Socks, Men...
Instead of that, I’m getting this error:
task 170.0 in stage 40.0 (TID 2223)
java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 WARN TaskSetManager: Lost task 170.0 in stage 40.0 (TID 2223, localhost): java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 ERROR TaskSetManager: Task 170 in stage 40.0 failed 1 times; aborting job
When I run this, actually worked:
val justItems = rankedDF.filter(col(“TimePeriod”)=== 1 ) .select(“Items”)
val justItems = rankedDF.filter(col(“TimePeriod”)=== 2 ) .select(“Items”)
What I'm unable to access dynamically my Data Frame?

you need to collect the distinct values first, then you can use map:
val rankedDF : DataFrame = ???
val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()
val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))

Related

How to properly create GraphX with attributes for Nodes and Edges

I'm running on Jupyter Notebook, using spylon kernel, a scala program that performs some actions on a network.
After some preprocessing I end up having two DataFrames, one for nodes and one for edges, of the following kind:
For Nodes
+---+--------------------+-------+--------+-----+
| id| node|trip_id| stop_id| time|
+---+--------------------+-------+--------+-----+
| 0|0b29d98313189b650...| 209518|u0007405|56220|
| 1|45adb49a23257198e...| 209518|u0007409|56340|
| 2|fe5f4e2dc48b97f71...| 209518|u0007406|56460|
| 3|7b32330b6fe10b073...| 209518|u0007407|56580|
+---+--------------------+-------+--------+-----+
only showing top 4 rows
vertices_complete: org.apache.spark.sql.DataFrame = [id: bigint, node: string ... 3 more fields]
For edges
+------+-----+----+------+------+---------+---------+--------+
| src| dst|time|method|weight|walk_time|wait_time|bus_time|
+------+-----+----+------+------+---------+---------+--------+
| 65465|52067|2640| walk|2640.0| 1112| 1528| 0|
| 68744|52067|1740| walk|1740.0| 981| 759| 0|
| 55916|52067|2700| walk|2700.0| 1061| 1639| 0|
|124559|52067|1440| walk|1440.0| 1061| 379| 0|
| 23036|52067|1800| walk|1800.0| 1112| 688| 0|
+------+-----+----+------+------+---------+---------+--------+
only showing top 5 rows
edges_DF: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint ... 6 more fields]
I want to create a Graph object out of this, to do PageRank, find shortest paths, etc. Therefore I convert these objects to RDD:
val verticesRDD : RDD[(VertexId, (String, Long, String, Long))] = vertices_complete.rdd
.map(row =>
(row.getAs[Long](0),
(row.getAs[String]("node"), row.getAs[Long]("trip_id"), row.getAs[String]("stop_id"), row.getAs[Long]("time"))))
val edgesRDD : RDD[Edge[Long]] = edges_DF.rdd
.map(row =>
Edge(
row.getAs[Long]("src"), row.getAs[Long]("dst"), row.getAs[Long]("weight")))
val my_graph = Graph(verticesRDD, edgesRDD)
Any operation, that can be even PageRank (tried also with the shortest path, and the error still persists)
val ranks = my_graph.pageRank(0.0001).vertices
raises the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 233.0 failed 1 times, most recent failure: Lost task 5.0 in stage 233.0 (TID 9390, DESKTOP-A7EPMQG.mshome.net, executor driver): java.lang.ClassCastException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2209)
at org.apache.spark.rdd.RDD.$anonfun$fold$1(RDD.scala:1157)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1151)
at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90)
at org.apache.spark.graphx.Pregel$.apply(Pregel.scala:140)
at org.apache.spark.graphx.lib.PageRank$.runUntilConvergenceWithOptions(PageRank.scala:431)
at org.apache.spark.graphx.lib.PageRank$.runUntilConvergence(PageRank.scala:346)
at org.apache.spark.graphx.GraphOps.pageRank(GraphOps.scala:380)
... 40 elided
Caused by: java.lang.ClassCastException
I think there is something wrong with the initialization of the RDD objects (and also I would like to add attributes to edges [time, walk_time etc...], too, in addition to the weight), but I cannot figure out how to do it properly. Any help, please?

Spark RDD Or SQL operations to compute conditional counts

As a bit of background, I'm trying to implement the Kaplan-Meier in Spark. In particular, I assume I have a data frame/set with a Double column denoted as Data and an Int column named censorFlag (0 value if censored, 1 if not, prefer this over Boolean type).
Example:
val df = Seq((1.0, 1), (2.3, 0), (4.5, 1), (0.8, 1), (0.7, 0), (4.0, 1), (0.8, 1)).toDF("data", "censorFlag").as[(Double, Int)]
Now I need to compute a column wins that counts instances of each data value. I achieve that with the following code:
val distDF = df.withColumn("wins", sum(col("censorFlag")).over(Window.partitionBy("data").orderBy("data")))
The problem comes when I need to compute a quantity called atRisk which counts, for each value of data, the number of data points that are greater than or equal to it (a cumulative filtered count, if you will).
The following code works:
// We perform the counts per value of "bins". This is an array of doubles
val bins = df.select(col("data").as("dataBins")).distinct().sort("dataBins").as[Double].collect
val atRiskCounts = bins.map(x => (x, df.filter(col("data").geq(x)).count)).toSeq.toDF("data", "atRisk")
// this works:
atRiskCounts.show
However, the use case involves deriving bins from the column data itself, which I'd rather leave as a single column data set (or RDD at worst), but certainly not local array. But this doesn't work:
// Here, 'bins' rightfully come from the data itself.
val bins = df.select(col("data").as("dataBins")).distinct().as[Double]
val atRiskCounts = bins.map(x => (x, df.filter(col("data").geq(x)).count)).toSeq.toDF("data", "atRisk")
// This doesn't work -- NullPointerException
atRiskCounts.show
Nor does this:
// Manually creating the bins and then parallelizing them.
val bins = Seq(0.7, 0.8, 1.0, 3.0).toDS
val atRiskCounts = bins.map(x => (x, df.filter(col("data").geq(x)).count)).toDF("data", "atRisk")
// Also fails with a NullPointerException
atRiskCounts.show
Another approach that does work, but is also not satisfactory from a parallelization perspective is using Window:
// Do the counts in one fell swoop using a giant window per value.
val atRiskCounts = df.withColumn("atRisk", count("censorFlag").over(Window.orderBy("data").rowsBetween(0, Window.unboundedFollowing))).groupBy("data").agg(first("atRisk").as("atRisk"))
// Works, BUT, we get a "WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation."
atRiskCounts.show
This last solution isn't useful as it ends up shuffling my data to a single partition (and in that case, I might as well go with Option 1 tha works).
The successful approaches are fine except that the bins are not parallel, which is something I'd really like to keep if possible. I've looked at groupBy aggregations, pivot type of aggregations, but none seem to make sense.
My question is: is there any way to compute atRisk column in a distributed way? Also, why do I get a NullPointerException in the failed solutions?
EDIT PER COMMENT:
I didn't originally post the NullPointerException as it didn't seem to include anything useful. I'll make a note that this is Spark installed via homebrew on my Macbook Pro (Spark version 2.2.1, standalone localhost mode).
18/03/12 11:41:00 ERROR ExecutorClassLoader: Failed to check existence of class <root>.package on REPL class server at spark://10.37.109.111:53360/classes
java.net.URISyntaxException: Illegal character in path at index 36: spark://10.37.109.111:53360/classes/<root>/package.class
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:327)
at org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
. . . .
18/03/12 11:41:00 ERROR ExecutorClassLoader: Failed to check existence of class <root>.scala on REPL class server at spark://10.37.109.111:53360/classes
java.net.URISyntaxException: Illegal character in path at index 36: spark://10.37.109.111:53360/classes/<root>/scala.class
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:327)
at org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
. . .
18/03/12 11:41:00 ERROR ExecutorClassLoader: Failed to check existence of class <root>.org on REPL class server at spark://10.37.109.111:53360/classes
java.net.URISyntaxException: Illegal character in path at index 36: spark://10.37.109.111:53360/classes/<root>/org.class
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:327)
at org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
. . .
18/03/12 11:41:00 ERROR ExecutorClassLoader: Failed to check existence of class <root>.java on REPL class server at spark://10.37.109.111:53360/classes
java.net.URISyntaxException: Illegal character in path at index 36: spark://10.37.109.111:53360/classes/<root>/java.class
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3053)
at java.net.URI.<init>(URI.java:588)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:327)
at org.apache.spark.repl.ExecutorClassLoader.org$apache$spark$repl$ExecutorClassLoader$$getClassFileInputStreamFromSparkRPC(ExecutorClassLoader.scala:90)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader$$anonfun$1.apply(ExecutorClassLoader.scala:57)
at org.apache.spark.repl.ExecutorClassLoader.findClassLocally(ExecutorClassLoader.scala:162)
at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:80)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
. . .
18/03/12 11:41:00 ERROR Executor: Exception in task 0.0 in stage 55.0 (TID 432)
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:171)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:62)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2889)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1301)
at $line124.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:33)
at $line124.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:33)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
18/03/12 11:41:00 WARN TaskSetManager: Lost task 0.0 in stage 55.0 (TID 432, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:171)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:62)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2889)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1301)
at $line124.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:33)
at $line124.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:33)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
18/03/12 11:41:00 ERROR TaskSetManager: Task 0 in stage 55.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 55.0 failed 1 times, most recent failure: Lost task 0.0 in stage 55.0 (TID 432, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:171)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:62)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2889)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1301)
at $anonfun$1.apply(<console>:33)
at $anonfun$1.apply(<console>:33)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
... 50 elided
Caused by: java.lang.NullPointerException
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:171)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:62)
at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2889)
at org.apache.spark.sql.Dataset.filter(Dataset.scala:1301)
at $anonfun$1.apply(<console>:33)
at $anonfun$1.apply(<console>:33)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
My best guess is that the line df("data").geq(x).count might be the part that barfs as not every node may have x and thus a null pointer?
I have not tested this so the syntax may be goofy, but I would do a series of joins:
I believe your first statement is equivalent to this--for each data value, count how many wins there are:
val distDF = df.groupBy($"data").agg(sum($"censorFlag").as("wins"))
Then, as you noted, we can build a dataframe of the bins:
val distinctData = df.select($"data".as("dataBins")).distinct()
And then join with a >= condition:
val atRiskCounts = distDF.join(distinctData, distDF.data >= distinctData.dataBins)
.groupBy($"data", $"wins")
.count()
When there is a requirement as yours to check a value in a column with all the rest of the values in that column, collection is the most important. And when there is requirement to check with all the values then it is certain that all the data of that column need to be accumulated in one executor or driver. You cannot avoid the step when there is a requirement as yours.
Now the main part is how you define the rest of the steps to benefit from the parallelization of spark. I would suggest you to broadcast the collected set (as its distinct data of one column only so they must not be huge) and use a udf function for checking the gte condition as below
firstly you can optimize the collection step of yours as
import org.apache.spark.sql.functions._
val collectedData = df.select(sort_array(collect_set("data"))).collect()(0)(0).asInstanceOf[collection.mutable.WrappedArray[Double]]
Then you broadcast the collected set
val broadcastedArray = sc.broadcast(collectedData)
Next step is to define a udf function and check the gte condition and return counts
def checkingUdf = udf((data: Double)=> broadcastedArray.value.count(x => x >= data))
and use it as
distDF.withColumn("atRisk", checkingUdf(col("data"))).show(false)
So that finally you should have
+----+----------+----+------+
|data|censorFlag|wins|atRisk|
+----+----------+----+------+
|4.5 |1 |1 |1 |
|0.7 |0 |0 |6 |
|2.3 |0 |0 |3 |
|1.0 |1 |1 |4 |
|0.8 |1 |2 |5 |
|0.8 |1 |2 |5 |
|4.0 |1 |1 |2 |
+----+----------+----+------+
I hope thats the required dataframe
I tried the above examples (albeit not the most rigorously!), and it seems the left join works best in general.
The data:
import org.apache.spark.mllib.random.RandomRDDs._
val df = logNormalRDD(sc, 1, 3.0, 10000, 100).zip(uniformRDD(sc, 10000, 100).map(x => if(x <= 0.4) 1 else 0)).toDF("data", "censorFlag").withColumn("data", round(col("data"), 2))
The join example:
def runJoin(sc: SparkContext, df:DataFrame): Unit = {
val bins = df.select(col("data").as("dataBins")).distinct().sort("dataBins")
val wins = df.groupBy(col("data")).agg(sum("censorFlag").as("wins"))
val atRiskCounts = bins.join(df, bins("dataBins") <= df("data")).groupBy("dataBins").count().withColumnRenamed("count", "atRisk")
val finalDF = wins.join(atRiskCounts, wins("data") === atRiskCounts("dataBins")).select("data", "wins", "atRisk").sort("data")
finalDF.show
}
The broadcast example:
def runBroadcast(sc: SparkContext, df: DataFrame): Unit = {
val bins = df.select(sort_array(collect_set("data"))).collect()(0)(0).asInstanceOf[collection.mutable.WrappedArray[Double]]
val binsBroadcast = sc.broadcast(bins)
val df2 = binsBroadcast.value.map(x => (x, df.filter(col("data").geq(x)).select(count(col("data"))).as[Long].first)).toDF("data", "atRisk")
val finalDF = df.groupBy(col("data")).agg(sum("censorFlag").as("wins")).join(df2, "data")
finalDF.show
binsBroadcast.destroy
}
And the testing code:
var start = System.nanoTime()
runJoin(sc, sampleDF)
val joinTime = TimeUnit.SECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS)
start = System.nanoTime()
runBroadcast(sc, sampleDF)
val broadTime = TimeUnit.SECONDS.convert(System.nanoTime() - start, TimeUnit.NANOSECONDS)
I ran this code for different sizes of the random data, provided manual bins arrays (some very granular, 50% of original distinct data, some very small, 10% of original distinct data), and consistently it seems the join approach is the fastest (although both arrive at the same solution, so that is a plus!).
On average I find that the smaller the bin array, the better broadcast approach works, but join doesn't seem too affected. If I had more time/resource to test this, I'd run lots of simulations to see what the average run time looks like, but for now I'll accept #hoyland's solution.
Still have not sure why the original approach didn't work, so open to comments on that.
Kindly let me know of any issues in my code, or improvements! Thank you both :)

NullPointerException: creating dataset/dataframe inside foreachPartition/foreach

1) If I use the following one in both local and cluster mode, I get NullPointerException error
import sparkSession.implicits._
val testDS = sparkSession.createDataFrame(
Seq(
ABC("1","2", 1),
ABC("3","9", 3),
ABC("8","2", 2),
ABC("1","2", 3),
ABC("3","9", 1),
ABC("2","7", 1),
ABC("1","3", 2))
).as[ABC]
val t = testDS
.rdd
.groupBy(_.c)
.foreachPartition(
p => p.foreach(
a => {
val id = a._1
println("inside foreach, id: " + id)
val itABC = a._2
val itSeq = itABC.toSeq
println(itSeq.size)
val itDS = itSeq.toDS // Get "Caused by: java.lang.NullPointerException" here
itDS.show()
funcA(itDS, id)
}
)
)
println(t.toString)
Or
import sparkSession.implicits._
val testDS = sparkSession.createDataFrame(
Seq(
ABC("1","2", 1),
ABC("3","9", 3),
ABC("8","2", 2),
ABC("1","2", 3),
ABC("3","9", 1),
ABC("2","7", 1),
ABC("1","3", 2))
).as[ABC]
testDS
.rdd
.groupBy(_.c)
.foreachPartition(
p => p.foreach(
a => {
val id = a._1
println("inside foreach, id: " + id)
val itABC = a._2
import sparkSession.implicits._
val itDS = sparkSession.createDataFrame(
sparkSession.sparkContext.parallelize(itABC.toList, numSlices=200)) // get "NullPointerException" here
itDS.show()
funcA(itDS, id)
}
)
)
Here's the output log for 1):
17/10/26 15:07:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[Stage 0:> (0 + 4) / 4]17/10/26 15:07:29 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 8, 10.142.17.137, executor 0): java.lang.NullPointerException
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:176)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:167)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/10/26 15:07:29 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 12, 10.142.17.137, executor 0): java.lang.NullPointerException
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:176)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:167)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:924)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:924)
at com.a.data_pipeline.SL.generateScaleGraphs(SL.scala:165)
at com.a.data_pipeline.GA$$anonfun$generateGraphsDataScale$1.apply(GA.scala:23)
at com.a.data_pipeline.GA$$anonfun$generateGraphsDataScale$1.apply(GA.scala:21)
at scala.collection.immutable.List.foreach(List.scala:381)
at com.a.data_pipeline.GA$.generateGraphsDataScale(GA.scala:21)
at com.a.data_pipeline.GA$.main(GA.scala:52)
at com.a.data_pipeline.GA.main(GA.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:176)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1$$anonfun$apply$1.apply(SL.scala:167)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at com.a.data_pipeline.SL$$anonfun$generateScaleGraphs$1.apply(SL.scala:166)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$29.apply(RDD.scala:926)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2) But if I use the following code, running in local mode works fine, but running in cluster mode I get NullPointerException or Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
import sparkSession.implicits._
val testDS = sparkSession.createDataFrame(
Seq(
ABC("1","2", 1),
ABC("3","9", 3),
ABC("8","2", 2),
ABC("1","2", 3),
ABC("3","9", 1),
ABC("2","7", 1),
ABC("1","3", 2))
).as[ABC]
val test = testDS
.rdd
.groupBy(_.c)
.foreachPartition(
p => p.foreach(
a => {
val id = a._1
println("inside foreach, id: " + id)
val itABC = a._2
val ss = SparkSessionUtil.getInstance(clusterMode)
import ss.implicits._
val itDS = ss.createDataFrame(
ss.sparkContext.parallelize(itABC.toList, numSlices=200)).as[ABC]
itDS.show()
funcA(itDS, id) // in funcA, I'd like to use this itDS(Dataset) to do some calculation, like itDS.groupby().agg().filter()
}
)
)
Here's the system out log for 2):
17/10/26 14:19:12 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
inside foreach, id: 1
17/10/26 14:19:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 1|
| 3| 9| 1|
| 2| 7| 1|
+---+---+---+
inside foreach, id: 2
17/10/26 14:19:14 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
17/10/26 14:19:14 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
+---+---+---+
| a| b| c|
+---+---+---+
| 8| 2| 2|
| 1| 3| 2|
+---+---+---+
inside foreach, id: 3
+---+---+---+
| a| b| c|
+---+---+---+
| 3| 9| 3|
| 1| 2| 3|
+---+---+---+
I would like to use id related Dataset(itDS) in funcA(itDS, id) to calculate something like itDS.groupby().agg().filter(),How should I solve this problem? Thank you in advance?
Recently encountered the same issue and since there was no answer Trying to add answer this question....
faustineinsun Comment is answer :
Thank you, #AlexandreDupriez ! The problem has been solved by
restructuring the codes from sparkSession.sql() to Seq[ABC] so that
sparkSession isn't referenced in the map/foreach function closure,
since sparkSession isn't serializable, it's designed to run on the
driver not on workers
Conclusion :
With in foreach , foreachPartition or map, mapPartitions you CANT create a new dataframe with spark session .read or .sql inside it it will throw null pointer exception.
Also have a look at :
How to use SQLContext and SparkContext inside foreachPartition

creating spark data frame based on condition

I have 2 data frames:
dataframe1 has 70000 rows like:
location_id, location, flag
1,Canada,active
2,Paris,active
3,London,active
4,Berlin,active
Second df lookup has modified ids for each location (This data frame is modified time to time), like:
id,location
1,Canada
10,Paris
4,Berlin
3,London
My problem is, I need to take new id as location_id from lookup and if location_id is different than id then, keep old id of corresponding location with flag name as inactive (to maintain historic data) and new id with flag name as active for each location. So the output table in hive should look like:
location_id,location,flag
1,Canada,active
2,Paris,inactive
10,Paris,active
3,London,active
4,Berlin,active
I tried to join both frame first. Then on Joined DF, I am performing action, to save all records in hive.I tried the operations as:
val joinedFrame = dataframe1.join(lookup, "location")
val df_temp = joinedFrame.withColumn("flag1", when($"tag_id" === $"tag_number", "active").otherwise("inactive"))
var count = 1
df_temp.foreach(x => {
val flag1 = x.getAs[String]("flag1").toString
val flag = x.getAs[String]("flag").toString
val location_id = x.getAs[String]("location_id").toString
val location = x.getAs[String]("location").toString
val id = x.getAs[String]("id").toString
if ((count != 1)&&(flag1 != flag)){
println("------not equal-------",flag1,"-------",flag,"---------",id,"---------",location,"--------",location_id)
val df_main = sc.parallelize(Seq((location_id, location,flag1), (id, location, flag))).toDF("location_id", "location", "flag")
df_main.show
df_main.write.insertInto("location_coords")
}
count += 1
})
It prints the location values which has different ids, but while saving those values as dataframe, I am getting exception:
not equal------inactive------active---10---------Paris---------2
17/09/29 03:43:29 ERROR Executor: Exception in task 0.0 in stage 25.0 (TID 45)
java.lang.NullPointerException
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:75)
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
17/09/29 03:43:29 WARN TaskSetManager: Lost task 0.0 in stage 25.0 (TID 45, localhost, executor driver): java.lang.NullPointerException
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:75)
at $line83.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Based on your comments, I think the easiest method would be to use join on the ids instead. When doing an outer join the missing columns will end up having null, these rows are the ones that have been updated and you are interested in.
After that all that is left is to update the location column in case it is empty as well as the flag column, see my code below (note that I changed the column names somewhat):
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = Seq((1,"Canada","active"),(2,"Paris","active"),(3,"London","active"),(4,"Berlin","active"))
.toDF("id", "location", "flag")
val df2 = Seq((1,"Canada"),(10,"Paris"),(4,"Berlin"),(3,"London"))
.toDF("id", "location_new")
val df3 = df.join(df2, Seq("id"), "outer")
.filter($"location".isNull or $"location_new".isNull)
.withColumn("location", when($"location_new".isNull, $"location").otherwise($"location_new"))
.withColumn("flag", when($"location" === $"location_new", "active").otherwise("inactive"))
.drop("location_new")
> df3.show()
+---+--------+--------+
| id|location| flag|
+---+--------+--------+
| 10| Paris| active|
| 2| Paris|inactive|
+---+--------+--------+
After this you can use this new dataframe to update the hive table.

Date type null value in dataframe not storing in cassandra

I am working in Apache Spark 1.6.0. I have a dataframe of 280 columns in which some of the columns are of type timestamp. A few values of the timestamp field are null. When I'm trying to write the same dataframe to cassandra, I'm getting an IllegalArgumentException.
The column looks like -
+------------------------+
| LoginDate|
+-------------------------+
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
+-------------------------+
When I'm trying to save the whole dataframe to cassandra, it comes up the error -
05:39:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 106.0 (TID 5136,): java.lang.IllegalArgumentException: Invalid date:
at com.datastax.spark.connector.types.TimestampParser$.parse(TimestampParser.scala:50)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$$anonfun$convertPF$13.applyOrElse(TypeConverter.scala:323)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter$$anonfun$convertPF$31.applyOrElse(TypeConverter.scala:812)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:795)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.convert(TypeConverter.scala:795)
at com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$readColumnValues$1.apply$mcVI$sp(SqlRowWriter.scala:26)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:24)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:100)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:157)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The type of the respective field in cassandra is of timestamp type.
Anyone can help to solve the issue ?
Add the following parameter to your spark Cassandra connection settings
spark.cassandra.output.ignoreNulls=true
It will ignore the NULL values in the input and also has benefit of avoiding creation of a corresponding tombstone column in Cassandra.