How to properly create GraphX with attributes for Nodes and Edges - scala

I'm running on Jupyter Notebook, using spylon kernel, a scala program that performs some actions on a network.
After some preprocessing I end up having two DataFrames, one for nodes and one for edges, of the following kind:
For Nodes
+---+--------------------+-------+--------+-----+
| id| node|trip_id| stop_id| time|
+---+--------------------+-------+--------+-----+
| 0|0b29d98313189b650...| 209518|u0007405|56220|
| 1|45adb49a23257198e...| 209518|u0007409|56340|
| 2|fe5f4e2dc48b97f71...| 209518|u0007406|56460|
| 3|7b32330b6fe10b073...| 209518|u0007407|56580|
+---+--------------------+-------+--------+-----+
only showing top 4 rows
vertices_complete: org.apache.spark.sql.DataFrame = [id: bigint, node: string ... 3 more fields]
For edges
+------+-----+----+------+------+---------+---------+--------+
| src| dst|time|method|weight|walk_time|wait_time|bus_time|
+------+-----+----+------+------+---------+---------+--------+
| 65465|52067|2640| walk|2640.0| 1112| 1528| 0|
| 68744|52067|1740| walk|1740.0| 981| 759| 0|
| 55916|52067|2700| walk|2700.0| 1061| 1639| 0|
|124559|52067|1440| walk|1440.0| 1061| 379| 0|
| 23036|52067|1800| walk|1800.0| 1112| 688| 0|
+------+-----+----+------+------+---------+---------+--------+
only showing top 5 rows
edges_DF: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint ... 6 more fields]
I want to create a Graph object out of this, to do PageRank, find shortest paths, etc. Therefore I convert these objects to RDD:
val verticesRDD : RDD[(VertexId, (String, Long, String, Long))] = vertices_complete.rdd
.map(row =>
(row.getAs[Long](0),
(row.getAs[String]("node"), row.getAs[Long]("trip_id"), row.getAs[String]("stop_id"), row.getAs[Long]("time"))))
val edgesRDD : RDD[Edge[Long]] = edges_DF.rdd
.map(row =>
Edge(
row.getAs[Long]("src"), row.getAs[Long]("dst"), row.getAs[Long]("weight")))
val my_graph = Graph(verticesRDD, edgesRDD)
Any operation, that can be even PageRank (tried also with the shortest path, and the error still persists)
val ranks = my_graph.pageRank(0.0001).vertices
raises the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 233.0 failed 1 times, most recent failure: Lost task 5.0 in stage 233.0 (TID 9390, DESKTOP-A7EPMQG.mshome.net, executor driver): java.lang.ClassCastException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2114)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2209)
at org.apache.spark.rdd.RDD.$anonfun$fold$1(RDD.scala:1157)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1151)
at org.apache.spark.graphx.impl.VertexRDDImpl.count(VertexRDDImpl.scala:90)
at org.apache.spark.graphx.Pregel$.apply(Pregel.scala:140)
at org.apache.spark.graphx.lib.PageRank$.runUntilConvergenceWithOptions(PageRank.scala:431)
at org.apache.spark.graphx.lib.PageRank$.runUntilConvergence(PageRank.scala:346)
at org.apache.spark.graphx.GraphOps.pageRank(GraphOps.scala:380)
... 40 elided
Caused by: java.lang.ClassCastException
I think there is something wrong with the initialization of the RDD objects (and also I would like to add attributes to edges [time, walk_time etc...], too, in addition to the weight), but I cannot figure out how to do it properly. Any help, please?

Related

Split a DataFrame Spark Scala [duplicate]

This question already has answers here:
How to split a dataframe into dataframes with same column values?
(3 answers)
Closed 4 years ago.
I have this dataframe called rankedDF:
+----------+-----------+----------+-------------+-----------------------+
|TimePeriod|TPStartDate| TPEndDate|TXN_HEADER_ID| Items|
+----------+-----------+----------+-------------+-----------------------+
| 1| 2017-03-01|2017-05-30| TxnHeader1|Womens Socks, Men...|
| 1| 2017-03-01|2017-05-30| TxnHeader4|Mens Pants, Mens ... |
| 1| 2017-03-01|2017-05-30| TxnHeader7|Womens Socks, Men...|
| 2| 2019-03-01|2017-05-30| TxnHeader1|Calcetas Mujer, Calc...|
| 2| 2019-03-01|2017-05-30| TxnHeader4|Pantalones H, Pan ... |
| 2| 2019-03-01|2017-05-30| TxnHeader7|Calcetas Mujer, Pan...|
So, I need to split this dataframe by each “TimePeriod” as an input for another function, but only with the column Items.
I’ve tried this:
val timePeriods = rankedDF.select(“TimePeriod”).distinct()
so at this point I have:
| Time Period |
| 1 |
| 2 |
According to this “timePeriods” I need to call my function twice:
timePeriods.foreach{
n=> val justItems = rankedDF.filter(col(“TimePeriod”)===n.getAsInt(0)) .select(“Items”)
}
Well... I was waiting for this DataFrame:
|TimePeriod|
|Womens Socks, Men...
|Mens Pants, Mens ...
|Womens Socks, Men...
Instead of that, I’m getting this error:
task 170.0 in stage 40.0 (TID 2223)
java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 WARN TaskSetManager: Lost task 170.0 in stage 40.0 (TID 2223, localhost): java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:799)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:720)
at com.brierley.versions.FpGrowth$$anonfun$PfpGrowth$1$$anonfun$apply$3.apply(FpGrowth.scala:718)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
18/04/24 11:49:32 ERROR TaskSetManager: Task 170 in stage 40.0 failed 1 times; aborting job
When I run this, actually worked:
val justItems = rankedDF.filter(col(“TimePeriod”)=== 1 ) .select(“Items”)
val justItems = rankedDF.filter(col(“TimePeriod”)=== 2 ) .select(“Items”)
What I'm unable to access dynamically my Data Frame?
you need to collect the distinct values first, then you can use map:
val rankedDF : DataFrame = ???
val timePeriods = rankedDF.select("TimePeriod").distinct().as[Int].collect()
val dataFrames: Array[DataFrame] = timePeriods.map(tp => rankedDF.where(col("TimePeriod")===tp))

cannot select a specific column for ReduceByKey operation Spark

I create a DataFrame which is showed as below, I want to apply map-reduce algorithm for column 'title', but when I use reduceByKey function, I encounter some problems.
+-------+--------------------+------------+-----------+
|project| title|requests_num|return_size|
+-------+--------------------+------------+-----------+
| aa|%CE%92%CE%84_%CE%...| 1| 4854|
| aa|%CE%98%CE%B5%CF%8...| 1| 4917|
| aa|%CE%9C%CF%89%CE%A...| 1| 4832|
| aa|%CE%A0%CE%B9%CE%B...| 1| 4828|
| aa|%CE%A3%CE%A4%CE%8...| 1| 4819|
| aa|%D0%A1%D0%BE%D0%B...| 1| 4750|
| aa| 271_a.C| 1| 4675|
| aa|Battaglia_di_Qade...| 1| 4765|
| aa| Category:User_th| 1| 4770|
| aa| Chiron_Elias_Krase| 1| 4694|
| aa|County_Laois/en/Q...| 1| 4752|
| aa| Dassault_rafaele| 2| 9372|
| aa|Dyskusja_wikiproj...| 1| 4824|
| aa| E.Desv| 1| 4662|
| aa|Enclos-apier/fr/E...| 1| 4772|
| aa|File:Wiktionary-l...| 1| 10752|
| aa|Henri_de_Sourdis/...| 1| 4748|
| aa|Incentive_Softwar...| 1| 4777|
| aa|Indonesian_Wikipedia| 1| 4679|
| aa| Main_Page| 5| 266946|
+-------+--------------------+------------+-----------+
I try this, but it doesn't work:
dataframe.select("title").map(word => (word,1)).reduceByKey(_+_);
it seems that I should transfer dataframe to list first and then use map function to generate key-value pairs(word,1), finally sum up key value.
I a method for transfering dataframe to list from stackoverflow,
for example
val text =dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
but an error occurs
scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
2018-04-08 21:49:35 WARN NettyRpcEnv:66 - Ignored message: HeartbeatResponse(false)
2018-04-08 21:49:35 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:280)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:276)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:298)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:297)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
... 16 elided
scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:280)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:276)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:298)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:297)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
... 16 elided
Collect-ing your DataFrame to a Scala collection would impose constraint on your dataset size. Rather, you could convert the DataFrame to a RDD then apply map and reduceByKey as below:
val df = Seq(
("aa", "271_a.C", 1, 4675),
("aa", "271_a.C", 1, 4400),
("aa", "271_a.C", 1, 4600),
("aa", "Chiron_Elias_Krase", 1, 4694),
("aa", "Chiron_Elias_Krase", 1, 4500)
).toDF("project", "title", "request_num", "return_size")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val rdd = df.rdd.
map{ case Row(_, title: String, _, _) => (title, 1) }.
reduceByKey(_ + _)
rdd.collect
// res1: Array[(String, Int)] = Array((Chiron_Elias_Krase,2), (271_a.C,3))
You could also transform your DataFrame directly using groupBy:
df.groupBy($"title").agg(count($"title").as("count")).
show
// +------------------+-----+
// | title|count|
// +------------------+-----+
// | 271_a.C| 3|
// |Chiron_Elias_Krase| 2|
// +------------------+-----+

Spark - How to use QuantileDiscretizer with RandomForestClassifier

Is it possible to use QuantileDiscretizer, keeping NaN values, with a RandomForestClassifier?
I have been getting an error like this:
18/03/23 17:38:15 ERROR Executor: Exception in task 3.0 in stage 133.0 (TID 381)
java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 1 is categorical with values in {0,...,1, but a data point gives it value 2.0.
Bad data point: (1.0,[1.0,2.0])
Example
The idea here is to create a numeric column and discretize it using quantiles, keeping invalid numbers (NaN) in a special bucket.
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler,
QuantileDiscretizer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{RandomForestClassifier}
val tseq = Seq((0, "a", 1.0), (1, "b", 0.0), (2, "c", 2.0),
(3, "a", 1.0), (4, "a", 3.0), (5, "c", Double.NaN))
val tdf = SparkInit.ss.createDataFrame(tseq).toDF("id", "category", "class")
val indexer = new StringIndexer()
.setInputCol("category")
.setOutputCol("categoryIndex")
val discr = new QuantileDiscretizer()
.setInputCol("class")
.setOutputCol("quant")
.setNumBuckets(2)
.setHandleInvalid("keep")
val assembler = new VectorAssembler()
.setInputCols(Array("categoryIndex", "quant"))
.setOutputCol("features")
val rf = new RandomForestClassifier()
.setLabelCol("categoryIndex")
.setFeaturesCol("features")
.setNumTrees(3)
new Pipeline()
.setStages(Array(indexer, discr, assembler, rf))
.fit(tdf)
.transform(tdf)
.show()
Without trying to fit the Random Forest, I was getting a DataFrame like this:
+---+--------+-----+-------------+-----+---------+
| id|category|class|categoryIndex|quant| features|
+---+--------+-----+-------------+-----+---------+
| 0| a| 1.0| 0.0| 1.0|[0.0,1.0]|
| 1| b| 0.0| 2.0| 0.0|[2.0,0.0]|
| 2| c| 2.0| 1.0| 1.0|[1.0,1.0]|
| 3| a| 1.0| 0.0| 1.0|[0.0,1.0]|
| 4| a| 3.0| 0.0| 1.0|[0.0,1.0]|
| 5| c| NaN| 1.0| 2.0|[1.0,2.0]|
+---+--------+-----+-------------+-----+---------+
If I try to fit the model, I get the error:
18/03/23 17:54:12 WARN DecisionTreeMetadata: DecisionTree reducing maxBins from 32 to 6 (= number of training instances)
18/03/23 17:54:12 WARN BlockManager: Putting block rdd_490_3 failed due to an exception
18/03/23 17:54:12 WARN BlockManager: Block rdd_490_3 could not be removed as it was not found on disk or in memory
18/03/23 17:54:12 ERROR Executor: Exception in task 3.0 in stage 143.0 (TID 414)
java.lang.IllegalArgumentException: DecisionTree given invalid data: Feature 1 is categorical with values in {0,...,1, but a data point gives it value 2.0.
Bad data point: (1.0,[1.0,2.0])
at org.apache.spark.ml.tree.impl.TreePoint$.findBin(TreePoint.scala:124)
at org.apache.spark.ml.tree.impl.TreePoint$.org$apache$spark$ml$tree$impl$TreePoint$$labeledPointToTreePoint(TreePoint.scala:93)
at org.apache.spark.ml.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:73)
at org.apache.spark.ml.tree.impl.TreePoint$$anonfun$convertToTreeRDD$2.apply(TreePoint.scala:72)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
Does QuantileDiscretizer inserts some kind of metadata about the special extra bucket? It's weird that I was able to build a model using columns with the same values before, but without forcing any discretization.
Update
Yes, columns does have attached metadata and it looks like this:
org.apache.spark.sql.types.Metadata = {"ml_attr":
{"ord":true,
"vals":["-Infinity, 5.0","5.0, 10.0","10.0, Infinity"],
"type":"nominal"}
}
The question now might be: how to set correctly the metadata to include values like Double.NaN?
The workaround I used was simply to remove the associated metadata from the discretized columns, letting the decision tree implementation to decide what to do with the data. I think the column would actually become a numerical column ([0, 1, 2, 2, 1], for example), but, if too many categories are created, the column could be discretized again (look for the parameter maxBins).
In my case, the simplest way to remove the metadata was to fill the DataFrame after applying QuantileDiscretizer:
// Nothing is actually filled in my case, since there was no missing
// values before this operation.
df.na.fill(Double.NaN, Array("quant"))
I'm almost sure you could also manually remove the metadata accessing the column object directly.
Update
We can change a column's metadata by creating an alias (reference):
val metadata: Metadata = ...
df.select($"colA".as("colB", metadata))
This answer describes a way to get the column's metadata by getting the respective StructField of a DataFrame's schema.

Date type null value in dataframe not storing in cassandra

I am working in Apache Spark 1.6.0. I have a dataframe of 280 columns in which some of the columns are of type timestamp. A few values of the timestamp field are null. When I'm trying to write the same dataframe to cassandra, I'm getting an IllegalArgumentException.
The column looks like -
+------------------------+
| LoginDate|
+-------------------------+
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
| null|
| null|
| 2014-06-25T12:27:...|
| 2014-06-25T12:27:...|
+-------------------------+
When I'm trying to save the whole dataframe to cassandra, it comes up the error -
05:39:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 106.0 (TID 5136,): java.lang.IllegalArgumentException: Invalid date:
at com.datastax.spark.connector.types.TimestampParser$.parse(TimestampParser.scala:50)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$$anonfun$convertPF$13.applyOrElse(TypeConverter.scala:323)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$DateConverter$.convert(TypeConverter.scala:313)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter$$anonfun$convertPF$31.applyOrElse(TypeConverter.scala:812)
at com.datastax.spark.connector.types.TypeConverter$class.convert(TypeConverter.scala:43)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.com$datastax$spark$connector$types$NullableTypeConverter$$super$convert(TypeConverter.scala:795)
at com.datastax.spark.connector.types.NullableTypeConverter$class.convert(TypeConverter.scala:56)
at com.datastax.spark.connector.types.TypeConverter$OptionToNullConverter.convert(TypeConverter.scala:795)
at com.datastax.spark.connector.writer.SqlRowWriter$$anonfun$readColumnValues$1.apply$mcVI$sp(SqlRowWriter.scala:26)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:24)
at com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
at com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:100)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:106)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.next(GroupingBatchBuilder.scala:31)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at com.datastax.spark.connector.writer.GroupingBatchBuilder.foreach(GroupingBatchBuilder.scala:31)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:157)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$write$1.apply(TableWriter.scala:134)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:134)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:37)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
The type of the respective field in cassandra is of timestamp type.
Anyone can help to solve the issue ?
Add the following parameter to your spark Cassandra connection settings
spark.cassandra.output.ignoreNulls=true
It will ignore the NULL values in the input and also has benefit of avoiding creation of a corresponding tombstone column in Cassandra.

How to force DataFrame evaluation in Spark

Sometimes (e.g. for testing and bechmarking) I want force the execution of the transformations defined on a DataFrame. AFAIK calling an action like count does not ensure that all Columns are actually computed, show may only compute a subset of all Rows (see examples below)
My solution is to write the DataFrame to HDFS using df.write.saveAsTable, but this "clutters" my system with tables I don't want to keep any further.
So what is the best way to trigger the evaluation of a DataFrame?
Edit:
Note that there is also a recent discussion on the spark developer list : http://apache-spark-developers-list.1001551.n3.nabble.com/Will-count-always-trigger-an-evaluation-of-each-row-td21018.html
I made a small example which shows that count on DataFrame does not evaluate everything (tested using Spark 1.6.3 and spark-master = local[2]):
val df = sc.parallelize(Seq(1)).toDF("id")
val myUDF = udf((i:Int) => {throw new RuntimeException;i})
df.withColumn("test",myUDF($"id")).count // runs fine
df.withColumn("test",myUDF($"id")).show() // gives Exception
Using the same logic, here an example that show does not evaluate all rows:
val df = sc.parallelize(1 to 10).toDF("id")
val myUDF = udf((i:Int) => {if(i==10) throw new RuntimeException;i})
df.withColumn("test",myUDF($"id")).show(5) // runs fine
df.withColumn("test",myUDF($"id")).show(10) // gives Exception
Edit 2 : For Eliasah: The Exception says this:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 (TID 6, localhost): java.lang.RuntimeException
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(<console>:68)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:68)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:68)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
.
.
.
.
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1500)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2087)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1506)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1376)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2100)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1457)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:350)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:311)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:319)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:74)
.
.
.
.
It's a bit late, but here's the fundamental reason: count does not act the same on RDD and DataFrame.
In DataFrames there's an optimization, as in some cases you do not require to load data to actually know the number of elements it has (especially in the case of yours where there's no data shuffling involved). Hence, the DataFrame materialized when count is called will not load any data and will not pass into your exception throwing. You can easily do the experiment by defining your own DefaultSource and Relation and see that calling count on a DataFrame will always end up in the method buildScan with no requiredColumns no matter how many columns you did select (cf. org.apache.spark.sql.sources.interfaces to understand more). It's actually a very efficient optimization ;-)
In RDDs though, there's no such optimizations (that's why one should always try to use DataFrames when possible). Hence the count on RDD executes all the lineage and returns the sum of all sizes of the iterators composing any partitions.
Calling dataframe.count goes into the first explanation, but calling dataframe.rdd.count goes into the second as you did build an RDD out of your DataFrame. Note that calling dataframe.cache().count forces the dataframe to be materialized as you required Spark to cache the results (hence it needs to load all the data and transform it). But it does have the side-effect of caching your data...
I guess simply getting an underlying rdd from DataFrame and triggering an action on it should achieve what you're looking for.
df.withColumn("test",myUDF($"id")).rdd.count // this gives proper exceptions
It appears that df.cache.count is the way to go:
scala> val myUDF = udf((i:Int) => {if(i==1000) throw new RuntimeException;i})
myUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,IntegerType,Some(List(IntegerType)))
scala> val df = sc.parallelize(1 to 1000).toDF("id")
df: org.apache.spark.sql.DataFrame = [id: int]
scala> df.withColumn("test",myUDF($"id")).show(10)
[rdd_51_0]
+---+----+
| id|test|
+---+----+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 5| 5|
| 6| 6|
| 7| 7|
| 8| 8|
| 9| 9|
| 10| 10|
+---+----+
only showing top 10 rows
scala> df.withColumn("test",myUDF($"id")).count
res13: Long = 1000
scala> df.withColumn("test",myUDF($"id")).cache.count
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (int) => int)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
.
.
.
Caused by: java.lang.RuntimeException
Source
I prefer to use df.save.parquet(). This does add disc I/o time that you can estimate and subtract out later, but you are positive that spark performed each step you expected and did not trick you with lazy evaluation.