Scala Reduce() operation on a RDD[Array[Int]] - eclipse

I have a RDD of a 1 dimensional matrix. I am trying to do a very basic reduce operation to sum up the values of the same position of the matrix from various partitions.
I am using:
var z=x.reduce((a,b)=>a+b)
or
var z=x.reduce(_ + _)
But I am getting an error saying:
type mismatch; found Array[Int], expected:String
I looked it up and found the link
Is there a better way for reduce operation on RDD[Array[Double]]
So I tried using
import.spire.implicits._
So now I don't have any compilation error, but after running the code I am getting a java.lang.NoSuchMethodError. I have provided the entire error below. Any help would be appreciated.
java.lang.NoSuchMethodError: scala.Product.$init$(Lscala/Product;)V
at spire.math.NumberTag$Integral$.<init>(NumberTag.scala:9)
at spire.math.NumberTag$Integral$.<clinit>(NumberTag.scala)
at spire.std.BigIntInstances.$init$(bigInt.scala:80)
at spire.implicits$.<init>(implicits.scala:6)
at spire.implicits$.<clinit>(implicits.scala)
at main.scala.com.ucr.edu.SparkScala.HistogramRDD$$anonfun$9.apply(HistogramRDD.scala:118)
at main.scala.com.ucr.edu.SparkScala.HistogramRDD$$anonfun$9.apply(HistogramRDD.scala:118)
at scala.collection.TraversableOnce$$anonfun$reduceLeft$1.apply(TraversableOnce.scala:190)
at scala.collection.TraversableOnce$$anonfun$reduceLeft$1.apply(TraversableOnce.scala:185)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:185)
at scala.collection.AbstractIterator.reduceLeft(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1012)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$15.apply(RDD.scala:1010)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2125)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

From my understanding you're trying to reduce the items by position in the arrays. You should consider to zip your arrays while reducing the rdd :
val a: RDD[Array[Int]] = ss.createDataset[Array[Int]](Seq(Array(1,2,3), Array(4,5,6))).rdd
a.reduce{case (a: Array[Int],b: Array[Int]) =>
val ziped = a.zip(b)
ziped.map{case (i1, i2) => i1 + i2}
}.foreach(println)
outputs :
5
7
9

Related

How can I cast WrappedArray to List in Spark Scala?

I use a DataFrame to handle data in spark. I have a array column in this dataframe. At the end of all the transformation I want to do, I have a dataframe with one array column and one row. In order to apply groupby, map and reduce I want to have this array as a list but I can't do it.
.drop("ScoresArray")
.filter($"min_score" < 0.2)
.select("WordsArray")
.agg(collect_list("WordsArray"))
.withColumn("FlattenWords", flatten($"collect_list(WordsArray)"))
.drop("collect_list(WordsArray)")
.collect()
val test1 = words(0).getAs[immutable.List[String]](0)
Here is the error message :
[error] (run-main-0) java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.List
[error] java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.immutable.List
[error] at analysis.Analysis$.main(Analysis.scala:37)
[error] at analysis.Analysis.main(Analysis.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run last Compile / bgRun for the full output
Thoughts ?
You can't cast an array to list but you can convert one to the other.
val test1 = words(0).getSeq[String](0).toList

set findFirstMatchIn to ignore line if match not found- spark - scala

I am new to scala. I am trying to write a code to map parsed numbers in a series of xml files. My code works for a small RDD as below:
val myrdd = sc.parallelize(Array("FavoriteCount=\"23\" Score=\"43\"","FavoriteCount=\"12\" Score=\"32\"","FavoriteCount=\"32\" Score=\"2\""))
def successMatches(s: String): (String,Int) = {
val fcountMatcher = """FavoriteCount=\"(\d+)\"""".r
val scoreMatcher = """Score=\"(\d+)\"""".r
val fcount = fcountMatcher.findFirstMatchIn(s).get.group(1)
val score = scoreMatcher.findFirstMatchIn(s).get.group(1)
(fcount,score.toInt)
}
val myWords = myrdd.map(x => successMatches(x))
myWords.take(3)
myrdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:29
successMatches: (s: String)(String, Int)
myWords: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
res1: Array[(String, Int)] = Array((23,43), (12,32), (32,2))
for the actual xml RDD it returns error message as below:
val myWords = valid_lines.take(1).map(x => successMatches(x))
myWords.take(1)
ava.util.NoSuchElementException: None.get
at scala.None$.get(Option.scala:347)
at scala.None$.get(Option.scala:345)
at successMatches(<console>:52)
at $anonfun$1.apply(<console>:57)
at $anonfun$1.apply(<console>:57)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
... 42 elided
What am I missing?
This is what the first element of the xml RDD looks like:
valid_lines.take(1)
res51: Array[String] = Array(" <row AnswerCount="0" Body="<p>I'm having trouble with a basic machine learning methodology question. I understand the concept of not using the same data to both train and evaluate a classifier, and furthermore when there are parameters in an algorithm to be optimized, you should use an independent third test set to get the final reportable performance figures (e.g. recall rate). However, using a <em>single</em> test set to measure performance seems to be problematic because the measures of performance will likely differ depending on how the data is partitioned into training (plus validation) and test sets, especially for small datasets. It would be better to average the results of N different partitions.</p>
<p>For t...

Combining entire column of Arrays into one Array

I have this dataframe and I'd like to combine all the arrays,
in the data column, into one big array, separate from the DataFrame.
Scala and DataFrame API are still pretty new to me, but I gave it a shot:
case class Tile(data: Array[Int])
val ta = Tile(Array(1,2))
val tb = Tile(Array(3,4))
val tc = Tile(Array(5,6))
df = ListBuffer(ta,tb,tc).toDF()
// Combine contents of DF into one array
val result = new Array[Int](6)
var offset = 0
val combine = (t: WrappedArray[Int]) => {
Array.copy(t, 0, result, offset, t.length)
offset += t.length
}
df.foreach(r => combine(r(0).asInstanceOf[WrappedArray[Int]]))
df.show()
+------+
| data|
+------+
|[1, 1]|
|[2, 2]|
|[3, 3]|
+------+
When I run this, I get the following error:
16/08/23 11:21:32 ERROR executor.Executor: Exception in task 0.0 in stage 17.0 (TID 17)
scala.MatchError: WrappedArray(1, 1) (of class scala.collection.mutable.WrappedArray$ofRef)
at scala.runtime.ScalaRunTime$.array_apply(ScalaRunTime.scala:71)
at scala.Array$.slowcopy(Array.scala:81)
at scala.Array$.copy(Array.scala:107)
at $line150.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:32)
at $line150.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at $line190.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:46)
at $line190.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:46)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$32.apply(RDD.scala:912)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:74
Can anyone point me in the right direction? Thanks!
When working with Spark you can not accumulate things using a foreach like you do normally. Since spark distributes the work among all executors, your function needs to be Serializable.
In case you still want to do things in a way similar to what you normally do, then use Accumulator which supports spark's distributed model.
val myRdd: RDD[List[Int]] = sc.parallelize(List(List(1,2), List(3,4), List(5,6))
val acc = sc.collectionAccumulator[Int]("MyAccumulator")
myRdd.foreach(l => l.foreach(i => acc.add(i)))
Or in your case
case class Tile(data: Array[Int])
val myRdd: RDD[Tile] = sc.parallelize(List(
Tile(Array(1,2)),
Tile(Array(3,4)),
Tile(Array(5,6))
))
val acc = sc.collectionAccumulator[Int]("MyAccumulator")
myRdd.foreach(t => t.data.foreach(i => acc.add(i)))

How to convert RDD[Row] to RDD[Vector]

I'm trying to implement k-means method using scala.
I created a RDD something like that
val df = sc.parallelize(data).groupByKey().collect().map((chunk)=> {
sc.parallelize(chunk._2.toSeq).toDF()
})
val examples = df.map(dataframe =>{
dataframe.selectExpr(
"avg(time) as avg_time",
"variance(size) as var_size",
"variance(time) as var_time",
"count(size) as examples"
).rdd
})
val rdd_final=examples.reduce(_ union _)
val kmeans= new KMeans()
val model = kmeans.run(rdd_final)
With this code I obtain an error
type mismatch;
[error] found : org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
[error] required:org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried to cast doing:
val rdd_final_Vector = rdd_final.map{x:Row => x.getAs[org.apache.spark.mllib.linalg.Vector](0)}
val model = kmeans.run(rdd_final_Vector)
But then I obtain an error:
java.lang.ClassCastException: java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
So I'm looking for a way to do that cast, but I can't find any method.
Any idea?
Best regards
At least a couple of issues here:
No you really can not cast a Row to a Vector: a Row is a collection of potentially disparate types understood by Spark SQL. A Vector is not a native spark sql type
There seems to be a mismatch between the content of your SQL statement and what you are attempting to achieve with KMeans: the SQL is performing aggregations. But KMeans expects a series of individual data points in the form a Vector (which encapsulates an Array[Double]) . So then - why are you supplying sum's and average's to a KMeans operation?
Addressing just #1 here: you will need to do something along the lines of:
val doubVals = <rows rdd>.map{ row => row.getDouble("colname") }
val vector = Vectors.toDense{ doubVals.collect}
Then you have a properly encapsulated Array[Double] (within a Vector) that can be supplied to Kmeans.

Reducing potentially empty RDD's

So I'm running into an issue where a filter I'm using on an RDD can potentially create an empty RDD. I feel that doing a count() in order to test for emptiness would be very expensive, and was wondering if there is a more performant way to handle this situation.
Here is an example of what this issue might look like:
val b:RDD[String] = sc.parallelize(Seq("a","ab","abc"))
println(b.filter(a => !a.contains("a")).reduce(_+_))
would give the result
empty collection
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$36.apply(RDD.scala:1005)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1005)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:985)
Does anyone have any suggestions for how I should go about addressing this edge case?
Consider .fold("")(_ + _) instead of .reduce(_ + _)
how about
scala> val b = sc.parallelize(Seq("a","ab","abc"))
b: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> b.isEmpty
res1: Boolean = false