scala spark UDF ClassCastException : WrappedArray$ofRef cannot be cast to [Lscala.Tuple2 - scala

So I perform the necessary imports etc
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types._
import spark.implicits._
then define some latlong points
val london = (1.0, 1.0)
val suburbia = (2.0, 2.0)
val southampton = (3.0, 3.0)
val york = (4.0, 4.0)
I then create a spark Dataframe like this and check that it works:
val exampleDF = Seq((List(london,suburbia),List(southampton,york)),
(List(york,london),List(southampton,suburbia))).toDF("AR1","AR2")
exampleDF.show()
the dataframe consists of the following types
DataFrame = [AR1: array<struct<_1:double,_2:double>>, AR2: array<struct<_1:double,_2:double>>]
I create a function to create a combination of points
// function to do what I want
val latlongexplode = (x: Array[(Double,Double)], y: Array[(Double,Double)]) => {
for (a <- x; b <-y) yield (a,b)
}
I check that the function works
latlongexplode(Array(london,york),Array(suburbia,southampton))
and it does. However after i create a UDF out of this function
// declare function into a Spark UDF
val latlongexplodeUDF = udf (latlongexplode)
when i try to use it in the spark dataframe I have created above like this:
exampleDF.withColumn("latlongexplode", latlongexplodeUDF($"AR1",$"AR2")).show(false)
I get a really long stacktrace which basically boils down to :
java.lang.ClassCastException:
scala.collection.mutable.WrappedArray$ofRef cannot be cast to
[Lscala.Tuple2;
org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$3(ScalaUDF.scala:121)
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:151)
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:50)
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:32)
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:273)
How can I get this udf to work in Scala Spark? (im using 2.4 at the moment if this helps)
EDIT: it could be that the way I construct my example df has an issue.
But what I have as the actual data is an array (of unknown size) of lat/long tuples on each column.

When working with struct types in UDF, they are represented as Row objects, and array columns are represented as Seq. Also, you need to return a struct in the form of a Row, and you need to define a schema to return a struct.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val london = (1.0, 1.0)
val suburbia = (2.0, 2.0)
val southampton = (3.0, 3.0)
val york = (4.0, 4.0)
val exampleDF = Seq((List(london,suburbia),List(southampton,york)),
(List(york,london),List(southampton,suburbia))).toDF("AR1","AR2")
exampleDF.show(false)
+------------------------+------------------------+
|AR1 |AR2 |
+------------------------+------------------------+
|[[1.0, 1.0], [2.0, 2.0]]|[[3.0, 3.0], [4.0, 4.0]]|
|[[4.0, 4.0], [1.0, 1.0]]|[[3.0, 3.0], [2.0, 2.0]]|
+------------------------+------------------------+
val latlongexplode = (x: Seq[Row], y: Seq[Row]) => {
for (a <- x; b <- y) yield Row(a, b)
}
val udf_schema = ArrayType(
StructType(Seq(
StructField(
"city1",
StructType(Seq(
StructField("lat", FloatType),
StructField("long", FloatType)
))
),
StructField(
"city2",
StructType(Seq(
StructField("lat", FloatType),
StructField("long", FloatType)
))
)
))
)
// include this line if you see errors like
// "You're using untyped Scala UDF, which does not have the input type information."
// spark.sql("set spark.sql.legacy.allowUntypedScalaUDF = true")
val latlongexplodeUDF = udf(latlongexplode, udf_schema)
result = exampleDF.withColumn("latlongexplode", latlongexplodeUDF($"AR1",$"AR2"))
result.show(false)
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+
|AR1 |AR2 |latlongexplode |
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+
|[[1.0, 1.0], [2.0, 2.0]]|[[3.0, 3.0], [4.0, 4.0]]|[[[1.0, 1.0], [3.0, 3.0]], [[1.0, 1.0], [4.0, 4.0]], [[2.0, 2.0], [3.0, 3.0]], [[2.0, 2.0], [4.0, 4.0]]]|
|[[4.0, 4.0], [1.0, 1.0]]|[[3.0, 3.0], [2.0, 2.0]]|[[[4.0, 4.0], [3.0, 3.0]], [[4.0, 4.0], [2.0, 2.0]], [[1.0, 1.0], [3.0, 3.0]], [[1.0, 1.0], [2.0, 2.0]]]|
+------------------------+------------------------+--------------------------------------------------------------------------------------------------------+

Related

scala spire interval giving wrong result

Scala spire is giving the following result. As per my understanding goes it must give List((0.0,0.1],[3.0,5.0)). Why such result?
scala> val x = Interval.openLower(0.0,0.1)
x: spire.math.Interval[Double] = (0.0, 0.1]
scala> val y = Interval.openUpper(3.0,5.0)
y: spire.math.Interval[Double] = [3.0, 5.0)
scala> x.union(y)
res0: spire.math.Interval[Double] = (0.0, 5.0)
And also
val S = Interval.open(1.0, 4.5)
val A = Interval.open(1.0, 3.0)
val B = Interval.open(2.0, 4.0)
val C = Interval.openUpper(3.0, 4.5)
println(S \ (A ∩ B))
val list = (S \ A).union(S \ B)
println(list)
The result is
List((1.0, 2.0], [3.0, 4.5))
List([3.0, 4.5), (1.0, 2.0], [4.0, 4.5))
How shall i unify the lower result to upper so that both will be equal.
I ran into the same issue and found out that Spire's IntervalSeq gets the job done.
// ammonite script intervals.sc
import $ivy.`org.typelevel::spire:0.17.0-M1`
import $ivy.`org.typelevel::spire-extras:0.17.0-M1`
import spire.math.Interval
import spire.math.extras.interval.IntervalSeq
import spire.implicits._
val S = IntervalSeq(Interval.open(1.0, 4.5))
val A = IntervalSeq(Interval.open(1.0, 3.0))
val B = IntervalSeq(Interval.open(2.0, 4.0))
val C = IntervalSeq(Interval.openUpper(3.0, 4.5))
val r1 = (S ^ (A & B))
println("r1=>" + r1.intervals.toList)
val r2 = ((S ^ A) | (S ^ B))
println("r2=>" + r2.intervals.toList)
Running this using the Ammonite REPL results in the following output:
r1=>List((1.0, 2.0], [3.0, 4.5))
r2=>List((1.0, 2.0], [3.0, 4.5))

Shuffling elements of an RDD[List[Double]] in Spark

In a program I am developing using Spark 2.3 in Scala, I have an RDD[List[Double]]. Every List[Double] have the same size. I can't figure out how to perform a transformation that given the RDD
[1.0, 1.5, 4.0, 3.0],
[2.3, 5.6, 3.4, 9.0],
[4.5, 2.0, 1.0, 5.7]
transform it in the RDD
[2.3, 2.0, 1.0, 3.0],
[1.0, 5.6, 4.0, 5.7],
[4.5, 1.5, 3.4, 9.0]
where every single element of the lists is swapped among them, maintaining the same position.
For example, the first element of the first list is moved to the first position of the second list, the second element of the first list is moved to the second position of the third list, and so on.
Thanks a lot.
One approach to shuffling column-wise would be to break down the dataset into individual single-column DataFrames each of which gets shuffled using orderBy(rand), and then piece them back together.
To join the shuffled DataFrames, RDD zipWithIndex is applied to each of them to create row-identifying ids. Note that monotonically_increasing_id won't cut it as it doesn't guarantee generating same list of ids necessary for the final join. Hence, this is rather expensive due to the required transformation between RDD and DataFrame.
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val rdd0 = sc.parallelize(Seq(
List(1.0, 1.5, 4.0, 3.0),
List(2.3, 5.6, 3.4, 9.0),
List(4.5, 2.0, 1.0, 5.7)
))
//rdd0: org.apache.spark.rdd.RDD[List[Double]] = ...
val rdd = rdd0.map{ case x: Seq[Double] => (x(0), x(1), x(2), x(3)) }
val df = rdd.toDF("c1", "c2", "c3", "c4")
val shuffledDFs = df.columns.filter(_.startsWith("c")).map{ c =>
val subDF = df.select(c)
val subRDD = subDF.orderBy(rand).rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
spark.createDataFrame( subRDD,
StructType(subDF.schema.fields :+ StructField("idx", LongType, false))
)
}
shuffledDFs.reduce( _.join(_, Seq("idx")) ).
show
// +---+---+---+---+---+
// |idx| c1| c2| c3| c4|
// +---+---+---+---+---+
// | 0|2.3|2.0|4.0|9.0|
// | 1|1.0|5.6|3.4|3.0|
// | 2|4.5|1.5|1.0|5.7|
// +---+---+---+---+---+

Displaying output under a certain format

I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0

How to define Tuple1 in Scala?

I try to use (1,), but doesn't work, what's the syntax to define Tuple1 in scala ?
scala> val a=(1,)
<console>:1: error: illegal start of simple expression
val a=(1,)
For tuple with cardinality 2 or more, you can use parentheses, however for with cardinality 1, you need to use Tuple1:
scala> val tuple1 = Tuple1(1)
tuple1: (Int,) = (1,)
scala> val tuple2 = ('a', 1)
tuple2: (Char, Int) = (a,1)
scala> val tuple3 = ('a', 1, "name")
tuple3: (Char, Int, java.lang.String) = (a,1,name)
scala> tuple1._1
res0: Int = 1
scala> tuple2._2
res1: Int = 1
scala> tuple3._1
res2: Char = a
scala> tuple3._3
res3: String = name
To declare the type, use Tuple1[T], for example val t : Tuple1[Int] = Tuple1(22)
A tuple is, by definition, an ordered list of elements. While Tuple1 exists, I haven't seen it used explicitly given you'd normally use a single element. Nevertheless, there is no sugar, you need to use Tuple1(1).
There is a valid use case in Spark that requires Tuple1: create a dataframe with one column.
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
data.toDF("features").show()
It will throw an error:
"value toDF is not a member of Seq[org.apache.spark.ml.linalg.Vector]"
To make it work, we have to convert each row to Tuple1:
val data = Seq(
Tuple1(Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))),
Tuple1(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
Tuple1(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
)
or a better way:
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
).map(Tuple1.apply)

simple matrix multiplication in Spark

I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));