How to define Tuple1 in Scala? - scala

I try to use (1,), but doesn't work, what's the syntax to define Tuple1 in scala ?
scala> val a=(1,)
<console>:1: error: illegal start of simple expression
val a=(1,)

For tuple with cardinality 2 or more, you can use parentheses, however for with cardinality 1, you need to use Tuple1:
scala> val tuple1 = Tuple1(1)
tuple1: (Int,) = (1,)
scala> val tuple2 = ('a', 1)
tuple2: (Char, Int) = (a,1)
scala> val tuple3 = ('a', 1, "name")
tuple3: (Char, Int, java.lang.String) = (a,1,name)
scala> tuple1._1
res0: Int = 1
scala> tuple2._2
res1: Int = 1
scala> tuple3._1
res2: Char = a
scala> tuple3._3
res3: String = name
To declare the type, use Tuple1[T], for example val t : Tuple1[Int] = Tuple1(22)

A tuple is, by definition, an ordered list of elements. While Tuple1 exists, I haven't seen it used explicitly given you'd normally use a single element. Nevertheless, there is no sugar, you need to use Tuple1(1).

There is a valid use case in Spark that requires Tuple1: create a dataframe with one column.
import org.apache.spark.ml.linalg.Vectors
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
data.toDF("features").show()
It will throw an error:
"value toDF is not a member of Seq[org.apache.spark.ml.linalg.Vector]"
To make it work, we have to convert each row to Tuple1:
val data = Seq(
Tuple1(Vectors.sparse(5, Seq((1, 1.0), (3, 7.0)))),
Tuple1(Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0)),
Tuple1(Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0))
)
or a better way:
val data = Seq(
Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
).map(Tuple1.apply)

Related

Update specific indices of Seq by another Seq in Scala

I have a first seq, for example:
val s: Seq[Double] = List.fill(6)(0.0)
and a sub sequences of the indices of s:
val subInd: Seq[Int] = List(2, 4, 5)
Now, what I want to do is update s on the positions 2, 4 and 5 by another Seq, which has the length of subInd:
val t: Seq[Double] = List(5.0, 6.0, 7.0)
such, that:
val updateBySeq(s, subInd, t): List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)
I have searched on this site and found Update multiple values in a sequence where the second answer comes close to the functionality I want to have.
However, the difference is, that the function provided would update s on the indices contained in subInd by one value. I, on the other hand, would want them to correspond to multiple, unique values in a third Seq t.
I have tried various things, like using recursion and ListBuffers, instead of Lists, to incrementally update the elements of s, but either they left s unchanged or I got an error because I violated some immutability constraint.
This should work:
def updateListByIndexes[T](data: List[T])(indexes: List[Int], update: List[T]): List[T] = {
val updateMap = (indexes lazyZip update).toMap
data.iterator.zipWithIndex.map {
case (elem, idx) =>
updateMap.getOrElse(key = idx, default = elem)
}.toList
}
Which you can use like this:
val s = List.fill(6)(0.0)
// s: List[Double] = List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
val subInd = List(2, 4, 5)
// subInd: List[Int] = List(2, 4, 5)
val t = List(5.0, 6.0, 7.0)
// t: List[Double] = List(5.0, 6.0, 7.0)
updateListByIndexes(s)(subInd, t)
// res: List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)

How to Sum a part of a list in RDD

I have an RDD, and I would like to sum a part of the list.
(key, element2 + element3)
(1, List(2.0, 3.0, 4.0, 5.0)), (2, List(1.0, -1.0, -2.0, -3.0))
output should look like this,
(1, 7.0), (2, -3.0)
Thanks
You can map and indexing on the second part:
yourRddOfTuples.map(tuple => {val list = tuple._2; list(1) + list(2)})
Update after your comment, convert it to Vector:
yourRddOfTuples.map(tuple => {val vs = tuple._2.toVector; vs(1) + vs(2)})
Or if you do not want to use conversions:
yourRddOfTuples.map(_._2.drop(1).take(2).sum)
This skips the first element (.drop(1)) from the second element of the tuple (.map(_._2), takes the next two (.take(2)) (might be less if you have less) and sums them (.sum).
You can map the key-list pair to obtain the 2nd and 3rd list elements as follows:
val rdd = sc.parallelize(Seq(
(1, List(2.0, 3.0, 4.0, 5.0)),
(2, List(1.0, -1.0, -2.0, -3.0))
))
rdd.map{ case (k, l) => (k, l(1) + l(2)) }.collect
// res1: Array[(Int, Double)] = Array((1,7.0), (2,-3.0))

Displaying output under a certain format

I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0

Scala, Spark: find element-wise average of N maps

I have N maps (Map[String, Double]) each having the same set of keys. Let's say something like the following:
map1 = ("elem1": 2.0, "elem2": 4.0, "elem3": 3.0)
map2 = ("elem1": 4.0, "elem2": 1.0, "elem3": 1.0)
map3 = ("elem1": 3.0, "elem2": 10.0, "elem3": 2.0)
I need to return a new map with element-wise average of those input maps:
resultMap = ("elem1": 3.0, "elem2": 5.0, "elem3": 2.0)
What's the cleanest way to do that in scala? Preferrably without using extra external libraries.
This all happens in Spark*. Thus any answers suggesting spark-specific usage could be helpful.
One option is to convert all Maps to Seqs, union them to a single Seq, group by key and take the average of values:
val maps = Seq(map1, map2, map3)
maps.map(_.toSeq).reduce(_++_).groupBy(_._1).mapValues(x => x.map(_._2).sum/x.length)
// res6: scala.collection.immutable.Map[String,Double] = Map(elem1 -> 3.0, elem3 -> 2.0, elem2 -> 5.0)
Since your question is tagged with apache-spark you can get your desired output by combining the maps into RDD[Map[String, Double]] as
scala> val rdd = sc.parallelize(Seq(Map("elem1"-> 2.0, "elem2"-> 4.0, "elem3"-> 3.0),Map("elem1"-> 4.0, "elem2"-> 1.0, "elem3"-> 1.0),Map("elem1"-> 3.0, "elem2"-> 10.0, "elem3"-> 2.0)))
rdd: org.apache.spark.rdd.RDD[scala.collection.immutable.Map[String,Double]] = ParallelCollectionRDD[1] at parallelize at <console>:24
Then you can use flatMap to flatten the entries of maps into individual rows and use groupBy function with key and sum the grouped values and devide it with the size of the grouped maps. You should get Your desired output as
scala> rdd.flatMap(row => row).groupBy(kv => kv._1).mapValues(values => values.map(value => value._2).sum/values.size)
res0: org.apache.spark.rdd.RDD[(String, Double)] = MapPartitionsRDD[5] at mapValues at <console>:27
scala> res0.foreach(println)
[Stage 0:> (0 + 0) / 4](elem2,5.0)
(elem3,2.0)
(elem1,3.0)
Hope the answer is helpful

simple matrix multiplication in Spark

I am struggling with some very basic spark code. I would like to define a matrix x with 2 columns. This is what I have tried:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0) // in this case I want s to be both column 1 and column 2 of x
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
scala> val mat = new RowMatrix(ss, 5, 2)
<console>:17: error: type mismatch;
found : Array[Double]
required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
val mat = new RowMatrix(ss, 5, 2)
I do not understand how I can get the right transformation in order to pass the values to the distributed matrix ^
EDIT:
Maybe I have been able to solve:
scala> val s = breeze.linalg.linspace(-3,3,5)
s: breeze.linalg.DenseVector[Double] = DenseVector(-3.0, -1.5, 0.0, 1.5, 3.0)
scala> val ss = s.to
toArray toDenseMatrix toDenseVector toScalaVector toString
toVector
scala> val ss = s.toArray ++ s.toArray
ss: Array[Double] = Array(-3.0, -1.5, 0.0, 1.5, 3.0, -3.0, -1.5, 0.0, 1.5, 3.0)
scala> val x = new breeze.linalg.Dense
DenseMatrix DenseVector
scala> val x = new breeze.linalg.DenseMatrix(5, 2, ss)
x: breeze.linalg.DenseMatrix[Double] =
-3.0 -3.0
-1.5 -1.5
0.0 0.0
1.5 1.5
3.0 3.0
scala> val xDist = sc.parallelize(x.toArray)
xDist: org.apache.spark.rdd.RDD[Double] = ParallelCollectionRDD[0] at parallelize at <console>:18
Something like this. This typechecks, but for some reason won't run in my Scala worksheet.
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
// the values for the column in each row
val col = List(-3.0, -1.5, 0.0, 1.5, 3.0) ;
// make two rows of the column values, transpose it,
// make Vectors of the result
val t = List(col,col).transpose.map(r=>Vectors.dense(r.toArray))
// make an RDD from the resultant sequence of Vectors, and
// make a RowMatrix from that.
val rm = new RowMatrix(sc.makeRDD(t));