Merge List[List[_]] conditionally - scala

I want to merge a List[List[Double]] based on the values of the elements in the inner Lists. Here's what I have so far:
// inner Lists are (timestamp, ID, measurement)
val data = List(List(60, 0, 3.4), List(60, 1, 2.5), List(120, 0, 1.1),
List(180, 0, 5.6), List(180, 1, 4.4), List(180, 2, 6.7))
data
.foldLeft(List[List[Double]]())(
(ret, ll) =>
// if this is the first list, just add it to the return val
if (ret.isEmpty){
List(ll)
// if the timestamps match, add a new (ID, measurement) pair to this inner list
} else if (ret(0)(0) == ll(0)){
{{ret(0) :+ ll(1)} :+ ll(2)} :: ret.drop(1)
// if this is a new timestamp, add it to the beginning of the return val
} else {
ll :: ret
}
)
This works, but it doesn't smell optimal to me (especially the right-additions ':+'). For my use case, I have a pretty big (~25,000 inner Lists) List of elements, which are themselves all length-3 Lists. At most, there will be a fourfold degeneracy, because the inner lists are List(timestamp, ID, measurement) groups, and there are only four unique IDs. Essentially, I want to smush together all of the measurements that have the same timestamps.
Does anyone see a more optimal way of doing this?
I actually start with a List[Double] of timestamps and a List[Double] of measurements for each of the four IDs, if there's a better way of starting from that point.

Here is a slightly shorter way to do it:
data.
groupBy(_(0)).
mapValues(_.flatMap(_.tail)).
toList.
map(kv => kv._1 :: kv._2)
The result looks 1:1 exactly the same as what your algorithm produces:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))
Explanation:
group by timestamp
in the grouped values, drop the redundant timestamps, and flatten to single list
tack the timestamp back onto the flat list of ids-&-measurements

Here is a possibility:
input
.groupBy(_(0))
.map { case (tstp, values) => tstp :: values.flatMap(_.tail) }
The idea is just to group inner lists by their first element and then flatten the resulting values.
which returns:
List(List(180.0, 0.0, 5.6, 1.0, 4.4, 2.0, 6.7), List(120.0, 0.0, 1.1), List(60.0, 0.0, 3.4, 1.0, 2.5))

What about representing your measurements with a case class?
case class Measurement(timestamp: Int, id: Int, value: Double)
val measurementData = List(Measurement(60, 0, 3.4), Measurement(60, 1, 2.5),
Measurement(120, 0, 1.1), Measurement(180, 0, 5.6),
Measurement(180, 1, 4.4), Measurement(180, 2, 6.7))
measurementData.foldLeft(List[Measurement]())({
case (Nil, m) => List(m)
case (x :: xs, m) if x.timestamp == m.timestamp => m :: xs
case (xs, m) => m :: xs
})

Related

Update specific indices of Seq by another Seq in Scala

I have a first seq, for example:
val s: Seq[Double] = List.fill(6)(0.0)
and a sub sequences of the indices of s:
val subInd: Seq[Int] = List(2, 4, 5)
Now, what I want to do is update s on the positions 2, 4 and 5 by another Seq, which has the length of subInd:
val t: Seq[Double] = List(5.0, 6.0, 7.0)
such, that:
val updateBySeq(s, subInd, t): List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)
I have searched on this site and found Update multiple values in a sequence where the second answer comes close to the functionality I want to have.
However, the difference is, that the function provided would update s on the indices contained in subInd by one value. I, on the other hand, would want them to correspond to multiple, unique values in a third Seq t.
I have tried various things, like using recursion and ListBuffers, instead of Lists, to incrementally update the elements of s, but either they left s unchanged or I got an error because I violated some immutability constraint.
This should work:
def updateListByIndexes[T](data: List[T])(indexes: List[Int], update: List[T]): List[T] = {
val updateMap = (indexes lazyZip update).toMap
data.iterator.zipWithIndex.map {
case (elem, idx) =>
updateMap.getOrElse(key = idx, default = elem)
}.toList
}
Which you can use like this:
val s = List.fill(6)(0.0)
// s: List[Double] = List(0.0, 0.0, 0.0, 0.0, 0.0, 0.0)
val subInd = List(2, 4, 5)
// subInd: List[Int] = List(2, 4, 5)
val t = List(5.0, 6.0, 7.0)
// t: List[Double] = List(5.0, 6.0, 7.0)
updateListByIndexes(s)(subInd, t)
// res: List[Double] = List(0.0, 0.0, 5.0, 0.0, 6.0, 7.0)

How to Sum a part of a list in RDD

I have an RDD, and I would like to sum a part of the list.
(key, element2 + element3)
(1, List(2.0, 3.0, 4.0, 5.0)), (2, List(1.0, -1.0, -2.0, -3.0))
output should look like this,
(1, 7.0), (2, -3.0)
Thanks
You can map and indexing on the second part:
yourRddOfTuples.map(tuple => {val list = tuple._2; list(1) + list(2)})
Update after your comment, convert it to Vector:
yourRddOfTuples.map(tuple => {val vs = tuple._2.toVector; vs(1) + vs(2)})
Or if you do not want to use conversions:
yourRddOfTuples.map(_._2.drop(1).take(2).sum)
This skips the first element (.drop(1)) from the second element of the tuple (.map(_._2), takes the next two (.take(2)) (might be less if you have less) and sums them (.sum).
You can map the key-list pair to obtain the 2nd and 3rd list elements as follows:
val rdd = sc.parallelize(Seq(
(1, List(2.0, 3.0, 4.0, 5.0)),
(2, List(1.0, -1.0, -2.0, -3.0))
))
rdd.map{ case (k, l) => (k, l(1) + l(2)) }.collect
// res1: Array[(Int, Double)] = Array((1,7.0), (2,-3.0))

Displaying output under a certain format

I'm quite new to Scala and Spark, and had some questions about displaying results in output file.
I actually have a Map in which each key is associated to a List of List (Map[Int, List<Double>]), such as :
(2, List(x1,x2,x3), List(y1,y2,y3), ...).
I am supposed to display for each key the values inside the lists of lists, such as:
2 x1,x2,x3
2 y1,y2,y3
1 z1,z2,z3
and so on.
When I use the saveAsTextFile function, it doesn't give me what I want in the output. Does anybody know how I can do it?
EDIT :
This is one of my function :
def PrintCluster(vectorsByKey : Map[Int, List[Double]], vectCentroidPairs : Map[Int, Int]) : Map[Int, List[Double]] = {
var vectorsByCentroid: Map[Int, List[Double]] = Map()
val SortedCentroid = vectCentroidPairs.groupBy(_._2).mapValues(x => x.map(_._1).toList).toSeq.sortBy(_._1).toMap
SortedCentroid.foreach { case (centroid, vect) =>
var nbVectors = vect.length
for (i <- 0 to nbVectors - 1) {
var vectValues = vectorsByKey(vect(i))
println(centroid + " " + vectValues)
vectorsByCentroid += (centroid -> (vectValues))
}
}
return vectorsByCentroid
}
I know it's wrong, because I only can affect one unique keys for a group of values. That is why it returns me only the first List for each key in the Map. I thought that for using the saveAsTextFile function, I've had necessarily to use a Map structure, but I don't really know.
create sample rdd as per your input data
val rdd: RDD[Map[Int, List[List[Double]]]] = spark.sparkContext.parallelize(
Seq(Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0)))
)
)
Transform RDD[Map[Int, List[List[Double]]]] to RDD[(Int, String)]
val result: RDD[(Int, String)] = rdd.flatMap(i => {
i.map {
case (x, y) => y.map(list => (x, list.mkString(" ")))
}
}).flatMap(z => z)
result.foreach(println)
result.saveAsTextFile("location")
Using a Map[Int, List[List[Double]]] and simply print it in the format wanted is simple, it can be done by first converting to a list and then applying flatMap. Using the data supplied in a comment:
val map: Map[Int, List[List[Double]]] = Map(
2 -> List(List(-4.4, -2.0, 1.5), List(-3.3, -5.4, 3.9), List(-5.8, -3.3, 2.3), List(-5.2, -4.0, 2.8)),
1 -> List(List(7.3, 1.0, -2.0), List(9.8, 0.4, -1.0), List(7.5, 0.3, -3.0), List(6.1, -0.5, -0.6), List(7.8, 2.2, -0.7), List(6.6, 1.4, -1.1), List(8.1, -0.0, 2.7)),
3 -> List(List(-3.0, 4.0, 1.4), List(-4.0, 3.9, 0.8), List(-1.4, 4.3, -0.5), List(-1.6, 5.2, 1.0))
)
val list = map.toList.flatMap(t => t._2.map((t._1, _)))
val result = for (t <- list) yield t._1 + "\t" + t._2.mkString(",")
// Saving the result to file
import java.io._
val pw = new PrintWriter(new File("fileName.txt"))
result.foreach{ line => pw.println(line)}
pw.close
Will print out:
2 -4.4,-2.0,1.5
2 -3.3,-5.4,3.9
2 -5.8,-3.3,2.3
2 -5.2,-4.0,2.8
1 7.3,1.0,-2.0
1 9.8,0.4,-1.0
1 7.5,0.3,-3.0
1 6.1,-0.5,-0.6
1 7.8,2.2,-0.7
1 6.6,1.4,-1.1
1 8.1,-0.0,2.7
3 -3.0,4.0,1.4
3 -4.0,3.9,0.8
3 -1.4,4.3,-0.5
3 -1.6,5.2,1.0

Sort list of tuple in descending order with category alternating condition

Given a list of tuples:
val mylist = List(('orange', 0.9, 1), ('apple', 0.8, 1), ('mellon', 0.7, 1),
('car', 0.5, 2), ('truck', 0.5, 2),
('tablet', 0.3, 3))
I would like to sort them in descending order with respect to the second element of the tuple. However, I would like to pick them by category, one at a time (third element) alternatively. The output should be the following list:
('orange', 0.9, 1)
('car', 0.5, 2)
('tablet', 0.3, 3)
('apple', 0.8, 1)
('truck', 0.5, 2)
('mellon', 0.7, 1)
What would be the functional way of doing it in Scala?
Try:
mylist.groupBy(_._3) // group by category
.toList
.sortBy(_._1) // sort by asc category
.map(_._2.sortBy(-_._2)) // drop the category key + sort each group by desc rank
.flatMap(_.zipWithIndex)
.sortBy(_._2) // sort by index (stable sort)
.map(_._1) // drop the index
> res: List[(String, Double, Int)] = List((orange,0.9,1), (car,0.5,2), (tablet,0.3,3), (apple,0.8,1), (truck,0.5,2), (mellon,0.7,1))

how to compare to previous values in Seq[Double]

I am new to functional programming, I have a Seq[Double] and I'd like to check for each value if it is higher (1), lower (-1) or equal (0) to previous value, like:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
and I'd like to have a result like:
val result = Seq(1, 1, 0, 0, 0, -1)
is there a more concise way than:
val g = Seq(0.1, 0.3, 0.5, 0.5, 0.5, 0.3)
g.sliding(2).toList.map(xs =>
if (xs(0)==xs(1)){
0
} else if(xs(0)>xs(1)){
-1
} else {
1
}
)
Use compare:
g.sliding(2).map{ case Seq(x, y) => y compare x }.toList
compare is added by an enrichment trait called OrderedProxy
That's rather concise in my opinion but I'd make it a function and pass it into map to make it more readable. I used pattern matching and guards.
//High, low, equal
scala> def hlo(x: Double, y: Double): Int = y - x match {
| case 0.0 => 0
| case x if x < 0.0 => -1
| case x if x > 0.0 => 1
| }
hlo: (x: Double, y: Double)Int
scala> g.sliding(2).map(xs => hlo(xs(0), xs(1))).toList
res9: List[Int] = List(1, 1, 0, 0, -1)
I agree with Travis Brown's comment from above so am proposing it as an answer.
Reversing the order of the values in the zip, just to match the order of g. This has the added benefit of using tuples instead of a sequence so no pattern matching is needed.
(g, g.tail).zipped.toList.map(t => t._2 compare t._1)
res0: List[Int] = List(1, 1, 0, 0, -1)