Map within filter in Spark - scala

How can I filter within mapping ?
Example :
test1 = sc.parallelize(Array(('a', (1,Some(4)), ('b', (2, Some(5)), \
('c', (3,Some(6)), ('d',(0,None))))
What I want :
Array(('a', (1,Some(4)), ('b', (2, Some(5)), \ ('c', (3,Some(6)), \
('d',(613,None))))
What I tried (I've change the 0 by 613) :
test 2 = test1.filter(value => value._2._1 == 0).mapValues(value =>
(613, value._2))
But it returns only :
Array('d',(613,None))

Use map with pattern matching:
test1.map {
case (x, (0, y)) => (x, (613, y))
case z => z
}.collect
// res2: Array[(Char, (Int, Option[Int]))] = Array((a,(1,Some(4))), (b,(2,Some(5))), (c,(3,Some(6))), (d,(613,None)))

test1.map{
case (a, (0, b)) => (a, (613, b))
case other => other
}

Related

How to group a list of tuples sorted by first element into two lists containing overlapping and non overlapping tuples

I have a sorted list of tuples (sorted by first element) say for eg.
[(1, 6)
(5, 9)
(6, 8)
(11, 12)
(16, 19)]
I need to split the list into a list of overlapping and a list of non overlapping tuples. So the output for the above list would be
overlapping: [(1, 6), (5, 9), (6, 8)]
non overlapping: [(11, 12), (16, 19)]
I am trying to use foldLeft but not sure if it's possible that way
.foldLeft(List[(Long, Long)]()){(list, tuple) => list match {
case Nil => List(tuple)
case head :: tail => if (head.2 >= tuple._1) {
// Not sure what should my logic be
} else {
// Not sure
}
}}
Input: [(1, 6) (5, 9) (6, 8) (11, 12) (16, 19)]
Output: [(1, 6), (5, 9), (6, 8)] and [(11, 12), (16, 19)]
Here's what I've understood. You want to find each tuple, in the input, whom Longs are consecutive bounds of a range (so I can use Range by the way) and that range doesn't contain any Long from another tuple in the input.
Here's my suggestion:
Seq((1L, 6L), (5L, 9L), (6L, 8L), (11L, 12L), (16L, 19L))
.map { case (start, end) => start to end }
.foldLeft(Set[(Long, Long)]() -> Set[(Long, Long)]()) {
case ((overlapping, nonoverlapping), range) =>
(overlapping ++ nonoverlapping).find { case (start, end) =>
range.contains(start) || range.contains(end) || (start to end).containsSlice(range)
}.fold(overlapping -> (nonoverlapping + (range.start -> range.end)))(matchedTuple =>
(overlapping + (matchedTuple, range.start -> range.end), nonoverlapping - matchedTuple)
)
}
It may not work for tuples like (6, 6) or (10, 0) because they're computed as empty ranges and you have to decide limit cases with empty ranges like them if you want to.
Hope it helps.
I agree with Dima that this question is unclear. It's important to note that the approach above will also fail because you return a single list, not one list of overlapping intervals and one of non-overlapping intervals.
A possible approach to this problem -- especially if you're set on using foldLeft -- would be to do something like this:
ls.foldLeft((List[(Int, Int)](), List[(Int, Int)]()))((a, b) => (a, b) match {
case ((Nil, _), (h1, t1)) => (a._1 ::: List((h1, t1)), a._2)
case ((head :: tail, _), (h2, t2)) if head._2 >= h2 => (a._1 ::: List((h2, t2)), a._2)
case ((head :: tail, _), (h2, t2)) => (a._1, a._2 ::: List((h2, t2)))
})
Of course, if we don't address the problem of having several non-overlapping subsets of overlapping intervals, this solution also fails.
I'd say, find overlapping ones first, and then compute the rest. This will do it in linear time.
#tailrec
def findOverlaps(
ls: List[(Int, Int)],
boundary: Int = Int.MinValue,
out: List[(Int, Int)] = Nil
): List[(Int, Int)] = ls match {
case (a, b) :: tail if a < boundary =>
findOverlaps(tail, b max boundary, (a, b) :: out)
case _ :: Nil | Nil => out.reverse
case (a, b) :: (c, d) :: tail if b > c =>
findOverlaps(ls.tail, b max boundary, (a, b) :: out)
case _ :: tail => findOverlaps(tail, boundary, out)
}
val overlaps = findOverlasp(ls)
val nonOverlaps = ls.filterNot(overlaps.toSet)

scala combining multiple sequences

I have a couple of lists:
val aa = Seq(1,2,3,4)
val bb = Seq(Seq(2.0,3.0,4.0,5.0), Seq(1.0,2.0,3.0,4.0))
val cc = Seq("a", "B")
And want to combine them in the desired format of:
(1, 2.0, a), (2, 3.0, a), (3, 4.0, a), (4, 5.0, a), (1, 1.0, b), (2, 2.0, b), (3, 3.0, b), (4, 4.0, b)
but my combination of zip and flatMap
(aa, bb,cc).zipped.flatMap{
case (a, b,c) => {
b.map(b1 => (a,b1,c))
}
}
is only producing
(1,2.0,a), (1,3.0,a), (1,4.0,a), (1,5.0,a), (2,1.0,B), (2,2.0,B), (2,3.0,B), (2,4.0,B)
In java I would just iterate for over bb and then again in a nested loop iterate over the values.
What do I need to change to get the data in the desired format using neat functional scala?
How about this:
for {
(bs, c) <- bb zip cc
(a, b) <- aa zip bs
} yield (a, b, c)
Produces:
List(
(1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a),
(1,1.0,b), (2,2.0,b), (3,3.0,b), (4,4.0,b)
)
I doubt this could be made any more neat & functional.
Not exactly pretty to read but here is an option:
bb
.map(b => aa.zip(b)) // List(List((1,2.0), (2,3.0), (3,4.0), (4,5.0)), List((1,1.0), (2,2.0), (3,3.0), (4,4.0)))
.zip(cc) // List((List((1,2.0), (2,3.0), (3,4.0), (4,5.0)),a), (List((1,1.0), (2,2.0), (3,3.0), (4,4.0)),B))
.flatMap{ case (l, c) => l.map(t => (t._1, t._2, c)) } // List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))
Another approach using collect and map
scala> val result = bb.zip(cc).collect{
case bc => (aa.zip(bc._1).map(e => (e._1,e._2, bc._2)))
}.flatten
result: Seq[(Int, Double, String)] = List((1,2.0,a), (2,3.0,a), (3,4.0,a), (4,5.0,a), (1,1.0,B), (2,2.0,B), (3,3.0,B), (4,4.0,B))

Looking for an FP ranking implementation which handles ties (i.e. equal values)

Starting from a sorted sequence of values, my goal is to assign a rank to each value, using identical ranks for equal values (aka ties):
Input: Vector(1, 1, 3, 3, 3, 5, 6)
Output: Vector((0,1), (0,1), (1,3), (1,3), (1,3), (2,5), (3,6))
A few type aliases for readability:
type Rank = Int
type Value = Int
type RankValuePair = (Rank, Value)
An imperative implementation using a mutable rank variable could look like this:
var rank = 0
val ranked1: Vector[RankValuePair] = for ((value, index) <- values.zipWithIndex) yield {
if ((index > 0) && (values(index - 1) != value)) rank += 1
(rank, value)
}
// ranked1: Vector((0,1), (0,1), (1,3), (1,3), (1,3), (2,5), (3,6))
To hone my FP skills, I was trying to come up with a functional implementation:
val ranked2: Vector[RankValuePair] = values.sliding(2).foldLeft((0 , Vector.empty[RankValuePair])) {
case ((rank: Rank, rankedValues: Vector[RankValuePair]), Vector(currentValue, nextValue)) =>
val newRank = if (nextValue > currentValue) rank + 1 else rank
val newRankedValues = rankedValues :+ (rank, currentValue)
(newRank, newRankedValues)
}._2
// ranked2: Vector((0,1), (0,1), (1,3), (1,3), (1,3), (2,5))
It is less readable, and – more importantly – is missing the last value (due to using sliding(2) on an odd number of values).
How could this be fixed and improved?
This works well for me:
// scala
val vs = Vector(1, 1, 3, 3, 3, 5, 6)
val rank = vs.distinct.zipWithIndex.toMap
val result = vs.map(i => (rank(i), i))
The same in Java 8 using Javaslang:
// java(slang)
Vector<Integer> vs = Vector(1, 1, 3, 3, 3, 5, 6);
Function<Integer, Integer> rank = vs.distinct().zipWithIndex().toMap(t -> t);
Vector<Tuple2<Integer, Integer>> result = vs.map(i -> Tuple(rank.apply(i), i));
The output of both variants is
Vector((0, 1), (0, 1), (1, 3), (1, 3), (1, 3), (2, 5), (3, 6))
*) Disclosure: I'm the creator of Javaslang
This is nice and concise but it assumes that your Values don't go negative. (Actually it just assumes that they can never start with -1.)
val vs: Vector[Value] = Vector(1, 1, 3, 3, 3, 5, 6)
val rvps: Vector[RankValuePair] =
vs.scanLeft((-1,-1)){ case ((r,p), v) =>
if (p == v) (r, v) else (r + 1, v)
}.tail
edit
Modification that makes no assumptions, as suggested by #Kolmar.
vs.scanLeft((0,vs.headOption.getOrElse(0))){ case ((r,p), v) =>
if (p == v) (r, v) else (r + 1, v)
}.tail
Here's an approach with recursion, pattern matching and guards.
The interesting part is where the head and head of the tail (h and ht respectively) are de-constructed from the list and an if checks if they are equal. The logic for each case adjusts the rank and proceeds on the remaining part of the list.
def rank(xs: Vector[Value]): List[RankValuePair] = {
def rankR(xs: List[Value], acc: List[RankValuePair], rank: Rank): List[RankValuePair] = xs match{
case Nil => acc.reverse
case h :: Nil => rankR(Nil, (rank, h) :: acc, rank)
case h :: ht :: t if (h == ht) => rankR(xs.tail, (rank, h) :: acc, rank)
case h :: ht :: t if (h != ht) => rankR(xs.tail, (rank, h) :: acc, rank + 1)
}
rankR(xs.toList, List[RankValuePair](), 0)
}
Output:
scala> rank(xs)
res14: List[RankValuePair] = List((0,1), (0,1), (1,3), (1,3), (1,3), (2,5), (3,6))
This is a modification of the solution by #jwvh, that doesn't make any assumptions about the values:
val vs = Vector(1, 1, 3, 3, 3, 5, 6)
vs.sliding(2).scanLeft(0, vs.head) {
case ((rank, _), Seq(a, b)) => (if (a != b) rank + 1 else rank, b)
}.toVector
Note, that it would throw if vs is empty, so you'd have to use vs.headOption getOrElse 0, or check if the input is empty beforehand: if (vs.isEmpty) Vector.empty else ...
import scala.annotation.tailrec
type Rank = Int
// defined type alias Rank
type Value = Int
// defined type alias Value
type RankValuePair = (Rank, Value)
// defined type alias RankValuePair
def rankThem(values: List[Value]): List[RankValuePair] = {
// Assumes that the "values" are sorted
#tailrec
def _rankThem(currentRank: Rank, currentValue: Value, ranked: List[RankValuePair], values: List[Value]): List[RankValuePair] = values match {
case value :: tail if value == currentValue => _rankThem(currentRank, value, (currentRank, value) +: ranked, tail)
case value :: tail if value > currentValue => _rankThem(currentRank + 1, value, (currentRank + 1, value) +: ranked, tail)
case Nil => ranked.reverse
}
_rankThem(0, Int.MinValue, List.empty[RankValuePair], values.sorted)
}
// rankThem: rankThem[](val values: List[Value]) => List[RankValuePair]
val valueList = List(1, 1, 3, 3, 5, 6)
// valueList: List[Int] = List(1, 1, 3, 3, 5, 6)
val rankValueList = rankThem(valueList)[RankedValuePair], values: Vector[Value])
// rankValueList: List[RankValuePair] = List((1,1), (1,1), (2,3), (2,3), (3,5), (4,6))
val list = List(1, 1, 3, 3, 5, 6)
val result = list
.groupBy(identity)
.mapValues(_.size)
.toArray
.sortBy(_._1)
.zipWithIndex
.flatMap(tuple => List.fill(tuple._1._2)((tuple._2, tuple._1._1)))
result: Array[(Int, Int)] = Array((0,1), (0,1), (1,3), (1,3), (2,5), (3,6))
The idea is using groupBy to find identical elements and find their occurrences and then sort and then flatMap. Time complexity I would say is O(nlogn), groupBy is O(n), sort is O(nlogn), fl

How to merge adjacent, similar entries of a sorted `Stream` or `List`

Given a
large ( > 1,000,000 entries, don't expect it to fit into memory)
sorted (wrt. the first value of the tuple)
stream like
val ss = List( (1, "2.5"), (1, "5.0"), (2, "3.0"), (2, "4.0"), (2, "6.0"), (3, "1.0")).toStream
// just for demo
val xs = List( (1, "2.5"), (1, "5.0"), (2, "3.0"), (2, "4.0"), (2, "6.0"), (3, "1.0"))
I want to join adjacent entries such that the output of transformation becomes
List( (1, "2.5 5.0"), (2, "3.0 4.0 6.0"), (3, "6.0") )
The second tuple value is to be merged by some monoid function (string concatenation here)
Ideas / attempts / tries
groupBy
groupBy does not seem to be a valid alternative, as entries are collected in a map in memory.
scanLeft
val ss: Stream[(Int, String)] = List( (1, "2.5"), (1, "5.0"), (2, "3.0")).toStream
val transformed = ss.scanLeft(Joiner(0, "a"))( (j, t) => {
j.x match {
case t._1 => j.copy(y = j.y + " " + t._2)
case _ => Joiner(t._1, t._2)
}
})
println(transformed.toList)
which ends up in
List(Joiner(0,a), Joiner(1,2.5), Joiner(1,2.5 5.0), Joiner(2,3.0))
(please ignore wrapping Joiner)
but I didn't find a way to get rid of the "incomplete" entries.
Emit true to indicate the initial element (when the value switches), not the final one, that's easy, right? Then you can just collect those entries, that are followed by the initial one.
Something like this perhaps:
ss.scanLeft((0, "", true)) {
case ((a, str, _), (b, c)) if (str == "" || a == b) => (b, str + " " + c, false)
case (_, (b, c)) => (b, c.toString, true)
} .:+ (0, "", true)
.sliding(2)
.collect { case Seq(a, (_, _, true)) => (a._1, a._2) }
(note the .:+ thingy - it appends a "dummy" entry to the end of the stream, so that the last "real" element is also followed by the "true" entry, and does not get filtered out).
This seems to do okay.
def makeEm(s: Stream[(Int, String)]) = {
import Stream._
#tailrec
def z(source: Stream[(Int, String)], curr: (Int, List[String]), acc: Stream[(Int, String)]): Stream[(Int, String)] = source match {
case Empty =>
Empty
case x #:: Empty =>
acc :+ (curr._1 -> (x._2 :: curr._2).mkString(","))
case x #:: y #:: etc if x._1 != y._1 =>
val c = curr._1 -> (x._2 :: curr._2).mkString(",")
z(y #:: etc, (y._1, List[String]()), acc :+ c)
case x #:: etc =>
z(etc, (x._1, x._2 :: curr._2), acc)
}
z(s, (0, List()), Stream())
}
Tests:
val ss = List( (1, "2.5"), (1, "5.0"), (2, "3.0"), (2, "4.0"), (2, "6.0"), (3, "1.0")).toStream
makeEm(ss).toList.mkString(",")
val s = List().toStream
makeEm(s).toList.mkString(",")
val ss2 = List( (1, "2.5"), (1, "5.0")).toStream
makeEm(ss2).toList.mkString(",")
val s3 = List((1, "2.5"),(2, "4.0"),(3, "1.0")).toStream
makeEm(s3).toList.mkString(",")
Output
ss: scala.collection.immutable.Stream[(Int, String)] = Stream((1,2.5), ?)
res0: String = (1,5.0,2.5),(2,6.0,4.0,3.0),(3,1.0)
s: scala.collection.immutable.Stream[Nothing] = Stream()
res1: String =
ss2: scala.collection.immutable.Stream[(Int, String)] = Stream((1,2.5), ?)
res2: String = (1,5.0,2.5)
s3: scala.collection.immutable.Stream[(Int, String)] = Stream((1,2.5), ?)
res3: String = (0,2.5),(2,4.0),(3,1.0)

How does one replace the first matching item in a list in Scala?

Let's say you have:
List(('a', 1), ('b', 1), ('c', 1), ('b', 1))
and you want to replace the first ('b', 1) with ('b', 2), and you don't want it to (a) waste time evaluating past the first match and (b) update any further matching tuples.
Is there a relatively concise way of doing this in Scala (i.e., without taking the list apart and re-concatenating it). Something like an imaginary function mapFirst that returns the list with the first matching value incremented:
testList.mapFirst { case ('b', num) => ('b', num + 1) }
You don't have to take the whole List apart i guess. (Only until the element is found)
def replaceFirst[A](a : List[A], repl : A, replwith : A) : List[A] = a match {
case Nil => Nil
case head :: tail => if(head == repl) replwith :: tail else head :: replaceFirst(tail, repl, replwith)
}
The call for example:
replaceFirst(List(('a', 1), ('b', 1), ('c', 1), ('b', 1)), ('b', 1), ('b', 2))
Result:
List((a,1), (b,2), (c,1), (b,1))
A way with a partial function and implicits (which looks more like your mapFirst):
implicit class MyRichList[A](val list: List[A]) {
def mapFirst(func: PartialFunction[A, A]) = {
def mapFirst2[A](a: List[A], func: PartialFunction[A, A]): List[A] = a match {
case Nil => Nil
case head :: tail => if (func.isDefinedAt(head)) func.apply(head) :: tail else head :: mapFirst2(tail, func)
}
mapFirst2(list, func)
}
}
And use it like this:
List(('a', 1), ('b', 1), ('c', 1), ('b', 1)).mapFirst {case ('b', num) => ('b', num + 1)}
You can emulate such function relatively easily. The quickest (implementation-wise, not necessarily performance-wise) I could think of was something like this:
def replaceFirst[A](a:List[A], condition: (A)=>Boolean, transform:(A)=>(A)) = {
val cutoff =a.indexWhere(condition)
val (h,t) = a.splitAt(cutoff)
h ++ (transform(t.head) :: t.tail)
}
scala> replaceFirst(List(1,2,3,4,5),{x:Int => x%2==0}, { x:Int=> x*2 })
res4: List[Int] = List(1, 4, 3, 4, 5)
scala> replaceFirst(List(('a',1),('b',2),('c',3),('b',4)), {m:(Char,Int) => m._1=='b'},{m:(Char,Int) => (m._1,m._2*2)})
res6: List[(Char, Int)] = List((a,1), (b,4), (c,3), (b,4))
Using span to find first element only. It shouldn't throw an exception even when case is not satified. Need less to say, you can specify as many cases as you like.
implicit class MyRichieList[A](val l: List[A]) {
def mapFirst(pf : PartialFunction[A, A]) =
l.span(!pf.isDefinedAt(_)) match {
case (x, Nil) => x
case (x, y :: ys) => (x :+ pf(y)) ++ ys
}
}
val testList = List(('a', 1), ('b', 1), ('c', 1), ('b', 1))
testList.mapFirst {
case ('b', n) => ('b', n + 1)
case ('a', 9) => ('z', 9)
}
// result --> List((a,1), (b,2), (c,1), (b,1))