Spark Scala Sequence Split on Integer Criteria - scala

I am using Scala on Spark and need some help splitting a sequence of sets based on specific values within the sets.
Here is an example:
val sets = Array(Seq(Set("A", 15,20 ),Set("B", 17, 21), Set("C", 22,34)),
Seq(Set("D", 15, 20),Set("E", 17, 21), Set("F", 21, 23), Set("G", 25,34)))
I am trying to split each sequence within the array based off the criteria that the first integer within each set is between the two integers within the other sets in the same sequence and return the character value of the sets grouped together.
So for the first sequence you can see that we have the integers 15 and 20 in the first set and in the second set 17 and 21. So those sets would be grouped together because 17 is between 15 and 20 and the third set would not be left alone.
In the second sequence I have 15 and 20 overlaps with 17 and 21. Also 17 and 21 will overlap with 21 and 23 and then the last set would be left alone.
Essentially I would like to have it return Set(A, B), Set(C), Set(D, E), Set(D, F), Set(G)
I realize this is not great phrasing but if someone could give me a hand that would be very appreciated.

As noted by the zero323, Set("A", 15,20 ) should probably not be a set. I suggest converting it to a case class:
case class Item(name: String, start: Int, end: Int) {
val range = Range.inclusive(start, end)
}
With this class, if you described your problem correctly it could be solved like this:
sets.map { seq =>
seq.foldLeft(Vector[Vector[Item]]()) { (list, item) =>
list.lastOption match {
case Some(lastGroup) if lastGroup.last.range.contains(item.start) =>
list.init :+ (lastGroup :+ item)
case _ =>
list :+ Vector(item)
}
}.map(l => l.map(i => i.name).toSet)
}.flatten

Related

Spark rdd split doesn't return last column?

I have the following data:
17|ABC|3|89|89|0|0|2|
17|DFD|3|89|89|0|0|2|
17|RFG|3|89|89|0|0|2|
17|TRF|3|89|89|0|0|2|
When I use the following code, I just get 8 elements instead of 9 since the last one doesn't contain any value. I can't use Dataframes as my csv is not fixed, every line can have a different number of elements. How can I get last column value even if its Null/None?
My current code:
data_rdd.filter(x => x contains '|').map{line => line.split('|')}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
})
In both Scala and Java, split will not return any trailing empty strings by default. Instead, you can use a slightly different version of split with a second argument (overloaded to Scala and seen in the Java docs here).
The method definition is:
split(String regex, int limit)
The second argument here limits how many time the regex pattern is applied, using a negative number will apply it as many times as possible.
Therefore, change the code to use:
.map{line => line.split("\\|", -1)}
Note that this split function takes a regex and not a normal string or char.
You can split your string as below:
scala> "17|ABC|3|89|89|0|0|2|".split("\\|", -1)
res24: Array[String] = Array(17, ABC, 3, 89, 89, 0, 0, 2, "")
updated code:
data_rdd.filter(x => x contains '|').map{line => line.split("\\|", -1)}.foreach(elem => {
println("size of element ->" + elem.size)
elem.foreach{elem =>
println(elem)
}
}

How to consecutive and non-consecutive list in scala?

val keywords = List("do", "abstract","if")
val resMap = io.Source
.fromFile("src/demo/keyWord.txt")
.getLines()
.zipWithIndex
.foldLeft(Map.empty[String,Seq[Int]].withDefaultValue(Seq.empty[Int])){
case (m, (line, idx)) =>
val subMap = line.split("\\W+")
.toSeq //separate the words
.filter(keywords.contains) //keep only key words
.groupBy(identity) //make a Map w/ keyword as key
.mapValues(_.map(_ => idx+1)) //and List of line numbers as value
.withDefaultValue(Seq.empty[Int])
keywords.map(kw => (kw, m(kw) ++ subMap(kw))).toMap
}
println("keyword\t\tlines\t\tcount")
keywords.sorted.foreach{kw =>
println(kw + "\t\t" +
resMap(kw).distinct.mkString("[",",","]") + "\t\t" +
resMap(kw).length)
}
This code is not mine and i don't own it ... .using for study purpose. However, I am still learning and I am stuck at implement consecutive to nonconsecutive list, such as the word "if" is in many line and when three or more consecutive line numbers appear then they should be written with a dash in between, e.g. 20-22, but not 20, 21, 22. How can I implement? I just wanted to learn this.
output:
keyword lines count
abstract [1] 1
do [6] 1
if [14,15,16,17,18] 5
But I want the result to be such as [14-18] because word "if" is in line 14 to 18.
First off, I'll give the customary caution that SO isn't meant to be a place to crowdsource answers to homework or projects. I'll give you the benefit of the doubt that this isn't the case.
That said, I hope you gain some understanding about breaking down this problem from this suggestion:
your existing implementation has nothing in place to understand if the int values are indeed consecutive, so you are going to need to add some code that sorts the Ints returned from resMap(kw).distinct in order to set yourself up for the next steps. You can figure out how to do this.
you will then need to group the Ints by their consecutive nature. For example, if you have (14,15,16,18,19,20,22) then this really needs to be further grouped into ((14,15,16),(18,19,20),(22)). You can come up with your algorithm for this.
map over the outer collection (which is a Seq[Seq[Int]] at this point), having different handling depending on whether or not the length of the inside Seq is greater than 1. If greater than one, you can safely call head and tail to get the Ints that you need for rendering your range. Alternatively, you can more idiomatically make a for-comprehension that composes the values from headOption and tailOption to build the same range string. You said something about length of 3 in your question, so you can adjust this step to meet that need as necessary.
lastly, now you have Seq[String] looking like ("14-16","18-20","22") that you need to join together using a mkString call similar to what you already have with the square brackets
For reference, you should get further acquainted with the Scaladoc for the Seq trait:
https://www.scala-lang.org/api/2.12.8/scala/collection/Seq.html
Here's one way to go about it.
def collapseConsecutives(nums :Seq[Int]) :List[String] =
nums.foldRight((nums.last, List.empty[List[Int]])) {
case (n, (prev,acc)) if prev-n == 1 => (n, (n::acc.head) :: acc.tail)
case (n, ( _ ,acc)) => (n, List(n) :: acc)
}._2.map{ ns =>
if (ns.length < 3) ns.mkString(",") //1 or 2 non-collapsables
else s"${ns.head}-${ns.last}" //3 or more, collapsed
}
usage:
println(kw + "\t\t" +
collapseConsecutives(resMap(kw).distinct).mkString("[",",","]") + "\t\t" +
resMap(kw).length)

How to find the last occurrence of an element in a Scala List?

I have a List of Students from which I want to find the last matching student whose age is 23.
I know that find() method gives us the first matching occurrence as shown below:
case class Student(id: Int, age: Int)
val students = List(Student(1, 23), Student(2, 24), Student(3, 23))
val firstStudentWithAge23 = students.find(student => student.age == 23)
// Some(Student(1, 23))
This code gives me the first matching student. But I need the last matching student.
For now, I am using reverse method followed by find:
val lastStudentWithAge23 = students.reverse.find(student => student.age == 23)
// Some(Student(3,23))
This gives me the last matching student.
But this doesn't not seem to be a good approach since the whole list has to be reversed first. How can I achieve this in a better yet functional way ?
As of Scala 2.13, you can use findLast to find the last element of a Seq satisfying a predicate, if one exists:
val students = List(Student(1, 23), Student(2, 24), Student(3, 23))
students.findLast(_.age == 23) // Option[Student] = Some(Student(3, 23))
chengpohi's and jwvh's answers work, but will traverse the list twice or more.
Here's a generic way to find the last occurence that will only traverse the list once:
def findLast[A](la: List[A])(f: A => Boolean): Option[A] =
la.foldLeft(Option.empty[A]) { (acc, cur) =>
if (f(cur)) Some(cur)
else acc
}
We walk through the collection exactly once and always take the last element that matches our predicate f.
I think maybe you can achieve this by filter with lastOption, like:
list.filter(student => student.age == 23).lastOption
list.lift(list.lastIndexWhere(_.age == 23))
Indexing on a List isn't optimum, but it's an option.
Yet another approach, using partition, for instance as follows,
val (l, _) = list.partition(_.age == 23)
l.lastOption
which delivers the list bisected into a tuple of those aged 23 in order of occurrence in the original list, and the rest. From the first tuple take the last item.

Functional Programming way to calculate something like a rolling sum

Let's say I have a list of numerics:
val list = List(4,12,3,6,9)
For every element in the list, I need to find the rolling sum, i,e. the final output should be:
List(4, 16, 19, 25, 34)
Is there any transformation that allows us to take as input two elements of the list (the current and the previous) and compute based on both?
Something like map(initial)((curr,prev) => curr+prev)
I want to achieve this without maintaining any shared global state.
EDIT: I would like to be able to do the same kinds of computation on RDDs.
You may use scanLeft
list.scanLeft(0)(_ + _).tail
The cumSum method below should work for any RDD[N], where N has an implicit Numeric[N] available, e.g. Int, Long, BigInt, Double, etc.
import scala.reflect.ClassTag
import org.apache.spark.rdd.RDD
def cumSum[N : Numeric : ClassTag](rdd: RDD[N]): RDD[N] = {
val num = implicitly[Numeric[N]]
val nPartitions = rdd.partitions.length
val partitionCumSums = rdd.mapPartitionsWithIndex((index, iter) =>
if (index == nPartitions - 1) Iterator.empty
else Iterator.single(iter.foldLeft(num.zero)(num.plus))
).collect
.scanLeft(num.zero)(num.plus)
rdd.mapPartitionsWithIndex((index, iter) =>
if (iter.isEmpty) iter
else {
val start = num.plus(partitionCumSums(index), iter.next)
iter.scanLeft(start)(num.plus)
}
)
}
It should be fairly straightforward to generalize this method to any associative binary operator with a "zero" (i.e. any monoid.) It is the associativity that is key for the parallelization. Without this associativity you're generally going to be stuck with running through the entries of the RDD in a serial fashion.
I don't know what functitonalities are supported by spark RDD, so I am not sure if this satisfies your conditions, because I don't know if zipWithIndex is supported (if the answer is not helpful, please let me know by a comment and I will delete my answer):
list.zipWithIndex.map{x => list.take(x._2+1).sum}
This code works for me, it sums up the elements. It gets the index of the list element, and then adds the corresponding n first elements in the list (notice the +1, since the zipWithIndex starts with 0).
When printing it, I get the following:
List(4, 16, 19, 25, 34)

Combine values with same keys in Scala

I currently have 2 lists List('a','b','a') and List(45,65,12) with many more elements and elements in 2nd list linked to elements in first list by having a key value relationship. I want combine elements with same keys by adding their corresponding values and create a map which should look like Map('a'-> 57,'b'->65) as 57 = 45 + 12.
I have currently implemented it as
val keys = List('a','b','a')
val values = List(45,65,12)
val finalMap:Map(char:Int) =
scala.collection.mutable.Map().withDefaultValue(0)
0 until keys.length map (w => finalMap(keys(w)) += values(w))
I feel that there should be a better way(functional way) of creating the desired map than how I am doing it. How could I improve my code and do the same thing in more functional way?
val m = keys.zip(values).groupBy(_._1).mapValues(l => l.map(_._2).sum)
EDIT: To explain how the code works, zip pairs corresponding elements of two input sequences, so
keys.zip(values) = List((a, 45), (b, 65), (a, 12))
Now you want to group together all the pairs with the same first element. This can be done with groupBy:
keys.zip(values).groupBy(_._1) = Map((a, List((a, 45), (a, 12))), (b, List((b, 65))))
groupBy returns a map whose keys are the type being grouped on, and whose values are a list of the elements in the input sequence with the same key.
The keys of this map are the characters in keys, and the values are a list of associated pair from keys and values. Since the keys are the ones you want in the output map, you only need to transform the values from List[Char, Int] to List[Int].
You can do this by summing the values from the second element of each pair in the list.
You can extract the values from each pair using map e.g.
List((a, 45), (a, 12)).map(_._2) = List(45,12)
Now you can sum these values using sum:
List(45, 12).sum = 57
You can apply this transform to all the values in the map using mapValues to get the result you want.
I was going to +1 Lee's first version, but mapValues is a view, and ell always looks like one to me. Just not to seem petty.
scala> (keys zip values) groupBy (_._1) map { case (k,v) => (k, (v map (_._2)).sum) }
res0: scala.collection.immutable.Map[Char,Int] = Map(b -> 65, a -> 57)
Hey, the answer with fold disappeared. You can't blink on SO, the action is so fast.
I'm going to +1 Lee's typing speed anyway.
Edit: to explain how mapValues is a view:
scala> keys.zip(values).groupBy(_._1).mapValues(l => l.map { v =>
| println("OK mapping")
| v._2
| }.sum)
OK mapping
OK mapping
OK mapping
res2: scala.collection.immutable.Map[Char,Int] = Map(b -> 65, a -> 57)
scala> res2('a') // recomputes
OK mapping
OK mapping
res4: Int = 57
Sometimes that is what you want, but often it is surprising. I think there is a puzzler for it.
You were actually on the right track to a reasonably efficient functional solution. If we just switch to an immutable collection and use a fold on a key-value zip, we get:
( Map[Char,Int]() /: (keys,values).zipped ) ( (m,kv) =>
m + ( kv._1 -> ( m.getOrElse( kv._1, 0 ) + kv._2 ) )
)
Or you could use withDefaultValue 0, as you did, if you want the final map to have that default. Note that .zipped is faster than zip because it doesn't create an intermediate collection. And a groupBy would create a number of other intermediate collections. Of course it may not be worth optimizing, and if it is you could do even better than this, but I wanted to show you that your line of thinking wasn't far off the mark.