Optimal Way to Achieve Traditional Loop Based Tasks in Scala - scala

I am new to Scala and working on implementing an algorithm. In C#, this would have been a much easier task with necessary loops, but it is a bit confusing to implement with Scala functional programming semantics.
Assume I have to fill a spreadsheet (S) with N rows and M cols with values that I have in a one-dimensional list (L).
While filing an individual cell in the spreadsheet, there is a back and forth logic involved.
2a. The system will walk through the items in L sequentially and will fill the same in next empty cell in sheet S
2b. While filling the item value of the currently processed item from L in a cell, the system will check, can the current cell accept the item value. If yes, it will fill, and move on to the next item and follow Step 2a. If not, it will see if it could fill the next item from L. Until it finds a value that could fit in, the system will continue to evaluate till it runs out of values and will leave it blank.
2c. The system after filling the cell in Step 2b will move to the next cell. Now, it will first check whether any of the unprocessed values from the previous step (2b) could be accepted by the currently processed cell. If yes, it will fill the same and continue to do work with unprocessed values. If it cannot find an unprocessed value that could fit in, it will pull the next item from L based on the position of the pointer on Step 2b.
It would be great if I could get ideas of how-to structure this with Scala. As I mentioned earlier, in C# this would have been easy with foreach loops, but I am not sure what is the most optimal way to do this in a functional programming construct.

You can remember that imperative:
for (init; condition; afterEach) {
instructions
}
is just a syntactic sugar for:
init
while (condition) {
instructions
afterEach
}
(at least until you use break or continue). So if you are able to rewrite your for-loop code into while-loop code the translation is pretty straightforward.
If you are not interested in such solution you could do something like
val indices = for {
i <- (0 until n).toStream // or .to(LazyList) if on 2.13
j <- (0 until m).toStream // or .to(LazyList) if on 2.13
} yield i -> j
indices.foldLeft(allItemsToInsert) { case (itemsLeft, (i, j)) =>
itemsLeft.find(item => /* predicate if item can be inserted at (i, j) */) match {
case Some(item) =>
// insert item to spreadsheet
items diff List(1) // remove found element - use other data structure if you find this too costly
case None =>
items // nothing could be inserted, move on
}
}
This would go through all indices one after another, and then try to find the first element which can be inserted. If it does it would insert it and take it off the list, if it cannot be inserted move on.
You can tweak the logic to e.g. partition on items that can be inserted if there could be more than one:
indices.foldLeft(allItemsToInsert) { case (itemsLeft, (i, j)) =>
val (insertable, nonInsertable) = itemsLeft.partition(item => /* predicate if item can be inserted */)
// insert insertable
nonInsertable // pass non-insertable for the next indice
}
Alternatively you could also use tail recursion if you really need to go back and forth:
#scala.annotation.tailrec
def insertValues(items: List[Item], i: Int, j: Int): Unit = {
if (items.nonEmpty) {
// insert what you can into spreadsheet
val itemsLeft = ... // items that you haven't inserted
val newI, newJ = ...
insertValues(itemsLeft, newI, newJ)
}
}

Related

Iterating through Seq[row] till a particular condition is met using Scala

I need to iterate a scala Seq of Row type until a particular condition is met. i dont need to process further post the condition.
I have a seq[Row] r->WrappedArray([1/1/2020,abc,1],[1/2/2020,pqr,1],[1/3/2020,stu,0],[1/4/2020,opq,1],[1/6/2020,lmn,0])
I want to iterate through this collection for r.getInt(2) until i encounter 0. As soon as i encounter 0, i need to break the iteration and collect r.getString(1) till then. I dont need to look into any other data post that.
My output should be: Array(abc,pqr,stu)
I am new to scala programming. This seq was actually a Dataframe. I know how to handle this using Spark dataframes, but due to some restriction put forth by my organization, windows function, createDataFrame function are not available/working in our environment. Hence i have resort to Scala programming to achieve the same.
All I could come up was something like below, but not really working!
breakable{
for(i <- r)
var temp = i.getInt(3)===0
if(temp ==true)
{
val = i.getInt(2)
break()
}
}
Can someone please help me here!
You can use the takeWhile method to grab the elements while it's value is 1
s.takeWhile(_.getInt(2) == 1).map(_.getString(1))
Than will give you
List(abc, pqr)
So you still need to get the first element where the int values 0 which you can do as follows:
s.find(_.getInt(2)== 0).map(_.getString(1)).get
Putting all together (and handle possible nil values):
s.takeWhile(_.getInt(2) == 1).map(_.getString(1)) ++ s.find(_.getInt(2)== 0).map(r => List(r.getString(1))).getOrElse(Nil)
Result:
Seq[String] = List(abc, pqr, stu)

Scala: For loop that matches ints in a List

New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}

Understanding spark process behaviour

I would like to understand a process behavior. Basically this spark process must be create at most five files, one for each territory and save them into HDFS.
Territories are provided by an array of five strings. But when I'm looking at spark UI, I see many times the same action being executed.
These are my questions:
Why isEmpty action has been executed 4 times for each territory instead of one? I expect just one action for territory.
How are decided the tasks number when isEmpty is calculated? First time there is just one task, the second time tasks are 4, third are 20 and fourth are 35. Which the logic behind that sizing? Can I control that number in some way?
NOTE: is someone has a more say big data solution for to accomplish the same process goal, please suggest me.
This is the code excerpt for the Spark process:
class IntegrationStatusD1RequestProcess {
logger.info(s"Retrieving all measurement point from DB")
val allMPoints = registryData.createIncrementalRegistryByMPointID()
.setName("allMPoints")
.persist(StorageLevel.MEMORY_AND_DISK)
logger.info("getTerritories return always an array of five String")
intStatusHelper.getTerritories.foreach { territory =>
logger.info(s"Retrieving measurement point for territory $territory")
val intStatusesChanged = allMPoints
.filter { m => m.getmPoint.substring(0, 3) == territory }
.setName(s"intStatusesChanged_${territory}")
.persist(StorageLevel.MEMORY_AND_DISK)
intStatusesChanged.isEmpty match {
case true => logger.info(s"No changes detected for territory")
case false =>
//create file and save it into hdfs
}
}
}
This is the image showing all the spark jobs:
The following first two images showing isEmpty stages:
isEmpty is inefficient if you expect it to be true!
Here's the RDD code for isEmpty:
def isEmpty(): Boolean = withScope {
partitions.length == 0 || take(1).length == 0
}
It calls take. This is an efficient implementation if you think the RDD isn't empty, but is a horrible implementation if you think that it is.
The implementation of take follows this recursive step, starting at parts = 1:
Collect the first parts partitions.
Check if this result contain >= n items.
If yes, take the first n
If no, repeat step 1 with parts = parts * 4.
This implementation strategy lets the execution short-circuit if the RDD has more elements than you want to take, which is usually true. But if your RDD has fewer elements than you want to take, you end up computing the partition #1 log4(nPartitions) + 1 times, partitions #2-4 log4(nPartitions) times, partitions #5-16 log4(nPartitions) - 1 times, and so on.
A better implementation for this use case
This implementation only computes each partition once by sacrificing short-circuit capability:
def fasterIsEmpty(rdd: RDD[_]): Boolean = {
rdd.mapPartitions(it => Iterator(it.isEmpty))
.fold(true)(_ && _)
}

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.

Scala vals vs vars

I'm pretty new to Scala but I like to know what is the preferred way of solving this problem. Say I have a list of items and I want to know the total amount of the items that are checks. I could do something like so:
val total = items.filter(_.itemType == CHECK).map(._amount).sum
That would give me what I need, the sum of all checks in a immutable variable. But it does it with what seems like 3 iterations. Once to filter the checks, again to map the amounts and then the sum. Another way would be to do something like:
var total = new BigDecimal(0)
for (
item <- items
if item.itemType == CHECK
) total += item.amount
This gives me the same result but with 1 iteration and a mutable variable which seems fine too. But if I wanted to to extract more information, say the total number of checks, that would require more counters or mutable variables but I wouldn't have to iterate over the list again. Doesn't seem like the "functional" way of achieving what I need.
var numOfChecks = 0
var total = new BigDecimal(0)
items.foreach { item =>
if (item.itemType == CHECK) {
numOfChecks += 1
total += item.amount
}
}
So if you find yourself needing a bunch of counters or totals on a list is it preferred to keep mutable variables or not worry about it do something along the lines of:
val checks = items.filter(_.itemType == CHECK)
val total = checks.map(_.amount).sum
return (checks.size, total)
which seems easier to read and only uses vals
Another way of solving your problem in one iteration would be to use views or iterators:
items.iterator.filter(_.itemType == CHECK).map(._amount).sum
or
items.view.filter(_.itemType == CHECK).map(._amount).sum
This way the evaluation of the expression is delayed until the call of sum.
If your items are case classes you could also write it like this:
items.iterator collect { case Item(amount, CHECK) => amount } sum
I find that speaking of doing "three iterations" is a bit misleading -- after all, each iteration does less work than a single iteration with everything. So it doesn't automatically follows that iterating three times will take longer than iterating once.
Creating temporary objects, now that is a concern, because you'll be hitting memory (even if cached), which isn't the case of the single iteration. In those cases, view will help, even though it adds more method calls to do the same work. Hopefully, JVM will optimize that away. See Moritz's answer for more information on views.
You may use foldLeft for that:
(0 /: items) ((total, item) =>
if(item.itemType == CHECK)
total + item.amount
else
total
)
The following code will return a tuple (number of checks -> sum of amounts):
((0, 0) /: items) ((total, item) =>
if(item.itemType == CHECK)
(total._1 + 1, total._2 + item.amount)
else
total
)