Handle different states - scala

I was wondering if it was possible to maintain radically different states across an application? For example, have the update function of the first state call the one from the second state?
I do not recall going through any such example, nor did I find any counter indication... Based on the example from https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.html, I know of no reason why I wouldn't be able to have different trackStateFuncs with different States, and still update those thanks to their Key, as shown below:
def firstTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Long]): Option[(String, Long)] = {
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val output = (key, sum)
state.update(sum)
Some(output)
}
and
def secondTrackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[Int]): Option[(String, Long)] = {
// disregard problems this example would cause
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, dif)
state.update(dif)
Some(output)
}
I think this is possible but still remain unsure. I would like someone to validate or invalidate this assumption...

I was wondering if it was possible to maintain radically different
states across an application?
Every call to mapWithState on a DStream[(Key, Value)] can hold one State[T] object. This T needs to be the same for every invocation of mapWithState. In order to use different states, you can either chain mapWithState calls, where one's Option[U] is anothers input, or you can have split the DStream and apply a different mapWithState call to each one. You cannot, however, call a different State[T] object inside another, as they are isolated from one another, and one cannot mutate the state of the other.

#Yuval gave a great answer to chain mapWithState functions. However, I have another approach. Instead of having two mapWithState calls, you can put both the sum and the diff in the same State[(Int, Int)].
In this case, you would only need one mapWithState functions where you could update both the things. Something like this:
def trackStateFunc(batchTime: Time,
key: String,
value: Option[Int],
state: State[(Long, Int)]): Option[(String, (Long, Int))] =
{
val sum = value.getOrElse(0).toLong + state.getOption.getOrElse(0L)
val dif = value.getOrElse(0) - state.getOption.getOrElse(0L)
val output = (key, (sum, diff))
state.update((sum, diff))
Some(output)
}

Related

Is there a better functional method to operate on Map[String,List[Int]]

I'm fairly new to scala and functional programming, and I'm working on a project where I have grocery prices in 30 days and want to apply some analysis over the data that I have.
The data is saved as map(string, List[Int])
What I'm trying to do is to get the lowest and highest price for each item, I did it like this and then I have another function that loops over the returned Map and prints it.
def f(): Map[String,List[Int]] = {
var result= Map.empty[String, List[Int]]
for ((k,v) <- data){
var low = v.min
var high = v.max
result+= (k -> List(low,high));
}
result
}
I think this is not the most functional method to do it, can anyone elaborate if there is a way to iterate over the data and return the result without creating an empty map?
The computation does not depend on the keys in any way, so there is no reason to introduce the ks anywhere, it's just distracting from the main goal. Just map the values:
data.view.mapValues(v => (v.min, v.max)).toMap
Also, your signature f() doesn't tell anything useful. How do you know what it's doing? If you deleted the body of that function, and were given only "f()", would you be able to unambiguously reconstruct the body? Would the GPT be able to reconstruct the body? Probably not.
Ideally, the signature should be precise enough so you never need to dig into the implementation bodies (and also that you don't actually have to write them). Here is a possible improvement:
def priceRanges(itemsToPrices: Map[String, List[Int]]): Map[String, (Int, Int)] =
itemsToPrices.view.mapValues(v => (v.min, v.max)).toMap
There are several ways to achieve this. I think one key aspect is readability, so while the following can be done as a pure one-liner, I think this could be a viable and readable solution:
data.map { case (k, v) =>
k -> Seq(v.min, v.max)
}
Feel free to shorten it if you like.
This would also work, but it may be less readable for someone not used to functional programming.
data.map(kv => kv._1 -> Seq(kv._2.min, kv._2.max))
Another thing you may want to consider:
There is nothing that protects the List/Seq in the result type from containing more than two elements. You may want to use a tuple or create a custom type for it.
I love it when people encourage themselves to do functional Scala instead of the imperative style, so congratulations on that.
Returning to your question, I think the easiest way to solve this problem is with the famous map function: it takes a function as a parameter which describes how you want to transform each element within the collection. In your case, this function goes from the tuple (item, values), which in your question would be the (k, v), to a new similar tuple, but this time only with the "prices" we are interested in:
def getLowAndHighPrices(itemsWithPrices: Map[String, List[Int]]): Map[String, List[Int]] =
itemsWithPrices.map((item, prices) => (item, List(prices.min, prices.max)))
You can read the previous map implementation as: for each value (item, prices), convert it into the tuple (item, List(prices.min, prices.max). The map function literally describes what you want to do, without telling exactly what steps to follow, because map takes care of that for you; that is personally one of the advantages of functional programming.
You can also print the results in a very “functional” way (ignoring the side effects):
// For demonstration purposes
val allItemPrices: Map[String, List[Int]] =
Map(
"Milk" -> List(9, 8, 7, 10),
"Eggs" -> List(1, 3, 4, 3, 5, 2)
)
def main(args: Array[String]): Unit =
getLowAndHighPrices(allItemPrices).foreach((item, prices) => println(s"$item -> $prices"))
/**
* Which prints out:
* Milk -> List(10, 7)
* Eggs -> List(5, 2)
*/
In this case, foreach does something very similar to map, with the difference that foreach is design to perform side-effects such as printing to the console.
I hope I made myself clear. Good luck on your Scala journey!

lazy val function vs def method

When calling to a function from external class, in case of many calls, what will give me a better performance, lazy val function or def method?
So far, what I understood is:
def method-
Defined and tied to a class, needed to be declare inside "object" in order to be called as java static style.
Call-by-name, evaluated only when accessed, and every accessed.
lazy val lambda expression -
Tied to object Function1/2...22
Call-by-value, evaluated the first time get accessed and evaluated only one time.
Is actually def apply method tied to a class.
So, it may seem that using lazy val will reduce the need to evaluate the function every time, should it be preferred ?
I faced that when i'm producing UDF for Spark code, and i'm trying to understand which approach is better.
object sql {
def emptyStringToNull(str: String): Option[String] = {
Option(str).getOrElse("").trim match {
case "" => None
case "[]" => None
case "null" => None
case _ => Some(str.trim)
}
}
def udfEmptyStringToNull: UserDefinedFunction = udf(emptyStringToNull _)
def repairColumn_method(dataFrame: DataFrame, colName: String): DataFrame = {
dataFrame.withColumn(colName, udfEmptyStringToNull(col(colName)))
}
lazy val repairColumn_fun: (DataFrame, String) => DataFrame = { (df,colName) =>
df.withColumn(colName, udfEmptyStringToNull(col(colName)))
}
}
There's no need for you to use a lazy val in this specific case. When you assign a function to a lazy val, its results are not memoized, as you seem to think they are. Since the function itself is a plain function literal and not the result of an expensive computation (regardless of what goes on inside it), making it lazy is not useful. All it does is add overhead when accessing and calling it. A simple val would be better, but making it a proper method would be best.
If you want memoization, see Is there a generic way to memoize in Scala? instead.
Ignoring your specific example, if the def in question didn't take any arguments and both it and the lazy val were simple values that were expensive to compute, I would go with the lazy val if you're going to call it many times to avoid computing it over and over again.
If they were values that were very cheap to compute and you're not going to call it many times, or if they're expensive to compute but you're only going to call them once, I would go with a def instead. There wouldn't be much difference if you used a lazy val instead, but it would avoid making a couple of fields.
If they're somewhat cheap to compute but they're being called many times, it may be better to use a lazy val simply because they'll be cached. However, you might want to look at your overall design before looking at such micro-optimizations.

How to sort on multiple columns using takeOrdered?

How to sort by 2 or multiple columns using the takeOrdered(4)(Ordering[Int]) approach in Spark-Scala.
I can achieve this using the sortBy like this :
lines.sortBy(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).map(p => println(p)).take(50)
But when i try to sort using the takeOrdered approach its failing
tl;dr Do something like this (but consider rewriting your code to call split only once):
lines.map(x => (x.split(",")(1).toInt, -x.split(",")(4).toInt)).takeOrdered(50)
Here is the explanation.
When you call takeOrdered directly on lines, the implicit Ordering that takes effect is Ordering[String] because lines is an RDD[String]. You need to transform lines into a new RDD[(Int, Int)]. Because there is an implicit Ordering[(Int, Int)] available, it takes effect on your transformed RDD.
Meanwhile, sortBy works a little differently. Here is the signature:
sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]
I know that is an intimidating signature, but if you cut through the noise, you can see that sortBy takes a function that maps your original type to a new type just for sorting purposes and applies the Ordering for that return type if one is in implicit scope.
In your case, you are applying a function to the Strings in your RDD to transform them into a "view" of how Spark should treat them merely for sorting purposes, i.e as a (Int, Int), and then relying on the fact that the implicit Ordering[(Int, Int)] is available as mentioned.
The sortBy approach allows you to keep lines intact as an RDD[String] and use the mapping just to sort while the takeOrdered approach operates on a brand new RDD containing (Int, Int) derived from the original lines. Whichever approach is more suitable for your needs depends on what you wish to accomplish.
On another note, you probably want to rewrite your code to only split your text once.
You could implement your custom Ordering:
lines.takeOrdered(4)(new Ordering[String] {
override def compare(x: String, y: String): Int = {
val xs=x.split(",")
val ys=y.split(",")
val d1 = xs(1).toInt - ys(1).toInt
if (d1 != 0) d1 else ys(4).toInt - xs(4).toInt
}
})

How to filter on a Seq[PartialFunction]

I have a list of business rules and multiple rules can apply to a given input.
type Input = …
type Output = …
type Rule = PartialFunction[Input, Output]
I want to write the method that compute all valid output. I've come up with this implementation :
def applyRules(i: Input, rules: Seq[Rule]) : Seq[Output] = {
rules.flatMap(_.lift.apply(i))
}
Is there a better way ?
One proposed variant of your solution (which I'd say is satisfactory) involves filtering and then mapping over the filtered results. This works but involves two passes over the same collection, which for smaller collection can be good. We can, however, reach the same result with three possible further variants:
using the collect method: rules.collect { case r if r.isDefinedAt(i) => r(i) }
using the lazy withFilter instead of filter: rules.withFilter(_.isDefinedAt(i)).map(_.apply(i)) or
using a for comprehension (semantically identical to the one above but perhaps more readable): for (r <- rule if r isDefinedAt i) r(i)
These solutions may produce slightly less garbage (each call to lift creates a new instance of a function object - here the code), however if the number of Rules is small I'm sure in most cases that's a non-issue.
You can use isDefinedAt to check whether given Input can be applied to Rule:
scala> val pf: PartialFunction[Any, Int] = { case s: String => 42 }
pf: PartialFunction[Any,Int] = <function1>
scala> pf.isDefinedAt(10)
res0: Boolean = false
scala> pf.isDefinedAt("")
res1: Boolean = true
So you can do something like:
val validInputs = rules.filter(_.isDefinedAt(i))
val result = validInputs.map(i)
Also PartialFunction contains methods applyOrElse, orElse, ..., that might increase readability.
Please correct me if I misunderstood your problem.

What are good examples of: "operation of a program should map input values to output values rather than change data in place"

I came across this sentence in Scala in explaining its functional behavior.
operation of a program should map input of values to output values rather than change data in place
Could somebody explain it with a good example?
Edit: Please explain or give example for the above sentence in its context, please do not make it complicate to get more confusion
The most obvious pattern that this is referring to is the difference between how you would write code which uses collections in Java when compared with Scala. If you were writing scala but in the idiom of Java, then you would be working with collections by mutating data in place. The idiomatic scala code to do the same would favour the mapping of input values to output values.
Let's have a look at a few things you might want to do to a collection:
Filtering
In Java, if I have a List<Trade> and I am only interested in those trades executed with Deutsche Bank, I might do something like:
for (Iterator<Trade> it = trades.iterator(); it.hasNext();) {
Trade t = it.next();
if (t.getCounterparty() != DEUTSCHE_BANK) it.remove(); // MUTATION
}
Following this loop, my trades collection only contains the relevant trades. But, I have achieved this using mutation - a careless programmer could easily have missed that trades was an input parameter, an instance variable, or is used elsewhere in the method. As such, it is quite possible their code is now broken. Furthermore, such code is extremely brittle for refactoring for this same reason; a programmer wishing to refactor a piece of code must be very careful to not let mutated collections escape the scope in which they are intended to be used and, vice-versa, that they don't accidentally use an un-mutated collection where they should have used a mutated one.
Compare with Scala:
val db = trades filter (_.counterparty == DeutscheBank) //MAPPING INPUT TO OUTPUT
This creates a new collection! It doesn't affect anyone who is looking at trades and is inherently safer.
Mapping
Suppose I have a List<Trade> and I want to get a Set<Stock> for the unique stocks which I have been trading. Again, the idiom in Java is to create a collection and mutate it.
Set<Stock> stocks = new HashSet<Stock>();
for (Trade t : trades) stocks.add(t.getStock()); //MUTATION
Using scala the correct thing to do is to map the input collection and then convert to a set:
val stocks = (trades map (_.stock)).toSet //MAPPING INPUT TO OUTPUT
Or, if we are concerned about performance:
(trades.view map (_.stock)).toSet
(trades.iterator map (_.stock)).toSet
What are the advantages here? Well:
My code can never observe a partially-constructed result
The application of a function A => B to a Coll[A] to get a Coll[B] is clearer.
Accumulating
Again, in Java the idiom has to be mutation. Suppose we are trying to sum the decimal quantities of the trades we have done:
BigDecimal sum = BigDecimal.ZERO
for (Trade t : trades) {
sum.add(t.getQuantity()); //MUTATION
}
Again, we must be very careful not to accidentally observe a partially-constructed result! In scala, we can do this in a single expression:
val sum = (0 /: trades)(_ + _.quantity) //MAPPING INTO TO OUTPUT
Or the various other forms:
(trades.foldLeft(0)(_ + _.quantity)
(trades.iterator map (_.quantity)).sum
(trades.view map (_.quantity)).sum
Oh, by the way, there is a bug in the Java implementation! Did you spot it?
I'd say it's the difference between:
var counter = 0
def updateCounter(toAdd: Int): Unit = {
counter += toAdd
}
updateCounter(8)
println(counter)
and:
val originalValue = 0
def addToValue(value: Int, toAdd: Int): Int = value + toAdd
val firstNewResult = addToValue(originalValue, 8)
println(firstNewResult)
This is a gross over simplification but fuller examples are things like using a foldLeft to build up a result rather than doing the hard work yourself: foldLeft example
What it means is that if you write pure functions like this you always get the same output from the same input, and there are no side effects, which makes it easier to reason about your programs and ensure that they are correct.
so for example the function:
def times2(x:Int) = x*2
is pure, while
def add5ToList(xs: MutableList[Int]) {
xs += 5
}
is impure because it edits data in place as a side effect. This is a problem because that same list could be in use elsewhere in the the program and now we can't guarantee the behaviour because it has changed.
A pure version would use immutable lists and return a new list
def add5ToList(xs: List[Int]) = {
5::xs
}
There are plenty examples with collections, which are easy to come by but might give the wrong impression. This concept works at all levels of the language (it doesn't at the VM level, however). One example is the case classes. Consider these two alternatives:
// Java-style
class Person(initialName: String, initialAge: Int) {
def this(initialName: String) = this(initialName, 0)
private var name = initialName
private var age = initialAge
def getName = name
def getAge = age
def setName(newName: String) { name = newName }
def setAge(newAge: Int) { age = newAge }
}
val employee = new Person("John")
employee.setAge(40) // we changed the object
// Scala-style
case class Person(name: String, age: Int) {
def this(name: String) = this(name, 0)
}
val employee = new Person("John")
val employeeWithAge = employee.copy(age = 40) // employee still exists!
This concept is applied on the construction of the immutable collection themselves: a List never changes. Instead, new List objects are created when necessary. Use of persistent data structures reduce the copying that would happen on a mutable data structure.