Improve the efficiency of the algorithm - scala

I'm trying to improve the algorithm.
Now it works for O(n) and iterates through all the elements of the set. Always. My attempts to achieve incomplete O(n) lead to the introduction of var variables. It would be great to do without var.
Task:
We need a class that implements a list of company names by substring - from the list of all available names, output a certain number of companies
that start with the entered line.
It is assumed that the class will be called when filling out a form on a website/mobile application with a high RPS (Requests per second).
My solution:
class SuggestService(companyNames : Seq[String]) {
def suggest(input: String, numberOfSuggest : Int) : Seq[String] = {
val resultCompanyNames =
for {
name <- companyNames if input.equals(name.take(input.length))
} yield name
resultCompanyNames.take(numberOfSuggest)
} //TODO: My code
}
Scastie: https://scastie.scala-lang.org/mIC5ZTGwRyKnuJAbhM1VlA

The solution proposed by #jwvh in the comments:
companyNames.view.filter(_.startsWith(input)).take(numberOfSuggest).toSeq
is good when you have several dozen company names, but in the worst case you'll have to check every single company name. If you have thousands of company names and many requests per second, this has a chance to become a serious bottleneck.
A better approach might be to sort the company names and use binary search to find the first potential result in O(L log N), where L is the average length of a company name:
import scala.collection.imuutable.ArraySeq // in Scala 2.13
class SuggestService(companyNames: Seq[String]) {
// in Scala 2.12 use companyNames.toIndexedSeq.sorted
private val sortedNames = companyNames.to(ArraySeq).sorted
#annotation.tailrec
private def binarySearch(input: String, from: Int = 0, to: Int = sortedNames.size): Int = {
if (from == to) from
else {
val cur = (from + to) / 2
if (sortedNames(cur) < input) binarySearch(input, cur + 1, to)
else binarySearch(input, from, cur)
}
}
def suggest(input: String, numberOfSuggest: Int): Seq[String] = {
val start = binarySearch(input)
sortedNames.view
.slice(start, start + numberOfSuggest)
.takeWhile(_.startsWith(input))
.toSeq
}
}
Note that the built-in binary search in scala (sortedNames.search) returns any result, not necessarily the first one. And the built-in binary search from Java works either on Arrays (Arrays.binarySearch) or on Java collections (Collections.binarySearch). So in the code above I've provided an explicit implementation of the lower bound binary search.
If you need still better performance, you can use a trie data structure. After traversing the trie and finding the node corresponding to input in O(L) (may depend on trie implementation), you can then continue down from this node to retrieve the first numberOfSuggest results. The query time doesn't depend on N at all, so you can use this method with millions company names.

Related

Data analysis on a subset with scala

I'm new to learning Scala and exploring the ways it can do things, and am now trying to learn to implement some slightly more sophisticated data analysis.
I have weather data for various cities in different countries in a text file loaded into the program. I have so far figured out how to calculate simple things like the average temperature across a country per day, or the average temperature of each city grouped by country across the whole file, using Maps/Mapvalues to bind keys to the values I'm looking for.
Now would like to be able to specify a time window (say, a week) and, from there, grouped by country, figure out things like the average temperature of each city in that time window. For simplicity, I've made the dates simple INTs rather than go with MM/DD/YY format.
In another language I would likely go for loops to do this, but I'm not quite sure the best "Scala" way to do this. At first I thought maybe "sliding" or "grouped," but have found this would split the list entirely and thefore I could not specify an arbitrary day to calculate the week from. I've included example code for my method which calculates the average temperature per city over the whole time period
def citytempaverages(): Map[String, Map[String, Double]] = {
weatherpatterns.groupBy(_.country)
.mapValues(_.groupBy(_.city)
.mapValues(cityavg => cityavg.map(_.temperature).sum /cityavg.length))
Does it even still make sense to use maps for this new problem, or perhaps another method in the collections API is more suited?
UPDATE #1: so I've built a collection like so:
def dailycities(): Map[Int, Map[String,Map[String, List[Double]]]] = {
weatherpatterns.groupBy(_.day)
.mapValues(_.groupBy(_.country).mapValues(_.groupBy(_.city)
.mapValues(_.map(_.temperature))))
}
And then created a new map using filterKeys and the Set function to give me back just a list of the days I'm looking for. So I suppose now it's just a matter of formatting to get the averages out correctly.
I would't call it a best way in scala to do this. Rather any way to minimize iteration is the best imo in that case:
def averageOfDay(country: String, city: String, day: Int) = {
val temps = weatherPatterns.collect {
case WeatherPattern(`day`, `country`, `city`, temp) => temp
}
temps.sum / temps.length
}
Edit
I just noticed you mainly need an operation that calculates avgs for all cities and countries. In that case I'd say instead of forming the hierarchical relationship of country -> city -> temp in every operation, you'd rather opt for building the hierarchy once beforhand then operate on that:
case class DailyTemperature(day: Int, temperature: Double)
object DailyTemperature {
def sequence(patterns: List[WeatherPattern]): List[DailyTemperature] =
patterns.map(p => DailyTemperature(p.day,p.temperature))
}
case class CityTempInfo(city: String, dailyTemperatures: List[DailyTemperature])
object CityTempInfo {
def sequence(patterns: List[WeatherPattern]): List[CityTempInfo] =
patterns.groupBy(_.city).map {
case (city, ps) => CityTempInfo(city,DailyTemperature.sequence(ps))
}.toList
}
case class CountryTempInfo(country: String, citiesInfo: List[CityTempInfo])
object CountryTempInfo {
def sequence(patterns: List[WeatherPattern]) =
patterns.groupBy(_.country).map {
case (country, ps) => CountryTempInfo(country, CityTempInfo.sequence(ps))
}.toList
}
now to have your tree of country -> city -> temp you call the CountryTempInfo.sequence and feed it your list of WeatherPatterns. any other method you want to have operate on DailyTemperature,CityTempInfo, of CountryTempInfo can be defined on their respective classes.
I am not sure what exactly you mean when you say that you use "simple ints" for the date, but if it is something sensible, like, for instance "days since epoch", you could fairly easily come up with a grouping function, that maps to weeks:
def weakOf(n: Int, start: Int) = (start + n) / 7
patterns
.groupBy { p => (weakOf(p.day, startDay), p.country, p.city) }
.mapValues(_.map(_.temperature))
.mapValues { s => s.sum / s.size }

Migrate from MurmurHash to MurmurHash3

In Scala 2.10, MurmurHash for some reason is deprecated, saying I should use MurmurHash3 now. But the API is different, and there is no useful scaladocs for MurmurHash3 -> fail.
For instance, current code:
trait Foo {
type Bar
def id: Int
def path: Bar
override def hashCode = {
import util.MurmurHash._
var h = startHash(2)
val c = startMagicA
val k = startMagicB
h = extendHash(h, id, c, k)
h = extendHash(h, path.##, nextMagicA(c), nextMagicB(k))
finalizeHash(h)
}
}
How would I do this using MurmurHash3 instead? This needs to be a fast operation, preferably without allocations, so I do not want to construct a Product, Seq, Array[Byte] or whathever MurmurHash3 seems to be offering me.
The MurmurHash3 algorithm was changed, confusingly, from an algorithm that mixed in its own salt, essentially (c and k), to one that just does more bit-mixing. The basic operation is now mix, which you should fold over all your values, after which you should finalizeHash (the Int argument for length is for convenience also, to help with distinguishing collections of different length). If you want to replace your last mix by mixLast, it's a little faster and removes redundancy with finalizeHash. If it takes you too long to detect what the last mix is, just mix.
Typically for a collection you'll want to mix in an extra value to indicate what type of collection it is.
So minimally you'd have
override def hashCode = finalizeHash(mixLast(id, path.##), 0)
and "typically" you'd
// Pick any string or number that suits you, put in companion object
val fooSeed = MurmurHash3.stringHash("classOf[Foo]")
// I guess "id" plus "path" is two things?
override def hashCode = finalizeHash(mixLast( mix(fooSeed,id), path.## ), 2)
Note that the length field is NOT there to give a high-quality hash that mixes in that number. All mixing of important hash values should be done with mix.
Looking at the source code of MurmurHash3 suggests something like this:
override def hashCode = {
import util.hashing.MurmurHash3._
val h = symmetricSeed // I'm not sure which seed to use here
val h1 = mix(h, id)
val h2 = mixLast(h1, path ##)
finalizeHash(h2, 2)
}
or, in (almost) one line:
import util.hashing.MurmurHash3._
override def hashCode = finalizeHash(mix(mix(symmetricSeed, id), path ##), 2)

Building an immutable List based on conditions

I have to build a list whose members should be included or not based on a different condition for each.
Let's say I have to validate a purchase order and, depending on the price, I have to notify a number of people: if the price is more than 10, the supervisor has to be notified. If the price is more than 100 then both the supervisor and the manager. If the price is more than 1000 then the supervisor, the manager, and the director.
My function should take a price as input and output a list of people to notify. I came up with the following:
def whoToNotify(price:Int) = {
addIf(price>1000, "director",
addIf(price>100, "manager",
addIf(price>10, "supervisor", Nil)
)
)
}
def addIf[A](condition:Boolean, elem:A, list:List[A]) = {
if(condition) elem :: list else list
}
Are there better ways to do this in plain Scala? Am I reinventing some wheel here with my addIf function?
Please note that the check on price is just a simplification. In real life, the checks would be more complex, on a number of database fields, and including someone in the organizational hierarchy will not imply including all the people below, so truncate-a-list solutions won't work.
EDIT
Basically, I want to achieve the following, using immutable lists:
def whoToNotify(po:PurchaseOrder) = {
val people = scala.collection.mutable.Buffer[String]()
if(directorCondition(po)) people += "director"
if(managerCondition(po)) people += "manager"
if(supervisorCondition(po)) people += "supervisor"
people.toList
}
You can use List#flatten() to build a list from subelements. It would even let you add two people at the same time (I'll do that for the managers in the example below):
def whoToNotify(price:Int) =
List(if (price > 1000) List("director") else Nil,
if (price > 100) List("manager1", "manager2") else Nil,
if (price > 10) List("supervisor") else Nil).flatten
Well, its a matter of style, but I would do it this way to make all the conditions more amenable -
case class Condition(price: Int, designation: String)
val list = List(
Condition(10, "supervisor"),
Condition(100, "manager") ,
Condition(1000, "director")
)
def whoToNotify(price: Int) = {
list.filter(_.price <= price).map(_.designation)
}
You can accommodate all your conditions in Condition class and filter function as per your requirements.
Well, it is a matter of style, but I would prefer keeping a list of people to notify with rules rather than function nesting. I don't see much value in having something like addIf in above example.
My solution.
val notifyAbovePrice = List(
(10, "supervisor"),
(100, "manager"),
(1000, "director"))
def whoToNotify(price: Int): Seq[String] = {
notifyAbovePrice.takeWhile(price > _._1).map(_._2)
}
In real world, you may have objects to notifyAbovePrice instead of tuples and use filter instead of takeWhile if there is no order or the order doesn't imply notifications on lower level.
If you have an original list of members, you would probably want to consider using the filter method. If you also want to transform the member object so as to have a different type of list at the end, take a look at the collect method, which takes a partial function.

What are good examples of: "operation of a program should map input values to output values rather than change data in place"

I came across this sentence in Scala in explaining its functional behavior.
operation of a program should map input of values to output values rather than change data in place
Could somebody explain it with a good example?
Edit: Please explain or give example for the above sentence in its context, please do not make it complicate to get more confusion
The most obvious pattern that this is referring to is the difference between how you would write code which uses collections in Java when compared with Scala. If you were writing scala but in the idiom of Java, then you would be working with collections by mutating data in place. The idiomatic scala code to do the same would favour the mapping of input values to output values.
Let's have a look at a few things you might want to do to a collection:
Filtering
In Java, if I have a List<Trade> and I am only interested in those trades executed with Deutsche Bank, I might do something like:
for (Iterator<Trade> it = trades.iterator(); it.hasNext();) {
Trade t = it.next();
if (t.getCounterparty() != DEUTSCHE_BANK) it.remove(); // MUTATION
}
Following this loop, my trades collection only contains the relevant trades. But, I have achieved this using mutation - a careless programmer could easily have missed that trades was an input parameter, an instance variable, or is used elsewhere in the method. As such, it is quite possible their code is now broken. Furthermore, such code is extremely brittle for refactoring for this same reason; a programmer wishing to refactor a piece of code must be very careful to not let mutated collections escape the scope in which they are intended to be used and, vice-versa, that they don't accidentally use an un-mutated collection where they should have used a mutated one.
Compare with Scala:
val db = trades filter (_.counterparty == DeutscheBank) //MAPPING INPUT TO OUTPUT
This creates a new collection! It doesn't affect anyone who is looking at trades and is inherently safer.
Mapping
Suppose I have a List<Trade> and I want to get a Set<Stock> for the unique stocks which I have been trading. Again, the idiom in Java is to create a collection and mutate it.
Set<Stock> stocks = new HashSet<Stock>();
for (Trade t : trades) stocks.add(t.getStock()); //MUTATION
Using scala the correct thing to do is to map the input collection and then convert to a set:
val stocks = (trades map (_.stock)).toSet //MAPPING INPUT TO OUTPUT
Or, if we are concerned about performance:
(trades.view map (_.stock)).toSet
(trades.iterator map (_.stock)).toSet
What are the advantages here? Well:
My code can never observe a partially-constructed result
The application of a function A => B to a Coll[A] to get a Coll[B] is clearer.
Accumulating
Again, in Java the idiom has to be mutation. Suppose we are trying to sum the decimal quantities of the trades we have done:
BigDecimal sum = BigDecimal.ZERO
for (Trade t : trades) {
sum.add(t.getQuantity()); //MUTATION
}
Again, we must be very careful not to accidentally observe a partially-constructed result! In scala, we can do this in a single expression:
val sum = (0 /: trades)(_ + _.quantity) //MAPPING INTO TO OUTPUT
Or the various other forms:
(trades.foldLeft(0)(_ + _.quantity)
(trades.iterator map (_.quantity)).sum
(trades.view map (_.quantity)).sum
Oh, by the way, there is a bug in the Java implementation! Did you spot it?
I'd say it's the difference between:
var counter = 0
def updateCounter(toAdd: Int): Unit = {
counter += toAdd
}
updateCounter(8)
println(counter)
and:
val originalValue = 0
def addToValue(value: Int, toAdd: Int): Int = value + toAdd
val firstNewResult = addToValue(originalValue, 8)
println(firstNewResult)
This is a gross over simplification but fuller examples are things like using a foldLeft to build up a result rather than doing the hard work yourself: foldLeft example
What it means is that if you write pure functions like this you always get the same output from the same input, and there are no side effects, which makes it easier to reason about your programs and ensure that they are correct.
so for example the function:
def times2(x:Int) = x*2
is pure, while
def add5ToList(xs: MutableList[Int]) {
xs += 5
}
is impure because it edits data in place as a side effect. This is a problem because that same list could be in use elsewhere in the the program and now we can't guarantee the behaviour because it has changed.
A pure version would use immutable lists and return a new list
def add5ToList(xs: List[Int]) = {
5::xs
}
There are plenty examples with collections, which are easy to come by but might give the wrong impression. This concept works at all levels of the language (it doesn't at the VM level, however). One example is the case classes. Consider these two alternatives:
// Java-style
class Person(initialName: String, initialAge: Int) {
def this(initialName: String) = this(initialName, 0)
private var name = initialName
private var age = initialAge
def getName = name
def getAge = age
def setName(newName: String) { name = newName }
def setAge(newAge: Int) { age = newAge }
}
val employee = new Person("John")
employee.setAge(40) // we changed the object
// Scala-style
case class Person(name: String, age: Int) {
def this(name: String) = this(name, 0)
}
val employee = new Person("John")
val employeeWithAge = employee.copy(age = 40) // employee still exists!
This concept is applied on the construction of the immutable collection themselves: a List never changes. Instead, new List objects are created when necessary. Use of persistent data structures reduce the copying that would happen on a mutable data structure.

When imperative style fits better?

From the Programming in Scala (second edition), bottom of the p.98:
A balanced attitude for Scala programmers
Prefer vals, immutable objects, and methods without side effects.
Reach for them first. Use vars, mutable objects, and methods with side effects when you have a specific need and justification for them.
It is explained on previous pages why to prefer vals, immutable objects, and methods without side effects so this sentence makes perfect sense.
But second sentence:"Use vars, mutable objects, and methods with side effects when you have a specific need and justification for them." is not explained so well.
So my question is:
What is justification or specific need to use vars, mutable objects and methods with side effect?
P.s.: It would be great if someone could provide some examples for each of those (besides explanation).
In many cases functional programming increases the level of abstraction and hence makes your code more concise and easier/faster to write and understand. But there are situations where the resulting bytecode cannot be as optimized (fast) as for an imperative solution.
Currently (Scala 2.9.1) one good example is summing up ranges:
(1 to 1000000).foldLeft(0)(_ + _)
Versus:
var x = 1
var sum = 0
while (x <= 1000000) {
sum += x
x += 1
}
If you profile these you will notice a significant difference in execution speed. So sometimes performance is a really good justification.
Ease of Minor Updates
One reason to use mutability is if you're keeping track of some ongoing process. For example, let's suppose I am editing a large document and have a complex set of classes to keep track of the various elements of the text, the editing history, the cursor position, and so on. Now suppose the user clicks on a different part of the text. Do I recreate the document object, copying many fields but not the EditState field; recreate the EditState with new ViewBounds and documentCursorPosition? Or do I alter a mutable variable in one spot? As long as thread safety is not an issue then is is much simpler and less error-prone to just update a variable or two than to copy everything. If thread safety is an issue, then protecting from concurrent access may be more work than using the immutable approach and dealing with out-of-date requests.
Computational efficiency
Another reason to use mutability is for speed. Object creation is cheap, but simple method calls are cheaper, and operations on primitive types are cheaper yet.
Let's suppose, for example, that we have a map and we want to sum the values and the squares of the values.
val xs = List.range(1,10000).map(x => x.toString -> x).toMap
val sum = xs.values.sum
val sumsq = xs.values.map(x => x*x).sum
If you do this every once in a while, it's no big deal. But if you pay attention to what's going on, for every list element you first recreate it (values), then sum it (boxed), then recreate it again (values), then recreate it yet again in squared form with boxing (map), then sum it. This is at least six object creations and five full traversals just to do two adds and one multiply per item. Incredibly inefficient.
You might try to do better by avoiding the multiple recursion and passing through the map only once, using a fold:
val (sum,sumsq) = ((0,0) /: xs){ case ((sum,sumsq),(_,v)) => (sum + v, sumsq + v*v) }
And this is much better, with about 15x better performance on my machine. But you still have three object creations every iteration. If instead you
case class SSq(var sum: Int = 0, var sumsq: Int = 0) {
def +=(i: Int) { sum += i; sumsq += i*i }
}
val ssq = SSq()
xs.foreach(x => ssq += x._2)
you're about twice as fast again because you cut the boxing down. If you have your data in an array and use a while loop, then you can avoid all object creation and boxing and speed up by another factor of 20.
Now, that said, you could also have chosen a recursive function for your array:
val ar = Array.range(0,10000)
def suma(xs: Array[Int], start: Int = 0, sum: Int = 0, sumsq: Int = 0): (Int,Int) = {
if (start >= xs.length) (sum, sumsq)
else suma(xs, start+1, sum+xs(start), sumsq + xs(start)*xs(start))
}
and written this way it's just as fast as the mutable SSq. But if we instead do this:
def sumb(xs: Array[Int], start: Int = 0, ssq: (Int,Int) = (0,0)): (Int,Int) = {
if (start >= xs.length) ssq
else sumb(xs, start+1, (ssq._1+xs(start), ssq._2 + xs(start)*xs(start)))
}
we're now 10x slower again because we have to create an object on each step.
So the bottom line is that it really only matters that you have immutability when you cannot conveniently carry your updating structure along as independent arguments to a method. Once you go beyond the complexity where that works, mutability can be a big win.
Cumulative Object Creation
If you need to build up a complex object with n fields from potentially faulty data, you can use a builder pattern that looks like so:
abstract class Built {
def x: Int
def y: String
def z: Boolean
}
private class Building extends Built {
var x: Int = _
var y: String = _
var z: Boolean = _
}
def buildFromWhatever: Option[Built] = {
val b = new Building
b.x = something
if (thereIsAProblem) return None
b.y = somethingElse
// check
...
Some(b)
}
This only works with mutable data. There are other options, of course:
class Built(val x: Int = 0, val y: String = "", val z: Boolean = false) {}
def buildFromWhatever: Option[Built] = {
val b0 = new Built
val b1 = b0.copy(x = something)
if (thereIsAProblem) return None
...
Some(b)
}
which in many ways is even cleaner, except you have to copy your object once for each change that you make, which can be painfully slow. And neither of these are particularly bulletproof; for that you'd probably want
class Built(val x: Int, val y: String, val z: Boolean) {}
class Building(
val x: Option[Int] = None, val y: Option[String] = None, val z: Option[Boolean] = None
) {
def build: Option[Built] = for (x0 <- x; y0 <- y; z0 <- z) yield new Built(x,y,z)
}
def buildFromWhatever: Option[Build] = {
val b0 = new Building
val b1 = b0.copy(x = somethingIfNotProblem)
...
bN.build
}
but again, there's lots of overhead.
I've found that imperative / mutable style is better fit for dynamic programming algorithms. If you insist on immutablility, it's harder to program for most people, and you end up using vast amounts of memory and / or overflowing the stack. One example: Dynamic programming in the functional paradigm
Some examples:
(Originally a comment) Any program has to do some input and output (otherwise, it's useless). But by definition, input/output is a side effect and can't be done without calling methods with side effects.
One major advantage of Scala is ability to use Java libraries. Many of them rely on mutable objects and methods with side-effects.
Sometimes you need a var due to scoping. See Temperature4 in this blog post for an example.
Concurrent programming. If you use actors, sending and receiving messages are a side effect; if you use threads, synchronizing on locks is a side effect and locks are mutable; event-driven concurrency is all about side effects; futures, concurrent collections, etc. are mutable.