Data structure to represent mapping of intervals - scala

Here is a function which can deduce the status of a person given their age
def getStatus(age: Int): String = {
age match {
case age if 0 until 2 contains age => "infant"
case age if 2 until 10 contains age => "child"
case age if 10 until 18 contains age => "teen"
case _ => "adult"
}
}
Let's say the boundaries can change. We can decide a person can be considered as an infant until 3 years old. As they change we do not want the boundaries to be hard coded and they will be stored externally.
What is a data-structure that can store mappings based on an interval?
Something like
val intervalMap = IntervalMap(
(0, 2) -> "infant",
(2, 10) -> "child",
(10, 18) -> "teen",
(18, 200 ) -> "adult"
)
intervalMap(1) // "infant"
intervalMap(12) // "teen"
I'm developping in Scala but a language agnostic answer will be much appreciated.

Easy Answer
There isn't anything in the Scala standard library that does that, but if the number of "categories" is low like in your example, there's no harm in implementing a naive O(N) apply method on your IntervalMap class.
def apply(in: Int) = categories.collectFirst {
case ((min, max), value) if in >= min && in < max => value
}
Guava
Looks like the Guava library has a RangeMap class that seems to fit your use-case.
Fancier DIY Idea
To get a O(log N) lookup characteristic, you could represent your category data as a binary tree:
Each node defines a min and max
Root node represents the absolute minimum to the absolute maximum, e.g. Int.MinValue to Int.MaxValue
Leaf nodes define a value (e.g. "child")
Non-leaf nodes define a split value, where the left child's max will be equal to the split, and the right child's min will be equal to the split
Find values in the tree by traversing left/right depending on whether your input number (e.g. age) is greater than or less than the current node's split
You'd have to deal with balancing the tree as it gets built... and TBH this is probably how Guava's doing it under the hood (I did not look into the implementation)

Related

Random value in Tree (Scala)

I'm working with Tree.scala defined in Functional Programming in Scala. I have implemented all the functions I needed but I have one more that I'm trying to do. The concept is to generate with a RNG home-made, a value between 1 and the numbers of leaves in the Tree (function count). The generated value represents the position of the value. So for example, if the generated value is 3, I will return the leaf at the 3rd position from left to right.
Data structure of a Tree
So I've made this function (according n is the random number generated):
def loop[A](tree: Tree[A], n: Int): A = tree match {
case Leaf(a) => if (n == 0) a else loop(tree, n - 1)
case Branch(left, right) => Branch(loop(left, n), loop(right, n))
}
The idea behind this is, if n > 0 and we are on a leaf, then we can minus a position, if n == 0 we return the value, and if we are on a branch we continue to travel the tree. But it's not working and I'm in lack of ideas, I also tried some things before but without the expected result.

Finding the average of option values containg object in a list

I'm new to Scala and I have been struggling with Option and Lists. I have the following object:
object Person {
case class Person(fName: String,
lName: String,
jeans: Option[Jeans],
jacket: Option[Jacket],
location: List[Locations],
age: Int)
case class Jeans(brand: String, price: Int, color: String)
...
}
And I'm trying to write the function that takes as input list of type person and return the average price of their jeans:
def avgPriceJeans(input: List[Person]): Int
When you have a list of values and want to reduce all of them to a single value, applying some kind of operation. You need a fold, the most common one would be a foldLeft.
As you can see in the scaladoc. This method receives an initial value and a combination function.
It should be obvious that the initial value should be a zero. And that the combination function should take the current accumulate and add to it the price of the current jeans.
Nevertheless, now we have another problem, the jeans may or may not exists, thus we use option. In this case we need a way to say if they exists give me their price, if not give a default value (which in this case makes sense to be another zero).
And that is precisely what Option.fold give us.
Thus we end with something like:
val sum = input.foldLeft(0) {
(acc, person) =>
acc + person.jeans.fold(ifEmpty = 0)(_.price)
}
Now that you need the average, you only need to divide that sum with the count.
However, we can do the count in the same foldLeft, just to avoid an extra iteration.
(I changed the return type, as well as the price property, to Double to ensure accurate results).
def avgPriceJeans(input: List[Person]): Double = {
val (sum, count) = input.foldLeft((0.0d, 0)) {
case ((accSum, accCount), person) =>
(
accSum + person.jeans.fold(ifEmpty = 0.0d)(_.price),
accCount + 1
)
}
sum / count
}
As #SethTissue points out, this is relatively straightforward:
val prices = persons.flatMap(_.jeans.map(_.price))
prices.sum.toDouble / prices.length
The first line needs some unpicking:
Taking this from the inside out, the expression jeans.map(_.price) takes the value of jeans, which is Option[Jeans], and extracts the price field to give Option[Int]. The flatMap call is equivalent to map followed by flatten. The map call applies this inner expression to the jeans field of each element of persons. This turns the List[Person] into a List[Option[Int]]. The flatten call extracts all the Int values from Some[Int] and discards all the None values. This gives List[Int] with one element for each Person that had a non-empty jeans field.
The second line simply sums the values in the List, converts it to Double and then divides by the length of the list. Adding error checking is left as an exercise!
One approach consist in calculate the value of the jeans price for each element of the list.
After that you can sum all the values (with sum method) and divide by the list size.
I managed the case when jeans is None with 0 as price value (so I consider it for the sum).
Here the code:
def avgPriceJeans(input: List[Person]): Int =
input.map(_.jeans.map(_.price).getOrElse(0)).sum / input.size

Understanding ScalaChecks' 'generation size'

ScalaCheck's Gen API docs explain lazy val sized:
def sized[T](f: (Int) ⇒ Gen[T]): Gen[T]
Creates a generator that can access its generation size
Looking at the following example:
import org.scalacheck.Gen.{sized, posNum}
scala> sized( s => { println(s); posNum[Int] })
res12: org.scalacheck.Gen[Int] = org.scalacheck.Gen$$anon$3#6a071cc0
scala> res12.sample
100
res13: Option[Int] = Some(12)
scala> res12.sample
100
res14: Option[Int] = Some(40)
What is the meaning of generation size, namely 100 in the above output?
sized provides access to the "size" parameter of Scalacheck. This parameter indicates how "big" the values generated by a generator should be. This parameter is useful in a couple of situations:
You want to limit the amount of values generated to make property generation and thus test run faster.
You need to fit generated data into external constraints, like form validators which check the length of a string, or databases which put limits on columns.
You need to generate recursive data structures and terminate at some point.
The companion of Gen.sized is Gen.resize which lets you change the size of a generator, as in Gen.resize(10, Gen.alphaNumString) which will generate a alpha-numeric string of no more than ten characters.
Most built-in generators use sized in some way, for instance Gen.buildableOf (which is the underpinning of all generators for lists and containers):
[…] The size of the container is bounded by the size parameter used when generating values.
A simple example
To get an idea of how Gen.size is used take a look at the example in "Sized Generators" (Generators, Scalacheck user guide):
def matrix[T](g: Gen[T]): Gen[Seq[Seq[T]]] = Gen.sized { size =>
val side = scala.math.sqrt(size).asInstanceOf[Int]
Gen.listOfN(side, Gen.listOfN(side, g))
}
This generator uses the "size" to limit the dimensions of the matrix so that the entire matrix will never have more than entries than the "size" parameter. In other words, with a size of 100 as in your question, the generated matrix would have 10 rows and 10 columns, amounting to a total of 100 entries.
Recursive data structures
"size" is particularly useful to make sure that generators for recursive data structures terminate. Consider the following example which generates instances of a binary tree and uses size to limit the height of each branch to make sure that the generator terminates at some point:
import org.scalacheck.Gen
import org.scalacheck.Arbitrary.arbitrary
sealed abstract class Tree
case class Node(left: Tree, right: Tree, v: Int) extends Tree
case object Leaf extends Tree
val genLeaf = Gen.const(Leaf)
val genNode = for {
v <- arbitrary[Int]
left <- Gen.sized(h => Gen.resize(h/2, genTree))
right <- Gen.sized(h => Gen.resize(h/2, genTree))
} yield Node(left, right, v)
def genTree: Gen[Tree] = Gen.sized { height =>
if (height <= 0) {
genLeaf
} else {
Gen.oneOf(genLeaf, genNode)
}
}
Note how the generator for nodes recursively generates trees, but allows them only half of the "size". The tree generator in turn will only generate leafs if its size is exhausted. Thus the "size" of the generator is an upper bound for the height of the generated tree, ensuring that the generator terminates at some point and does not generate excessively large trees.
Note that the size only sets an upper bound for the height of the tree in this example. It does not affect the balancing of the generated tree or the likeliness of generating a tree with a certain depth. These depend solely on the bias defined in genTree.
With oneOf each subtree has a 50% chance of being just a Leaf, ending the growth of the tree at this branch, which makes generating a complete tree that exhausts the "whole" size somewhat unlikely.
frequency (see below) let's you encode a different bias. In the example below nodes a way more likely than leafs, so the tree generated by the generator below is more likely to grow, but it's still unlikely to be complete.
Relation to Gen.frequency
Gen.frequency is for a different use case: You would not use it to limit the depths or size of a data structure, but to add a certain bias to a choice of generators. Take a look at the definition of Gen.option:
def option[T](g: Gen[T]): Gen[Option[T]] =
frequency(1 -> const(None), 9 -> some(g))
This definition uses frequency to make the less-interesting case of None less likely than the more interesting case of Some.
In fact, we could combine Gen.sized and Gen.frequency in our binary tree example above to make genTree more likely to generate "interesting" nodes rather than "boring" leafs:
def genTree: Gen[Tree] = Gen.sized { height =>
if (height <= 0) {
genLeaf
} else {
Gen.frequency(1 -> genLeaf, 9 -> genNode)
}
}
Generation size is the number of results the generator will produce. The sized method simply lets you write generators that know their own size so you can use that information as a factor in what you generate.
For example, this generator (from this resource) produces two lists of numbers where 1/3 are positive and 2/3 are negative:
import org.scalacheck.Gen
import org.scalacheck.Prop.forAll
val myGen = Gen.sized { size =>
val positiveNumbers = size / 3
val negativeNumbers = size * 2 / 3
for {
posNumList <- Gen.listOfN(positiveNumbers, Gen.posNum[Int])
negNumList <- Gen.listOfN(negativeNumbers, Gen.posNum[Int] map (n => -n))
} yield (size, posNumList, negNumList)
}
check {
forAll(myGen) {
case (genSize, posN, negN) =>
posN.length == genSize / 3 && negN.length == genSize * 2 / 3
}
}
So sort of like zipWithIndex in Scala collections, sized just provides you meta information to help you do what you need to do.

How to write an efficient groupBy-size filter in Scala, can be approximate

Given a List[Int] in Scala, I wish to get the Set[Int] of all Ints which appear at least thresh times. I can do this using groupBy or foldLeft, then filter. For example:
val thresh = 3
val myList = List(1,2,3,2,1,4,3,2,1)
myList.foldLeft(Map[Int,Int]()){case(m, i) => m + (i -> (m.getOrElse(i, 0) + 1))}.filter(_._2 >= thresh).keys
will give Set(1,2).
Now suppose the List[Int] is very large. How large it's hard to say but in any case this seems wasteful as I don't care about each of the Ints frequencies, and I only care if they're at least thresh. Once it passed thresh there's no need to check anymore, just add the Int to the Set[Int].
The question is: can I do this more efficiently for a very large List[Int],
a) if I need a true, accurate result (no room for mistakes)
b) if the result can be approximate, e.g. by using some Hashing trick or Bloom Filters, where Set[Int] might include some false-positives, or whether {the frequency of an Int > thresh} isn't really a Boolean but a Double in [0-1].
First of all, you can't do better than O(N), as you need to check each element of your initial array at least once. You current approach is O(N), presuming that operations with IntMap are effectively constant.
Now what you can try in order to increase efficiency:
update map only when current counter value is less or equal to threshold. This will eliminate huge number of most expensive operations — map updates
try faster map instead of IntMap. If you know that values of the initial List are in fixed range, you can use Array instead of IntMap (index as the key). Another possible option will be mutable HashMap with sufficient initail capacity. As my benchmark shows it actually makes significant difference
As #ixx proposed, after incrementing value in the map, check whether it's equal to 3 and in this case add it immediately to result list. This will save you one linear traversing (appears to be not that significant for large input)
I don't see how any approximate solution can be faster (only if you ignore some elements at random). Otherwise it will still be O(N).
Update
I created microbenchmark to measure the actual performance of different implementations. For sufficiently large input and output Ixx's suggestion regarding immediately adding elements to result list doesn't produce significant improvement. However similar approach could be used to eliminate unnecessary Map updates (which appears to be the most expensive operation).
Results of benchmarks (avg run times on 1000000 elems with pre-warming):
Authors solution:
447 ms
Ixx solution:
412 ms
Ixx solution2 (eliminated excessive map writes):
150 ms
My solution:
57 ms
My solution involves using mutable HashMap instead of immutable IntMap and includes all other possible optimizations.
Ixx's updated solution:
val tuple = (Map[Int, Int](), List[Int]())
val res = myList.foldLeft(tuple) {
case ((m, s), i) =>
val count = m.getOrElse(i, 0) + 1
(if (count <= 3) m + (i -> count) else m, if (count == thresh) i :: s else s)
}
My solution:
val map = new mutable.HashMap[Int, Int]()
val res = new ListBuffer[Int]
myList.foreach {
i =>
val c = map.getOrElse(i, 0) + 1
if (c == thresh) {
res += i
}
if (c <= thresh) {
map(i) = c
}
}
The full microbenchmark source is available here.
You could use the foldleft to collect the matching items, like this:
val tuple = (Map[Int,Int](), List[Int]())
myList.foldLeft(tuple) {
case((m, s), i) => {
val count = (m.getOrElse(i, 0) + 1)
(m + (i -> count), if (count == thresh) i :: s else s)
}
}
I could measure a performance improvement of about 40% with a small list, so it's definitely an improvement...
Edited to use List and prepend, which takes constant time (see comments).
If by "more efficiently" you mean the space efficiency (in extreme case when the list is infinite), there's a probabilistic data structure called Count Min Sketch to estimate the frequency of items inside it. Then you can discard those with frequency below your threshold.
There's a Scala implementation from Algebird library.
You can change your foldLeft example a bit using a mutable.Set that is build incrementally and at the same time used as filter for iterating over your Seq by using withFilter. However, because I'm using withFilteri cannot use foldLeft and have to make do with foreach and a mutable map:
import scala.collection.mutable
def getItems[A](in: Seq[A], threshold: Int): Set[A] = {
val counts: mutable.Map[A, Int] = mutable.Map.empty
val result: mutable.Set[A] = mutable.Set.empty
in.withFilter(!result(_)).foreach { x =>
counts.update(x, counts.getOrElse(x, 0) + 1)
if (counts(x) >= threshold) {
result += x
}
}
result.toSet
}
So, this would discard items that have already been added to the result set while running through the Seq the first time, because withFilterfilters the Seqin the appended function (map, flatMap, foreach) rather than returning a filtered Seq.
EDIT:
I changed my solution to not use Seq.count, which was stupid, as Aivean correctly pointed out.
Using Aiveans microbench I can see that it is still slightly slower than his approach, but still better than the authors first approach.
Authors solution
377
Ixx solution:
399
Ixx solution2 (eliminated excessive map writes):
110
Sascha Kolbergs solution:
72
Aivean solution:
54

Combination of elements

Problem:
Given a Seq seq and an Int n.
I basically want all combinations of the elements up to size n. The arrangement matters, meaning e.g. [1,2] is different that [2,1].
def combinations[T](seq: Seq[T], size: Int) = ...
Example:
combinations(List(1,2,3), 0)
//Seq(Seq())
combinations(List(1,2,3), 1)
//Seq(Seq(), Seq(1), Seq(2), Seq(3))
combinations(List(1,2,3), 2)
//Seq(Seq(), Seq(1), Seq(2), Seq(3), Seq(1,2), Seq(2,1), Seq(1,3), Seq(3,1),
//Seq(2,3), Seq(3,2))
...
What I have so far:
def combinations[T](seq: Seq[T], size: Int) = {
#tailrec
def inner(seq: Seq[T], soFar: Seq[Seq[T]]): Seq[Seq[T]] = seq match {
case head +: tail => inner(tail, soFar ++ {
val insertList = Seq(head)
for {
comb <- soFar
if comb.size < size
index <- 0 to comb.size
} yield comb.patch(index, insertList, 0)
})
case _ => soFar
}
inner(seq, IndexedSeq(IndexedSeq.empty))
}
What would be your approach to this problem? This method will be called a lot and therefore it should be made most efficient.
There are methods in the library like subsets or combinations (yea I chose the same name), which return iterators. I also thought about that, but I have no idea yet how to design this lazily.
Not sure if this is efficient enough for your purpose but it's the simplest approach.
def combinations[T](seq: Seq[T], size: Int) : Seq[Seq[T]] = {
(1 to size).flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
edit:
to make it lazy you can use a view
def combinations[T](seq: Seq[T], size: Int) : Iterable[Seq[T]] = {
(1 to size).view.flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
From permutations theory we know that the number of permutations of K elements taken from a set of N elements is
N! / (N - K)!
(see http://en.wikipedia.org/wiki/Permutation)
Therefore if you wanna build them all, you will have
algorithm complexity = number of permutations * cost of building each permutation
The potential optimization of the algorithm lies in minimizing the cost of building each permutation, by using a data structure that has some appending / prepending operation that runs in O(1).
You are using an IndexedSeq, which is a collection optimized for O(1) random access. When collections are optimized for random access they are backed by arrays. When using such collections (this is also valid for java ArrayList) you give up the guarantee of a low cost insertion operation: sometimes the array won't be big enough and the collection will have to create a new one and copy all the elements.
When using instead linked lists (such as scala List, which is the default implementation for Seq) you take the opposite choice: you give up constant time access for constant time insertion. In particular, scala List is a recursive data structure with constant time insertion at the front.
So if you are looking for high performance and you need the collection to be available eagerly, use a Seq.empty instead of IndexedSeq.empty and at each iteration prepend the new element at the head of the Seq. If you need something lazy, use Stream which will minimize memory occupation. Additional strategies worth exploring is to create an empty IndexedSeq for your first iteration, but not through Indexed.empty. Use instead the builder and try to provide an array which has the right size (N! / (N-K)!)