Understanding ScalaChecks' 'generation size' - scala

ScalaCheck's Gen API docs explain lazy val sized:
def sized[T](f: (Int) ⇒ Gen[T]): Gen[T]
Creates a generator that can access its generation size
Looking at the following example:
import org.scalacheck.Gen.{sized, posNum}
scala> sized( s => { println(s); posNum[Int] })
res12: org.scalacheck.Gen[Int] = org.scalacheck.Gen$$anon$3#6a071cc0
scala> res12.sample
100
res13: Option[Int] = Some(12)
scala> res12.sample
100
res14: Option[Int] = Some(40)
What is the meaning of generation size, namely 100 in the above output?

sized provides access to the "size" parameter of Scalacheck. This parameter indicates how "big" the values generated by a generator should be. This parameter is useful in a couple of situations:
You want to limit the amount of values generated to make property generation and thus test run faster.
You need to fit generated data into external constraints, like form validators which check the length of a string, or databases which put limits on columns.
You need to generate recursive data structures and terminate at some point.
The companion of Gen.sized is Gen.resize which lets you change the size of a generator, as in Gen.resize(10, Gen.alphaNumString) which will generate a alpha-numeric string of no more than ten characters.
Most built-in generators use sized in some way, for instance Gen.buildableOf (which is the underpinning of all generators for lists and containers):
[…] The size of the container is bounded by the size parameter used when generating values.
A simple example
To get an idea of how Gen.size is used take a look at the example in "Sized Generators" (Generators, Scalacheck user guide):
def matrix[T](g: Gen[T]): Gen[Seq[Seq[T]]] = Gen.sized { size =>
val side = scala.math.sqrt(size).asInstanceOf[Int]
Gen.listOfN(side, Gen.listOfN(side, g))
}
This generator uses the "size" to limit the dimensions of the matrix so that the entire matrix will never have more than entries than the "size" parameter. In other words, with a size of 100 as in your question, the generated matrix would have 10 rows and 10 columns, amounting to a total of 100 entries.
Recursive data structures
"size" is particularly useful to make sure that generators for recursive data structures terminate. Consider the following example which generates instances of a binary tree and uses size to limit the height of each branch to make sure that the generator terminates at some point:
import org.scalacheck.Gen
import org.scalacheck.Arbitrary.arbitrary
sealed abstract class Tree
case class Node(left: Tree, right: Tree, v: Int) extends Tree
case object Leaf extends Tree
val genLeaf = Gen.const(Leaf)
val genNode = for {
v <- arbitrary[Int]
left <- Gen.sized(h => Gen.resize(h/2, genTree))
right <- Gen.sized(h => Gen.resize(h/2, genTree))
} yield Node(left, right, v)
def genTree: Gen[Tree] = Gen.sized { height =>
if (height <= 0) {
genLeaf
} else {
Gen.oneOf(genLeaf, genNode)
}
}
Note how the generator for nodes recursively generates trees, but allows them only half of the "size". The tree generator in turn will only generate leafs if its size is exhausted. Thus the "size" of the generator is an upper bound for the height of the generated tree, ensuring that the generator terminates at some point and does not generate excessively large trees.
Note that the size only sets an upper bound for the height of the tree in this example. It does not affect the balancing of the generated tree or the likeliness of generating a tree with a certain depth. These depend solely on the bias defined in genTree.
With oneOf each subtree has a 50% chance of being just a Leaf, ending the growth of the tree at this branch, which makes generating a complete tree that exhausts the "whole" size somewhat unlikely.
frequency (see below) let's you encode a different bias. In the example below nodes a way more likely than leafs, so the tree generated by the generator below is more likely to grow, but it's still unlikely to be complete.
Relation to Gen.frequency
Gen.frequency is for a different use case: You would not use it to limit the depths or size of a data structure, but to add a certain bias to a choice of generators. Take a look at the definition of Gen.option:
def option[T](g: Gen[T]): Gen[Option[T]] =
frequency(1 -> const(None), 9 -> some(g))
This definition uses frequency to make the less-interesting case of None less likely than the more interesting case of Some.
In fact, we could combine Gen.sized and Gen.frequency in our binary tree example above to make genTree more likely to generate "interesting" nodes rather than "boring" leafs:
def genTree: Gen[Tree] = Gen.sized { height =>
if (height <= 0) {
genLeaf
} else {
Gen.frequency(1 -> genLeaf, 9 -> genNode)
}
}

Generation size is the number of results the generator will produce. The sized method simply lets you write generators that know their own size so you can use that information as a factor in what you generate.
For example, this generator (from this resource) produces two lists of numbers where 1/3 are positive and 2/3 are negative:
import org.scalacheck.Gen
import org.scalacheck.Prop.forAll
val myGen = Gen.sized { size =>
val positiveNumbers = size / 3
val negativeNumbers = size * 2 / 3
for {
posNumList <- Gen.listOfN(positiveNumbers, Gen.posNum[Int])
negNumList <- Gen.listOfN(negativeNumbers, Gen.posNum[Int] map (n => -n))
} yield (size, posNumList, negNumList)
}
check {
forAll(myGen) {
case (genSize, posN, negN) =>
posN.length == genSize / 3 && negN.length == genSize * 2 / 3
}
}
So sort of like zipWithIndex in Scala collections, sized just provides you meta information to help you do what you need to do.

Related

Data structure to represent mapping of intervals

Here is a function which can deduce the status of a person given their age
def getStatus(age: Int): String = {
age match {
case age if 0 until 2 contains age => "infant"
case age if 2 until 10 contains age => "child"
case age if 10 until 18 contains age => "teen"
case _ => "adult"
}
}
Let's say the boundaries can change. We can decide a person can be considered as an infant until 3 years old. As they change we do not want the boundaries to be hard coded and they will be stored externally.
What is a data-structure that can store mappings based on an interval?
Something like
val intervalMap = IntervalMap(
(0, 2) -> "infant",
(2, 10) -> "child",
(10, 18) -> "teen",
(18, 200 ) -> "adult"
)
intervalMap(1) // "infant"
intervalMap(12) // "teen"
I'm developping in Scala but a language agnostic answer will be much appreciated.
Easy Answer
There isn't anything in the Scala standard library that does that, but if the number of "categories" is low like in your example, there's no harm in implementing a naive O(N) apply method on your IntervalMap class.
def apply(in: Int) = categories.collectFirst {
case ((min, max), value) if in >= min && in < max => value
}
Guava
Looks like the Guava library has a RangeMap class that seems to fit your use-case.
Fancier DIY Idea
To get a O(log N) lookup characteristic, you could represent your category data as a binary tree:
Each node defines a min and max
Root node represents the absolute minimum to the absolute maximum, e.g. Int.MinValue to Int.MaxValue
Leaf nodes define a value (e.g. "child")
Non-leaf nodes define a split value, where the left child's max will be equal to the split, and the right child's min will be equal to the split
Find values in the tree by traversing left/right depending on whether your input number (e.g. age) is greater than or less than the current node's split
You'd have to deal with balancing the tree as it gets built... and TBH this is probably how Guava's doing it under the hood (I did not look into the implementation)

Is the map generator from the EPFL online course able to generate every possible map?

https://www.coursera.org/learn/progfun2 assignment for Week 1 shows, as an example, a generator for maps of type Map[Int, Int]:
lazy val genMap: Gen[Map[Int,Int]] = oneOf(
const(Map.empty[Int,Int]),
for {
k <- arbitrary[Int]
v <- arbitrary[Int]
m <- oneOf(const(Map.empty[Int,Int]), genMap)
} yield m.updated(k, v)
)
I'm new to Scala, but I'm familiar with generators in imperative programming languages. My understanding of the generator's execution flow is as follows:
arbitrary[Int] is called, it returns a generator yielding an endless sequence of Ints, the first generated value is assigned to k
arbitrary[Int] is called again, it returns a new generator, the first generated value is assigned to v
A random map is created recursively, updated with k->v, and yielded to the consumer
When the next value from the generator is requested, the execution resumes at m <- ... definition, proceeding with a new random m and the same k->v mapping
Neither const nor the recursive genMap ever run out of values, meaning that the "loop" for m never terminates, so new values for v and k are never requested from the corresponding arbitrary generators.
My conclusion is that all generated maps would either be empty or include the k->v mapping generated in the first iteration of the outermost invocation, i.e. genMap can never generate a non-empty map without such a mapping.
Q1: are my analysis and my conclusion correct?
Q2: if they are, how can I implement a generator which, after generating a first map, would have non-zero chance of generating any possible map?
Q3: if I simplify the last definition in the for-expression to m <- genMap, does that change the generator's behaviour in any way?
In short, your analysis and conclusion aren't correct.
I suspect the root of the misunderstanding is in interpreting for as a loop (it's not in general, and specifically not so in this context (when dealing with things that are more explicitly collections, for is close enough, I guess)).
I'll explain from the top down.
oneOf, given 1 or more generators will create a generator which, when asked to generate a value, will defer to one of the the given generators by random selection. So
oneOf(
const(Map.empty[Int, Int]),
k: Gen[Map[Int, Int]] // i.e. some generator for Map[Int, Int]
)
The output might be
someMapFromK, Map.empty, someMapFromK, someMapFromK, Map.empty, Map.empty...
In this case, our k is
for {
k <- arbitrary[Int]
v <- arbitrary[Int]
m <- oneOf(const(Map.empty[Int, Int]), genMap) // genMap being the name the outermost generator will be bound to
} yield m.updated(k)
for is syntactic sugar for calls to flatMap and map:
arbitrary[Int].flatMap { k =>
arbitrary[Int].flatMap { v =>
oneOf(const(Map.empty[Int, Int]), genMap).map { m =>
m.updated(k, v)
}
}
}
For something like List, map and flatMap consume the entire collection. Gen is lazier:
flatMap basically means generate a value, and feed that value to a function that results in a Gen
map basically means generate a value, and transform it
If we imagined a method on Gen named sample which gave us the "next" generated value (for this purpose, we'll say that for a Gen[T] it will result in T and never throw an exception, etc.) genMap is exactly analogous to:
trait SimpleGen[T] { def sample: T }
lazy val genMap: SimpleGen[Map[Int, Int]] = new SimpleGen[Map[Int, Int]] {
def sample: Map[Int, Int] =
if (scala.util.Random.nextBoolean) Map.empty
else {
val k = arbitrary[Int].sample
val v = arbitrary[Int].sample
val m =
if (scala.util.Random.nextBoolean) Map.empty
else genMap.sample // Since genMap is lazy, we can recurse
m.updated(k, v)
}
}
Regarding the third question, in the original definition, the extra oneOf serves to bound the recursion depth to prevent the stack from being blown. For that definition, there's a 1/4 chance of going recursive, while replacing the inner oneOf with genMap would have a 1/2 chance of going recursive. Thus (ignoring the chance of a collision in the ks), for the first:
50% chance of empty (50% chance of 1+)
37.5% chance of size 1 (12.5% chance of 2+)
9.375% chance of size 2 (3.125% chance of 3+)
2.34375 chance of size 3 (0.78125% chance of 4+)...
While for the second:
50% chance of empty
25% chance of size 1
12.5% chance of size 2
6.25% chance of size 3...
Technically the possibility of stack overflow implies that depending on how many recursions you can make there's a maximum number of k -> v pairs in the Map you can generate, so there are almost certainly Maps that could not be generated.

How to pick a random value from a collection in Scala

I need a method to pick uniformly a random value from a collection.
Here is my current impl.
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.toSeq(Random.nextInt(elements.size))
}
But this code instantiate a new collection, so not ideal in term of memory.
Any way to improve ?
[update] make it work with Iterator
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = {
val seq = elements.toSeq
seq(Random.nextInt(seq.size))
}
}
It may seem at first glance that you can't do this without counting the elements first, but you can!
Iterate through the sequence f and take each element fi with probability 1/i:
def choose[A](it: Iterator[A], r: util.Random): A =
it.zip(Iterator.iterate(1)(_ + 1)).reduceLeft((x, y) =>
if (r.nextInt(y._2) == 0) y else x
)._1
A quick demonstration of uniformity:
scala> ((1 to 1000000)
| .map(_ => choose("abcdef".iterator, r))
| .groupBy(identity).values.map(_.length))
res45: Iterable[Int] = List(166971, 166126, 166987, 166257, 166698, 166961)
Here's a discussion of the math I wrote a while back, though I'm afraid it's a bit unnecessarily long-winded. It also generalizes to choosing any fixed number of elements instead of just one.
Simplest way is just to think of the problem as zipping the collection with an equal-sized list of random numbers, and then just extract the maximum element. You can do this without actually realizing the zipped sequence. This does require traversing the entire iterator, though
val maxElement = s.maxBy(_=>Random.nextInt)
Or, for the implicit version
implicit class TraversableOnceOps[A, Repr](val elements: TraversableOnce[A]) extends AnyVal {
def pickRandomly : A = elements.maxBy(_=>Random.nextInt)
}
It's possible to select an element uniformly at random from a collection, traversing it once without copying the collection.
The following algorithm will do the trick:
def choose[A](elements: TraversableOnce[A]): A = {
var x: A = null.asInstanceOf[A]
var i = 1
for (e <- elements) {
if (Random.nextDouble <= 1.0 / i) {
x = e
}
i += 1
}
x
}
The algorithm works by at each iteration makes a choice: take the new element with probability 1 / i, or keep the previous one.
To understand why the algorithm choose the element uniformly at random, consider this: Start by considering an element in the collection, for example the first one (in this example the collection only has three elements).
At iteration:
Chosen with probability: 1.
Chosen with probability:
(probability of keeping the element at previous iteration) * (keeping at current iteration)
probability => 1 * 1/2 = 1/2
Chosen with probability: 1/2 * 2/3=1/3 (in other words, uniformly)
If we take another element, for example the second one:
0 (not possible to choose the element at this iteration).
1/2.
1/2*2/3=1/3.
Finally for the third one:
0.
0.
1/3.
This shows that the algorithm selects an element uniformly at random. This can be proved formally using induction.
If the collection is large enough that you care about about instantiations, here is the constant memory solution (I assume, it contains ints' but that only matters for passing initial param to fold):
collection.fold((0, 0)) {
case ((0, _), x) => (1, x)
case ((n, x), _) if (random.nextDouble() > 1.0/n) => (n+1, x)
case ((n, _), x) => (n+1, x)
}._2
I am not sure if this requires a further explanation ... Basically, it does the same thing that #svenslaggare suggested above, but in a functional way, since this is tagged as a scala question.

Combination of elements

Problem:
Given a Seq seq and an Int n.
I basically want all combinations of the elements up to size n. The arrangement matters, meaning e.g. [1,2] is different that [2,1].
def combinations[T](seq: Seq[T], size: Int) = ...
Example:
combinations(List(1,2,3), 0)
//Seq(Seq())
combinations(List(1,2,3), 1)
//Seq(Seq(), Seq(1), Seq(2), Seq(3))
combinations(List(1,2,3), 2)
//Seq(Seq(), Seq(1), Seq(2), Seq(3), Seq(1,2), Seq(2,1), Seq(1,3), Seq(3,1),
//Seq(2,3), Seq(3,2))
...
What I have so far:
def combinations[T](seq: Seq[T], size: Int) = {
#tailrec
def inner(seq: Seq[T], soFar: Seq[Seq[T]]): Seq[Seq[T]] = seq match {
case head +: tail => inner(tail, soFar ++ {
val insertList = Seq(head)
for {
comb <- soFar
if comb.size < size
index <- 0 to comb.size
} yield comb.patch(index, insertList, 0)
})
case _ => soFar
}
inner(seq, IndexedSeq(IndexedSeq.empty))
}
What would be your approach to this problem? This method will be called a lot and therefore it should be made most efficient.
There are methods in the library like subsets or combinations (yea I chose the same name), which return iterators. I also thought about that, but I have no idea yet how to design this lazily.
Not sure if this is efficient enough for your purpose but it's the simplest approach.
def combinations[T](seq: Seq[T], size: Int) : Seq[Seq[T]] = {
(1 to size).flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
edit:
to make it lazy you can use a view
def combinations[T](seq: Seq[T], size: Int) : Iterable[Seq[T]] = {
(1 to size).view.flatMap(i => seq.combinations(i).flatMap(_.permutations))
}
From permutations theory we know that the number of permutations of K elements taken from a set of N elements is
N! / (N - K)!
(see http://en.wikipedia.org/wiki/Permutation)
Therefore if you wanna build them all, you will have
algorithm complexity = number of permutations * cost of building each permutation
The potential optimization of the algorithm lies in minimizing the cost of building each permutation, by using a data structure that has some appending / prepending operation that runs in O(1).
You are using an IndexedSeq, which is a collection optimized for O(1) random access. When collections are optimized for random access they are backed by arrays. When using such collections (this is also valid for java ArrayList) you give up the guarantee of a low cost insertion operation: sometimes the array won't be big enough and the collection will have to create a new one and copy all the elements.
When using instead linked lists (such as scala List, which is the default implementation for Seq) you take the opposite choice: you give up constant time access for constant time insertion. In particular, scala List is a recursive data structure with constant time insertion at the front.
So if you are looking for high performance and you need the collection to be available eagerly, use a Seq.empty instead of IndexedSeq.empty and at each iteration prepend the new element at the head of the Seq. If you need something lazy, use Stream which will minimize memory occupation. Additional strategies worth exploring is to create an empty IndexedSeq for your first iteration, but not through Indexed.empty. Use instead the builder and try to provide an array which has the right size (N! / (N-K)!)

Generate a DAG from a poset using stricly functional programming

Here is my problem: I have a sequence S of (nonempty but possibly not distinct) sets s_i, and for each s_i need to know how many sets s_j in S (i ≠ j) are subsets of s_i.
I also need incremental performance: once I have all my counts, I may replace one set s_i by some subset of s_i and update the counts incrementally.
Performing all this using purely functional code would be a huge plus (I code in Scala).
As set inclusion is a partial ordering, I thought the best way to solve my problem would be to build a DAG that would represent the Hasse diagram of the sets, with edges representing inclusion, and join an integer value to each node representing the size of the sub-dag below the node plus 1. However, I have been stuck for several days trying to develop the algorithm that builds the Hasse diagram from the partial ordering (let's not talk about incrementality!), even though I thought it would be some standard undergraduate material.
Here is my data structure :
case class HNode[A] (
val v: A,
val child: List[HNode[A]]) {
val rank = 1 + child.map(_.rank).sum
}
My DAG is defined by a list of roots and some partial ordering:
class Hasse[A](val po: PartialOrdering[A], val roots: List[HNode[A]]) {
def +(v: A): Hasse[A] = new Hasse[A](po, add(v, roots))
private def collect(v: A, roots: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
if (roots == Nil) collected
else {
val (subsets, remaining) = roots.partition(r => po.lteq(r.v, v))
collect(v, remaining.map(_.child).flatten, subsets.filter(r => !collected.exists(c => po.lteq(r.v, c.v))) ::: collected)
}
}
I am pretty stuck here. The last I came up to add a new value v to the DAG is:
find all "root subsets" rs_i of v in the DAG, i.e., subsets of v such that no superset of rs_i is a subset of v. This can be done quite easily by performing a search (BFS or DFS) on the graph (collect function, possibly non-optimal or even flawed).
build the new node n_v, the children of which are the previously found rs_i.
Now, let's find out where n_v should be attached: for a given list of roots, find out supersets of v. If none are found, add n_v to the roots and remove subsets of n_v from the roots. Else, perform step 3 recursively on the supersets's children.
I have not yet implemented fully this algorithm, but it seems uncessarily circonvoluted and nonoptimal for my apparently simple problem. Is there some simpler algorithm available (Google was clueless on this)?
After some work, I finally ended up solving my problem, following my initial intuition. The collect method and rank evaluation were flawed, I rewrote them with tail-recursion as a bonus. Here is the code I obtained:
final case class HNode[A](
val v: A,
val child: List[HNode[A]]) {
val rank: Int = 1 + count(child, Set.empty)
#tailrec
private def count(stack: List[HNode[A]], c: Set[HNode[A]]): Int =
if (stack == Nil) c.size
else {
val head :: rem = stack
if (c(head)) count(rem, c)
else count(head.child ::: rem, c + head)
}
}
// ...
private def add(v: A, roots: List[HNode[A]]): List[HNode[A]] = {
val newNode = HNode(v, collect(v, roots, Nil))
attach(newNode, roots)
}
private def attach(n: HNode[A], roots: List[HNode[A]]): List[HNode[A]] =
if (roots.contains(n)) roots
else {
val (supersets, remaining) = roots.partition { r =>
// Strict superset to avoid creating cycles in case of equal elements
po.tryCompare(n.v, r.v) == Some(-1)
}
if (supersets.isEmpty) n :: remaining.filter(r => !po.lteq(r.v, n.v))
else {
supersets.map(s => HNode(s.v, attach(n, s.child))) ::: remaining
}
}
#tailrec
private def collect(v: A, stack: List[HNode[A]], collected: List[HNode[A]]): List[HNode[A]] =
if (stack == Nil) collected
else {
val head :: tail = stack
if (collected.exists(c => po.lteq(head.v, c.v))) collect(v, tail, collected)
else if (po.lteq(head.v, v)) collect(v, tail, head :: (collected.filter(c => !po.lteq(c.v, head.v))))
else collect(v, head.child ::: tail, collected)
}
I now must check some optimization:
- cut off branches with totally distinct sets when collecting subsets (as Rex Kerr suggested)
- see if sorting the sets by size improves the process (as mitchus suggested)
The following problem is to work the (worst case) complexity of the add() operation out.
With n the number of sets, and d the size of the largest set, the complexity will probably be O(n²d), but I hope it can be refined. Here is my reasoning: if all sets are distinct, the DAG will be reduced to a sequence of roots/leaves. Thus, every time I try to add a node to the data structure, I still have to check for inclusion with each node already present (both in collect and attach procedures). This leads to 1 + 2 + … + n = n(n+1)/2 ∈ O(n²) inclusion checks.
Each set inclusion test is O(d), hence the result.
Suppose your DAG G contains a node v for each set, with attributes v.s (the set) and v.count (the number of instances of the set), including a node G.root with G.root.s = union of all sets (where G.root.count=0 if this set never occurs in your collection).
Then to count the number of distinct subsets of s you could do the following (in a bastardized mixture of Scala, Python and pseudo-code):
sum(apply(lambda x: x.count, get_subsets(s, G.root)))
where
get_subsets(s, v) :
if(v.s is not a subset of s, {},
union({v} :: apply(v.children, lambda x: get_subsets(s, x))))
In my opinion though, for performance reasons you would be better off abandoning this kind of purely functional solution... it works well on lists and trees, but beyond that the going gets tough.