Overriding `Comparison method violates its general contract` exception - scala

I have a comparator like this:
lazy val seq = mapping.toSeq.sortWith { case ((_, set1), (_, set2)) =>
// Just propose all the most connected nodes first to the users
// But also allow less connected nodes to pop out sometimes
val popOutChance = random.nextDouble <= 0.1D && set2.size > 5
if (popOutChance) set1.size < set2.size else set1.size > set2.size
}
It is my intention to compare sets sizes such that smaller sets may appear higher in a sorted list with 10% chance.
But compiler does not let me do that and throws an Exception: java.lang.IllegalArgumentException: Comparison method violates its general contract! once I try to use it in runtime. How can I override it?

I think the problem here is that, every time two elements are compared, the outcome is random, thus violating the transitive property required of a comparator function in any sorting algorithm.
For example, let's say that some instance a compares as less than b, and then b compares as less than c. These results should imply that a compares as less than c. However, since your comparisons are stochastic, you can't guarantee that outcome. In fact, you can't even guarantee that a will be less than b next time they're compared.
So don't do that. No sort algorithm can handle it. (Such an approach also violates the referential transparency principle of functional programming and will make your program much harder to reason about.)
Instead, what you need to do is to decorate your map's members with a randomly assigned weighting - before attempting to sort them - so that they can be sorted consistently. However, since this happens at the start of a sort operation, the result of the sort will be different each time, which I think is what you're looking for.
It's not clear what type mapping has in your example, but it appears to be something like: Map[Any, Set[_]]. (You can replace the types as required - it's not that important to this approach. For example, say mapping actually has the type Map[String, Set[SomeClass]], then you would replace references below to Any with String and Set[_] to Set[SomeClass].)
First, we'll create a case class that we'll use to score and compare the map elements. Then we'll map the contents of mapping to a sequence of elements of this case class. Next, we sort those elements. Finally, we extract the tuple from the decorated class. The result should look something like this:
final case class Decorated(x: (Any, Set[_]), rand: Double = random.nextDouble)
extends Ordered[Decorated] {
// Calculate a rank for this element. You'll need to change this to suit your precise
// requirements. Here, if rand is less than 0.1 (a 10% chance), I'm adding 5 to the size;
// otherwise, I'll report the actual size. This allows transitive comparisons, since
// rand doesn't change once defined. Values are negated so bigger sets come to the fore
// when sorted.
private def rank: Int = {
if(rand < 0.1) -(x._2.size + 5)
else -x._2.size
}
// Compare this element with another, by their ranks.
override def compare(that: Decorated): Int = rank.compare(that.rank)
}
// Now sort your mapping elements as follows and convert back to tuples.
lazy val seq = mapping.map(x => Decorated(x)).toSeq.sorted.map(_.x)
This should put the elements with larger sets towards the front, but there's 10% chance that sets appear 5 bigger and so move up the list. The result will be different each time the last line is re-executed, since map will create new random values for each element. However, during sorting, the ranks will be fixed and will not change.
(Note that I'm setting the rank to a negative value. The Ordered[T] trait sorts elements in ascending order, so that - if we sorted purely by set size - smaller sets would come before larger sets. By negating the rank value, sorting will put larger sets before smaller sets. If you don't want this behavior, remove the negations.)

Related

Swift - should I create local variable of a strings "count"?

Does it matter if I use a strings 'count' multiple times within a function. That is, does Swift cache the 'count' after it firsts computes it. Below are two examples, does it matter which one I use? I assume the second is definitely okay but what about the first? I see example code like the first one all the time.
func Foo1 (str: String) {
...
// calling str.count twice
if x < str.count && y < str.count {
...
}
func Foo2 (str: String) {
...
// calling str.count once
let c = str.count
if x < c && y < c {
...
}
.count is defined by the Collection protocol with the following complexity:
Complexity: O(1) if the collection conforms to RandomAccessCollection; otherwise, O(n), where n is the length of the collection.
String is not a RandomAccessCollection. It's a BidirectionalCollection, so it does not promise O(1). It only promises O(n).
It definitely does not promise any caching (and you shouldn't expect any).
It happens to be true that in many (probably most) cases, String's count is cached. It's part of _StringObject, which is part of the low-level storage abstraction, and it's often inlined by the optimizer. But none of this is promised.
That said, unless you expect the String to be extremely large (10kB at a minimum, possibly more), it is difficult to imagine this being a major bottleneck by being called twice outside a tight loop. As with most things, you should write clearly, and then profile. I would likely create an extra variable just for clarity, but you shouldn't second-guess here too much. Write clearly. Then profile.
Do you have particularly large strings that you're working with?

Difference between size and sizeIs

What is the semantic difference between size and sizeIs? For example,
List(1,2,3).sizeIs > 1 // true
List(1,2,3).size > 1 // true
Luis mentions in a comment that
...on 2.13+ one can use sizeIs > 1 which will be more efficient than
size > 1 as the first one does not compute all the size before
returning
Add size comparison methods to IterableOps #6950 seems to be the pull request that introduced it.
Reading the scaladoc
Returns a value class containing operations for comparing the size of
this $coll to a test value. These operations are implemented in terms
of sizeCompare(Int)
it is not clear to me why is sizeIs more efficient than regular size?
As far as I understand the changes.
The idea is that for collections that do not have a O(1) (constant) size. Then, sizeIs can be more efficient, specially for comparisons with small values (like 1 in the comment).
But why?
Simple, because instead of computing all the size and then doing the comparison, sizeIs returns an object which when computing the comparison, can return early.
For example, lets check the code
def sizeCompare(otherSize: Int): Int = {
if (otherSize < 0) 1
else {
val known = knownSize
if (known >= 0) Integer.compare(known, otherSize)
else {
var i = 0
val it = iterator
while (it.hasNext) {
if (i == otherSize) return if (it.hasNext) 1 else 0 // HERE!!! - return as fast as possible.
it.next()
i += 1
}
i - otherSize
}
}
}
Thus, in the example of the comment, suppose a very very very long List of three elements. sizeIs > 1 will return as soon as it knows that the List has at least one element and hasMore. Thus, saving the cost of traversing the other two elements to compute a size of 3 and then doing the comparison.
Note that: If the size of the collection is greater than the comparing value, then the performance would be roughly the same (maybe slower than just size due the extra comparisons on each cycle). Thus, I would only recommend this for comparisons with small values, or when you believe the values will be smaller than the collection.

Strange slowdown in some simple scala code

I am processing a large number of records (CDRS) that are essentially (who, where, how much), to save space I use a lookup to map the strings into integer and aggregate the traffic on a map of maps (who maps to a map (where maps how much)
type CDR = (String, String, Int)
type Lookup = scala.collection.mutable.HashMap[String, (Int, Float)]
type Traffic = scala.collection.mutable.HashMap[Int,scala.collection.mutable.HashMap[Int,Int]]enter code here
I have found a strange behavior, when I build the lookup tables in advance the code runs as expected, however when I start processing and build the maps on the fly it slows down as it processes the records.
I use the same function to build the lookup tables for this comparison. I essentially check if the code for the lookup is there, if not i create a new entry (it is a mutable map), like this:
def index(id: String, map: Lookup, reverse: Reverse): Int = {
if (map.contains(id)) {
map(id)._1
} else {
val number = if (map.keys.size == 0) 0 else reverse.keys.max + 1
reverse += ( number -> id)
map += (id -> (number, 0.toFloat))
number
}
}
Am I missing something here?
EDIT----> I can no longer reproduce the slowdown. I will assume I was either too tired or dumber than usual. Running time now seems to be same as I expected to be.
What is mapCellRvs? Default scala Map's .size (and .keys.size, which is the same thing) simply counts all elements by scanning them linearly.
Try replacing mapCellRvs.keys.size == 0 with mapCellRvs.isEmpty ...
Also, reverse.keys.max is linear as well. You may want to just remember the max somewhere separately, rather than compute it every time.

Scala: For loop that matches ints in a List

New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}

Remember suchThat clauses when shrinking

If I have a custom generator then the shrinker will remember my suchThat clause and not shrink with invalid values:
val myGen = Gen.identifier.suchThat { _.length > 3 }
// all shrinks have > 3 characters
property("failing case") = forAll (myGen) { (a: String) =>
println(s"Gen suchThat Value: $a")
a == "Impossible"
}
If I do something further to the generated value (ie map it) then the shrinker "forgets" my suchThat clause:
// the shrinker will shrink all the way down to ""
property("failing case") = forAll (myGen.map{_ + "bbb"}) { (a: String) =>
println(s"Gen with map Value: $a")
a == "Impossible"
}
Is it possible to have suchThat values propagate through generators. In my real project I am doing more than a simple map but that seems to be the simplest example of the limitation I am hitting.
I'm fairly certain the answer is no (at least at this point in time).
This is quite annoying although perhaps not as trivial as it seems. The generator result does attempt to keep track of the sieve although it gets lost in map and flatMap. Apart from applying the sieve to the result of the shrink there isn't any other connection back to the generator. Even if there were all the intermediate results would need to be retained and applied to each sieve at the correct points. That then raises the question of: What exactly is being shrunk? The generated result or the original generator(s)?
The only solution that I have found so far is to either:
Disable shrinking, or
Implement a custom Shrink, or
Add a whenever clause that rechecks the generated value.
This can be quite challenging, especially when composing multiple generators.