Difference between size and sizeIs - scala

What is the semantic difference between size and sizeIs? For example,
List(1,2,3).sizeIs > 1 // true
List(1,2,3).size > 1 // true
Luis mentions in a comment that
...on 2.13+ one can use sizeIs > 1 which will be more efficient than
size > 1 as the first one does not compute all the size before
returning
Add size comparison methods to IterableOps #6950 seems to be the pull request that introduced it.
Reading the scaladoc
Returns a value class containing operations for comparing the size of
this $coll to a test value. These operations are implemented in terms
of sizeCompare(Int)
it is not clear to me why is sizeIs more efficient than regular size?

As far as I understand the changes.
The idea is that for collections that do not have a O(1) (constant) size. Then, sizeIs can be more efficient, specially for comparisons with small values (like 1 in the comment).
But why?
Simple, because instead of computing all the size and then doing the comparison, sizeIs returns an object which when computing the comparison, can return early.
For example, lets check the code
def sizeCompare(otherSize: Int): Int = {
if (otherSize < 0) 1
else {
val known = knownSize
if (known >= 0) Integer.compare(known, otherSize)
else {
var i = 0
val it = iterator
while (it.hasNext) {
if (i == otherSize) return if (it.hasNext) 1 else 0 // HERE!!! - return as fast as possible.
it.next()
i += 1
}
i - otherSize
}
}
}
Thus, in the example of the comment, suppose a very very very long List of three elements. sizeIs > 1 will return as soon as it knows that the List has at least one element and hasMore. Thus, saving the cost of traversing the other two elements to compute a size of 3 and then doing the comparison.
Note that: If the size of the collection is greater than the comparing value, then the performance would be roughly the same (maybe slower than just size due the extra comparisons on each cycle). Thus, I would only recommend this for comparisons with small values, or when you believe the values will be smaller than the collection.

Related

"Find the Parity Outlier Code Wars (Scala)"

I am doing some of CodeWars challenges recently and I've got a problem with this one.
"You are given an array (which will have a length of at least 3, but could be very large) containing integers. The array is either entirely comprised of odd integers or entirely comprised of even integers except for a single integer N. Write a method that takes the array as an argument and returns this "outlier" N."
I've looked at some solutions, that are already on our website, but I want to solve the problem using my own approach.
The main problem in my code, seems to be that it ignores negative numbers even though I've implemented Math.abs() method in scala.
If you have an idea how to get around it, that is more than welcome.
Thanks a lot
object Parity {
var even = 0
var odd = 0
var result = 0
def findOutlier(integers: List[Int]): Int = {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 == 0)
even += 1
else
odd += 1
}
if (even == 1) {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 == 0)
result = integers(y)
}
} else {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 != 0)
result = integers(y)
}
}
result
}
Your code handles negative numbers just fine. The problem is that you rely on mutable sate, which leaks between runs of your code. Your code behaves as follows:
val l = List(1,3,5,6,7)
println(Parity.findOutlier(l)) //6
println(Parity.findOutlier(l)) //7
println(Parity.findOutlier(l)) //7
The first run is correct. However, when you run it the second time, even, odd, and result all have the values from your previous run still in them. If you define them inside of your findOutlier method instead of in the Parity object, then your code gives correct results.
Additionally, I highly recommend reading over the methods available to a Scala List. You should almost never need to loop through a List like that, and there are a number of much more concise solutions to the problem. Mutable var's are also a pretty big red flag in Scala code, as are excessive if statements.

Overriding `Comparison method violates its general contract` exception

I have a comparator like this:
lazy val seq = mapping.toSeq.sortWith { case ((_, set1), (_, set2)) =>
// Just propose all the most connected nodes first to the users
// But also allow less connected nodes to pop out sometimes
val popOutChance = random.nextDouble <= 0.1D && set2.size > 5
if (popOutChance) set1.size < set2.size else set1.size > set2.size
}
It is my intention to compare sets sizes such that smaller sets may appear higher in a sorted list with 10% chance.
But compiler does not let me do that and throws an Exception: java.lang.IllegalArgumentException: Comparison method violates its general contract! once I try to use it in runtime. How can I override it?
I think the problem here is that, every time two elements are compared, the outcome is random, thus violating the transitive property required of a comparator function in any sorting algorithm.
For example, let's say that some instance a compares as less than b, and then b compares as less than c. These results should imply that a compares as less than c. However, since your comparisons are stochastic, you can't guarantee that outcome. In fact, you can't even guarantee that a will be less than b next time they're compared.
So don't do that. No sort algorithm can handle it. (Such an approach also violates the referential transparency principle of functional programming and will make your program much harder to reason about.)
Instead, what you need to do is to decorate your map's members with a randomly assigned weighting - before attempting to sort them - so that they can be sorted consistently. However, since this happens at the start of a sort operation, the result of the sort will be different each time, which I think is what you're looking for.
It's not clear what type mapping has in your example, but it appears to be something like: Map[Any, Set[_]]. (You can replace the types as required - it's not that important to this approach. For example, say mapping actually has the type Map[String, Set[SomeClass]], then you would replace references below to Any with String and Set[_] to Set[SomeClass].)
First, we'll create a case class that we'll use to score and compare the map elements. Then we'll map the contents of mapping to a sequence of elements of this case class. Next, we sort those elements. Finally, we extract the tuple from the decorated class. The result should look something like this:
final case class Decorated(x: (Any, Set[_]), rand: Double = random.nextDouble)
extends Ordered[Decorated] {
// Calculate a rank for this element. You'll need to change this to suit your precise
// requirements. Here, if rand is less than 0.1 (a 10% chance), I'm adding 5 to the size;
// otherwise, I'll report the actual size. This allows transitive comparisons, since
// rand doesn't change once defined. Values are negated so bigger sets come to the fore
// when sorted.
private def rank: Int = {
if(rand < 0.1) -(x._2.size + 5)
else -x._2.size
}
// Compare this element with another, by their ranks.
override def compare(that: Decorated): Int = rank.compare(that.rank)
}
// Now sort your mapping elements as follows and convert back to tuples.
lazy val seq = mapping.map(x => Decorated(x)).toSeq.sorted.map(_.x)
This should put the elements with larger sets towards the front, but there's 10% chance that sets appear 5 bigger and so move up the list. The result will be different each time the last line is re-executed, since map will create new random values for each element. However, during sorting, the ranks will be fixed and will not change.
(Note that I'm setting the rank to a negative value. The Ordered[T] trait sorts elements in ascending order, so that - if we sorted purely by set size - smaller sets would come before larger sets. By negating the rank value, sorting will put larger sets before smaller sets. If you don't want this behavior, remove the negations.)

Whats faster: Insert at index 0 or array.Reverse?

I'm getting data from my database in the reverse order of how I need it to be. In order to correctly order it I have a couple choices: I can insert each new piece of data gotten at index 0 of my array, or just append it then reverse the array at the end. Something like this:
let data = ["data1", "data2", "data3", "data4", "data5", "data6"]
var reversedArray = [String]()
for var item in data {
reversedArray.insert(item, 0)
}
// OR
reversedArray = data.reverse()
Which one of these options would be faster? Would there be any significant difference between the 2 as the number of items increased?
Appending new elements has an amortized complexity of roughly O(1). According to the documentation, reversing an array has also a constant complexity.
Insertion has a complexity O(n), where n is the length of the array and you're inserting all elements one by one.
So appending and then reversing should be faster. But you won't see a noticeable difference if you're only dealing with a few dozen elements.
Creating the array by repeatedly inserting items at the beginning will be slowest because it will take time proportional to the square of the number of items involved.
(Clarification: I mean building the entire array reversed will take time proportional to n^2, because each insert will take time proportional to the number of items currently in the array, which will therefore be 1 + 2 + 3 + ... + n which is proportional to n squared)
Reversing the array after building it will be much faster because it will take time proportional to the number of items involved.
Just accessing the items in reverse order will be even faster because you avoid reversing the array.
Look up 'big O notation' for more information. Also note that an algorithm with O(n^2) runtime can outperform one with O(n) for small values of n.
My test results…
do {
let start = Date()
(1..<100).forEach { _ in
for var item in data {
reversedArray.insert(item, at: 0)
}
}
print("First: \(Date().timeIntervalSince1970 - start.timeIntervalSince1970)")
}
do {
let start = Date()
(1..<100).forEach { _ in
reversedArray = data.reversed()
}
print("Second: \(Date().timeIntervalSince1970 - start.timeIntervalSince1970)")
}
First: 0.0124959945678711
Second: 0.00890707969665527
Interestingly, running them 10,000 times…
First: 7.67399883270264
Second: 0.0903480052947998

Merge Sort algorithm efficiency

I am currently taking an online algorithms course in which the teacher doesn't give code to solve the algorithm, but rather rough pseudo code. So before taking to the internet for the answer, I decided to take a stab at it myself.
In this case, the algorithm that we were looking at is merge sort algorithm. After being given the pseudo code we also dove into analyzing the algorithm for run times against n number of items in an array. After a quick analysis, the teacher arrived at 6nlog(base2)(n) + 6n as an approximate run time for the algorithm.
The pseudo code given was for the merge portion of the algorithm only and was given as follows:
C = output [length = n]
A = 1st sorted array [n/2]
B = 2nd sorted array [n/2]
i = 1
j = 1
for k = 1 to n
if A(i) < B(j)
C(k) = A(i)
i++
else [B(j) < A(i)]
C(k) = B(j)
j++
end
end
He basically did a breakdown of the above taking 4n+2 (2 for the declarations i and j, and 4 for the number of operations performed -- the for, if, array position assignment, and iteration). He simplified this, I believe for the sake of the class, to 6n.
This all makes sense to me, my question arises from the implementation that I am performing and how it effects the algorithms and some of the tradeoffs/inefficiencies it may add.
Below is my code in swift using a playground:
func mergeSort<T:Comparable>(_ array:[T]) -> [T] {
guard array.count > 1 else { return array }
let lowerHalfArray = array[0..<array.count / 2]
let upperHalfArray = array[array.count / 2..<array.count]
let lowerSortedArray = mergeSort(array: Array(lowerHalfArray))
let upperSortedArray = mergeSort(array: Array(upperHalfArray))
return merge(lhs:lowerSortedArray, rhs:upperSortedArray)
}
func merge<T:Comparable>(lhs:[T], rhs:[T]) -> [T] {
guard lhs.count > 0 else { return rhs }
guard rhs.count > 0 else { return lhs }
var i = 0
var j = 0
var mergedArray = [T]()
let loopCount = (lhs.count + rhs.count)
for _ in 0..<loopCount {
if j == rhs.count || (i < lhs.count && lhs[i] < rhs[j]) {
mergedArray.append(lhs[i])
i += 1
} else {
mergedArray.append(rhs[j])
j += 1
}
}
return mergedArray
}
let values = [5,4,8,7,6,3,1,2,9]
let sortedValues = mergeSort(values)
My questions for this are as follows:
Do the guard statements at the start of the merge<T:Comparable> function actually make it more inefficient? Considering we are always halving the array, the only time that it will hold true is for the base case and when there is an odd number of items in the array.
This to me seems like it would actually add more processing and give minimal return since the time that it happens is when we have halved the array to the point where one has no items.
Concerning my if statement in the merge. Since it is checking more than one condition, does this effect the overall efficiency of the algorithm that I have written? If so, the effects to me seems like they vary based on when it would break out of the if statement (e.g at the first condition or the second).
Is this something that is considered heavily when analyzing algorithms, and if so how do you account for the variance when it breaks out from the algorithm?
Any other analysis/tips you can give me on what I have written would be greatly appreciated.
You will very soon learn about Big-O and Big-Theta where you don't care about exact runtimes (believe me when I say very soon, like in a lecture or two). Until then, this is what you need to know:
Yes, the guards take some time, but it is the same amount of time in every iteration. So if each iteration takes X amount of time without the guard and you do n function calls, then it takes X*n amount of time in total. Now add in the guards who take Y amount of time in each call. You now need (X+Y)*n time in total. This is a constant factor, and when n becomes very large the (X+Y) factor becomes negligible compared to the n factor. That is, if you can reduce a function X*n to (X+Y)*(log n) then it is worthwhile to add the Y amount of work because you do fewer iterations in total.
The same reasoning applies to your second question. Yes, checking "if X or Y" takes more time than checking "if X" but it is a constant factor. The extra time does not vary with the size of n.
In some languages you only check the second condition if the first fails. How do we account for that? The simplest solution is to realize that the upper bound of the number of comparisons will be 3, while the number of iterations can be potentially millions with a large n. But 3 is a constant number, so it adds at most a constant amount of work per iteration. You can go into nitty-gritty details and try to reason about the distribution of how often the first, second and third condition will be true or false, but often you don't really want to go down that road. Pretend that you always do all the comparisons.
So yes, adding the guards might be bad for your runtime if you do the same number of iterations as before. But sometimes adding extra work in each iteration can decrease the number of iterations needed.

Scala vals vs vars

I'm pretty new to Scala but I like to know what is the preferred way of solving this problem. Say I have a list of items and I want to know the total amount of the items that are checks. I could do something like so:
val total = items.filter(_.itemType == CHECK).map(._amount).sum
That would give me what I need, the sum of all checks in a immutable variable. But it does it with what seems like 3 iterations. Once to filter the checks, again to map the amounts and then the sum. Another way would be to do something like:
var total = new BigDecimal(0)
for (
item <- items
if item.itemType == CHECK
) total += item.amount
This gives me the same result but with 1 iteration and a mutable variable which seems fine too. But if I wanted to to extract more information, say the total number of checks, that would require more counters or mutable variables but I wouldn't have to iterate over the list again. Doesn't seem like the "functional" way of achieving what I need.
var numOfChecks = 0
var total = new BigDecimal(0)
items.foreach { item =>
if (item.itemType == CHECK) {
numOfChecks += 1
total += item.amount
}
}
So if you find yourself needing a bunch of counters or totals on a list is it preferred to keep mutable variables or not worry about it do something along the lines of:
val checks = items.filter(_.itemType == CHECK)
val total = checks.map(_.amount).sum
return (checks.size, total)
which seems easier to read and only uses vals
Another way of solving your problem in one iteration would be to use views or iterators:
items.iterator.filter(_.itemType == CHECK).map(._amount).sum
or
items.view.filter(_.itemType == CHECK).map(._amount).sum
This way the evaluation of the expression is delayed until the call of sum.
If your items are case classes you could also write it like this:
items.iterator collect { case Item(amount, CHECK) => amount } sum
I find that speaking of doing "three iterations" is a bit misleading -- after all, each iteration does less work than a single iteration with everything. So it doesn't automatically follows that iterating three times will take longer than iterating once.
Creating temporary objects, now that is a concern, because you'll be hitting memory (even if cached), which isn't the case of the single iteration. In those cases, view will help, even though it adds more method calls to do the same work. Hopefully, JVM will optimize that away. See Moritz's answer for more information on views.
You may use foldLeft for that:
(0 /: items) ((total, item) =>
if(item.itemType == CHECK)
total + item.amount
else
total
)
The following code will return a tuple (number of checks -> sum of amounts):
((0, 0) /: items) ((total, item) =>
if(item.itemType == CHECK)
(total._1 + 1, total._2 + item.amount)
else
total
)