I am going through the List methods in Scala.
val mylist = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 3, 10)
I am quite confused by hasDefiniteSize and knownSize.
For List, hasDefiniteSize returns true and knownSize returns -1.
What is the exact theory behind these methods?
This method is defined by a superclass of List which is common with possibly endless collections (like Streams, LazyLists and Iterators).
For more details, I believe the documentation puts it best.
Here is the one for hasDefiniteSize in version 2.13.1:
Tests whether this collection is known to have a finite size. All
strict collections are known to have finite size. For a non-strict
collection such as Stream, the predicate returns true if all elements
have been computed. It returns false if the stream is not yet
evaluated to the end. Non-empty Iterators usually return false even if
they were created from a collection with a known finite size.
Note: many collection methods will not work on collections of infinite
sizes. The typical failure mode is an infinite loop. These methods
always attempt a traversal without checking first that hasDefiniteSize
returns true. However, checking hasDefiniteSize can provide an
assurance that size is well-defined and non-termination is not a
concern.
Note that hasDefiniteSize is deprecated with the following message:
(Since version 2.13.0) Check .knownSize instead of .hasDefiniteSize
for more actionable information (see scaladoc for details)
The documentation for knownSize further states:
The number of elements in this collection, if it can be cheaply
computed, -1 otherwise. Cheaply usually means: Not requiring a
collection traversal.
List is an implementation of a linked list, which is why List(1, 2, 3).hasDefiniteSize returns true (the collection is not boundless) but List(1, 2, 3).knownSize returns -1 (computing the collection size requires traversing the whole list).
Some collections know their size
Vector(1,2,3).knownSize // 3
and some do not
List(1,2,3).knownSize // -1
If a collection knows its size then some operations can be optimised, for example, consider how Iterable#sizeCompare uses knownSize to possibly return early
def sizeCompare(that: Iterable[_]): Int = {
val thatKnownSize = that.knownSize
if (thatKnownSize >= 0) this sizeCompare thatKnownSize
else {
val thisKnownSize = this.knownSize
if (thisKnownSize >= 0) {
val res = that sizeCompare thisKnownSize
// can't just invert the result, because `-Int.MinValue == Int.MinValue`
if (res == Int.MinValue) 1 else -res
} else {
val thisIt = this.iterator
val thatIt = that.iterator
while (thisIt.hasNext && thatIt.hasNext) {
thisIt.next()
thatIt.next()
}
java.lang.Boolean.compare(thisIt.hasNext, thatIt.hasNext)
}
}
}
See related question Difference between size and sizeIs
Related
I like to know if I use Set instead of Array can my method of first(where:) became Complexity:O(1)?
Apple says that the first(where:) Method is O(n), is it in general so or it depends on how we use it?
for example look at these two ways of coding:
var numbers: [Int] = [Int]()
numbers = [3, 7, 4, -2, 9, -6, 10, 1]
if let searchResult = numbers.first(where: { value in value == -2 })
{
print("The number \(searchResult) Exist!")
}
else
{
print("The number does not Exist!")
}
and this:
var numbers: Set<Int> = Set<Int>()
numbers = [3, 7, 4, -2, 9, -6, 10, 1]
if let searchResult = numbers.first(where: { value in value == -2 })
{
print("The number \(searchResult) Exist!")
}
else
{
print("The number does not Exist!")
}
can we say that in second way Complexity is O(1)?
It's still O(n) even when you use a Set. .first(where:) is defined on a sequence, and it is necessary to check the items in the sequence one at a time to find the first one that makes the predicate true.
Your example is simply checking if the item exists in the Set, but since you are using .first(where:) and a predicate { value in value == -2} Swift will run that predicate for each element in the sequence in turn until it finds one that returns true. Swift doesn't know that you are really just checking to see if the item is in the set.
If you want O(1), then use .contains(-2) on the Set.
I recommend to learn more about Big-O notation. O(1) is a strict subset of O(n). Thus every function that is O(1) is also in O(n).
That said, Apple’s documentation is actually misleading as it does not take the complexity of the predicate function into account. The following is clearly O(n^2):
numbers.first(where: { value in numbers.contains(value + 42) })
Both Set and Dictionary conform to the Sequence protocol, which is the one that exposes the first(where:) function. And this function has the following requirement, taken from the documentation:
Complexity: O(n), where n is the length of the sequence.
Now, this is the upper limit of the function complexity, it might well be that some sequences optimize the search based on their data type and the storage details.
Bottom line: you need to reach the documentation for a particular type if you want to know more about the performance of some feature, however if you're only circulating some protocol references, then you should assume the "worst" - aka what's in the protocol documentation.
This is the implementation of the first(where:) function in the sequence:
/// - Complexity: O(*n*), where *n* is the length of the sequence.
#inlinable
public func first(
where predicate: (Element) throws -> Bool
) rethrows -> Element? {
for element in self {
if try predicate(element) {
return element
}
}
return nil
}
From the Swift Source Code on the Github
As you can see, It's a simple for loop and the complexity is O(n) (assuming the predicate complexity is 1 🤷🏻♂️).
The predicate executes n times. So the worst case is O(n)
The Set has not an overload for this function (since it is nonsense and there will be nothing more than the first one in a Set). If you know about the sequence and you are just looking for a value (not a predicate), just use contains or firstIndex(of:). These two have overloads with the complexity of O(1)
From the Swift Source Code on the Github
What is the semantic difference between size and sizeIs? For example,
List(1,2,3).sizeIs > 1 // true
List(1,2,3).size > 1 // true
Luis mentions in a comment that
...on 2.13+ one can use sizeIs > 1 which will be more efficient than
size > 1 as the first one does not compute all the size before
returning
Add size comparison methods to IterableOps #6950 seems to be the pull request that introduced it.
Reading the scaladoc
Returns a value class containing operations for comparing the size of
this $coll to a test value. These operations are implemented in terms
of sizeCompare(Int)
it is not clear to me why is sizeIs more efficient than regular size?
As far as I understand the changes.
The idea is that for collections that do not have a O(1) (constant) size. Then, sizeIs can be more efficient, specially for comparisons with small values (like 1 in the comment).
But why?
Simple, because instead of computing all the size and then doing the comparison, sizeIs returns an object which when computing the comparison, can return early.
For example, lets check the code
def sizeCompare(otherSize: Int): Int = {
if (otherSize < 0) 1
else {
val known = knownSize
if (known >= 0) Integer.compare(known, otherSize)
else {
var i = 0
val it = iterator
while (it.hasNext) {
if (i == otherSize) return if (it.hasNext) 1 else 0 // HERE!!! - return as fast as possible.
it.next()
i += 1
}
i - otherSize
}
}
}
Thus, in the example of the comment, suppose a very very very long List of three elements. sizeIs > 1 will return as soon as it knows that the List has at least one element and hasMore. Thus, saving the cost of traversing the other two elements to compute a size of 3 and then doing the comparison.
Note that: If the size of the collection is greater than the comparing value, then the performance would be roughly the same (maybe slower than just size due the extra comparisons on each cycle). Thus, I would only recommend this for comparisons with small values, or when you believe the values will be smaller than the collection.
I just encountered an issue with degrading fs2 performance using a stream of strings to be written to a file via text.utf8encode. I tried to change my source to use chunked strings to increase performance, but the observation was performance degradation instead.
As far as I can see, it boils down to the following: Invoking flatMap on a stream that originates from Stream.emits() can be very expensive. Time usage seems to be exponential based on the size of the sequence passed to Stream.emits(). The code snippet below shows an example:
/*
Test done with scala 2.11.11 and fs2 version 0.10.0-M7.
*/
val rangeSize = 20000
val integers = (1 to rangeSize).toVector
// Note that the last flatMaps are just added to show extreme load for streamA.
val streamA = Stream.emits(integers).flatMap(Stream.emit(_))
val streamB = Stream.range(1, rangeSize + 1).flatMap(Stream.emit(_))
streamA.toVector // Uses approx. 25 seconds (!)
streamB.toVector // Uses approx. 15 milliseconds
Is this a bug, or should usage of Stream.emits() for large sequences be avoided?
TLDR: Allocations.
Longer answer:
Interesting question. I ran a JFR profile on both methods separately, and looked at the results. First thing which immediately attracted my eye was the amount of allocations.
Stream.emit:
Stream.range:
We can see that Stream.emit allocates a significant amount of Append instances, which are the concrete implementation of Catenable[A], which is the type used in Stream.emit to fold:
private[fs2] final case class Append[A](left: Catenable[A], right: Catenable[A]) extends Catenable[A]
This actually comes from the implementation of how Catenable[A] implemented foldLeft:
foldLeft(empty: Catenable[B])((acc, a) => acc :+ f(a))
Where :+ allocates a new Append object for each element. This means we're at least generating 20000 such Append objects.
There is also a hint in the documentation of Stream.range about how it produces a single chunk instead of dividing the stream further, which may be bad if this was a big range we're generating:
/**
* Lazily produce the range `[start, stopExclusive)`. If you want to produce
* the sequence in one chunk, instead of lazily, use
* `emits(start until stopExclusive)`.
*
* #example {{{
* scala> Stream.range(10, 20, 2).toList
* res0: List[Int] = List(10, 12, 14, 16, 18)
* }}}
*/
def range(start: Int, stopExclusive: Int, by: Int = 1): Stream[Pure,Int] =
unfold(start){i =>
if ((by > 0 && i < stopExclusive && start < stopExclusive) ||
(by < 0 && i > stopExclusive && start > stopExclusive))
Some((i, i + by))
else None
}
You can see that there is no additional wrapping here, only the integers that get emitted as part of the range. On the other hand, Stream.emits creates an Append object for every element in the sequence, where we have a left containing the tail of the stream, and right containing the current value we're at.
Is this a bug? I would say no, but I would definitely open this up as a performance issue to the fs2 library maintainers.
New to Scala. I'm iterating a for loop 100 times. 10 times I want condition 'a' to be met and 90 times condition 'b'. However I want the 10 a's to occur at random.
The best way I can think is to create a val of 10 random integers, then loop through 1 to 100 ints.
For example:
val z = List.fill(10)(100).map(scala.util.Random.nextInt)
z: List[Int] = List(71, 5, 2, 9, 26, 96, 69, 26, 92, 4)
Then something like:
for (i <- 1 to 100) {
whenever i == to a number in z: 'Condition a met: do something'
else {
'condition b met: do something else'
}
}
I tried using contains and == and =! but nothing seemed to work. How else can I do this?
Your generation of random numbers could yield duplicates... is that OK? Here's how you can easily generate 10 unique numbers 1-100 (by generating a randomly shuffled sequence of 1-100 and taking first ten):
val r = scala.util.Random.shuffle(1 to 100).toList.take(10)
Now you can simply partition a range 1-100 into those who are contained in your randomly generated list and those who are not:
val (listOfA, listOfB) = (1 to 100).partition(r.contains(_))
Now do whatever you want with those two lists, e.g.:
println(listOfA.mkString(","))
println(listOfB.mkString(","))
Of course, you can always simply go through the list one by one:
(1 to 100).map {
case i if (r.contains(i)) => println("yes: " + i) // or whatever
case i => println("no: " + i)
}
What you consider to be a simple for-loop actually isn't one. It's a for-comprehension and it's a syntax sugar that de-sugares into chained calls of maps, flatMaps and filters. Yes, it can be used in the same way as you would use the classical for-loop, but this is only because List is in fact a monad. Without going into too much details, if you want to do things the idiomatic Scala way (the "functional" way), you should avoid trying to write classical iterative for loops and prefer getting a collection of your data and then mapping over its elements to perform whatever it is that you need. Note that collections have a really rich library behind them which allows you to invoke cool methods such as partition.
EDIT (for completeness):
Also, you should avoid side-effects, or at least push them as far down the road as possible. I'm talking about the second example from my answer. Let's say you really need to log that stuff (you would be using a logger, but println is good enough for this example). Doing it like this is bad. Btw note that you could use foreach instead of map in that case, because you're not collecting results, just performing the side effects.
Good way would be to compute the needed stuff by modifying each element into an appropriate string. So, calculate the needed strings and accumulate them into results:
val results = (1 to 100).map {
case i if (r.contains(i)) => ("yes: " + i) // or whatever
case i => ("no: " + i)
}
// do whatever with results, e.g. print them
Now results contains a list of a hundred "yes x" and "no x" strings, but you didn't do the ugly thing and perform logging as a side effect in the mapping process. Instead, you mapped each element of the collection into a corresponding string (note that original collection remains intact, so if (1 to 100) was stored in some value, it's still there; mapping creates a new collection) and now you can do whatever you want with it, e.g. pass it on to the logger. Yes, at some point you need to do "the ugly side effect thing" and log the stuff, but at least you will have a special part of code for doing that and you will not be mixing it into your mapping logic which checks if number is contained in the random sequence.
(1 to 100).foreach { x =>
if(z.contains(x)) {
// do something
} else {
// do something else
}
}
or you can use a partial function, like so:
(1 to 100).foreach {
case x if(z.contains(x)) => // do something
case _ => // do something else
}
I am trying to build a test case that checks if two streams are the same. Zip can be used to check the value elements are the same, but it doesn't help if one stream is the wrong length. Any ideas on how to approach this?
There's an operator for that: sequenceEqual.
Returns
(Observable): An observable sequence that contains a single element which indicates whether both sequences are of equal length and their corresponding elements are equal according to the specified equality comparer.
Here's a simple example showing the length equality check.
var log = console.log.bind(console);
Rx.Observable.of(1, 2, 3)
.sequenceEqual(Rx.Observable.of(1, 2, 3))
.subscribe(log); // logs true
Rx.Observable.of(1, 2, 3)
.sequenceEqual(Rx.Observable.of(1, 2))
.subscribe(log); // logs false