Split scala treeset from ordered object - scala

My use-case is very simple and looks like caching so maybe something like Guava is useful but I use scala and prefer not to pull in Guava if I dont need to.
case class AAA(index:Double) extends Ordered[AAA] {
override def compare(that: AAA): Int = index.compare(that.index)
}
var aaaSet = mutable.TreeSet[AAA]()
AAA's mostly come in the set in increasing order but the index value might be lower then what already exists. What I need is a simple function that removes elements lower than a certain index (a Double). This does not -need- to be exact as long nothing above the index gets deleted. I can do this with O(log(n)) complexity but since I always can start at the bottom of the set(or head) I think it can be done more efficient. Obviously I quickly end up with caching libs but these indexes are not time-based and I need up to millions of these sets in my program (hence the wish to go faster then O(log(n))).
Some help and direction to possible solutions are much appreciated. Even if it means that O(log(n)) means best performance.

Even though it is not really the solution I was looking for I think this will be an ok solution:
aaaSet = aaaSet.dropWhile(aa => aa.index < 1.3)

Related

Should Scala immutable case classes be defined to hold Seq[T], immutable.Seq[T], List[T] or Vector[T]?

If we want to define a case class that holds a single object, say a tuple, we can do it easily:
sealed case class A(x: (Int, Int))
In this case, retrieving the "x" value will take a small constant amount of time, and this class will only take a small constant amount of space, regardless of how it was created.
Now, let's assume we want to hold a sequence of values instead; we could it like this:
sealed final case class A(x: Seq[Int])
This might seem to work as before, except that now storage and time to read all of x is proportional to x.length.
However, this is not actually the case, because someone could do something like this:
val hugeList = (1 to 1000000000).toList
val a = A(hugeList.view.filter(_ == 500000000))
In this case, the a object looks like an innocent case class holding a single int in a sequence, but in fact it requires gigabytes of memory, and it will take on the order of seconds to access that single element every time.
This could be fixed by specifying something like List[T] as the type instead of Seq[T]; however, this seems ugly since it adds a reference to a specific implementation, while in fact other well behaved implementations, like Vector[T], would also do.
Another worrying issue is that one could pass a mutable Seq[T], so it seems that one should at least use immutable.Seq instead of scala.collection.Seq (although the compiler can't actually enforce the immutability at the moment).
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
Or perhaps Seq is being used just because it's the shortest to type, and in fact it would be best to use immutable.Seq[T], List[T], Vector[T] or something else?
New text added in edit
Looking at the class library, some of the most core functionality like scala.reflect.api.Trees does in fact use List[T], and in general using a concrete class seems a good idea.
But then, why use List and not Vector?
Vector has O(1)/O(log(n)) length, prepend, append and random access, is asymptotically smaller (List is ~3-4 times bigger due to vtable and next pointers), and supports cache efficient and parallelized computation, while List has none of those properties except O(1) prepend.
So, personally I'm leaning towards Vector[T] being the correct choice for something exposed in a library data structure, where one doesn't know what operations the library user will need, despite the fact that it seems less popular.
First of all, you talk both about space and time requirements. In terms of space, your object will always be as large as the collection. It doesn't matter whether you wrap a mutable or immutable collection, that collection for obvious reasons needs to be in memory, and the case class wrapping it doesn't take any additional space (except its own small object reference). So if your collection takes "gigabytes of memory", that's a problem of your collection, not whether you wrap it in a case class or not.
You then go on to argue that a problem arises when using views instead of eager collections. But again the question is what the problem actually is? You use the example of lazily filtering a collection. In general running a filter will be an O(n) operation just as if you were iterating over the original list. In that example it would be O(1) for successive calls if that collection was made strict. But that's a problem of the calling site of your case class, not the definition of your case class.
The only valid point I see is with respect to mutable collections. Given the defining semantics of case classes, you should really only use effectively immutable objects as arguments, so either pure immutable collections or collections to which no instance has any more write access.
There is a design error in Scala in that scala.Seq is not aliased to collection.immutable.Seq but a general seq which can be either mutable or immutable. I advise against any use of unqualified Seq. It is really wrong and should be rectified in the Scala standard library. Use collection.immutable.Seq instead, or if the collection doesn't need to be ordered, collection.immutable.Traversable.
So I agree with your suspicion:
Looking at most libraries it seems that the common pattern is to use scala.collection.Seq[T], but is this really a good idea?
No! Not good. It might be convenient, because you can pass in an Array for example without explicit conversion, but I think a cleaner design is to require immutability.

Optimal HashSet Initialization (Scala | Java)

I'm writing an A.I. to solve a "Maze of Life" puzzle. Attempting to store states to a HashSet slows everything down. It's faster to run it without a set of explored states. I'm fairly confident my node (state storage) implements equals and hashCode well as tests show a HashSet doesn't add duplicate states. I may need to rework the hashCode function, but I believe what's slowing it down is the HashSet rehashing and resizing.
I've tried setting the initial capacity to a very large number, but it's still extremely slow:
val initCapacity = java.lang.Math.pow(initialGrid.width*initialGrid.height,3).intValue()
val frontier = new QuickQueue[Node](initCapacity)
Here is the quick queue code:
class QuickQueue[T](capacity: Int) {
val hashSet = new HashSet[T](capacity)
val queue = new Queue[T]
//methods below
For more info, here is the hash function. I store the grid values in bytes in two arrays and access it using tuples:
override def hashCode(): Int = {
var sum = Math.pow(grid.goalCoords._1, grid.goalCoords._2).toInt
for (y <- 0 until grid.height) {
for (x <- 0 until grid.width) {
sum += Math.pow(grid((x, y)).doubleValue(), x.toDouble).toInt
}
sum += Math.pow(sum, y).toInt
}
return sum
}
Any suggestions on how to setup a HashSet that wont slow things down? Maybe another suggestion of how to remember explored states?
P.S. using java.util.HashSet, and even with initial capacity set, it takes 80 seconds vs < 7 seconds w/o the set
Okay, for a start, please replace
override def hashCode(): Int =
with
override lazy val hashCode: Int =
so you don't calculate (grid.height*grid.width) floating point powers every time you need to access the hash code. That should speed things up by an enormous amount.
Then, unless you somehow rely upon close cells having close hash codes, don't re-invent the wheel. Use scala.util.hashing.MurmurHash3.seqHash or somesuch to calculate your hash. This should speed your hash up by another factor of 20 or so. (Still keep the lazy val.)
Then you only have overhead from the required set operations. Right now, unless you have a lot of 0x0 grids, you are using up the overwhelming majority of your time waiting for math.pow to give you a result (and risking everything becoming Double.PositiveInfinity or 0.0, depending on how big the values are, which will create hash collisions which will slow things down still further).
Note that the following assumes all your objects are immutable. This is a sane assumption when using hashing.
Also you should profile your code before applying optimization (use e.g. the free jvisualvm, that comes with the JDK).
Memoization for fast hashCode
Computing the hash code is usually a bottleneck. By computing the hash code only once for each object and storing the result you can reduce the cost of hash code computation to a minimum (once at object creation) at the expense of increased space consumption (probably moderate). To achieve this turn the def hashCode into a lazy val or val.
Interning for fast equals
Once you have the cost of hashCode eliminated, computing equals becomes a problem. equals is particularly expensive for collection fields and deep structures in general.
You can minimize the cost of equals by interning. This means that you acquire new objects of the class through a factory method, which checks whether the requested new object already exists, and if so, returns a reference to the existing object. If you assert that every object of this type is constructed in this way you know that there is only one instance of each distinct object and equals becomes equivalent to object identity, which is a cheap reference comparison (eq in Scala).

Complexity of List.reverse?

In Scala, there is reverse method for lists. What is the complexity of this method? Is it better to simply use the original list and always remember that the list is the reverse of what we expect, or to explicitly use reverse before operating on it.
EDIT: What I am really interested in is to get the last two elements of the original list (or the first two of the reversed list).
So I would do something like:
val myList = origList.reverse
val a = myList(0)
val b = myList(1)
This is not in a loop, just a one-time thing in my library... but if someone else uses the library and puts it in a loop, it is not under my control.
Looking at the source, it's O(n) as you might reasonably expect:
override def reverse: List[A] = {
var result: List[A] = Nil
var these = this
while (!these.isEmpty) {
result = these.head :: result
these = these.tail
}
result
}
If in your code you're able to iterate through the list in reverse order at the same cost of iterating in forward order, then it would be more efficient to do this rather than reversing the List.
In fact, if your alternative operation which involves using the original list works in less than O(n) time, then there's a real argument for going with that. Making an algorithm asymptotically faster will make a huge difference if you ever rely on it more (especially if used inside other loops, as oxbow_lakes points out below).
On the whole though I'd expect that anything where you're reversing a list means that you care about the relative ordering of a non-trivial number of elements, and so whatever you're doing is inherently O(n) anyway. (This might not be true for other data structures such as a binary tree; but lists are linear, and in the extreme case even reverse . head can't be done in O(1) time with a singly-linked list.)
So if you're choosing between two O(n) options - for the vast majority of applications, shaving a few nanoseconds off the iteration time isn't going to really gain you anything. Hence it would be "best" to make your code as readable as possible - which means calling reverse and then iterating, if that's closest to your intention.
(And if your app is too slow, and profiling shows that this list manipulation is a hotspot, then you can think about how to make it more efficient. Which by that point may well involve a different option to both of your current candidates, given the extra context you'll have at that point.)

Scala: read and save all elements of an Iterable

I have an Iterable[T] that is really a stream of unknown length, and want to read it all and save it into something that is still an instance of Iterable. I really do have to read it and save it; I can't do it in a lazy way. The original Iterable can have a few thousand elements, at least. What's the most efficient/best/canonical way? Should I use an ArrayBuffer, a List, a Vector?
Suppose xs is my Iterable. I can think of doing these possibilities:
xs.toArray.toIterable // Ugh?
xs.toList // Fast?
xs.copyToBuffer(anArrayBuffer)
Vector(xs: _*) // There's no toVector, sadly. Is this construct as efficient?
EDIT: I see by the questions I should be more specific. Here's a strawman example:
def f(xs: Iterable[SomeType]) { // xs might a stream, though I can't be sure
val allOfXS = <xs all read in at once>
g(allOfXS)
h(allOfXS) // Both g() and h() take an Iterable[SomeType]
}
This is easy. A few thousand elements is nothing, so it hardly matters unless it's a really tight loop. So the flippant answer is: use whatever you feel is most elegant.
But, okay, let's suppose that this is actually in some tight loop, and you can predict or have benchmarked your code enough to know that this is performance-limiting.
Your best performance for an immutable solution will likely be a Vector, used like so:
Vector() ++ xs
In my hands, this can copy a 10k iterable about 4k-5k times per second. List is about half the speed.
If you're willing to try a mutable solution under the hood, xs.toArray.toIterable usually takes the cake with about 10k copies per second. ArrayBuffer is about the same speed as List.
If you actually know the size of the target (i.e. size is O(1) or you know it from somewhere else), you can shave off another 20-30% of the execution speed by allocating just the right size and writing a while loop.
If it's actually primitives, you can gain a factor of 10 by writing your own specialized Iterable-like-thing that acts on arrays and converts to regular collections via the underlying array.
Bottom line: for a great blend of power, speed, and flexibility, use Vector() ++ xs in most situations. xs.toIndexedSeq defaults to the same thing, with the benefit that if it's already a Vector that it will take no time at all (and chains nicely without using parens), and the drawback that you are relying upon a convention, not a specification for behavior (and it takes 1-3 more characters to type).
How about Stream.force?
Forces evaluation of the whole stream and returns it.
This is hard. An Iterable's methods are defined in terms of its iterator, but that gets overridden by subtraits. For instance, IndexedSeq methods are usually defined in terms of apply.
There is the question of why do you want to copy the Iterable, but I suppose you might be guarding against the possibility of it being mutable. If you do not want to copy it, then you need to rephrase your question.
If you are going to copy it, and you want to be sure all elements are copied in a strict manner, you could use .toList. That will not copy a List, but a List does not need to be copied. For anything else, it will produce a new copy.

Scala linked list stackoverflow

Using scala I have added about 100000 nodes to a linked list. When I use the function length, for example mylist.length. I get a 'java.lang.StackOverflowError' error, is my list to big to process? The list is only string objects.
It appears the library implementation is not tail-recursive override def length: Int = if (isEmpty) 0 else next.length + 1. It seems like this is something that could be discussed on the mailing list to check if an enhancement ticket should be opened.
You can compute the length like this:
def length[T](l:LinkedList[T], acc:Int=0): Int =
if (l.isEmpty) acc else length(l.tail, acc + 1)
In Scala, computing the length of a List is an order n operation, therefore you should try to avoid it. You might consider switching to an Array, as that is a constant time operation.
You could try increasing the stack/heap size available to the JVM.
scala JAVA_OPTS="-Xmx512M -Xms16M -Xss16M" MyClass.scala
Where
-Xss<size> maximum native stack size for any thread
-Xms<size> set initial Java heap size
-Xmx<size> set maximum Java heap size
This question has some more information.
See also this This Scala document.
Can you confirm that you truly need to use the length method? It sounds like you might not be using the correct collection type for your use-case (hard to tell without any extra information). Lists are optimised to be mapped over using folds or a tail-recursive function.
Despite saying that, this is absolutely an oversight that can easily be fixed in the standard library with a tail-recursive function. Hopefully we can get it in time for 2.9.0.