How to efficiently select a random element from a Scala immutable HashSet - scala

I have a scala.collection.immutable.HashSet that I want to randomly select an element from.
I could solve the problem with an extension method like this:
implicit class HashSetExtensions[T](h: HashSet[T]) {
def nextRandomElement (): Option[T] = {
val list = h.toList
list match {
case null | Nil => None
case _ => Some (list (Random.nextInt (list.length)))
}
}
}
...but converting to a list will be slow. What would be the most efficient solution?

WARNING This answer is for experimental use only. For real project you probably should use your own collection types.
So i did some research in the HashSet source and i think there is little opportunity to someway extract the inner structure of most valuable class HashTrieSet without package violation.
I did come up with this code, which is extended Ben Reich's solution:
package scala.collection
import scala.collection.immutable.HashSet
import scala.util.Random
package object random {
implicit class HashSetRandom[T](set: HashSet[T]) {
def randomElem: Option[T] = set match {
case trie: HashSet.HashTrieSet[T] => {
trie.elems(Random.nextInt(trie.elems.length)).randomElem
}
case _ => Some(set.size) collect {
case size if size > 0 => set.iterator.drop(Random.nextInt(size)).next
}
}
}
}
file should be created somewhere in the src/scala/collection/random folder
note the scala.collection package - this thing makes the elems part of HashTrieSet visible. This is only solution i could think, which could run better than O(n). Current version should have complexity O(ln(n)) as any of immutable.HashSet's operation s.
Another warning - private structure of HashSet is not part of scala's standard library API, so it could change any version making this code erroneous (though it's didn't changed since 2.8)

Since size is O(1) on HashSet, and iterator is as lazy as possible, I think this solution would be relatively efficient:
implicit class RichHashSet[T](val h: HashSet[T]) extends AnyVal {
def nextRandom: Option[T] = Some(h.size) collect {
case size if size > 0 => h.iterator.drop(Random.nextInt(size)).next
}
}
And if you're trying to get every ounce of efficiency you could use match here instead of the more concise Some/collect idiom used here.
You can look at the mutable HashSet implementation to see the size method. The iterator method defined there basically just calls iterator on FlatHashTable. The same basic efficiencies of these methods apply to immutable HashSet if that's what you're working with. As a comparison, you can see the toList implementation on HashSet is all the way up the type hierarchy at TraversableOnce and uses far more primitive elements which are probably less efficient and (of course) the entire collection must be iterated to generate the List. If you were going to convert the entire set to a Traversable collection, you should use Array or Vector which have constant-time lookup.
You might also note that there is nothing special about HashSet in the above methods, and you could enrich Set[T] instead, if you so chose (although there would be no guarantee that this would be as efficient on other Set implementations, of course).
As a side note, when implementing enriched classes for extension methods, you should always consider making an implicit, user-defined value class by extending AnyVal. You can read about some of the advantages and limitations in the docs, and on this answer.

Related

How to model a bloom filter in Scala

I am trying to model a bloom filter in Scala. The logic itself is actually pretty straightforward, but I'm struggling to figure out how to use Scala's data structures adequately to make nice, idiomatic and functional.
My problem is this: if I use a case class, I need the constructor to generate the hash functions and the bits array that will store the actual bloom filter's data.
But then, in a method like "add" that will change the contents of the bits array, I need to return a new bloom filter instead of mutating the contents of the existing one in order for my method to be referentially transparent.
Unfortunately I can't construct a new bloom filter, because I don't want the new one to re-create a new bits array and new hash functions, and I also can't pass it the existing ones because neither the bits array nor the hash functions are part of the bloom filter case class.
So how am I supposed to model this in Scala?
[ Modified to use BitSet, following comment ]
This is an outline of how it might work.
trait HashFunctions[T] {
def apply(value: T): BitSet
}
object Bloom {
class BloomFactory[T](hash: HashFunctions[T]) {
case class Bloom(flags: BitSet) {
def add(value: T): Bloom =
Bloom(flags union hash(value))
def test(value: T): Boolean =
hash(value).subsetOf(flags)
}
}
def apply[T](): BloomFactory[T]#Bloom = new BloomFactory(DefaultHashFunctions[T]).Bloom(BitSet.empty)
}
Note that this does create a new Bloom each time you add a value, but this makes the class immutable which is a good idea. The hash functions are created in the companion object so that this does not happen each time you add to the filter.
Clearly this can be made significantly more efficient in both speed and memory usage.

How to design abstract classes if methods don't have the exact same signature?

This is a "real life" OO design question. I am working with Scala, and interested in specific Scala solutions, but I'm definitely open to hear generic thoughts.
I am implementing a branch-and-bound combinatorial optimization program. The algorithm itself is pretty easy to implement. For each different problem we just need to implement a class that contains information about what are the allowed neighbor states for the search, how to calculate the cost, and then potentially what is the lower bound, etc...
I also want to be able to experiment with different data structures. For instance, one way to store a logic formula is using a simple list of lists of integers. This represents a set of clauses, each integer a literal. We can have a much better performance though if we do something like a "two-literal watch list", and store some extra information about the formula in general.
That all would mean something like this
object BnBSolver[S<:BnBState]{
def solve(states: Seq[S], best_state:Option[S]): Option[S] = if (states.isEmpty) best_state else
val next_state = states.head
/* compare to best state, etc... */
val new_states = new_branches ++ states.tail
solve(new_states, new_best_state)
}
class BnBState[F<:Formula](clauses:F, assigned_variables) {
def cost: Int
def branches: Seq[BnBState] = {
val ll = clauses.pick_variable
List(
BnBState(clauses.assign(ll), ll :: assigned_variables),
BnBState(clauses.assign(-ll), -ll :: assigned_variables)
)
}
}
case class Formula[F<:Formula[F]](clauses:List[List[Int]]) {
def assign(ll: Int) :F =
Formula(clauses.filterNot(_ contains ll)
.map(_.filterNot(_==-ll))))
}
Hopefully this is not too crazy, wrong or confusing. The whole issue here is that this assign method from a formula would usually take just the current literal that is going to be assigned. In the case of two-literal watch lists, though, you are doing some lazy thing that requires you to know later what literals have been previously assigned.
One way to fix this is you just keep this list of previously assigned literals in the data structure, maybe as a private thing. Make it a self-standing lazy data structure. But this list of the previous assignments is actually something that may be naturally available by whoever is using the Formula class. So it makes sense to allow whoever is using it to just provide the list every time you assign, if necessary.
The problem here is that we cannot now have an abstract Formula class that just declares a assign(ll:Int):Formula. In the normal case this is OK, but if this is a two-literal watch list Formula, it is actually an assign(literal: Int, previous_assignments: Seq[Int]).
From the point of view of the classes using it, it is kind of OK. But then how do we write generic code that can take all these different versions of Formula? Because of the drastic signature change, it cannot simply be an abstract method. We could maybe force the user to always provide the full assigned variables, but then this is a kind of a lie too. What to do?
The idea is the watch list class just becomes a kind of regular assign(Int) class if I write down some kind of adapter method that knows where to take the previous assignments from... I am thinking maybe with implicit we can cook something up.
I'll try to make my answer a bit general, since I'm not convinced I'm completely following what you are trying to do. Anyway...
Generally, the first thought should be to accept a common super-class as a parameter. Obviously that won't work with Int and Seq[Int].
You could just have two methods; have one call the other. For instance just wrap an Int into a Seq[Int] with one element and pass that to the other method.
You can also wrap the parameter in some custom class, e.g.
class Assignment {
...
}
def int2Assignment(n: Int): Assignment = ...
def seq2Assignment(s: Seq[Int]): Assignment = ...
case class Formula[F<:Formula[F]](clauses:List[List[Int]]) {
def assign(ll: Assignment) :F = ...
}
And of course you would have the option to make those conversion methods implicit so that callers just have to import them, not call them explicitly.
Lastly, you could do this with a typeclass:
trait Assigner[A] {
...
}
implicit val intAssigner = new Assigner[Int] {
...
}
implicit val seqAssigner = new Assigner[Seq[Int]] {
...
}
case class Formula[F<:Formula[F]](clauses:List[List[Int]]) {
def assign[A : Assigner](ll: A) :F = ...
}
You could also make that type parameter at the class level:
case class Formula[A:Assigner,F<:Formula[A,F]](clauses:List[List[Int]]) {
def assign(ll: A) :F = ...
}
Which one of these paths is best is up to preference and how it might fit in with the rest of the code.

Override equality for floating point values in Scala

Note: Bear with me, I'm not asking how to override equals or how to create a custom method to compare floating point values.
Scala is very nice in allowing comparison of objects by value, and by providing a series of tools to do so with little code. In particular, case classes, tuples and allowing comparison of entire collections.
I've often call methods that do intensive computations and generate o non-trivial data structure to return and I can then write a unit test that given a certain input will call the method and then compare the results against a hardcoded value. For instance:
def compute() =
{
// do a lot of computations here to produce the set below...
Set(('a', 1), ('b', 3))
}
val A = compute()
val equal = A == Set(('a', 1), ('b', 3))
// equal = true
This is a bare-bones example and I'm omitting here any code from specific test libraries, etc.
Given that floating point values are not reliably compared with equals, the following, and rather equivalent example, fails:
def compute() =
{
// do a lot of computations here to produce the set below...
Set(('a', 1.0/3.0), ('b', 3.1))
}
val A = compute()
val equal2 = A == Set(('a', 0.33333), ('b', 3.1)) // Use some arbitrary precision here
// equal2 = false
What I would want is to have a way to make all floating-point comparisons in that call to use an arbitrary level of precision. But note that I don't control (or want to alter in any way) either Set or Double.
I tried defining an implicit conversion from double to a new class and then overloading that class to return true. I could then use instances of that class in my hardcoded validations.
implicit class DoubleAprox(d: Double)
{
override def hashCode = d.hashCode()
override def equals(other : Any) : Boolean = other match {
case that : Double => (d - that).abs < 1e-5
case _ => false
}
}
val equals3 = DoubleAprox(1.0/3.0) == 0.33333 // true
val equals4 = 1.33333 == DoubleAprox(1.0/3.0) // false
But as you can see, it breaks symmetry. Given that I'm then comparing more complex data-structures (sets, tuples, case classes), I have no way to define a priori if equals() will be called on the left or the right. Seems like I'm bound to traverse all the structures and then do single floating-point comparisons on the branches... So, the question is: is there any way to do this at all??
As a side note: I gave a good read to an entire chapter on object equality and several blogs, but they only provides solutions for inheritance problems and requires you to basically own all classes involved and change all of them. And all of it seems rather convoluted given what it is trying to solve.
Seems to me that equality is one of those things that is fundamentally broken in Java due to the method having to be added to each class and permanently overridden time and again. What seems more intuitive to me would be to have comparison methods that the compiler can find. Say, you would provide equals(DoubleAprox, Double) and it would be used every time you want to compare 2 objects of those classes.
I think that changing the meaning of equality to mean anything fuzzy is a bad idea. See my comments in Equals for case class with floating point fields for why.
However, it can make sense to do this in a very limited scope, e.g. for testing. I think for numerical problems you should consider using the spire library as a dependency. It contains a large amount of useful things. Among them a type class for equality and mechanisms to derive type class instances for composite types (collections, tuples, etc) based on the type class instances for the individual scalar types.
Since as you observe, equality in the java world is fundamentally broken, they are using other operators (=== for type safe equality).
Here is an example how you would redefine equality for a limited scope to get fuzzy equality for comparing test results:
// import the machinery for operators like === (when an Eq type class instance is in scope)
import spire.syntax.all._
object Test extends App {
// redefine the equality for double, just in this scope, to mean fuzzy equali
implicit object FuzzyDoubleEq extends spire.algebra.Eq[Double] {
def eqv(a:Double, b:Double) = (a-b).abs < 1e-5
}
// this passes. === looks up the Eq instance for Double in the implicit scope. And
// since we have not imported the default instance but defined our own, this will
// find the Eq instance defined above and use its eqv method
require(0.0 === 0.000001)
// import automatic generation of type class instances for tuples based on type class instances of the scalars
// if there is an Eq available for each scalar type of the tuple, this will also make an Eq instance available for the tuple
import spire.std.tuples._
require((0.0, 0.0) === (0.000001, 0.0)) // works also for tuples containing doubles
// import automatic generation of type class instances for arrays based on type class instances of the scalars
// if there is an Eq instance for the element type of the array, there will also be one for the entire array
import spire.std.array._
require(Array(0.0,1.0) === Array(0.000001, 1.0)) // and for arrays of doubles
import spire.std.seq._
require(Seq(1.0, 0.0) === Seq(1.000000001, 0.0))
}
Java equals is indeed not as principled as it should be - people who are very bothered about this use something like Scalaz' Equal and ===. But even that assumes a symmetry of the types involved; I think you would have to write a custom typeclass to allow comparing heterogeneous types.
It's quite easy to write a new typeclass and have instances recursively derived for case classes, using Shapeless' automatic type class instance derivation. I'm not sure that extends to a two-parameter typeclass though. You might find it best to create distinct EqualityLHS and EqualityRHS typeclasses, and then your own equality method for comparing A: EqualityLHS and B: EqualityRHS, which could be pimped onto A as an operator if desired. (Of course it should be possible to extend the technique generically to support two-parameter typeclasses in full generality rather than needing such workarounds, and I'm sure shapeless would greatly appreciate such a contribution).
Best of luck - hopefully this gives you enough to find the rest of the answer yourself. What you want to do is by no means trivial, but with the help of modern Scala techniques it should be very much within the realms of possibility.

Scala BitSet and shift operations

I'm looking for a way to represent a set of integers with a bit vector (which would be the characteristic function of that set of integers) and be able to perform bitwise operations on this set.
Initially I thought scala's BitSet would be the ideal candidate. However, it seems BitSet doesn't support shifting operations according to the documentation 1. Upon further investigation I also found that the related Java BitSet implementation doesn't support shift operations either 2.
Am I left with the only option of implementing my own BitSet class which supports shift operations? Moreover, according to the description given in 3 it doesn't sound that difficult to support shift operations on the Scala's BitSet implementation, or have I misunderstood something here?
Thanks in advance.
The usual trick when faced with a need for retrofitting new functionality is the "Pimp My Library" pattern. Implicitly convert the BitSet to a dedicated type intended to perform the added operation:
class ShiftableBitSet(bs: BitSet) {
def shiftLeft(n: Int): BitSet = ... //impl goes here
}
implicit def bitsetIsShiftable(bs: BitSet) = new ShiftableBitSet(bs)
val sample = BitSet(1,2,3,5,7,9)
val shifted = sample.shiftLeft(2)
Alter shiftLeft to whatever name and with whatever arguments you prefer.
UPDATE
If you know for certain that you'll have an immutable BitSet, then a (slightly hacky) approach to access the raw underlying array is to pattern match. Not too painful either, as there are only 3 possible concrete subclasses for an immutable BitSet:
import collection.immutable.BitSet
val bitSet = BitSet(1,2,3)
bitSet match {
case bs: BitSet.BitSet1 => Array(bs.elems)
case bs: BitSet.BitSetN => bs.elems
case _ => error("unusable BitSet")
}
Annoyingly, the elems1 param to BitSet2 isn't a val, and the elems param to a mutable BitSet is marked protected. So it's not perfect, but should do the trick if your set is non-trivial and immutable. For the trivial cases, "normal" access to the set won't be too expensive.
And yes, this technique would be used within the wrapper as described above.
You can just use map, for example to shift to left by 4 positions:
import collection.immutable.BitSet
val bitSet = BitSet(1,2,3)
bitSet map (_ + 4)

How would I implement a fixed size List in Scala?

For example suppose I want a list that contains 0 up to a max of 1000 elements. Above this, the oldest insertions should be dropped first. Do collections support this functionality natively? If not how would I go about the implementation? I understand that certain operations are very slow on Lists so maybe I need a different data type?
Looking at an element should not affect the list. I would like insert and size operations only.
It sounds like you want a size-bounded queue. Here's a similar question: Maximum Length for scala queue
There are three solutions presented in that question. You can,
Write a queue from scratch (paradigmatic gave code for this),
Extend Scala's Queue implementation by subclassing, or
Use the typeclass extension pattern (aka, "pimp my library") to extend Scala's Queue.
Here is my first pass implementation in case someone else find it useful
import scala.collection._
import mutable.ListBuffer
class FixedList[A](max: Int) extends Traversable[A] {
val list: ListBuffer[A] = ListBuffer()
def append(elem: A) {
if (list.size == max) {
list.trimStart(1)
}
list.append(elem)
}
def foreach[U](f: A => U) = list.foreach(f)
}
A circular array is the fastest implementation. It's basically an array with a read and write index which are wrapped when reaching the end of the array. Size is defined as:
def size = writeIndex - readIndex + (if (readIndex > writeIndex) array.size else 0)
While not an answer to the question's details (but does somewhat answer the question's title), List.fill(1000){0} would create a List of length 1000 with initial value of 0, which is from
Scala - creating a type parametrized array of specified length