What's the difference between a set and a mapping to boolean? - scala

In Scala, I sometimes use a Map[A, Boolean], sometimes a Set[A]. There's really not much difference between these two concepts, and an implementation might use the same data structure to implement them. So why bother to have Sets? As I said, this question occurred to me in connection with Scala, but it would arise in any programming language whose library implements a Set abstraction.

The are some specific convenient methods defined on Set (intersect, diff and more). Not a big deal, but often useful.

My first thoughts are two:
efficiency: if you only want to signal presence, why bothering with a flag that can either be true or false?
meaning: a set is about the existence of something, a map is about a correlation between a key and value (generally speaking); these two ideas are quite different and should be used accordingly to simplify reading and understanding the code
Furthermore, the semantics of application change:
val map: Map[String, Bool] = Map("hello" -> true, "world" -> false)
val set: Set[String] = Set("hello")
map("hello") // true
map("world") // false
map("none!") // EXCEPTION
set("hello") // true
set("world") // false
set("none!") // false
Without having to actually store an extra pair to indicate absence (not to mention the boolean that actually indicates such absence).
Sets are very good to indicate the presence of something, which makes them very good for filtering:
val map = (0 until 10).map(_.toString).zipWithIndex.toMap
val set = (3 to 5).map(_.toString).toSet
map.filterKeys(set) // keeps pairs 3rd through 5th
Maps, in terms of processing collections, are good to indicate relationships, which makes them very good for collecting:
set.collect(map) // more or less equivalent as above, but only values are returned
You can read more about using collections as functions to process other collections here.

There are several reasons:
1) It is easier to think/work with a data structure that only has single elements as opposed to mapping to dummy true,
For example, it is easier to convert a list to Set, then to Map:
scala> val myList = List(1,2,3,2,1)
myList: List[Int] = List(1, 2, 3, 2, 1)
scala> myList.toSet
res9: scala.collection.immutable.Set[Int] = Set(1, 2, 3)
scala> myList.map(x => (x, true)).toMap
res1: scala.collection.immutable.Map[Int,Boolean] = Map(1 -> true, 2 -> true, 3 -> true)
2) As Kombajn zbożowy pointed out, Sets have additional helper methods, union, intersect, diff, subsetOf.
3) Sense Set does not have mapping to dummy variable the size of a set in memory is smaller, this is more noticeable for small sized keys.
Having said the above, not all languages have Set data structure, Go for example does not.

Related

Can I mutate a variable in place in a purely functional way?

I know I can use state passing and state monads for purely functional mutation, but afaik that's not in-place and I want the performance benefits of doing it in-place.
An example would be great, e.g. adding 1 to a number, preferably in Idris but Scala will also be good
p.s. is there a tag for mutation? can't see one
No, this is not possible in Scala.
It is however possible to achieve the performance benefits of in-place mutation in a purely functional language. For instance, let's take a function that updates an array in a purely functional way:
def update(arr: Array[Int], idx: Int, value: Int): Array[Int] =
arr.take(idx) ++ Array(value) ++ arr.drop(idx + 1)
We need to copy the array here in order to maintain purity. The reason is that if we mutated it in place, we'd be able to observe that after calling the function:
def update(arr: Array[Int], idx: Int, value: Int): Array[Int] = {
arr(idx) = value
arr
}
The following code will work fine with the first implementation but break with the second:
val arr = Array(1, 2, 3)
assert(arr(1) == 2)
val arr2 = update(arr, 1, 42)
assert(arr2(1) == 42) // so far, so good…
assert(arr(1) == 2) // oh noes!
The solution in a purely functional language is to simply forbid the last assert. If you can't observe the fact that the original array was mutated, then there's nothing wrong with updating the array in place! The means to achieve this is called linear types. Linear values are values that you can use exactly once. Once you've passed a linear value to a function, the compiler will not allow you to use it again, which fixes the problem.
There are two languages I know of that have this feature: ATS and Haskell. If you want more details, I'd recommend this talk by Simon Peyton-Jones where he explains the implementation in Haskell:
https://youtu.be/t0mhvd3-60Y
Support for linear types has since been merged into GHC: https://www.tweag.io/blog/2020-06-19-linear-types-merged/

Any concise way to define a constant function (mapping) in scala

I am a newbie in Scala. I have the following code to define a constant function that returns true for 1,2,3 and false for the other Integers.( actually the function defines a set {1,2,3} of integers):
val a= Node1 _
def Node1(x:Int):Boolean={
if (x==1 || x ==2 || x==3){true}
else{false}
}
Is there any way to define this function more concisely?
val a: Int => Boolean = Set(1, 2, 3).contains(_)
which is effectively same as
val a: Int => Boolean = Set(1, 2, 3)
This is because Set's apply method is same as its contains method.
On lookup performance, although as aforementioned Set.apply conveys the desired semantics, note that Set is a trait that may implement for instance ListSet, HashSet or TreeSet, each designed for specific uses.
For the aim here of looking up for an element, consider TreeSet (see Collections Performance Characteristics for details) outperforms other Set implementations. This proves relevant specially for large collections.

How to remove duplicates from collection (without creating new ones in-between)?

So first up, I'm fully aware mutation is a bad idea, but I need to keep object creation down to a minimum as I have an incredibly huge amount of data to process (keeps GC hang time down and speeds up my code).
What I want is a scala collection that has a method like distinct or similar, or possibly a library or code snippet (but native scala preferred) such that the method is side effecting / mutating the collection rather than creating a new collection.
I've explored the usual suspects like ArrayBuffer, mutable.List, Array, MutableList, Vector and they all "create a new sequence" from the original rather than mutate the original in place. Am I trying to find something that does not exist? Will I just have to write my own??
I think this exists in C++ http://www.cplusplus.com/reference/algorithm/unique/
Also, mega mega bonus points if there is some kind of awesome tail recursive way of doing this so that any bookkeeping structures created are kept in a single stack frame that is thus deallocated from memory once the method exits. The reason this would be uber cool is then even if the method creates some instances of things in order to perform the removal of duplicates, those instance will not need to be garbage collected and therefore not contribute to massive GC hangs. It doesn't have to be recursion, as long as it's likely to cause the objects to go on the stack (see escape analysis here http://www.ibm.com/developerworks/java/library/j-jtp09275/index.html)
(Also if I can specify and fix the capacity (size in memory) of the collection that would also be great)
The algorithm (for C++), you mentioned is for consecutive duplicates. So if you need it for consecutive, you could use some LinkedList, but mutable lists was deprecated. On the other hand if you want something memory-efficient and agree with linear access - you could wrap your collection (mutable or immutable) with distinct iterator (O(N)):
def toConsDist[T](c: Traversable[T]) = new Iterator[T] {
val i = c.toIterator
var prev: Option[T] = None
var _nxt: Option[T] = None
def nxt = {
if (_nxt.isEmpty) _nxt = i.find(x => !prev.toList.contains(x))
prev = _nxt
_nxt
}
def hasNext = nxt.nonEmpty
def next = {
val next = nxt.get
_nxt = None
next
}
}
scala> toConsDist(List(1,1,1,2,2,3,3,3,2,2)).toList
res44: List[Int] = List(1, 2, 3, 2)
If you need to remove all duplicates, it will be О(N*N), but you can't use scala collections for that, because of https://github.com/scala/scala/commit/3cc99d7b4aa43b1b06cc837a55665896993235fc (see LinkedList part), https://stackoverflow.com/a/27645224/1809978.
But you may use Java's LinkedList:
import scala.collection.JavaConverters._
scala> val mlist = new java.util.LinkedList[Integer]
mlist: java.util.LinkedList[Integer] = []
scala> mlist.asScala ++= List(1,1,1,2,2,3,3,3,2,2)
res74: scala.collection.mutable.Buffer[Integer] = Buffer(1, 1, 1, 2, 2, 3, 3, 3, 2, 2)
scala> var i = 0
i: Int = 0
scala> for(x <- mlist.asScala){ if (mlist.indexOf(x) != i) mlist.set(i, null); i+=1} //O(N*N)
scala> while(mlist.remove(null)){} //O(N*N)
scala> mlist
res77: java.util.LinkedList[Integer] = [1, 2, 3]
mlist.asScala just creates wrapper without any copying. You can't modify Java's LinkedList during iteration, that's why i used null's. You may try Java ConcurrentLinkedQueue, but it doesn't support indexOf, so you will have to implement it by yourself (scala maps it to the Iterator, so asScala.indexOf won't work).
By definition, immutability forces you to create new objects whenever you want to change your collection.
What Scala provides for some collection are buffers which allow you to build a collection using a mutable interface and finally returning a immutable version but once you got your immutable collection you can't change its references in any way, that includes filtering as distinct. The furthest point you can reach concerning mutability in an immutable collection is changing its elements state when these are mutable objects.
On the other hand, some collections as Vector are implemented as trees (in this case as a trie) and insert or delete operations are implemented not by copying the entire tree but just the required branches.
From Martin Ordesky's Programming in Scala:
Updating an element in the middle of a vector can be done by copying
the node that contains the element, and every node that points to it,
starting from the root of the tree. This means that a functional
update creates between one and five nodes that each contain up to 32
elements or subtrees. This is certainly more expensive than an
in-place update in a mutable array, but still a lot cheaper than
copying the whole vector.

Creating a Map from a Set of keys

I have a set of keys, say Set[MyKey] and for each of the keys I want to compute the value through some value function, lets say computeValueOf(key: MyKey). In the end I want to have a Map which maps key -> value
What is the most efficient way to do this without iterating too much?
A collection of Tuple2s can be converted to a Map, where the tuple's first element will be the key and the second element will be the value.
val setOfKeys = Set[MyKey]()
setOfKeys.map(key => (key, computeValueOf(key)).toMap
This is actually a pretty neat application for collection.breakOut, one of my favorite pieces of bizarre Scala voodoo:
type MyKey = Int
def computeValueOf(key: MyKey) = "value" * key
val mySet: Set[MyKey] = Set(1, 2, 3)
val myMap: Map[MyKey, String] =
mySet.map(k => k -> computeValueOf(k))(collection.breakOut)
See this answer for some discussion of what's going on here. Unlike the version with toMap, this won't construct an intermediate Set, saving you some allocations and a traversal. It's also much less readable, though—I only offer it because you mentioned that you wanted to avoid "iterating too much".

Scala Set Hashcode

Assume we have three sets of strings in Scala. One has elements A,B,C. Two has elements B,C,D. And Three has elements J,K,I.
My first question is, is there any way that the hashcodes for any two of these sets could be the same?
My second question is, if I add D to One and A to Two to get new Sets One.n and Two.n, are the hashcodes for One.n and Two.n the same?
Question 1) In general yes, entirely possible. A hashcode is a limited number of bytes long. A Set can be any size. So hashcodes cannot be unique (although usually they are).
Question 2) Why not try it?
scala> val One = collection.mutable.Set[String]("A", "B", "C")
One: scala.collection.mutable.Set[String] = Set(A, B, C)
scala> One.hashCode
res3: Int = 1491157345
scala> val Two = collection.mutable.Set[String]("B", "C", "D")
Two: scala.collection.mutable.Set[String] = Set(B, D, C)
scala> Two.hashCode
res4: Int = -967442916
scala> One += "D"
res5: One.type = Set(A, B, D, C)
scala> Two += "A"
res6: Two.type = Set(B, D, A, C)
scala> One.hashCode
res7: Int = -232075924
scala> Two.hashCode
res8: Int = -232075924
So, yes they are, as you might expect, since you would expect the == method to be true for these two instances.
Sets which are equal and don't have anything strange inside them (i.e. anything with an unstable hash code, or where the hash code is inconsistent with equals) should have equal hash codes. If this is not true, and the sets are the same type of set, it is a bug and should be reported. If the sets are different types of sets, it may or may not be a bug to have different hash codes (but in any case it should agree with equals). I am not aware of any cases where different set implementations are not equal (e.g. even mutable BitSet agrees with immutable Set), however.
So:
hashCode is never guaranteed to be unique, but it should be well-distributed in that the probability of collisions should be low
hashCode of sets should always be consistent with equals (as long as everything you put in the set has hashCode consistent with equals) in that equal sets have equal hash codes. (The converse is not true because of point (1).)
Sets care only about the identity of the contents, not the order of addition to the set (that's the point of having a set instead of, say, a List)