How to remove duplicates from collection (without creating new ones in-between)? - scala

So first up, I'm fully aware mutation is a bad idea, but I need to keep object creation down to a minimum as I have an incredibly huge amount of data to process (keeps GC hang time down and speeds up my code).
What I want is a scala collection that has a method like distinct or similar, or possibly a library or code snippet (but native scala preferred) such that the method is side effecting / mutating the collection rather than creating a new collection.
I've explored the usual suspects like ArrayBuffer, mutable.List, Array, MutableList, Vector and they all "create a new sequence" from the original rather than mutate the original in place. Am I trying to find something that does not exist? Will I just have to write my own??
I think this exists in C++ http://www.cplusplus.com/reference/algorithm/unique/
Also, mega mega bonus points if there is some kind of awesome tail recursive way of doing this so that any bookkeeping structures created are kept in a single stack frame that is thus deallocated from memory once the method exits. The reason this would be uber cool is then even if the method creates some instances of things in order to perform the removal of duplicates, those instance will not need to be garbage collected and therefore not contribute to massive GC hangs. It doesn't have to be recursion, as long as it's likely to cause the objects to go on the stack (see escape analysis here http://www.ibm.com/developerworks/java/library/j-jtp09275/index.html)
(Also if I can specify and fix the capacity (size in memory) of the collection that would also be great)

The algorithm (for C++), you mentioned is for consecutive duplicates. So if you need it for consecutive, you could use some LinkedList, but mutable lists was deprecated. On the other hand if you want something memory-efficient and agree with linear access - you could wrap your collection (mutable or immutable) with distinct iterator (O(N)):
def toConsDist[T](c: Traversable[T]) = new Iterator[T] {
val i = c.toIterator
var prev: Option[T] = None
var _nxt: Option[T] = None
def nxt = {
if (_nxt.isEmpty) _nxt = i.find(x => !prev.toList.contains(x))
prev = _nxt
_nxt
}
def hasNext = nxt.nonEmpty
def next = {
val next = nxt.get
_nxt = None
next
}
}
scala> toConsDist(List(1,1,1,2,2,3,3,3,2,2)).toList
res44: List[Int] = List(1, 2, 3, 2)
If you need to remove all duplicates, it will be О(N*N), but you can't use scala collections for that, because of https://github.com/scala/scala/commit/3cc99d7b4aa43b1b06cc837a55665896993235fc (see LinkedList part), https://stackoverflow.com/a/27645224/1809978.
But you may use Java's LinkedList:
import scala.collection.JavaConverters._
scala> val mlist = new java.util.LinkedList[Integer]
mlist: java.util.LinkedList[Integer] = []
scala> mlist.asScala ++= List(1,1,1,2,2,3,3,3,2,2)
res74: scala.collection.mutable.Buffer[Integer] = Buffer(1, 1, 1, 2, 2, 3, 3, 3, 2, 2)
scala> var i = 0
i: Int = 0
scala> for(x <- mlist.asScala){ if (mlist.indexOf(x) != i) mlist.set(i, null); i+=1} //O(N*N)
scala> while(mlist.remove(null)){} //O(N*N)
scala> mlist
res77: java.util.LinkedList[Integer] = [1, 2, 3]
mlist.asScala just creates wrapper without any copying. You can't modify Java's LinkedList during iteration, that's why i used null's. You may try Java ConcurrentLinkedQueue, but it doesn't support indexOf, so you will have to implement it by yourself (scala maps it to the Iterator, so asScala.indexOf won't work).

By definition, immutability forces you to create new objects whenever you want to change your collection.
What Scala provides for some collection are buffers which allow you to build a collection using a mutable interface and finally returning a immutable version but once you got your immutable collection you can't change its references in any way, that includes filtering as distinct. The furthest point you can reach concerning mutability in an immutable collection is changing its elements state when these are mutable objects.
On the other hand, some collections as Vector are implemented as trees (in this case as a trie) and insert or delete operations are implemented not by copying the entire tree but just the required branches.
From Martin Ordesky's Programming in Scala:
Updating an element in the middle of a vector can be done by copying
the node that contains the element, and every node that points to it,
starting from the root of the tree. This means that a functional
update creates between one and five nodes that each contain up to 32
elements or subtrees. This is certainly more expensive than an
in-place update in a mutable array, but still a lot cheaper than
copying the whole vector.

Related

Can I mutate a variable in place in a purely functional way?

I know I can use state passing and state monads for purely functional mutation, but afaik that's not in-place and I want the performance benefits of doing it in-place.
An example would be great, e.g. adding 1 to a number, preferably in Idris but Scala will also be good
p.s. is there a tag for mutation? can't see one
No, this is not possible in Scala.
It is however possible to achieve the performance benefits of in-place mutation in a purely functional language. For instance, let's take a function that updates an array in a purely functional way:
def update(arr: Array[Int], idx: Int, value: Int): Array[Int] =
arr.take(idx) ++ Array(value) ++ arr.drop(idx + 1)
We need to copy the array here in order to maintain purity. The reason is that if we mutated it in place, we'd be able to observe that after calling the function:
def update(arr: Array[Int], idx: Int, value: Int): Array[Int] = {
arr(idx) = value
arr
}
The following code will work fine with the first implementation but break with the second:
val arr = Array(1, 2, 3)
assert(arr(1) == 2)
val arr2 = update(arr, 1, 42)
assert(arr2(1) == 42) // so far, so good…
assert(arr(1) == 2) // oh noes!
The solution in a purely functional language is to simply forbid the last assert. If you can't observe the fact that the original array was mutated, then there's nothing wrong with updating the array in place! The means to achieve this is called linear types. Linear values are values that you can use exactly once. Once you've passed a linear value to a function, the compiler will not allow you to use it again, which fixes the problem.
There are two languages I know of that have this feature: ATS and Haskell. If you want more details, I'd recommend this talk by Simon Peyton-Jones where he explains the implementation in Haskell:
https://youtu.be/t0mhvd3-60Y
Support for linear types has since been merged into GHC: https://www.tweag.io/blog/2020-06-19-linear-types-merged/

Scala: Update Array inside a Map

I am creating a Map which has an Array inside it. I need to keep adding values to that Array. How do I do that?
var values: Map[String, Array[Float]] = Map()
I tried several ways such as:
myobject.values.getOrElse("key1", Array()).++(Array(float1))
Few other ways to but nothing updates the array inside the Map.
There is a problem with this code:
values.getOrElse("key1", Array()).++(Array(float1))
This does not update the Map in values, it just creates a new Array and then throws it away.
You need to replace the original Map with a new, updated Map, like this:
values = values.updated("key1", values.getOrElse("key1", Array.empty[Float]) :+ float1)
To understand this you need to be clear on the distinction between mutable variables and mutable data.
var is used to create a mutable variable which means that the variable can be assigned a new value, e.g.
var name = "John"
name = "Peter" // Would not work if name was a val
By contrast mutable data is held in objects whose contents can be changed
val a = Array(1,2,3)
a(0) = 12 // Works even though a is a val not a var
In your example values is a mutable variable but the Map is immutable so it can't be changed. You have to create a new, immutable, Map and assign it to the mutable var.
From what I can see (according to ++), you would like to append Array, with one more element. But Array fixed length structure, so instead I'd recommend to use Vector. Because, I suppose, you are using immutable Map you need update it as well.
So the final solution might look like:
var values: Map[String, Vector[Float]] = Map()
val key = "key1"
val value = 1.0
values = values + (key -> (values.getOrElse(key, Vector.empty[Float]) :+ value))
Hope this helps!
You can use Scala 2.13's transform function to transform your map anyway you want.
val values = Map("key" -> Array(1f, 2f, 3f), "key2" -> Array(4f,5f,6f))
values.transform {
case ("key", v) => v ++ Array(6f)
case (_,v) => v
}
Result:
Map(key -> Array(1.0, 2.0, 3.0, 6.0), key2 -> Array(4.0, 5.0, 6.0))
Note that appending to arrays takes linear time so you might want to consider a more efficient data structure such as Vector or Queue or even a List (if you can afford to prepend rather than append).
Update:
However, if it is only one key you want to update, it is probably better to use updatedWith:
values.updatedWith("key")(_.map(_ ++ Array(6f)))
which will give the same result. The nice thing about the above code is that if the key does not exist, it will not change the map at all without throwing any error.
Immutable vs Mutable Collections
You need to choose what type of collection you will use immutable or mutable one. Both are great and works totally differently. I guess you are familiar with mutable one (from other languages), but immutable are default in scala and probably you are using it in your code (because it doesn't need any imports). Immutable Map cannot be changed... you can only create new one with updated values (Tim's and Ivan's answers covers that).
There are few ways to solve your problem and all are good depending on use case.
See implementation below (m1 to m6):
//just for convenience
type T = String
type E = Long
import scala.collection._
//immutable map with immutable seq (default).
var m1 = immutable.Map.empty[T,List[E]]
//mutable map with immutable seq. This is great for most use-cases.
val m2 = mutable.Map.empty[T,List[E]]
//mutable concurrent map with immutable seq.
//should be fast and threadsafe (if you know how to deal with it)
val m3 = collection.concurrent.TrieMap.empty[T,List[E]]
//mutable map with mutable seq.
//should be fast but could be unsafe. This is default in most imperative languages (PHP/JS/JAVA and more).
//Probably this is what You have tried to do
val m4 = mutable.Map.empty[T,mutable.ArrayBuffer[E]]
//immutable map with mutable seq.
//still could be unsafe
val m5 = immutable.Map.empty[T,mutable.ArrayBuffer[E]]
//immutable map with mutable seq v2 (used in next snipped)
var m6 = immutable.Map.empty[T,mutable.ArrayBuffer[E]]
//Oh... and NEVER DO THAT, this is wrong
//I mean... don't keep mutable Map in `var`
//var mX = mutable.Map.empty[T,...]
Other answers show immutable.Map with immutable.Seq and this is preferred way (or default at least). It costs something but for most apps it is perfectly ok. Here You have nice source of info about immutable data structures: https://stanch.github.io/reftree/talks/Immutability.html.
Each variant has it's own Pros and Cons. Each deals with updates differently, and it makes this question much harder than it looks at the first glance.
Solutions
val k = "The Ultimate Answer"
val v = 42f
//immutable map with immutable seq (default).
m1 = m1.updated(k, v :: m1.getOrElse(k, Nil))
//mutable map with immutable seq.
m2.update(k, v :: m2.getOrElse(k, Nil))
//mutable concurrent map with immutable seq.
//m3 is bit harder to do in scala 2.12... sorry :)
//mutable map with mutable seq.
m4.getOrElseUpdate(k, mutable.ArrayBuffer.empty[Float]) += v
//immutable map with mutable seq.
m5 = m5.updated(k, {
val col = m5.getOrElse(k, c.mutable.ArrayBuffer.empty[E])
col += v
col
})
//or another implementation of immutable map with mutable seq.
m6.get(k) match {
case None => m6 = m6.updated(k, c.mutable.ArrayBuffer(v))
case Some(col) => col += v
}
check scalafiddle with this implementations. https://scalafiddle.io/sf/WFBB24j/3.
This is great tool (ps: you can always save CTRL+S your changes and share link to write question about your snippet).
Oh... and if You care about concurrency (m3 case) then write another question. Such topic deserve to be in separate thread :)
(im)mutable api VS (im)mutable Collections
You can have mutable collection and still use immutable api that will copy orginal seq. For example Array is mutable:
val example = Array(1,2,3)
example(0) = 33 //edit in place
println(example.mkString(", ")) //33, 2, 3
But some functions on it (e.g. ++) will create new sequence... not change existing one:
val example2 = example ++ Array(42, 41) //++ is immutable operator
println(example.mkString(", ")) //33, 2, 3 //example stays unchanged
println(example2.mkString(", ")) //33, 2, 3, 42, 41 //but new sequence is created
There is method updateWith that is mutable and will exist only in mutable sequences. There is also updatedWith and it exists in both immutable AND mutable collections and if you are not careful enough you will use wrong one (yea ... 1 letter more).
This means you need to be careful which functions you are using, immutable or mutable one. Most of the time you can distinct them by result type. If something returns collection then it will be probably some kind of copy of original seq. It result is unit then it is mutable for sure.

Scala's mutable.ListBuffer seems to use List's tail function yet it is documented as having linear complexity?

As of scala's 2.12.8 current documentation, List's tail is constant and ListBuffer's tail is linear. However, looking at the source code, it looks like there is no overwrite for the tail function and in most use-cases (such as remove the head element), List's tail function is explicitly called. Since ListBuffer seems to be little more than a List wrapper with a length var and a pointer to the last element, why is it linear?
I timed both methods and indeed it seems like List's tail is constant and ListBuffer's tail is indeed linear:
import scala.collection.mutable
import scala.collection.immutable
val immutableList: immutable.List[Int] = (1 to 10000).toList
val mutableListBuffer: mutable.ListBuffer[Int] = mutable.ListBuffer.empty[Int] ++= (1 to 10000).toList
// Warm-up
(1 to 100000).foreach(_ => immutableList.tail)
(1 to 100000).foreach(_ => mutableListBuffer.tail)
// Measure
val start = System.nanoTime()
(1 to 1000).foreach(_ => immutableList.tail)
val middle = System.nanoTime()
(1 to 1000).foreach(_ => mutableListBuffer.tail)
val stop = System.nanoTime()
println((middle - start) / 1000)
println((stop - middle) / 1000)
The results were, as documented:
1076
86010
However, if you use functions such as remove(0) that use List's tail, it is constant with the following results:
1350
1724
I expect that the linearity complexity comes from building a whole new list to return, but since the internal structure is a List, why not return the List's tail?
ListBuffer doesn't extend List, and the fact that it doesn't override tail doesn't mean it's using List#tail. If you look at the Definition Classes section of the docs for tail on ListBuffer, you'll see that it comes from TraversableLike, where it's defined like this:
override def tail: Repr = {
if (isEmpty) throw new UnsupportedOperationException("empty.tail")
drop(1)
}
And if you look at drop, you'll see that it uses a builder to construct a new collection containing all but the first element, which explains why it's linear.
As talex hints at in the comments above, ListBuffer#tail has to return a new collection because the original buffer could be modified, and the standard library designers have decided that you wouldn't want those modifications reflected in the result you get from tail.
since the internal structure is a List
If you look at the source, the internal structure is actually two lists:
private var start: List[A] = Nil
private var last0: ::[A] = _
last0 has this type because it's mutated using internal API (and it has to be for ListBuffer to make sense) (actually, just having two lists for the front and back parts with the back part in the reverse order should support all ListBuffer operations quite efficiently, including (amortized) O(1) tail; presumably the current implementation wins on constant factors, or maybe I am missing some operation it does much better).
So tail can't just "return the List's tail": it would at the least have to copy last0 because you can't share the same mutable part between two buffers. Even if the designers wanted changes to tail to reflect changes to the original ListBuffer and vice versa, sharing last0 wouldn't have this effect
(without a lot more effort). This is already linear.
Note that if return type of ListBuffer#tail were List you also need to copy last0, or copy the contents from last0 to start before returning its tail, etc. So it doesn't make tail constant-time. But it does create additional problems: does ArrayBuffer#tail return Array? How do you declare tail's return type in GenTraversableLike if it's still available there?

What is the easiest and most efficient way to make a min heap in Scala?

val maxHeap = scala.collection.mutable.PriorityQueue[Int] //Gives MaxHeap
What is the most concise and efficient way to use Ordering to turn a PriorityQueue into a minHeap?
You'll have to define your own Ordering :
scala> object MinOrder extends Ordering[Int] {
def compare(x:Int, y:Int) = y compare x
}
defined object MinOrder
Then use that when creating the heap :
scala> val minHeap = scala.collection.mutable.PriorityQueue.empty(MinOrder)
minHeap: scala.collection.mutable.PriorityQueue[Int] = PriorityQueue()
scala> minHeap.ord
res1: Ordering[Int] = MinOrder$#158ac84e
Update August 2016: you can consider the proposal chrisokasaki/scads/scala/heapTraits.scala from Chris Okasaki (chrisokasaki).
That proposal illustrates the "not-so-easy" part of an Heap:
Proof of concept for typesafe heaps with a merge operation.
Here, "typesafe" means that the interface will never allow different orderings to be mixed within the same heap.
In particular,
when adding an element to an existing heap, that insertion cannot involve an ordering different from the one used to create the existing heap, and
when merging two existing heaps, the heaps are guaranteed to have been created with the same ordering.
See its design.
val h1 = LeftistHeap.Min.empty[Int] // an empty min-heap of integers

How do I insert something at a specific position of a mutable LinkedList?

Again, this seems like something that should be obvious.
I would like to insert an element into a linked list at a specific position.
In one case, this is where a field in the element is less than a certain value, so I can do it this way:
def Add(act:Elem):Unit = {
val (before, after) = myList.partition(elem.n >= _.n)
myList = (before :+ act) ++ after
}
... but this is really an immutable approach disguised as a mutable one. I don't think I can get at the LinkedList node that corresponds to the insertion point, so I can't mess with the "next" attribute.
It shouldn't be this difficult. Half the point of linked lists is so you insert things in the middle.
I'm still messing with a compiler generator (as in this question). Replacing lists with copies is just not the way to do this, as there are many recursive calls during which the lists are quite deliberately modified, so you may find that some of the recursive calls are still using the lists you just replaced.
I really want mutable lists, and straightforward mutable operations. I guess I can write my own collection classes, but I don't think the need is that unusual. Anyone implemented "proper" multable linked lists already?
EDIT
Some more detail
I should have perhaps chosen a different example. Typically, I've got a reference to the element by some other route, and I want to insert an new element in one of the linked lists this element is on (I'd be happy with the element being in one linked list as a start)
In the naive Java implementation I'm starting with, the element itself contains a next field (which I can then manipulate).
In the Scala LinkedList case, the linked list node contains a reference to the element, and so, given the element, I cannot easily find the LinkedList node and so the next field.
I can traverse the list again, but it might be very long.
It might help to assume a DoublyLinkedList and deleting the element as the operation I want to do, as it's clearer then that traversal isn't needed and so should be avoided. So in that case, assume I have found the element by some other means than traversing the linked list. I now want to delete the element. In the Java/naive case, the back and forward pointers are part of the element. In the Scala collections case, there's a DoublyLinkedList node somewhere that contains a reference to my element. But I can't go from element to that node without traversing the list again.
Random thoughts follow: I'm getting somewhere by mixing in a Trait that defines a next field (for my singly linked case). This trait might support iterating over the objects in the list, for example. But that would help me only for elements that are on one list at a time and I have objects that are on three (with, currently, three different "next" pointers called things like "nezt", "across" and "down").
I don't want a List of Nodes pointing to Elements, I wanta List of Elements that are Nodes (ie. have a next field).
Unfortunately, LinkedList is does not implement Buffer, so there isn't AFAIK a good way to do this out of the box. You actually do have access to the next field, however, so you can write your own. Something like this (warning, untested!; warning, low level code!):
def Add(act: Elem) {
var last = myList
while (last.next ne null && last.next.n >= act.n) last = last.next
var ins = LinkedList(act)
ins.next = last.next
last.next = ins
}
(You might want to add a special case for myList being empty, and for insertion before the first element. But you get the idea
Edit after clarification: Don't keep copies of the elements; keep copies of the list starting at that element. (That's what last is.) Then write an implicit conversion from a list of your thingy of choice to the head thingy itself. Unless you duplicate the collections methods in your element, you get all the power of the collections library and all the syntactic convenience of having an element with a next pointer, with only an extra object allocation as drawback.
Of course, you can always implement any low-level data structure you want, if you want to reinvent the wheel so that it fits your car better (so to speak).
Why are people going to such trouble?
scala> LinkedList(1, 2, 3)
res21: scala.collection.mutable.LinkedList[Int] = LinkedList(1, 2, 3)
scala> val ll = LinkedList(1, 2, 3)
ll: scala.collection.mutable.LinkedList[Int] = LinkedList(1, 2, 3)
scala> ll.next.insert(LinkedList(0))
scala> ll
res23: scala.collection.mutable.LinkedList[Int] = LinkedList(1, 2, 0, 3)
scala> ll.insert(LinkedList(-1, -2))
scala> ll
res25: scala.collection.mutable.LinkedList[Int] = LinkedList(1, -1, -2, 2, 0, 3)
Of course, this doesn't answer the question after clarification, but I think Rex Kerr's idea of implicit conversions might be the way to go here. That, or just add a .elem before any method using the value. In fact, here's the implicit:
implicit def linkedListToA[A](ll: LinkedList[A]): A = ll.elem
Unpolished version: Inserts other into l the first time that the predicate p is true.
import scala.collection.mutable.LinkedList
import scala.annotation.tailrec
val list = LinkedList(1, 2, 3, 10, 11, 12)
def insertAfter[T](l: LinkedList[T], other: LinkedList[T], p: (T) => Boolean) {
#tailrec
def loop(x: LinkedList[T]) {
if (p(x.head)) {
other.next = x.next
x.next = other
return
}
if (x.next.isEmpty) {}
else loop(x.next)
}
loop(l)
}
insertAfter(list, LinkedList(100), (_:Int) >= 10)