Scala Array.view memory usage - scala

I'm learning Scala and have been trying some LeetCode problems with it, but I'm having trouble with the memory limit being exceeded. One problem I have tried goes like this:
A swap is defined as taking two distinct positions in an array and swapping the values in them.
A circular array is defined as an array where we consider the first element and the last element to be adjacent.
Given a binary circular array nums, return the minimum number of swaps required to group all 1's present in the array together at any location.
and my attempted solution looks like
object Solution {
def minSwaps(nums: Array[Int]): Int = {
val count = nums.count(_==1)
if (count == 0) return 0
val circular = nums.view ++ nums.view
circular.sliding(count).map(_.count(_==0)).min
}
}
however, when I submit it, I'm hit with Memory Limit Exceeded for one of the test case where nums is very large.
My understanding is that, because I'm using .view, I shouldn't be allocating over O(1) memory. Is that understanding incorrect? To be clear, I realise this is the most time efficient way of solving this, but I didn't expect it to be memory inefficient.
The version used is Scala 2.13.7, in case that makes a difference.
Update
I did some inspection of the types and it seems circular is only a View unless I replace ++ with concat which makes it IndexedSeqView, why is that, I thought ++ was just an alias for concat?
If I make the above change, and replace circular.sliding(count) with (0 to circular.size - count).view.map(i => circular.slice(i, i + count)) it "succeeds" in hitting the time limit instead, so I think sliding might not be optimised for IndexedSeqView.

Related

Scala immutable list internal implementation

Suppose I am having a huge list having elements from 1 to 1 million.
val initialList = List(1,2,3,.....1 million)
and
val myList = List(1,2,3)
Now when I apply an operation such as foldLeft on the myList giving initialList as the starting value such as
val output = myList.foldLeft(initialList)(_ :+ _)
// result ==>> List(1,2,3,.....1 million, 1 , 2 , 3)
Now my question comes here, both the lists being immutable the intermediate lists that were produced were
List(1,2,3,.....1 million, 1)
List(1,2,3,.....1 million, 1 , 2)
List(1,2,3,.....1 million, 1 , 2 , 3)
By the concept of immutability, every time a new list is being created and the old one being discarded. So isn't this operation a performance killer in scala as every time a new list of 1 million elements has to be copied to create a new list.
Please correct me if I am wrong as I am trying to understand the internal implementation of an immutable list.
Thanks in advance.
Yup, this is performance killer, but this is a cost of having immutable structures (which are amazing, safe and makes programs much less buggy). That's why you should often avoid appending list if you can. There is many tricks that can avoid this issue (try to use accumulators).
For example:
Instead of:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = myList.foldLeft(initialList)(_ :+ _)
You can write:
val initialList = List(1,2,3,.....1 million)
val myList = List(1,2,3,...,100)
val output = List(initialList,myList).flatten
Flatten is implemented to copy first line only once instead of copying it for every single append.
P.S.
At least adding element to the front of list works fast (O(1)), cause sharing of old list is possible. Let's Look at this example:
You can see how memory sharing works for immutable linked lists. Computer only keeps one copy of (b,c,d) end. But if you want to append bar to the end of baz you cannot modify baz, cause you would destroy foo, bar and raz! That's why you have to copy first list.
Appending to a List is not a good idea because List has linear cost for appending. So, if you can
either prepend to the List (List have constant time prepend)
or choose another collection that is efficient for appending. That would be a Queue
For the list of performance characteristic per operation on most scala collections, See:
https://docs.scala-lang.org/overviews/collections/performance-characteristics.html
Note that, depending on your requirement, you may also make your own smarter collection, such as chain iterable for example

views in collections in scala

I understand that a view is a light-weight collection and that it is lazy. I would like to understand what makes a view light weight.
Say I have a list of 1000 random numbers. I'll like to find even numbers in this list and pick only 1st 10 even numbers. I believe using a view here is better because we can avoid creating an intermediate list esp because I'll pick only 1st 10 even numbers. Initially, I thought that the the optimization is achieved because the function I'll use in the filter method will not get executed till the method force is called but this isn't correct I believe. I am struggling to understand what makes using the view better in this scenario. Or have I picked a wrong example?
val r = scala.util.Random
val l:List[Int] = List.tabulate(1000)(x=>r.nextInt())
//without view, I'll get an intermediate list. The function x%2==0 will be on each elemenet of l
val l1 = l.filter(x=>(x%2 == 0))
//this will give size of l2. I got size as 508 but yours could be different depending on the random numbers generated in your case
l1.size
//pick 1st 10 even numbers
val l2 = l1.take(10)
//using view. I thought that x%2==0 will not be executed right now
val lv1 = l.view.filter(x=>(x%2 == 0))
lv1: scala.collection.SeqView[Int,List[Int]] = SeqViewF(...)
lv1.size //this is same as l1 size so my assumption that x%2==0 will not be executed is wrong else lv1.size will not be same as l1.size
val lv2 = lv1.take(10).force
**Question 1 - if I use view, how is the processing optimised?
Question 2 - lv1 is of type SeqViewF, F is related to filter but what does it mean?
Question 3 - what do the elements of lv1 look like (l1 for example are integers)**
You wrote:
lv1.size //this is same as l1 size so my assumption that x%2==0 will
not be executed is wrong else lv1.size will not be same as l1.size
Your assumption is actually correct it's just that your means of measuring the difference is faulty.
val l:List[Int] = List.fill(10)(util.Random.nextInt) // ten random Ints
// print every Int that gets tested in the filter
val lv1 = l.view.filter{x => println(x); x%2 == 0} // no lines printed
lv1.size // ten Ints sent to STDOUT
So, as you see, taking the size of your view also forces its completion.
Yeah, that's not a very fitting example. What you are doing is better done with an iterator: list.filter(_ % 2 == 0).take(10). This doesn't create intermediate collections, and does not scan the list past the first 10 even elements (view wouldn't either, it's just a bit of an overcomplication for this case).
A view is a sequence of delayed operations. It has a reference to the collection, and a bunch of operations to be applied when it is forced. The way operations to be applied are recorded is rather complicated, and not really important. You guessed right - SeqViewF means a view of a sequence with a filter applied. If you map over it, you'll get a SeqViewFM etc.
When would this be needed?
One example is when you need to "massage" a sequence that you are passing somewhere else. Suppose, you have a function, that combines elements of a sequence you pass in somehow:
def combine(s: Seq[Int]) = s.iterator.zipWithIndex.map {
case(x, i) if i % 2 == 0 => x
case(x, _) => -x
}.sum
Now, suppose, you have a huge stream of numbers, and you want to combine only even ones, while dropping the others. You can use your existing function for that:
val result = combine(stream.view.filter(_ % 2 == 0))
Of course, if combine parameter was declared as iterator to begin with, you would not need the view again, but that is not always possible, sometimes you just have to use some standard interface, that just wants a sequence.
Here is a fancier example, that also takes advantage of the fact that the elements are computed on access:
def notifyUsers(users: Seq[User]) = users
.iterator
.filter(_.needsNotification)
.foreach(_.notify)
timer.schedule(60 seconds) { notifyUsers(userIDs.view.map(getUser)) }
So, I have some ids of the users that may need to be notified of some external events. I have them stored in userIDs.
Every minute a task runs, that finds all users that need to be notified, and sends a notification to each of them.
Here is the trick: notifyUsers takes a collection of User as a parameter. But what we are really passing in is a view, composed of the initial set of user ids, and a .map operation, getting the User object for each of them. As a result, every time the task runs, a new User object will be obtained for each id (perhaps, from the database), so, if the _needsNotification flag gets changed, the new value is picked up.
Surely, I could change notifyUsers to receive the list of ids, and do getUser on its own instead, but that wouldn't be as neat. First, this way, it is easier to unit-test - I can just pass an a list of test objects directly in, without bothering to mock out getUser. And second, a generic utility like this is more useful - a User could be a trait, for example, that could be representing many different domain objects.

Optimal HashSet Initialization (Scala | Java)

I'm writing an A.I. to solve a "Maze of Life" puzzle. Attempting to store states to a HashSet slows everything down. It's faster to run it without a set of explored states. I'm fairly confident my node (state storage) implements equals and hashCode well as tests show a HashSet doesn't add duplicate states. I may need to rework the hashCode function, but I believe what's slowing it down is the HashSet rehashing and resizing.
I've tried setting the initial capacity to a very large number, but it's still extremely slow:
val initCapacity = java.lang.Math.pow(initialGrid.width*initialGrid.height,3).intValue()
val frontier = new QuickQueue[Node](initCapacity)
Here is the quick queue code:
class QuickQueue[T](capacity: Int) {
val hashSet = new HashSet[T](capacity)
val queue = new Queue[T]
//methods below
For more info, here is the hash function. I store the grid values in bytes in two arrays and access it using tuples:
override def hashCode(): Int = {
var sum = Math.pow(grid.goalCoords._1, grid.goalCoords._2).toInt
for (y <- 0 until grid.height) {
for (x <- 0 until grid.width) {
sum += Math.pow(grid((x, y)).doubleValue(), x.toDouble).toInt
}
sum += Math.pow(sum, y).toInt
}
return sum
}
Any suggestions on how to setup a HashSet that wont slow things down? Maybe another suggestion of how to remember explored states?
P.S. using java.util.HashSet, and even with initial capacity set, it takes 80 seconds vs < 7 seconds w/o the set
Okay, for a start, please replace
override def hashCode(): Int =
with
override lazy val hashCode: Int =
so you don't calculate (grid.height*grid.width) floating point powers every time you need to access the hash code. That should speed things up by an enormous amount.
Then, unless you somehow rely upon close cells having close hash codes, don't re-invent the wheel. Use scala.util.hashing.MurmurHash3.seqHash or somesuch to calculate your hash. This should speed your hash up by another factor of 20 or so. (Still keep the lazy val.)
Then you only have overhead from the required set operations. Right now, unless you have a lot of 0x0 grids, you are using up the overwhelming majority of your time waiting for math.pow to give you a result (and risking everything becoming Double.PositiveInfinity or 0.0, depending on how big the values are, which will create hash collisions which will slow things down still further).
Note that the following assumes all your objects are immutable. This is a sane assumption when using hashing.
Also you should profile your code before applying optimization (use e.g. the free jvisualvm, that comes with the JDK).
Memoization for fast hashCode
Computing the hash code is usually a bottleneck. By computing the hash code only once for each object and storing the result you can reduce the cost of hash code computation to a minimum (once at object creation) at the expense of increased space consumption (probably moderate). To achieve this turn the def hashCode into a lazy val or val.
Interning for fast equals
Once you have the cost of hashCode eliminated, computing equals becomes a problem. equals is particularly expensive for collection fields and deep structures in general.
You can minimize the cost of equals by interning. This means that you acquire new objects of the class through a factory method, which checks whether the requested new object already exists, and if so, returns a reference to the existing object. If you assert that every object of this type is constructed in this way you know that there is only one instance of each distinct object and equals becomes equivalent to object identity, which is a cheap reference comparison (eq in Scala).

Scala: Hash ignores initial size (fast hash table for billions of entries)

I am trying to find out how well Scala's hash functions scale for big hash tables (with billions of entries, e.g. to store how often a particular bit of DNA appeared).
Interestingly, however, both HashMap and OpenHashMap seem to ignore the parameters which specify initial size (2.9.2. and 2.10.0, latest build).
I think that this is so because adding new elements becomes much slower after the first 800.000 or so.
I have tried increasing the entropy in the strings which are to be inserted (only the chars ACGT in the code below), without effect.
Any advice on this specific issue? I would also be grateful to hear your opinion on whether using Scala's inbuilt types is a good idea for a hash table with billions of entries.
import scala.collection.mutable.{ HashMap, OpenHashMap }
import scala.util.Random
object HelloWorld {
def main(args: Array[String]) {
val h = new collection.mutable.HashMap[String, Int] {
override def initialSize = 8388608
}
// val h = new scala.collection.mutable.OpenHashMap[Int,Int](8388608);
for (i <- 0 until 10000000) {
val kMer = genkMer()
if(! h.contains(kMer))
{
h(kMer) = 0;
}
h(kMer) = h(kMer) + 1;
if(i % 100000 == 0)
{
println(h.size);
}
}
println("Exit. Hashmap size:\n");
println(h.size);
}
def genkMer() : String =
{
val nucs = "A" :: "C" :: "G" :: "T" :: Nil
var s:String = "";
val r = new scala.util.Random
val nums = for(i <- 1 to 55 toList) yield r.nextInt(4)
for (i <- 0 until 55) {
s = s + nucs(nums(i))
}
s
}
}
I wouldn't use Java data structures to manage a map of billions of entries. Reasons:
The max buckets in a Java HashMap is 2^30 (~1B), so
with default load factor you'll fail when the map tries to resize after 750 M entries
you'll need to use a load factor > 1 (5 would theoretically get you 5 billion items, for example)
With a high load factor you're going to get a lot of hash collisions and both read and write performance is going to start to degrade badly
Once you actually exceed Integer.MAX_INTEGER values I have no idea what gotchas exist -- .size() on the map wouldn't be able to return the real count, for example
I would be very worried about running a 256 GB heap in Java -- if you ever hit a full GC it is going lock the world for a long time to check the billions of objects in old gen
If it was me I'd be looking at an off-heap solution: a database of some sort. If you're just storing (hashcode, count) then one of the many key-value stores out the might work. The biggest hurdle is finding one that can support many billions of records (some max out at 2^32).
If you can accept some error, probabilistic methods might be worth looking at. I'm no expert here, but the stuff listed here sounds relevant.
First, you can't override initialSize, I think scala let's you because it's package private in HashTable:
private[collection] final def initialSize: Int = 16
Second, if you want to set the initial size, you have to give it a HashTable of the initial size that you want. So there's really no good way of constructing this map without starting at 16, but it does grow by a power of 2, so each resize should get better.
Third, scala collections are relatively slow, I would recommend java/guava/etc collections instead.
Finally, billions of entries is a bit much for most hardware, you'll probably run out of memory. You'll most likely need to use memory mapped files, here's a good example (no hashing though):
https://github.com/peter-lawrey/Java-Chronicle
UPDATE 1
Here's a good drop in replacement for java collections:
https://github.com/boundary/high-scale-lib
UPDATE 2
I ran your code and it did slow down around 800,000 entries, but then I boosted the java heap size and it ran fine. Try using something like this for jvm:
-Xmx2G
Or, if you want to use every last bit of your memory:
-Xmx256G
These are the wrong data structures. You will hit a ram limit pretty fast (unless you have 100+gb, and even then you will still hit limits very fast).
I don't know if suitable data structures exist for scala, although someone will have done something with Java probably.

How to correctly get current loop count from a Iterator in scala

I am looping over the following lines from a csv file to parse them. I want to identify the first line since its the header. Whats the best way of doing this instead of making a var counter holder.
var counter = 0
for (line <- lines) {
println(CsvParser.parse(line, counter))
counter++
}
I know there is got to be a better way to do this, newbie to Scala.
Try zipWithIndex:
for (line <- lines.zipWithIndex) {
println(CsvParser.parse(line._1, line._2))
}
#tenshi suggested the following improvement with pattern matching:
for ((line, count) <- lines.zipWithIndex) {
println(CsvParser.parse(line, count))
}
I totally agree with the given answer, still that I've to point something important out and initially I planned to put in a simple comment.
But it would be quite long, so that, leave me set it as a variant answer.
It's prefectly true that zip* methods are helpful in order to create tables with lists, but they have the counterpart that they loop the lists in order to create it.
So that, a common recommendation is to sequence the actions required on the lists in a view, so that you combine all of them to be applied only producing a result will be required. Producing a result is considered when the returnable isn't an Iterable. So is foreach for instance.
Now, talking about the first answer, if you have lines to be the list of lines in a very big file (or even an enumeratee on it), zipWithIndex will go through all of 'em and produce a table (Iterable of tuples). Then the for-comprehension will go back again through the same amount of items.
Finally, you've impacted the running lenght by n, where n is the length of lines and added a memory footprint of m + n*16 (roughtly) where m is the lines' footprint.
Proposition
lines.view.zipWithIndex map Function.tupled(CsvParser.parse) foreach println
Some few words left (I promise), lines.view will create something like scala.collection.SeqView that will hold all further "mapping" function producing new Iterable, as are zipWithIndex and map.
Moreover, I think the expression is more elegant because it follows the reader and logical.
"For lines, create a view that will zip each item with its index, the result as to be mapped on the result of the parser which must be printed".
HTH.