Dot product in Scala without heap allocation - scala

I have a Scala project with some intensive arithmetics, and it sometimes allocates Floats faster than the GC can clean them up. (This is not about memory leaks caused by retained references, just fast memory consumption for temporary values.) I try to use Arrays with primitive types, and reuse them when I can, but still some new allocations sneak in.
One piece that puzzles me, for instance:
import org.specs2.mutable.Specification
class CalcTest extends Specification {
def dot(a: Array[Float], b: Array[Float]): Float = {
require(a.length == b.length, "array size mismatch")
val n = a.length
var sum: Float = 0f
var i = 0
while (i < n) {
sum += a(i) * b(i)
i += 1
}
sum
}
val vector = Array.tabulate(1000)(_.toFloat)
"calculation" should {
"use memory sparingly" >> {
val before = Runtime.getRuntime().freeMemory()
for (i <- 0 to 1000000)
dot(vector, vector)
val after = Runtime.getRuntime().freeMemory()
(before - after) must be_<(1000L) // actual result above 4M
}
}
}
I would expect it to compute the dot products using only stack memory, but apparently it allocates about 4 bytes per call on the heap. This may not sound like much, but it adds up quickly in my code.
I was suspecting the sum, but from the bytecode output, looks like it is on the stack:
aload 1
arraylength
istore 3
fconst_0
fstore 4
iconst_0
istore 5
l2
iload 5
iload 3
if_icmpge l3
fload 4
aload 1
iload 5
faload
aload 2
iload 5
faload
fmul
fadd
fstore 4
iload 5
iconst_1
iadd
istore 5
_goto l2
l3
fload 4
freturn
Is it the return value that goes on the heap? Is there any way to avoid this overhead entirely? Is there a better way to investigate and solve such memory problems?
From the visualVM output for my project, I only see that i have an awful lot of Floats allocated. It is hard to track there small objects like that, being allocated rapidly. It is more useful for large objects and memory snapshots taken at long intervals.
Update:
I was so focused on the function code, I missed the problem in the test. If I rewrite it with a while loop, it succeeds:
var i = 0
while (i < 1000000) {
dot(vector, vector)
i += 1
}
I would still appreciate more ideas for other ways to debug this sort of issues, in addition to tests like this and using visualVM memory snapshots.

Range implementation in
for (i <- 0 to 1000000)
dot(vector, vector)
might use some memory, or just be slow enough to let JVM allocate something else in background and break the fragile measurement method used in the test.
Try to modify these lines into a while loop, for example.
(The original version of this post said that for() was equivalent to map(), which was wrong. It is equivalent to foreach() here because it does not have a yield clause.)

Related

"Find the Parity Outlier Code Wars (Scala)"

I am doing some of CodeWars challenges recently and I've got a problem with this one.
"You are given an array (which will have a length of at least 3, but could be very large) containing integers. The array is either entirely comprised of odd integers or entirely comprised of even integers except for a single integer N. Write a method that takes the array as an argument and returns this "outlier" N."
I've looked at some solutions, that are already on our website, but I want to solve the problem using my own approach.
The main problem in my code, seems to be that it ignores negative numbers even though I've implemented Math.abs() method in scala.
If you have an idea how to get around it, that is more than welcome.
Thanks a lot
object Parity {
var even = 0
var odd = 0
var result = 0
def findOutlier(integers: List[Int]): Int = {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 == 0)
even += 1
else
odd += 1
}
if (even == 1) {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 == 0)
result = integers(y)
}
} else {
for (y <- 0 until integers.length) {
if (Math.abs(integers(y)) % 2 != 0)
result = integers(y)
}
}
result
}
Your code handles negative numbers just fine. The problem is that you rely on mutable sate, which leaks between runs of your code. Your code behaves as follows:
val l = List(1,3,5,6,7)
println(Parity.findOutlier(l)) //6
println(Parity.findOutlier(l)) //7
println(Parity.findOutlier(l)) //7
The first run is correct. However, when you run it the second time, even, odd, and result all have the values from your previous run still in them. If you define them inside of your findOutlier method instead of in the Parity object, then your code gives correct results.
Additionally, I highly recommend reading over the methods available to a Scala List. You should almost never need to loop through a List like that, and there are a number of much more concise solutions to the problem. Mutable var's are also a pretty big red flag in Scala code, as are excessive if statements.

How to do bitwise operation decently?

I'm doing analysis on binary data. Suppose I have two uint8 data values:
a = uint8(0xAB);
b = uint8(0xCD);
I want to take the lower two bits from a, and whole content from b, to make a 10 bit value. In C-style, it should be like:
(a[2:1] << 8) | b
I tried bitget:
bitget(a,2:-1:1)
But this just gave me separate [1, 1] logical type values, which is not a scalar, and cannot be used in the bitshift operation later.
My current solution is:
Make a|b (a or b):
temp1 = bitor(bitshift(uint16(a), 8), uint16(b));
Left shift six bits to get rid of the higher six bits from a:
temp2 = bitshift(temp1, 6);
Right shift six bits to get rid of lower zeros from the previous result:
temp3 = bitshift(temp2, -6);
Putting all these on one line:
result = bitshift(bitshift(bitor(bitshift(uint16(a), 8), uint16(b)), 6), -6);
This is doesn't seem efficient, right? I only want to get (a[2:1] << 8) | b, and it takes a long expression to get the value.
Please let me know if there's well-known solution for this problem.
Since you are using Octave, you can make use of bitpack and bitunpack:
octave> a = bitunpack (uint8 (0xAB))
a =
1 1 0 1 0 1 0 1
octave> B = bitunpack (uint8 (0xCD))
B =
1 0 1 1 0 0 1 1
Once you have them in this form, it's dead easy to do what you want:
octave> [B A(1:2)]
ans =
1 0 1 1 0 0 1 1 1 1
Then simply pad with zeros accordingly and pack it back into an integer:
octave> postpad ([B A(1:2)], 16, false)
ans =
1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0
octave> bitpack (ans, "uint16")
ans = 973
That or is equivalent to an addition when dealing with integers
result = bitshift(bi2de(bitget(a,1:2)),8) + b;
e.g
a = 01010111
b = 10010010
result = 00000011 100010010
= a[2]*2^9 + a[1]*2^8 + b
an alternative method could be
result = mod(a,2^x)*2^y + b;
where the x is the number of bits you want to extract from a and y is the number of bits of a and b, in your case:
result = mod(a,4)*256 + b;
an extra alternative solution close to the C solution:
result = bitor(bitshift(bitand(a,3), 8), b);
I think it is important to explain exactly what "(a[2:1] << 8) | b" is doing.
In assembly, referencing individual bits is a single operation. Assume all operations take the exact same time and "efficient" a[2:1] starts looking extremely inefficient.
The convenience statement actually does (a & 0x03).
If your compiler actually converts a uint8 to a uint16 based on how much it was shifted, this is not a 'free' operation, per se. Effectively, what your compiler will do is first clear the "memory" to the size of uint16 and then copy "a" into the location. This requires an extra step (clearing the "memory" (register)) that wouldn't normally be needed.
This means your statement actually is (uint16(a & 0x03) << 8) | uint16(b)
Now yes, because you're doing a power of two shift, you could just move a into AH, move b into AL, and AH by 0x03 and move it all out but that's a compiler optimization and not what your C code said to do.
The point is that directly translating that statement into matlab yields
bitor(bitshift(uint16(bitand(a,3)),8),uint16(b))
But, it should be noted that while it is not as TERSE as (a[2:1] << 8) | b, the number of "high level operations" is the same.
Note that all scripting languages are going to be very slow upon initiating each instruction, but will complete said instruction rapidly. The terse nature of Python isn't because "terse is better" but to create simple structures that the language can recognize so it can easily go into vectorized operations mode and start executing code very quickly.
The point here is that you have an "overhead" cost for calling bitand; but when operating on an array it will use SSE and that "overhead" is only paid once. The JIT (just in time) compiler, which optimizes script languages by reducing overhead calls and creating temporary machine code for currently executing sections of code MAY be able to recognize that the type checks for a chain of bitwise operations need only occur on the initial inputs, hence further reducing runtime.
Very high level languages are quite different (and frustrating) from high level languages such as C. You are giving up a large amount of control over code execution for ease of code production; whether matlab actually has implemented uint8 or if it is actually using a double and truncating it, you do not know. A bitwise operation on a native uint8 is extremely fast, but to convert from float to uint8, perform bitwise operation, and convert back is slow. (Historically, Matlab used doubles for everything and only rounded according to what 'type' you specified)
Even now, octave 4.0.3 has a compiled bitshift function that, for bitshift(ones('uint32'),-32) results in it wrapping back to 1. BRILLIANT! VHLL place you at the mercy of the language, it isn't about how terse or how verbose you write the code, it's how the blasted language decides to interpret it and execute machine level code. So instead of shifting, uint32(floor(ones / (2^32))) is actually FASTER and more accurate.

Chisel compiler is very slow

I am working on a matrix summation kind of design. The compiler takes 4+hours to generate 1+million lines of codes. Every line is "assign....." I don't know if this is the inefficiency of the compiler or my coding style is bad. If someone could suggest some alternatives that will be great!
Here is the description of the code
The input will be AND with a random matrix element by element and summed up using .reduce, so the result matrix should be 140X6 vec, cat them together gives me a 840 bits output
(rndvec, which is supposed to be a 140x840x6 bits random matrix. since I don't know how to generate random value so I started with a fixed 140x6 to represent one row and feed it with input over and over again)
This following is my code
import Chisel._
import scala.collection.mutable.HashMap
import util.Random
class LBio(n: Int) extends Bundle {
var myinput = UInt(INPUT,840)
var myoutput = UInt (OUTPUT,840)
}
class Lbi(q: Int,n:Int,m :Int ) extends Module{
def mask(orig: Vec[UInt],maska:UInt,mi:Int)={
val result = Vec.fill(840){UInt(width =6)}
for (i<-0 until 840 ){
result(i) := orig(i)&Fill(6,maska(i)) //every bits of input AND with random vector
}
result
}
val io= new LBio(840)
val rndvec = Vec.fill(840){UInt("h13",6)} //random vector, for now its just replication of 0x13....
val resultvec = Vec.fill(140){UInt(width = 6)}
for (i<-0 until 140){
resultvec(i) := mask(rndvec,io.myinput,m).reduce(_+_) //add the entire row of 6 bits element together with reduce
}
io.myoutput := resultvec.toBits
}
The terminal report:
started inference
finished inference (4)
start width checking
finished width checking
started flattenning
finished flattening (941783)
resolving nodes to the components
finished resolving
started transforms
finished transforms
checking for combinational loops
NO COMBINATIONAL LOOP FOUND
COMPILING class TutorialExamples.Lbi 0 CHILDREN (0,0)
[success] Total time: 33453 s, completed Oct 16, 2013 10:32:10 PM
There's nothing obviously wrong with your Chisel code, but I should point out that if rndvec is 140x840x6 bits, that's ~689kB of state! And your reduce operation is on 5kB of state.
Chisel uses "assign" statements because your code is entirely combinational and Chisel produces a very structural form of Verilog.
I suspect the part that is killing the compile time (aside from the huge amount of state) is that you are generating and manipulating 140 Vecs with the mask() function.
I tried my hand at rewriting your code and got it down from 941,783 nodes to 202,723 (takes about 10-15 minutes to compile, but generates 11MB of Verilog code). I'm pretty sure this does what your code was doing:
class Hello(q: Int, dim_n:Int) extends Module
{
val io = new LBio(dim_n)
val rndvec = Vec.fill(dim_n){UInt("h13",6)}
val resultvec = Vec.fill(dim_n/6){UInt(width=6)}
// lift this work outside of the for loop
val padded_input = Vec.fill(dim_n){UInt(width=6)}
for (i <- 0 until dim_n)
{
padded_input(i) := Fill(6,io.myinput)
}
for (i <- 0 until dim_n/6)
{
val result = Bits(width=dim_n*6)
result := rndvec.toBits & padded_input.toBits
var sum = UInt(0) //advanced Chisel - be careful with the use of var!
for (j <- 0 until dim_n by 6)
{
sum = sum + result(j+6,j)
}
resultvec(i) := sum
}
io.myoutput := resultvec.toBits
}
What I did was avoid doing the same work over and over again - like padding out the myinput Vec inside of the for loop's mask() function. I also kept everything in Bits() instead of Vecs. Sadly it means I lose the awesome .reduce() function.
I think maybe the answer is "be cognizant of how much state you're creating" and "Vecs are awesome, but use carefully".
Do you have a Verilog version that's short and concise? It'd be interesting to see if there are areas where Chisel is losing out efficiency wise.

Why is SIGAR returning NaN or zero randomly

Trying to run SIGAR rapidly to get a number of hardware metric samples, and I see this behavior:
val sig: Sigar = new Sigar()
val steady_cpu: Double = (for (i <- 1 to 100) yield sig.getCpuPerc().getUser()).sum / 100.0
where steady_cpu results to NaN. Looking at the generated list, the NaN's come from the getUser() call returning NaN
General issue seems to be that SIGAR calls are potentially stateful, and calling the functions too quickly does not give SIGAR time to rebuild it's internal state. I'd guess that they are counting CPU cycles or something similar, which is typically an approximate science, and if you call functions too rapidly the internal libraries end up dividing by zero. The fix is to add a short sleep between calls:
val sig: Sigar = new Sigar()
val steady_cpu: Double = (for (i <- 1 to 100) yield {
Thread.sleep(10);
sig.getCpuPerc().getUser()
}).sum / 100.0

Adding immutable Vectors

I am trying to work more with scalas immutable collection since this is easy to parallelize, but i struggle with some newbie problems. I am looking for a way to create (efficiently) a new Vector from an operation. To be precise I want something like
val v : Vector[Double] = RandomVector(10000)
val w : Vector[Double] = RandomVector(10000)
val r = v + w
I tested the following:
// 1)
val r : Vector[Double] = (v.zip(w)).map{ t:(Double,Double) => t._1 + t._2 }
// 2)
val vb = new VectorBuilder[Double]()
var i=0
while(i<v.length){
vb += v(i) + w(i)
i = i + 1
}
val r = vb.result
}
Both take really long compared to the work with Array:
[Vector Zip/Map ] Elapsed time 0.409 msecs
[Vector While Loop] Elapsed time 0.374 msecs
[Array While Loop ] Elapsed time 0.056 msecs
// with warm-up (10000) and avg. over 10000 runs
Is there a better way to do it? I think the work with zip/map/reduce has the advantage that it can run in parallel as soon as the collections have support for this.
Thanks
Vector is not specialized for Double, so you're going to pay a sizable performance penalty for using it. If you are doing a simple operation, you're probably better off using an array on a single core than a Vector or other generic collection on the entire machine (unless you have 12+ cores). If you still need parallelization, there are other mechanisms you can use, such as using scala.actors.Futures.future to create instances that each do the work on part of the range:
val a = Array(1,2,3,4,5,6,7,8)
(0 to 4).map(_ * (a.length/4)).sliding(2).map(i => scala.actors.Futures.future {
var s = 0
var j = i(0)
while (j < i(1)) {
s += a(j)
j += 1
}
s
}).map(_()).sum // _() applies the future--blocks until it's done
Of course, you'd need to use this on a much longer array (and on a machine with four cores) for the parallelization to improve things.
You should use lazily built collections when you use more than one higher-order methods:
v1.view zip v2 map { case (a,b) => a+b }
If you don't use a view or an iterator each method will create a new immutable collection even when they are not needed.
Probably immutable code won't be as fast as mutable but the lazy collection will improve execution time of your code a lot.
Arrays are not type-erased, Vectors are. Basically, JVM gives Array an advantage over other collections when handling primitives that cannot be overcome. Scala's specialization might decrease that advantage, but, given their cost in code size, they can't be used everywhere.