CoffeeScript for efficiency - coffeescript

I have a CoffeeScript code
for y in [coY - limit .. coY + limit]
for x in [coX - limit .. coX + limit]
I was looking for ways how to improve speed of my code and found what it compiles into:
for (y = _i = _ref = coY - limit, _ref1 = coY + limit; _ref <= _ref1 ? _i <= _ref1 : _i >= _ref1; y = _ref <= _ref1 ? ++_i : --_i) {
for (x = _j = _ref2 = coX - limit, _ref3 = coX + limit; _ref2 <= _ref3 ? _j <= _ref3 : _j >= _ref3; x = _ref2 <= _ref3 ? ++_j : --_j) {
When I replaced that with my own JavaScript
for(y = coY - limit; y <= coY + limit; y++) {
for(x = coX - limit; x <= coX + limit; x++) {
I have measured the script to be significantly faster (from 25 to 15 ms). Can I somehow force CoffeeScript to compile into code similar to mine? Or is there other solution?
Thank you.

Assuming your loop will always go from a smaller number to a bigger number, you can use by 1:
for y in [coY - limit .. coY + limit] by 1
for x in [coX - limit .. coX + limit] by 1
Which compiles to:
for (y = _i = _ref = coY - limit, _ref1 = coY + limit; _i <= _ref1; y = _i += 1) {
for (x = _j = _ref2 = coX - limit, _ref3 = coX + limit; _j <= _ref3; x = _j += 1) {
It's not HEAPS better, but possibly a bit.

I dunno buddy, the code in your edit compiles to this for me:
// Generated by CoffeeScript 1.4.0
var x, y, _i, _j, _ref, _ref1, _ref2, _ref3;
for (y = _i = _ref = coY - limit, _ref1 = coY + limit; _i <= _ref1; y = _i += 1) {
for (x = _j = _ref2 = coX - limit, _ref3 = coX + limit; _j <= _ref3; x = _j += 1) {
}
}
To get it exactly like you want it, you might just have to actually write it in JavaScript. Luckily, CoffeeScript has syntax for inserting literal JS into a CS file. If you surround JS with backticks (`), the CS compiler will include it in the output but it won't change what's in the backticks in any way.
Here's an example:
console.log "regular coffeescript"
#surround inline JS with backticks, like so:
`for(y = coY - limit; y <= coY + limit; y++) {
for(x = coX - limit; x <= coX + limit; x++) {
console.log('inline JS!');
}
}`
console.log "continue writing regular CS after"
Source: http://coffeescript.org/#embedded

Related

Question on the dimension of cuda block indexing

In the following cuda code taken from book "Accelerating MATLAB with GPU computing: a primer with examples", I think
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 1)
return;
should actually be
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 0 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 0 || col > numCols - 1)
return;
Am I right?
The following is the whole code that does image convolution using cuda code called from MATLAB.
#include "conv2Mex.h"
__global__ void conv2MexCuda(float* src,
float* dst,
int numRows,
int numCols,
float* mask)
{
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 1)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 1)
return;
int dstIndex = col * numRows + row;
dst[dstIndex] = 0;
int mskIndex = 3 * 3 - 1;
for (int kc = -1; kc < 2; kc++)
{
int srcIndex = (col + kc) * numRows + row;
for (int kr = -1; kr < 2; kr++)
{
dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];
}
}
}
void conv2Mex(float* src, float* dst, int numRows, int numCols, float* msk)
{
...
conv2MexCuda<<<gridSize, blockSize>>>...
...
}
Am I right?
I don't think you are right.
The construction of the row and col indices in the kernel code is such that they will vary (across threads in the grid) from 0 to numRows-1 and 0 to numCols-1 (and perhaps larger, depending on actual grid sizing, which you haven't shown).
Based on the code you have shown, the mask is evidently a 3x3 mask, which means that it acts as a stencil over the current (row, col) position, and extends plus and minus one row, and plus and minus one column. Let's take a careful look at the indexing here for the case where (row, col) = (0,0); this is one of the positions you have allowed to execute based on your proposed change:
for (int kc = -1; kc < 2; kc++)
{
int srcIndex = (col + kc) * numRows + row;
for (int kr = -1; kr < 2; kr++)
{
dst[dstIndex] += mask[mskIndex--] * src[srcIndex + kr];
At the first iteration of the outer for loop, kc will be -1, therefore srcIndex is (0-1)*numRows+0. Let's assume numRows is reasonably large, like 256. So srcIndex is -1*256 or -256. At the first iteration of the inner for-loop, kr is -1, so the computed index for the access to src is -256-1 = -257. That is almost never sensible.
If anything, the upper bounds look incorrect to me. If we assume that the valid image index ranges are 0..numRows-1 and 0..numCols-1, then I think the restrictions should be as follows:
int row = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 1 || row > numRows - 2)
return;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if (col < 1 || col > numCols - 2)
return;
That appears to be the classic computer science off-by-1 error.

Where is the huge performance difference for the two versions of the code?

I am working on a problem that Given a string s, partitions s such that every substring of the partition is a palindrome.
Return the minimum cuts needed for a palindrome partitioning of s. The problem can also be found in here. https://oj.leetcode.com/problems/palindrome-partitioning-ii/
Version 1 is one version of solution I found online.
Version 2 is my code.
They both seem to work in very similar ways. However, with a reasonably large input, version 2 takes more than 6000 milliseconds whereas version 1 takes around 71 milliseconds.
Can anyone provide any idea where the time difference is from?
Version 1:
int minSol(string s) {
int len = s.size();
vector<int> D(len + 1);
vector<vector<int>> P;
for (int i = 0; i < len; i++){
vector<int> t(len);
P.push_back(t);
}
for (int i = 0; i <= len; i++)
D[i] = len - i;
for (int i = 0; i < len; i++)
for (int j = 0; j < len; j++)
P[i][j] = false;
for (int i = len - 1; i >= 0; i--){
for (int j = i; j < len; j++){
if (s[i] == s[j] && (j - i < 2 || P[i + 1][j - 1])){
P[i][j] = true;
D[i] = min(D[i], D[j + 1] + 1);
}
}
}
return D[0] - 1;
}
Version 2:
int minCut(string s) {
int size = s.size();
vector<vector<bool>> map;
for (int i = 0; i < size; i++){
vector<bool> t;
for (int j = 0; j < size; j++){
t.push_back(false);
}
map.push_back(t);
}
vector<int> minCuts;
for (int i = 0; i < size; i++){
map[i][i] = true;
minCuts.push_back(size - i - 1);
}
for (int i = size - 1; i >= 0; i--){
for (int j = size - 1; j >= i; j--){
if (s[i] == s[j] && (j - i <= 1 || map[i + 1][j - 1])){
map[i][j] = true;
if (j == size - 1){
minCuts[i] = 0;
}else if (minCuts[i] > minCuts[j + 1] + 1){
minCuts[i] = minCuts[j + 1] + 1;
}
}
}
}
return minCuts[0];
}
I would guess it's because in the second version you're doing size^2 push_back's, whereas in the first version you're just doing size push_back's.

for loop with range in CoffeeScript

Noob question. I am trying to write a for loop with a range. For example, this is what I want to produce in JavaScript:
var i, a, j, b, len = arr.length;
for (i = 0; i < len - 1; i++) {
a = arr[i];
for (j = i + 1; i < len; j++) {
b = arr[j];
doSomething(a, b);
}
}
The closest I've come so far is the following, but
It generates unnecessary and expensive slice calls
accesses the array length inside the inner loop
CoffeeScript:
for a, i in a[0...a.length-1]
for b, j in a[i+1...a.length]
doSomething a, b
Generated code:
var a, b, i, j, _i, _j, _len, _len1, _ref, _ref1;
_ref = a.slice(0, a.length - 1);
for (i = _i = 0, _len = _ref.length; _i < _len; i = ++_i) {
a = _ref[i];
_ref1 = a.slice(i + 1, a.length);
for (j = _j = 0, _len1 = _ref1.length; _j < _len1; j = ++_j) {
b = _ref1[j];
doSomething(a, b);
}
}
(How) can this be expressed in CoffeeScript?
Basically, transcribing your first JS code to CS:
len = arr.length
for i in [0...len - 1] by 1
a = arr[i]
for j in [i + 1...len] by 1
b = arr[j]
doSomething a, b
Seems like the only way to avoid the extra variables is with a while loop http://js2.coffee
i = 0
len = arr.length
while i < len - 1
a = arr[i]
j = i + 1
while j < len
b = arr[j]
doSomething a, b
j++
i++
or a bit less readable:
i = 0; len = arr.length - 1
while i < len
a = arr[i++]; j = i
while j <= len
doSomething a, arr[j++]

Looking for SLAB6 implementation

I'm looking to implement SLAB6 into my raycaster, especially the kv6 support for voxelmodels. However the SLAB6 source by Ken Silverman is totally unreadably (mostly ASM) so I was hoping someone could point me to a proper C / Java source to load kv6 models or maybe to explain me the workings in some pseudocode preferably (since I want to know how to support the kv6, I know how it works). Thanks, Kaj
EDIT: the implementation would be in Java.
I found some code in an application called VoxelGL (author not mentioned in sourcecode):
void CVoxelWorld::generateSlabFromData(unsigned char *data, VoxelData *vdata, Slab *slab)
{
int currentpattern = 1;
int i = 0;
int n, totalcount, v, count;
n = 0;
v = 0;
while (1)
{
while (data[i] == currentpattern)
{
if (currentpattern == 1)
v++;
i++;
if (i == 256)
break;
}
n++;
if (i == 256)
{
if (currentpattern == 0)
n--;
break;
}
currentpattern ^= 1;
}
slab->nentries = n;
if (slab->description != 0)delete [] slab->description;
if (slab->data != 0)delete [] slab->data;
slab->description = new int[n];
slab->data = new VoxelData[v];
totalcount = 0;
v = 0;
currentpattern = 1;
for (i = 0; i < n; i++)
{
count = 0;
while (data[totalcount] == currentpattern)
{
count++;
totalcount++;
if (totalcount == 256)
break;
}
slab->description[i] = count-1;
if (i % 2 == 0)
{
memcpy(slab->data + v, vdata + totalcount - count, 3 * count);
v += count;
}
currentpattern ^= 1;
}
}
And:
#define clustersize 8
Slab *CVoxelWorld::getSlab(int x, int z)
{
int xgrid = x / clustersize;
int ygrid = z / clustersize;
int clusteroffset = xgrid * 1024 * clustersize + ygrid * clustersize * clustersize;
return &m_data[clusteroffset + (x & (clustersize - 1)) + (z & (clustersize - 1)) * clustersize];
}
And:
int CVoxelWorld::isSolid(int x, int y, int z)
{
Slab *slab;
if (y < 0 || y > 256)
return 0;
slab = getSlab(x, z);
int counter = 0;
for (int i = 0; i < slab->nentries; i++)
{
int height = slab->description[i] + 1;
if (i % 2 == 0)
{
if (y >= counter && y < counter + height)
return 1;
}
counter += height;
}
return 0;
}

Teaching a Neural Net: Bipolar XOR

I'm trying to to teach a neural net of 2 inputs, 4 hidden nodes (all in same layer) and 1 output node. The binary representation works fine, but I have problems with the Bipolar. I can't figure out why, but the total error will sometimes converge to the same number around 2.xx. My sigmoid is 2/(1+ exp(-x)) - 1. Perhaps I'm sigmoiding in the wrong place. For example to calculate the output error should I be comparing the sigmoided output with the expected value or with the sigmoided expected value?
I was following this website here: http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html , but they use different functions then I was instructed to use. Even when I did try to implement their functions I still ran into the same problem. Either way I get stuck about half the time at the same number (a different number for different implementations). Please tell me if I have made a mistake in my code somewhere or if this is normal (I don't see how it could be). Momentum is set to 0. Is this a common 0 momentum problem? The error functions we are supposed to be using are:
if ui is an output unit
Error(i) = (Ci - ui ) * f'(Si )
if ui is a hidden unit
Error(i) = Error(Output) * weight(i to output) * f'(Si)
public double sigmoid( double x ) {
double fBipolar, fBinary, temp;
temp = (1 + Math.exp(-x));
fBipolar = (2 / temp) - 1;
fBinary = 1 / temp;
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// Initialize the weights to random values.
private void initializeWeights(double neg, double pos) {
for(int i = 0; i < numInputs + 1; i++){
for(int j = 0; j < numHiddenNeurons; j++){
inputWeights[i][j] = Math.random() - pos;
if(inputWeights[i][j] < neg || inputWeights[i][j] > pos){
print("ERROR ");
print(inputWeights[i][j]);
}
}
}
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenWeights[i] = Math.random() - pos;
if(hiddenWeights[i] < neg || hiddenWeights[i] > pos){
print("ERROR ");
print(hiddenWeights[i]);
}
}
}
// Computes output of the NN without training. I.e. a forward pass
public double outputFor ( double[] argInputVector ) {
for(int i = 0; i < numInputs; i++){
inputs[i] = argInputVector[i];
}
double weightedSum = 0;
for(int i = 0; i < numHiddenNeurons; i++){
weightedSum = 0;
for(int j = 0; j < numInputs + 1; j++){
weightedSum += inputWeights[j][i] * inputs[j];
}
hiddenActivation[i] = sigmoid(weightedSum);
}
weightedSum = 0;
for(int j = 0; j < numHiddenNeurons + 1; j++){
weightedSum += (hiddenActivation[j] * hiddenWeights[j]);
}
return sigmoid(weightedSum);
}
//Computes the derivative of f
public static double fPrime(double u){
double fBipolar, fBinary;
fBipolar = 0.5 * (1 - Math.pow(u,2));
fBinary = u * (1 - u);
if(bipolar){
return fBipolar;
}else{
return fBinary;
}
}
// This method is used to update the weights of the neural net.
public double train ( double [] argInputVector, double argTargetOutput ){
double output = outputFor(argInputVector);
double lastDelta;
double outputError = (argTargetOutput - output) * fPrime(output);
if(outputError != 0){
for(int i = 0; i < numHiddenNeurons + 1; i++){
hiddenError[i] = hiddenWeights[i] * outputError * fPrime(hiddenActivation[i]);
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
hiddenWeights[i] += deltaHiddenWeights[i];
}
for(int in = 0; in < numInputs + 1; in++){
for(int hid = 0; hid < numHiddenNeurons; hid++){
lastDelta = deltaInputWeights[in][hid];
deltaInputWeights[in][hid] = learningRate * hiddenError[hid] * inputs[in] + (momentum * lastDelta);
inputWeights[in][hid] += deltaInputWeights[in][hid];
}
}
}
return 0.5 * (argTargetOutput - output) * (argTargetOutput - output);
}
General coding comments:
initializeWeights(-1.0, 1.0);
may not actually get the initial values you were expecting.
initializeWeights should probably have:
inputWeights[i][j] = Math.random() * (pos - neg) + neg;
// ...
hiddenWeights[i] = (Math.random() * (pos - neg)) + neg;
instead of:
Math.random() - pos;
so that this works:
initializeWeights(0.0, 1.0);
and gives you initial values between 0.0 and 1.0 rather than between -1.0 and 0.0.
lastDelta is used before it is declared:
deltaHiddenWeights[i] = learningRate * outputError * hiddenActivation[i] + (momentum * lastDelta);
I'm not sure if the + 1 on numInputs + 1 and numHiddenNeurons + 1 are necessary.
Remember to watch out for rounding of ints: 5/2 = 2, not 2.5!
Use 5.0/2.0 instead. In general, add the .0 in your code when the output should be a double.
Most importantly, have you trained the NeuralNet long enough?
Try running it with numInputs = 2, numHiddenNeurons = 4, learningRate = 0.9, and train for 1,000 or 10,000 times.
Using numHiddenNeurons = 2 it sometimes get "stuck" when trying to solve the XOR problem.
See also XOR problem - simulation