Speeding up GPU neural network

Speeding up GPU neural network - neural-network

I'm trying to implement a neural network to run on a GPU using the Thrust and CUBLAS libraries, but I'm having a lot of trouble getting it to run faster than our current multithreaded and vectorized CPU implementation. The network has a single hidden layer with logistic units and an output layer with linear units, and here is the code for that:
// Functor to add bias before computing logistic
template <typename T>
struct bias_logistic_f {
__host__ __device__
T operator()(const T& x, const T& y) const {
return 1/(1+exp(-(x+y)));
}
};
bias_logistic_f bias_logistic();
// Thrust vectors for input/hidden/output units
thrust::device_vector<FLT> batch(batch_rows*ndim);
thrust::device_vector<FLT> hid(batch_rows*nhid);
thrust::device_vector<FLT> gpu_code(ndata*ncode);
// ...Load data and network weights...
// Multiply input (batch) by weights (vis2hid)
// Our matrices are stored row-major, but BLAS wants column-major,
// so pretend they're transposed and compute hid' = vis2hid' * batch'
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, nhid, batch_rows, ndim,
&alpha, thrust::raw_pointer_cast(&vis2hid[0]), nhid,
thrust::raw_pointer_cast(&batch[0]), ndim,
&beta, thrust::raw_pointer_cast(&hid[0]), nhid);
// Add hidbiases to hid and compute logistic
thrust::transform(hid.begin(), hid.end(), hidbiases.begin(), hid.begin(),
bias_logistic);
// Multiply hid by weights (hid2code)
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, ncode, batch_rows, nhid,
&alpha, thrust::raw_pointer_cast(&hid2code[0]), ncode,
thrust::raw_pointer_cast(&hid[0]), nhid,
&beta, thrust::raw_pointer_cast(&gpu_code[b*batch_rows*ncode]), ncode);
// Add codebiases
thrust::transform(gpu_code.begin() + b*batch_rows*ncode, gpu_code.begin() + (b+1)*batch_rows*ncode,
codebiases.begin(), gpu_code.begin() + b*batch_rows*ncode,
thrust::plus<FLT>());
Our input data is a sparse matrix with about 150,000 rows and 6,500 columns, with about 100 non-zero elements per row on average. This is too large to store the full matrix as a dense matrix on the GPU, so what I do is loop through the sparse matrix expanding batches of 1,000 rows each to input into the neural network:
for(int b=0; b<nbatch; ++b) {
// Zero out batch b
thrust::fill(batch.begin(), batch.end(), 0.0f);
// batch_val contains the non-zero values for the current batch, batch_idx the indices within the batch,
// and batch_ptr indexes into batch_val/batch_idx
// This is like CSR format except instead of compressing rows, it's compressing submatrices of 1,000 rows
thrust::scatter(batch_val.begin() + batch_ptr[b],
batch_val.begin() + batch_ptr[b+1],
batch_idx.begin() + batch_ptr[b],
batch.begin());
// ...Input batch to network (shown above)...
}
Our CPU implementation does the same thing, using STL vectors. When I ran both and compared their run times, I was surprised to find that the GPU code takes about 38 seconds on average to process our data, while the CPU code only takes about 27 seconds. It could be that some of this difference is due to the GPU being a few years old (a Tesla C1060) while the server is a newer 24-core machine. But still I would've thought with thousands of threads available, it wouldn't end up being 50% slower.
Any ideas how I can make this code run faster? I'm new to GPU programming so I'm at a loss as to what I could be doing wrong. Is there a more efficient way to deal with sparse matrices than what I'm doing here, such as using the CUSPARSE library? Or would it be a better idea to forget about the high-level libraries altogether and just write my own kernels in CUDA to combine the matrix multiplication/logistic/addition steps?

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.

loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Memory allocation Coregionalized Kernel

I'm currently using a linear model of coregionalization
(see e.g. alvarez notes https://arxiv.org/pdf/1106.6251.pdf )
which is optimized via SVGP.
I noticed that the upper limit of the number of inducing points before running OOM was greatly reduced (now about ~5k inducing points instead of 8k when not using a coregionalized kernel). From my understanding, the limiting bottleneck should have been the same (still the MxM kernel matrix), however it seems like more changed.
In addition, I now get the warning:
.../lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:112: UserWarning:
Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
The Kernel Matrices are constructed as follows.
I don't use big Qs or Rs (Q=3, R=3).
def coreg_k(Q, R, output_dim, active_dims):
# create Q different kernels with rank R
coreg = []
k_q = []
# lengthscales = np.logspace(-1, 3, 5)
lengthscales = [0.1, 1, 5]
for q in range(Q):
coreg_tmp = gpflow.kernels.Coregion(input_dim=1, output_dim=output_dim, rank=R, active_dims=active_dims)
coreg_tmp.W = np.random.randn(output_dim, R)
coreg.append(coreg_tmp)
k_tmp = []
k_tmp.append(Matern52(input_dim=len(kernel_idxs["coords"]), active_dims=kernel_idxs["coords"],
lengthscales=lengthscales[q], ARD=False))
k_tmp.append(RBF(input_dim=len(kernel_idxs["rest"]), active_dims=kernel_idxs["rest"],
ARD=True, lengthscales=lengthscales[q]))
k = k_tmp[0]
for i in range(1, len(k_tmp)):
k += k_tmp[i]
k_q.append(k)
# combine all those kernels
kern_lcm = coreg[0] * k_q[0]
for q in range(1, Q):
kern_lcm += coreg[q] * k_q[q]
return kern_lcm
What is taking up so much memory? The few parameters more from the extra kernels should not change that much.
Thanks.

In the computation of the Kuu matrix, the coregionalisation kernel constructs an M x M matrix. So if you have Q Coregion kernels, tensorflow actually needs to allocate Q x M x M memory. This is not orders of magnitude more, but linear in the number of kernels, which seems to roughly match up with how much less inducing points you can fit in memory on your machine.
For a more efficient implementation of the intrinsic coregionalisation model, have a look at the multi-output framework notebook in the GPflow documentation. Hope this helps!

CUDA fft 1d different results from MATLAB fft

I want to use GPU to speed up my matlab program but I find out a problem.
The fft result is different from CUDA to matlab.
I have tried many time but can't solve it.
So I come here for help.
The original data: name:cj1;size:1*8
And in matlab use the code:
a1=fft(cj1)';
Get the result:
the fft result of matlab
And cuda code:
cuFloatComplex *idata_m;
idata_m = (cuFloatComplex*)malloc(M * sizeof(cuFloatComplex));
for (int i = 0; i < 8; i++)
{
idata_m[i].x = initA[i];
idata_m[i].y = initB[i];
}
cuComplex *dev_test;
cudaMalloc((void**)&dev_test, M * sizeof(cuFloatComplex));
cudaMemcpy(dev_test, idata_m, M * sizeof(cuFloatComplex), cudaMemcpyHostToDevice);
cufftHandle plantest;
cufftPlan1d(&plantest, 8, CUFFT_C2C, 1);
cufftExecC2C(plantest, dev_test, dev_test, CUFFT_FORWARD);//forward
cuComplex *test_out;
test_out = (cuFloatComplex*)malloc( M * sizeof(cuFloatComplex));
cudaMemcpy(test_out, dev_test, 8 * sizeof(cuFloatComplex), cudaMemcpyDeviceToHost);
the input data is the same to the original data in matlab
the fft result of cuda
the inserest thing is these two result are very similar but in the wrong order.
So what can I do to make the result the same to the result of matlab?

The imaginary part of the input data used with the CUDA code is the negative of that used with Matlab. So you really are computing the FFT of the complex conjugated input, which inverts the order of the result. To obtain the same results with CUDA you should be using the same input.
Also of note, in Matlab, the ' operator computes the complex-conjugate transpose, so you probably want to compare your CUDA results with a1=transpose(fft(cj1)); instead.

How do I efficiently replace a function with a lookup?

I am trying to increase the speed of code that operates on large datasets. I need to perform the function out = sinc(x), where x is a 2048-by-37499 matrix of doubles. This is very expensive and is the bottleneck of my program (even when computed on the GPU).
I am looking for any solution which improves the speed of this operation.
I expect that this might be achieved by pre-computing a vector LookUp = sinc(y) where y is the vector y = min(min(x)):dy:max(max(x)), i.e. a vector spanning the whole range of expected x elements.
How can I efficiently generate an approximation of sinc(x) from this LookUp vector?
I need to avoid generating a three dimensional array, since this would consume more memory than I have available.
Here is a test for the interp1 solution:
a = -15;
b = 15;
rands = (b-a).*rand(1024,37499) + a;
sincx = -15:0.000005:15;
sincy = sinc(sincx);
tic
res1 = interp1(sincx,sincy,rands);
toc
tic
res2 = sinc(rands);
toc'
sincx = gpuArray(sincx);
sincy = gpuArray(sincy);
r = gpuArray(rands);
tic
r = interp1(sincx,sincy,r);
toc
r = gpuArray(rands);
tic
r = sinc(r);
toc
Elapsed time is 0.426091 seconds.
Elapsed time is 0.472551 seconds.
Elapsed time is 0.004311 seconds.
Elapsed time is 0.130904 seconds.
Corresponding to CPU interp1, CPU sinc, GPU interp1, GPU sinc respectively

Not sure I understood completely your problem.
But once you have LookUp = sinc(y) you can use the Matlab function interp1
out = interp1(y,LookUp,x)
where x can be a matrix of any size

I came to the conclusion, that your code can not be improved significantly. The fastest possible lookup table is based on simple indexing. For a performance test, lets just perform the test based on random data:
%test data:
x=rand(2048,37499);
%relevant code:
out = sinc(x);
Now the lookup based on integer indices:
a=min(x(:));
b=max(x(:));
n=1000;
x2=round((x-a)/(b-a)*(n-1)+1);
lookup=sinc(1:n);
out2=lookup(x2);
Regardless of the size of the lookup table or the input data, the last lines in both code blocks take roughly the same time. Having sinc evaluate roughly as fast as a indexing operation, I can only assume that it is already implemented using a lookup table.

I found a faster way (if you have a NVIDIA GPU on your PC) , however this will return NaN for x=0, but if, for any reason, you can deal with having NaN or you know it will never be zero then:
if you define r = gpuArray(rands); and actually evaluate the sinc function by yourself in the GPU as:
tic
r=rdivide(sin(pi*r),pi*r);
toc
This generally is giving me about 3.2x the speed than the interp1 version in the GPU, and its more accurate (tested using your code above, iterating 100 times with different random data, having both methods similar std).
This works because sin and elementwise division rdivide are also GPU implemented (while for some reason sinc isn't) . See: http://uk.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html

m = min(x(:));
y = m:dy:max(x(:));
LookUp = sinc(y);
now sinc(n) should equal
LookUp((n-m)/dy + 1)
assuming n is an integer multiple of dy and lies within the range m and max(x(:)). To get to the LookUp index (i.e. an integer between 1 and numel(y), we first shift n but the minimum m, then scale it by dy and finally add 1 because MATLAB indexes from 1 instead of 0.
I don't know what that wll do for you efficiency though but give it a try.
Also you can put this into an anonymous function to help readability:
sinc_lookup = #(n)(LookUp((n-m)/dy + 1))
and now you can just call
sinc_lookup(n)

operations on a 2D array in CUDA kernel for matlab

suppose I have the following serial C:
int add(int* a, int* b, int n)
{
for(i=0; i<n; i++)
{
for(j=0; j<n; j++)
{
a[i][j]+=b[i][j];
}
}
return 0;
}
I think the best way to paralellise it is to realise it is a 2D problem and use 2D thread blocks as per CUDA kernel - nested for loop
With that in mind I started writing my cuda kernal like this:
__global__ void calc(int **A, int **B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i>=n || j>=n)
return;
A[i][j]+=B[i][j];
}
nvcc tells me that:
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
./addm.cu(13): Warning: Cannot tell what pointer points to, assuming global memory space
1) I am correct with my philosphy?
2) I think I understand blocks, thread etc but I don't understand what
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
does
3) Is this the most efficient/fastest way of performing operations on a 2D array in general? i.e not just matrix addition it could be any "element by element" operation.
4) Will I be able to call it from matlab? normally it freaks when the prototype is of the form type** var
Thanks guys

The compiler warnings you are getting come from the fact that on older GPUs, the memory structure is not "flat". The compiler can't know what memory space the addresses held by the pointer arrays your kernel is working in are. So it is warning you that it is assuming the operation is being peforming in global memory. If you compile the code for a Fermi card (sm_20 or sm_21 architecture), you won't see that warning because the memory model on those cards is "flat", and pointers are correctly interpreted by the hardware at runtime. The compiler doesn't need to handle it at compile time.
To answer each of your questions:
Yes. And no. The overall idea is about 90% right, but there are several implementation issues which will become apparent from the answers which follow.
CUDA C has built in variables to allow each thread to determine its "coordinates" in the execution grid which it is running, and the dimensions of each block and the grid itsef. threadIdx.{xyz} provides the thread coordinates within a block, and blockIdx.{xyz} the block coordinate with the grid. blockDim.{xyz} and gridDim.{xyz} provide the dimensions of the block and the grid, respectively (note not all hardware supports 3D grids). CUDA uses column major order for numbering threads within each block and block within each grid. The calculation you are querying is computing the equivalent {i,j} coordinate in a 2D grid using the thread and block coordinates and the block size. This is discussed in some detail in the first few pages of the "Programming model" chapter of the CUDA programming guide.
No, and I say that for two reasons.
Firstly, using arrays of pointers for memory access is not a good idea in CUDA. Two levels of pointer indirection hugely increases the latency penalty to get to your data. The key difference in a typical GPU architecture compared to a modern CPU achitecture is the memory system. GPUs have stunningly high peak memory bandwidth, but very high access latency, whereas CPUs are designed for minimal latency. So having to read and indirect two pointers to fetch a value from memory is a very big performance penalty. Store your 2D array or matrix in linear memory instead. This is what BLAS, LAPACK and Matlab do anyway.
Secondly, every thread in your code is performing four integer arithmetic operations of setup overhead (the index calculations) for every one "productive" integer operation (the addition). There are strategies to reduce that, usually involving having each thread process more than one array element.
If I was to write a kernel for that operation I would do it something like the code at the bottom of my answer. This uses linear memory and a 1D grid. A suitable number of threads to properly occupy the GPU process the whole input array, with each thread processing many inputs.
No. As I mentioned earlier in my answer, Matlab uses linear memory to store matrices, not an array of pointers. This doesn't match the layout your kernel code is expecting.
Sample code:
__global__ void calc(int *A, int *B, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int s = blockDim.x * gridDim.x;
for( ; i<N; i+=s) {
A[i] += B[i];
}
}

I am assuming you are working with n-by-n, row major order array. Try the following :
__global__ void calc(int *A, int *B, int n)
{
int i= blockIdx.x * blockDim.x + threadIdx.x;
int j= blockIdx.y * blockDim.y + threadIdx.y;
if (i<n && j<n) {
A[i*n+j] += B[i*n+j];
}
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse