In neural networks, why conventionally set number of neurons to 2^n? - neural-network

For example, when pile up Dense layers, conventionally, we always set the number of neurons as 256 neurons, 128 neurons, 64 neurons, ... and so on.
My question is:
What's the reason for conventionally use 2^n neurons? Will this implementation makes the code runs faster? Saves memory? Or are there any other reasons?

It's historical. Early neural network implementations for GPU Computing (written in CUDA, OpenCL etc) had to concern themselves with efficient memory management to do data parallelism.
Generally speaking, you have to align N computations on physical processors. The number of physical processors is usually a power of 2. Therefore, if the number of computations is not a power of 2, the computations can't be mapped 1:1 and have to be moved around, requiring additional memory management (further reading here). This was only relevant for parallel batch processing, i.e. having the batch size as a power of 2 gave you slightly better performance. Interestingly, having other hyperparameters such as the number of hidden units as a power of 2 never had a measurable benefit - I assume as neural networks got more popular, people simply started adapting this practice without knowing why and spreading it to other hyperparameters.
Nowadays, some low-level implementations might still benefit from this convention but if you're using CUDA with Tensorflow or Pytorch in 2020 with a modern GPU architecture, you're very unlikely to encounter any difference between a batch size of 128 and 129 as these systems are highly optimized for very efficient data parallelism.

Related

Why Ethernet is using CRC-32 and not CRC-32C?

It has being shown that CRC32C provides better results (an improve Hamming distance and a faster implementation) than CRC32. Why Ethernet is stil using the old CRC 32 and not CRC32C?
"Faster implementation" is not true for the hardware that is normally used to implement the data link layer.
You may be referring to the fact that one particular processor architecture, x86-64, has a CRC instruction that uses the CRC-32C polynomial. However the ARM architecture (aarch64) CRC instruction uses the CRC-32 polynomial. Go figure.
One could argue that yet another polynomial should be used, since Koopman has characterized the performance of many polynomials with better performance than either of the ones you mention. But none of that really matters since ...
All of the legacy hardware has to support the original CRC, and there is little motivation to provide an alternate CRC would use would need to somehow be negotiated between the transmitter and receiver. There is not a noticeable performance advantage for typical noise sources, which are rare single bit errors.

Application performance vs Peak performance

I have questions about real application performance running on a cluster vs cluster peak performance.
Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated?
To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct?
When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?
Thanks
Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.
Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).
The theoretical (peak) value is based on the capability of each individual core in the cluster, which depends on clock frequency, number of floating point units, parallel instruction issuing capacity, vector register sizes, etc. which are design characteristics of the core. The flops/s count for each core in the cluster is then aggregated to get the cluster flops/s count.
For a car the equivalent theoretical performance would be the maximum speed it can reach given the specification of its engine.
For a program to reach the theoretical count, it has to perform specific operations in a specific order so that the instruction-level parallelism is maximum and all floating-point units are working constantly without delay due to synchronization or memory access, etc. (See this SO question for more insights)
For a car, it is equivalent to measuring top speed on a straight line with no wind.
But of course, chances that such a program computes something of interest are small. So benchmarks like HPL use actual problems in linear algebra, with a highly optimized and tuned implementation, but which is still imperfect due to IO operations and the fact that the order of operations is not optimal.
For a car, it could be compared to measuring the top average speed on a race track with straight lines, curves, etc.
If the program requires a lot of network, or disk communications, which are operations that require a lot of clock cycle, then the CPU has often to stay idle waiting for data before it can perform arithmetic operations, effectively wasting away a lot of computing power. Then, the actual performance is estimated by dividing the number of floating points operations (addition and multiplications) the program is performing by the time it takes to perform them.
For a car, this would correspond to measuring the top average speed in town with red lights, etc. by calculating the length of the trip divided by the time needed to accomplish it.

Ways to accelerate reduce operation on Xeon CPU, GPU and Xeon Phi

I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?
Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?
How are reduce operations implemented in these hardware in general?
You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.
The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.
However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.
Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.
EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.
Turns out none of the hardware have reduce operation circuit built-in.
I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is #pragma omp parallel for reduction in openmp. I am yet to test its performance though.
This operation is going to be bandwidth-limited and thus vectorization almost certainly doesn't matter. You want the hardware with the most memory bandwidth. An Intel Xeon Phi processor has more aggregate bandwidth (but not bandwidth-per-core) than a Xeon processor.

Why hidden layers of neural networks often contain 64, 128, 256 neurons?

When looking for classic neural nets architectures on the web, I find that very often, hidden layers of neural networks contain 32, 64, 128, 256... neurons.
Is there a reason for that?
Usually the number of hidden neurons is either picked through experimentation or as part of a model selection algorithm doing some sort of grid search. The numbers you give are all powers of two, and I have never seen any research to indicate that a power of two is more effective than another number as a hidden neuron count. I think it is just a matter of a programmer picking conveniently aligned memory block, which MIGHT give some small processing power boost if you hit the page boundary just right.
How often have you seen C/Java/C# code with something like this.
char buffer[1024];
Apart from memory alignment, there is one more hardware based reasoning for this choice. Matrix multiplication is at the heart of deep learning computations. Modern CPUs support SIMD operations via extensions like AVX, which process numbers in batch sizes, which are powers of 2.

Neural Nets: computing weight deltas is a bottleneck

After profiling my Neural Nets' code I've realized that the method, which computes the weight changes for each arc in the network (-rate*gradient + momentum*previous_delta - decay*rate*weight), already given the gradient, is the bottleneck (55% inclusive samples).
Is there any trick to compute these values in a efficient manner?
This is normal behaviour. I am assuming that you are using an iterative process to solve the weights at each evolution step (such as backpropagation?). If the number of neurons is large and the training (back-testing) algorithm is short, then it is normal that weight mutation such as this will consume a larger fraction of compute time during training of the neural network.
Did you get this result using a simple XOR problem or similar? If so, you will probably find that if you start to solve more complex problems (such as pattern detection in multidimensional arrays, image processing, etc.) that those functions will begin to consume an insignificant fraction of compute time.
If you are profiling, I would suggest you profile with a problem that is closer to the purpose for which the neural network is designed (I am guessing you didn't design it to solve XOR or play tic tac toe) and you will probably find that optimising code such as -rate*gradient + momentum*previous_delta - decay*rate*weight is more or less a waste of time, at least this is my experience.
If you do find that this code is compute-intensive in real-world applications then I would suggest trying to reduce the number of times this line of code is executed via structural changes. Neural network optimization is a rich field and I can't possibly give you useful advise from such a broad question, but I will say that if your program is unusually slow, you're unlikely to see significant improvements by tinkering at such low-level code. I will however suggest the following from my own experience:
Consider parallelisation. Many search algorithms such as those implemented in back-propagation techniques are amenable to parallel attempts to improve convergence. As weight-adjustments are identical in terms of computation demand for a given network, think static loops in Open MP.
Modify the convergence criterion (the critical convergence rate before you stop adjustments of weights) to perform less of these calculations
Consider an alternative to deterministic solutions such as back-propagations, which are slightly more prone to local optimisation anyway. Consider gaussian mutation (All things being equal gaussian mutation will 1) reduce time spent on mutation relative to backtesting 2) increase convergence time and 3) be less prone to getting caught in local minima of the error search space)
Please note that this is a non-technical answer to what I have interpreted as a non-technical question.