Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
This post was edited and submitted for review last month and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I know that there are a ton of factors to this, but assuming the most average hardware (around an Intel i5 or i7, or Ryzen 5 or 7) and very fast implementation in both methods, under what circumstances would a GPU be a faster option than the CPU or vice versa? (To help visualize what I mean by "very fast implementation", let's say the CPU implementation is a multithreaded C program that stores nodes in the register, and the GPU implementation uses a GLES2 compute shader).
The GPU would have to be run once for each layer, but can do every node in the layer in sync. The CPU can run the whole thing seamlessly, but is limited in sync-ness by the amount of CPU cores. What number of nodes or layers could I use to reasonably assume that a CPU or GPU-based implementation would be the better choice?
I think the most important factor about using a GPU or CPU for neural network is compute time. Generally, GPUs have more floating-point processing units than CPUs and can perform multiple tasks concurrently, so GPUs always faster.
However, training with a GPU requires data transfer to GPU memory and requires specialized GPU hardware. This means that if your dataset is small or your computer does not have a GPU, using a CPU can be faster.
And maybe I couldn't tell you the answer about your last question. Because it is difficult to provide a specific number of nodes or layers, it depends on various factors such as the performance of your computer's GPU and CPU, the size and complexity of the dataset, and the architecture of the neural network.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Given the derivative of the cost function with respect to the weights or biases of the neurons of a neural network, how do I adjust these neurons to minimize the cost function?
Do I just subtract the derivative multiplied by a constant off of the individual weight and bias? If constants are involved how do I know what is reasonable to pick?
Your right about how to perform the update. This is what is done in gradient descent in its various forms. Learning rates (the constant you are referring to) are generally very small 1e-6 - 1e-8. There are numerous articles on the web covering both of these concepts.
In the interest of a direct answer though, it is good to start out with a small learning rate (on the order suggested above), and check that the loss is decreasing (via plotting). If the loss decreases, you can raise the learning rate a bit. I recommend to raise it by 3x its current value. For example, if it is 1e-6, raise it to 3e-6 and check again that your loss is still decreasing. Keep doing this until the loss is no longer decreasing nicely. This image should give some nice intuition on how learning rates affect the loss curve (image comes from Stanford's cs231n lecture series)
You want to raise the learning rate so that the model doesn't take as long to train. You don't want to raise the learning rate too much because then it is possible to overshoot the local minimum you're descending towards and for the loss to increase (the yellow curve above). This is an oversimplification because the loss landscape of a neural network is very non-convex, but this is the general intuition.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
As I understand it, all CNNs are quite similar. They all have a convolutional layers followed by pooling and relu layers. Some have specialised layers like FlowNet and Segnet. My doubt is how should we decide how many layers to use and how do we set the kernel size for each layer in the network. I have searched for an answer to this question but I couldn't find a concrete answer. Is the network designed using trial and error or are some specific rules that I am not aware of? If you could please clarify this, I would be very grateful to you.
Short answer: if there are design rules, we haven't discovered them yet.
Note that there are comparable questions in computing. For instance, note that there is only a handful of basic electronic logic units, the gates that drive your manufacturing technology. All computing devices use the same Boolean logic; some have specialised additions, such as photoelectric input or mechanical output.
How do you decide how to design your computing device?
The design depends on the purpose of the CNN. Input characteristics, accuracy, training speed, scoring speed, adaptation, computing resources, ... all of these affect the design. There is no generalized solution, even for a given problem (yet).
For instance, consider the ImageNet classification problem. Note the structural differences between the winners and contenders so far: AlexNet, GoogleNet, ResNet, VGG, etc. If you change inputs (say, to MNIST), then these are overkill. If you change the paradigm, they may be useless. GoogleNet may be a prince of image processing, but it's horrid for translating spoken French to written English. If you want to track a hockey puck in real time on your video screen, forget these implementations entirely.
So far, we're doing this the empirical way: a lot of people try a lot of different things to see what works. We get feelings for what will improve accuracy, or training time, or whatever factor we want to tune. We find what works well with total CPU time, or what we can do in parallel. We change algorithms to take advantage of vector math in lengths that are powers of 2. We change problems slightly and see how the learning adapts elsewhere. We change domains (say, image processing to written text), and start all over -- but with a vague feeling of what might tune a particular bottleneck, once we get down to considering certain types of layers.
Remember, CNNs really haven't been popular for that long, barely 6 years. For the most part, we're still trying to learn what the important questions might be. Welcome to the research team.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We know for example that Moores law states that the number of transistors on a chip doubles every 1.8-2 years (and hence computing power has been approximately increasing at this rate). This got me thinking about compiler optimizations. Are compilers getting better a making codes run faster as time goes on? If they are is there any theory as to how this performance increase scales? If I were to take a piece of code written in 1970 compiled with 1970 compiler optimizations would that same code run faster on the same machine but compiled with todays optimizations? Can I expect a piece a code written today to run faster in say a 100 years solely as the result of better optimizations/compilers (obviously independent of improvements in hardware and algorithm improvements)?
This is a complex, multi-faceted question, so let me try to hit on a few key points:
Compiler optimization theory is highly complex and is often (far) more difficult than the actual design of the language in the first place. This domain incorporates many other complex mathematical subdomains (eg, directed graph theory). Some problems in compiler optimization theory are known to be NP-complete or even undecidable (which represent the most complex categories of problems to solve).
While there are hundreds of known techniques (see here, for example) the implementation of these techniques is highly dependent on both the computer language and the targeted CPU (such as instruction set and pipelines). Because computer languages and CPUs are constantly evolving, the optimal implementations of even well-known techniques can change over time. New CPU features and architectures can also open up previously unavailable optimization techniques. Some of the most cutting-edge techniques may also be proprietary and thus not available to the general public for reuse. For example, several commerical JVMs offer specialty optimizations to the JIT-compilation of Java bytecode which are quantitatively superior to (default) open-source JVMs on a statistical basis.
There is an unmistakable historical trend toward better and better compiler optimization. This is why, for example, it is quite rare nowadays that any manual assembly coding is done regularly. But due to the factors already discussed (and others), the evolution of the efficiency and benefits provided by automatic compiler optimizations has been quite non-linear historically. This is in contrast to the fairly consistent curvature of Moore's law and other laws relating to computer hardware improvements. Compiler optimization's track record is probably better visualized as a line with many "fits and starts". Because the factors driving the non-linearity of compiler optimization theory will not likely change in the immediate future, it's likely this trajectory will remain non-linear for at least the near future.
It would be quite difficult to state even an average rate of improvement when languages themselves are coming and going, not to mention CPU models with different hardware features coming and going. CPUs have evolved different instruction sets and instruction set extensions over time, so it's quite difficult to even do an "apples to apples" comparison. This is true regardless of which metric you use: program length in terms of discrete instructions, program execution time (highly dependent on CPU clock speed and pipelining capabilities), or others.
Compiler optimization theory is probably now in the regime of diminishing returns. That is to say that most of the "low hanging" fruit have been addressed and much of the remaining optimizations are either quite complex or provide relatively small marginal improvements. Perhaps the greatest coming factor which will disruptively impact compiler optimization theory will be the advent of weak (or strong) AI. Because many of the future gains in compiler optimization theory will require highly complex predictive capabilities, the best optimizers will actually have some level of innate intelligence (for example, to predict the most common user inputs, to predict the most common execution paths, and to reduce NP-hard optimization problems into solvable sub-problems, etc.). It could very well be possible in the future that every piece of software you use is specifically custom compiled just for you in a tailored way to your specific use cases, interests, and requirements. Imagine that your OS (operating system) is specifically compiled or re-compiled just for you based on your specific use cases as a scientist vs. a video gamer vs. a corporate executive, or old vs. young, or any other combination of demographics that potentially impact code execution.
Can we improve performance by calculating some parts of CPU's parfor or spmd blocks using gpuArray of GPU functions? Is this a rational way to improve performance or there are limitations in this procedure? I read somewhere that we can use this procedure when we have some GPU units. Is this the only way that we can use GPU computing besides CPU parallel loops?
It is possible that using gpuArray within a parfor loop or spmd block can give you a performance benefit, but really it depends on several factors:
How many GPUs you have on your system
What type of GPUs you have (some are better than others at dealing with being "oversubscribed" - i.e. where there are multiple processes using the same GPU)
How many workers you run
How much GPU memory you need for your alogrithm
How well suited the problem is to the GPU in the first place.
So, if you had two high-powered GPUs in your machine and ran two workers in a parallel pool on a problem that could keep a single GPU fully occupied - you'd expect to see good speedup. You might still get decent speedup if you ran 4 workers.
One thing that I would recommend is: if possible, try to avoid transferring gpuArray data from client to workers, as this is slower than usual data transfers (the gpuArray is first gathered to the CPU and then reconstituted on the worker).
I have questions about real application performance running on a cluster vs cluster peak performance.
Let's say one HPC cluster report that it has peak performance of 1 Petaflops. How is this calculated?
To me, it seems that there are two measuring matrixes. One is the performance calculated based on the hardware. The other one is from running HPL? Is my understanding correct?
When I am reading one real application running on the system at full scale, the developer mentions that it could achieve 10% of the peak performance. How is this measured and why it can't achieve peak performance?
Thanks
Peak performance is what the system is theoretically able to deliver. It is the product of the total number of CPU cores, the core clock frequency, and the number of FLOPs one core makes per clock tick. That performance can never be reached in practice because no real application consists of 100% fully vectorised tight loops that only operate on data held in the L1 data cache. In many cases data doesn't even fit in the last-level cache and the memory interface is usually not fast enough to deliver data at the same rate at which the CPU is able to process it. One ubiquitous example from HPC is the multiplication of a sparse matrix with a vector. It is so memory intensive (i.e. many loads and stores per arithmetic operation) that on many platforms it only achieves a fraction of the peak performance.
Things get even worse when multiple nodes are networked together on a massive scale as data transfers could introduce huge additional delays. Performance in those cases is determined mainly by the ratio of local data processing and data transfer. HPL is a particularly good in that aspect - it does a lot of vectorised local processing and does not move much data across the CPUs/nodes. That's not the case with many real-world parallel programs and also the reason why many are questioning the applicability of HPL in assessing cluster performance nowadays. Alternative benchmarks are already emerging, for example the HPCG benchmark (from the people who brought you HPL).
The theoretical (peak) value is based on the capability of each individual core in the cluster, which depends on clock frequency, number of floating point units, parallel instruction issuing capacity, vector register sizes, etc. which are design characteristics of the core. The flops/s count for each core in the cluster is then aggregated to get the cluster flops/s count.
For a car the equivalent theoretical performance would be the maximum speed it can reach given the specification of its engine.
For a program to reach the theoretical count, it has to perform specific operations in a specific order so that the instruction-level parallelism is maximum and all floating-point units are working constantly without delay due to synchronization or memory access, etc. (See this SO question for more insights)
For a car, it is equivalent to measuring top speed on a straight line with no wind.
But of course, chances that such a program computes something of interest are small. So benchmarks like HPL use actual problems in linear algebra, with a highly optimized and tuned implementation, but which is still imperfect due to IO operations and the fact that the order of operations is not optimal.
For a car, it could be compared to measuring the top average speed on a race track with straight lines, curves, etc.
If the program requires a lot of network, or disk communications, which are operations that require a lot of clock cycle, then the CPU has often to stay idle waiting for data before it can perform arithmetic operations, effectively wasting away a lot of computing power. Then, the actual performance is estimated by dividing the number of floating points operations (addition and multiplications) the program is performing by the time it takes to perform them.
For a car, this would correspond to measuring the top average speed in town with red lights, etc. by calculating the length of the trip divided by the time needed to accomplish it.