Just getting started with Theano and my GTX 960, but when training a network (specifically recurrent regression) and GPU usage spikes up for less than a second, maybe two or three times throughout the several minutes it takes. Is there a way to force it to actually use the GPU the entire time?
Seems to take an obscenely long time for not much data.
What is actually happening in your code is that much of the time is spent getting data onto and off of the GPU, but any GPU monitor will show that the GPU is "not being used" during this time, even though what it is doing is loading data. You can't avoid doing this, other than being smart about how and when you load data onto the GPU. The spikes you are describing are completely normal - that is when the GPU actually does the computation. Nothing you can do!
Related
So I have my .exe ready to deploy, and for distribution, I need to know the minimal requirements for my program to run on a machine... and I really don't know how to do that.
Is there a way to know that ? Some kind of benchmark ? Or must I just set things as I think it'll work ?
Maybe should I just buy all existing components until I find the minimal ? :')
Well, thanks for your answers.
Start by seeing the first Windows' version you can deploy on (Windows XP? Vista?).
If your program is cpu or gpu intensive, and has a fixed time loop (eg. game) then you'll have to do benchmarks.
You should look at several old vs new CPUs/GPUs and trying to "guess" based on online specs posted online what the minimal requirement is. For example, if your program can't run on an old cpu, but runs blazingly fast on a new one, try to find the model that -barely- runs it, which will obviously be one somewhere in the middle.
If your program requires other special things, specify them (eg. USB 3.0, controllers supported...).
Otherwise, if your program loads slower but doesn't have runtime issues, the minimum specs should be indicative of a reasonable loading time (a minute seems to be the standard now, sadly).
Additionally, if your program is memory hungry (both hard drive or RAM), you must indicate this.
For hard drive memory, simply state your program size, along with the files included with it.
For RAM, use a profiler - it will tell you how much memory your program is using.
I've completely skipped over the fact that, in some computers, the bottleneck might be the cpu, and it might be the gpu in others. You need to know which is the bottleneck to make your judgement.
To find out is a rather simple process - remove expensive gpu operations (lower texture resolutions, turn off shaders). If the program still runs slowly, then the bottleneck is the cpu.
edit: this is a simplification of the problem and hardware is a little more complicated than this (slower multi-core cpus vs faster one-core cpus vary their performance depending on how many cores a program uses and how / a program may require a gpu to have less memory but more processing power, or the opposite... even heat dissipation can affect component efficiency: your program might run fine for 20 minutes but start to slow down if the cpu isn't cooled down properly), but "minimal hardware requirements" aren't exactly precise so this method is appropriate.
tl;dr of the spoiler:
In short, there are so many factors that affect performance that you can't measure it, so just a rough estimation is good.
I have a system with configuration intel(R) core(TM) i3-5020U CPU # 2.2 GHz,4GB RAM. But in order to compare the performance of my MATLAB program in terms of execution time, I need to execute it on a machine with configuration Intel(R) Core(TM) i5-3570 CPU # 3.40GHz, 16 GB RAM. Is there a way to perform this kind of simulation?
TL:DR: No. Performance differences between Broadwell and IvyBridge depend on lots of complicated details. (See Agner Fog's microarch pdf for the low-level microarchitectural details, and also other stuff in the x86 tag wiki)
It's likely that performance will scale with either clock speed or memory speed within maybe 10%, even between different microarchitectures, but it might not.
Using your own system, you can probably figure out how your code scales with CPU frequency, by forcing it to stay at minimum frequency for a test run. If it's a lot less than perfect scaling, then memory speed is a big factor. (The slower your CPU, the fewer cycles are spent waiting for memory.)
You can't extrapolate IvB i5 3.4GHz performance from BDW 2.2GHz performance without knowing a lot more details about exactly what your code bottlenecks on. It's possible that it bottlenecks on the same simple thing on both CPUs, in which case you could extrapolate. e.g. if it turns out that it bottlenecks on FP multiply latency, then run-time on IvB would be 5/3rds the run time on Broadwell (times the clock frequency ratio), since BDW has 3 cycle FP multiply and add, but SnB/IvB/Haswell have 5 cycle multiply. (FMA is 5 cycles on BDW, if I recall correctly. IvB doesn't support FMA, so if Matlab takes advantage of that on BDW, it's not even running the same machine code).
More likely, it's not that simple and cache / memory performance comes into it, too. Haswell/Broadwell don't have L1 cache-bank conflicts, but SnB/IvB do.
Depending on how you run the workload on the i5 CPU, it might or might not be able to turbo up to higher than its rated 3.4GHz, further confounding any attempt to guess at performance.
It's hard to tell with different computers to measure practical efficiency. That's why you usually use theoretical efficiency with Big-O, check the wiki page for algorithm efficiency and Big-O notation.
In the case you have access to both codes (yours, and the other guy's code), you can test them in the same computer with the methods for measuring performance proposed by mathworks, which are mainly time functions in real time and cpu time.
Lastly, you can see here several challenges about benchmarking that might be interesting to consider.
Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.
I am working on Image Processing . I am having a computer with Intel(R) Core(TM) i7 -3770 CPU #3.40 GHz, RAM 4 GB Configuration. I just want parallelize our code of an algorithm of image processing using SPMD command of PCT. For this i have divided image vertically into 8 parts and send it different labs and by using SPMD command i executed algorithm of image processing parallely on different parts on different lab.
I got the right answer which i got from sequential code. But this is taking more time than a sequential code . I have tried this with very largest image to smallest image but didn't get the significant result.
Suggest me how can i get significant speed up using SPMD command?
Since you did not provide any code I'll have to stick to a general answer. In all parallel computing there are several design considerations, the two most important are: is your code able to run in parallel, and secondly: how much communication overhead do you create.
Calling workers means sending information back and forth, so there is an optimum in parallel computing. Make sure you provide your workers with enough work so that the communication to and from your workers requires less time than the speed-up gained from parallel computing.
Last but not least: if you provide a working code example the community is able to help you much better!
If you want to apply the same operation to several blocks within an image, then rather than worry about constructs such as spmd, you can just apply the command blockproc and set the UseParallel option to true. It will parallelize everything for you, without you needing to do anything.
If that doesn't work for you and you really have a requirement to implement your algorithm directly using spmd, you'll need to post some example code to indicate what you've tried, and where it's going wrong.
I often have to run calculation intensive simulations using Matlab. These simulations often take a long time and I expect my computer to use all its ressources in order for these simulations to be completed in as little time as possible.
However, when I open the Activity Monitor on my computer, processor usage is never above 55% and there is often about 1GB of unused RAM.
My question is: why is the processor not used to its full potential, and is there a safe and easy way to change this? Indeed, it would be great if I could get my simulations to be completed in half the time they currently take!
Its probably because you have a processer with multiple cores and that the code you are executing isn't written to run in multiple threads/processes. Unless you specifically write your code to take advantage of multiple cores it will only be able to use a single core at a single time.
A relatively easy way to enable parallel computing is to use the Parallel Computing Toolbox.
Additionally you might consider reading this: http://www.mathworks.com/company/newsletters/articles/parallel-matlab-multiple-processors-and-multiple-cores.html