I want to load a large set of MAT-files in a loop. I'm testing different ways to make the files load faster, and I have a subset of 10,000 files I'm working with, each containing about 50 variables of different sizes. I noticed an interesting detail:
If I load 10,000 files using load(filename) in a loop one after another, it takes about 5 minutes.
If I load the same set of files a few more times (basically repeat the test), the time doesn't change.
If I load only one variable from each file using load(filename, 'varname'), it takes about the same amount of time.
If I repeat step 3, it takes about 15 seconds to complete the load. Same files, same variable being loaded.
If I now run step 1 and again repeat step 3, I'm back to the load taking about 5 minutes. But once I try to do a second load, it takes a very short time again.
I'm puzzled. Is Matlab somehow keeping the data in memory once it loads it from a file once? This phenomenon, however, survives Matlab restarts and clear commands, so can it actually be Windows 7 that's keeping a memory cache of some of the data?
Needless to say, I would like to determine what's causing the unexpected improvement and, if possible, reproduce it to make the first load as fast as the subsequent ones.
Related
After a long time searching for an answer and not having found one, my last resort is asking a new question. I create multiple (N=1000) -mat v-6 files that are each about 100 MB in size and contain a single matrix. In a separate part of my code, I need to load in each file. The problem I'm running into is that loading the files in suddenly becomes very time consuming around file number 600 and I'm not sure why its happening. Thanks in advance for any suggestions.
I'm using Matlab R2014b on a Mac with 16GB of ram.
Sample code
c=nan(1,1000)
for h=1:1000
tic
filename=[basefilename,'_',num2str(h),'.mat'];
transition=load(filename,'P')
c(h)=toc;
end
Here is an image of the recorded loading times using the exact code above
I am getting some readings off an accelerometer connected to an Arduino which is in turn connected to MATLAB through serial communication. I would like to write the readings into a text file. A 10 second reading will write around 1000 entries that make the text file size around 1 kbyte.
I will be using the following code:
%%%%%// Communication %%%%%
arduino=serial('COM6','BaudRate',9600);
fopen(arduino);
fileID = fopen('Readings.txt','w');
%%%%%// Reading from Serial %%%%%
for i=1:Samples
scan = fscanf(arduino,'%f');
if isfloat(scan),
vib = [vib;scan];
fprintf(fileID,'%0.3f\r\n',scan);
end
end
Any suggestions on improving this code ? Will this have a time or Size limit? This code is to be run for 3 days.
Do not use text files, use binary files. 42718123229.123123 is 18 bytes in ASCII, 4 bytes in a binary file. Don't waste space unnecessarily. If your data is going to be used later in MATLAB, then I just suggest you save in .mat files
Do not use a single file! Choose a reasonable file size (e.g. 100Mb) and make sure that when you get to that many amount of data you switch to another file. You could do this by e.g. saving a file per hour. This way you minimize the possible errors that may happen if the software crashes 2 minutes before finishing.
Now knowing the real dimensions of your problem, writing a text file is totally fine, nothing special is required to process such small data. But there is a problem with your code. You are writing a variable vid which increases over time. That may cause bad performance because you are not using preallocation and it may consume a lot of memory. I strongly recommend not to keep this variable, and if you need the dater read it afterwards.
Another thing you should consider is verification of your data. What do you do when you receive less samples than you expect? Include timestamps! Be aware that these timestamps are not precise because you add them afterwards, but it allows you to identify if just some random samples are missing (may be interpolated afterwards) or some consecutive series of maybe 100 samples is missing.
I have a DAQ for Temperature measurment. I take a continuous sample rate and after DAQ, calculating temperature difference per minute (Cooling Rate: CR) during this process. This CR and temperature values are inserted into the Matlab script for a physical model running (predicting the temperature drop for next 30 sec). Then, I record and compare the predicted and experimental values in LabVIEW.
What i am trying to do is the matlab model is executing every 30 sec, and send out its predictions as an output from matlab script. One of this outputs helps me to change the Air Blower Motor Speed until next matlab run( eventually affect the temperature drop for next 30 sec as well, which becomes a closed loop). After 30 sec while main process is still running, sending CR and temperature values to matlab model again, and so on.
I have a case structure for this Matlab script. And inside of case structure i applied an elapsed time function to control the timing for the matlab script, but this is not working.
Yes. Short answer: I believe (one of) the reasons the program behaves weird on changed timing are several race conditions present in the code.
The part of the diagram presented shows several big problems with the code:
Local variables lead to race conditions. Use dataflow. E.g. you are writing to Tinitial local variable, and reading from Tinitial local varaible in the chunk of code with no data dependencies. It is not known whether reading or writing will happen first. It may not manifest itself badly with small delays, while big delays may be an issue. Solution: rewrite you program using the following example:
From Bad:
To Good:
(nevermind broken wires)
Matlab script node executes in the main UI execution system. If it is executing for a long time, it may freeze indicators/controls as well as execution of other pieces of code. Change execution system of other VIs in your program (say to "other 1") and see if the situation improves.
I have noticed that the first time I run a script, it takes considerably more time than the second and third time1. The "warm-up" is mentioned in this question without an explanation.
Why does the code run faster after it is "warmed up"?
I don't clear all between calls2, but the input parameters change for every function call. Does anyone know why this is?
1. I have my license locally, so it's not a problem related to license checking.
2. Actually, the behavior doesn't change if I clear all.
One reason why it would run faster after the first time is that many things are initialized once, and their results are cached and reused the next time. For example in the M-side, variables can be defined as persistent in functions that can be locked. This can also occur on the MEX-side of things.
In addition many dependencies are loaded after the first time and remain so in memory to be re-used. This include M-functions, OOP classes, Java classes, MEX-functions, and so on. This applies to both builtin and user-defined ones.
For example issue the following command before and after running your script for the first run, then compare:
[M,X,C] = inmem('-completenames')
Note that clear all does not necessarily clear all of the above, not to mention locked functions...
Finally let us not forget the role of the accelerator. Instead of interpreting the M-code every time a function is invoked, it gets compiled into machine code instructions during runtime. JIT compilation occurs only for the first invocation, so ideally the efficiency of running object code the following times will overcome the overhead of re-interpreting the program every time it runs.
Matlab is interpreted. If you don't warm up the code, you will be losing a lot of time due to interpretation instead of the actual algorithm. This can skew results of timings significantly.
Running the code at least once will enable Matlab to actually compile appropriate code segments.
Besides Matlab-specific reasons like JIT-compilation, modern CPUs have large caches, branch predictors, and so on. Warming these up is an issue for benchmarking even in assembly language.
Also, more importantly, modern CPUs usually idle at low clock speed, and only jump to full speed after several milliseconds of sustained load.
Intel's Turbo feature gets even more funky: when power and thermal limits allow, the CPU can run faster than its sustainable max frequency. So the first ~20 seconds to 1 minute of your benchmark may run faster than the rest of it, if you aren't careful to control for these factors.
Another issue not mensioned by Amro and Marc is memory (pre)allocation.
If your script does not pre-allocate its memory it's first run would be very slow due to memory allocation. Once it completed its first iteration all memory is allocated, so consecutive invokations of the script would be more efficient.
An illustrative example
for ii = 1:1000
vec(ii) = ii; %// vec grows inside loop the first time this code is executed only
end
IMPORTANT UPDATE
I just made the discovery that after restarting Matlab and the computer, this simplified code no longer reproduces the problem for me either... I am so sorry for taking up your time with a script that didn't work. However, the old problem still persists in my original script if I save anything in any folder (that I have tried) in the inner 'for' loop. For my purposes, I have worked around it by simply not make this save unless I absolutely need it. The original script has the following structure in terms of for loops and use of save or load:
load() % .mat files, size 365x92x240
for day = 1:365
load() % .mat files, size 8x92x240
for type = 1:17
load() % .mat files size 17x92x240
load() % .mat files size 92x240
for step 1:8
%only calculations
end
save() % .mat files size 8x92x240
end
save() % .mat files, size 8x92x240
end
% the load and saves outside the are in for loops too, but do not seem to affect the described behavior in the above script
load() % .mat files size 8x92x240
save() % .mat files size 2920x92x240
load()
save() % .mat files size 365x92x240
load()
save() % .mat files size 12x92x240
If run in full, the script saves approx. 10 Gb and loads approx. 2Gb of data.
The entire script is rather lengthy and makes a lot of saves and loads. It would be rather impractical too share all here before I have managed to reproduce the problem in a reduced version, unfortunately. As I frustratingly discovered that the very same code could behave differently from to time to time, it immediately got more tedious than anticipated to find a simplification that consistently reproduces the behavior. I will get back as soon as I am sure about a manageable code that produces the problem.
PREVIOUS PROBLEM DESCRIPTION
(NB. The code below does not for sure reproduce the described problem.):
I just learnt the hard way that, in Matlab, you can't name a saving folder to temp in a for loop without slowing down data loading in the next round of the loop. My question is why?
If you are interested in reproducing the problem yourself, please see the code below. To run it, you will also need a matfile called anyData.mat to load and two folders for saving, one called temp and the other called temporary.
clear all;clc;close all;profile off;
profile on
tT= zeros(1,endDay+1);
tTD= zeros(1,endDay+1);
for day = 0:2;
tic
T = importdata('anyData.mat')
tT(day+1)=toc; %loading time in seconds
tic
TD = importdata('anyData.mat')
tTD(day+1)=toc;
for type = 0:1
saveFile = ones(92,240);
save('AnyFolder\temporary\saveFile.mat', 'saveFile') % leads to fast data loading
%save('AnyFolder\temp\saveFile.mat', 'saveFile') %leads to slow data loading
end % end of type
end% end of day
profile off
profile report
plot(tT)
You will see in y-axis of the plot that data loading takes significantly longer time when you in the later for loop save to temp rather than temporary. Is there anyone out there who knows why this occurs?
There are two things here
Storage during a for loop is an expensive operation as it usually opens a file stream and closes it before it moves on. You might not be able to avoid this.
Second thing is speed of storage and its cache speed. Most likely programs use temp folder for its own temporary files and have a garbage collector or software looking after these to clean them. If you start opening and closing file stream to this folder you have to send a request to get exclusive write access to the folder. This again adds to the time.
If you are doing image processing operations and you have multiple images you can run into a bottle neck with writing to hard drive due to its speed, cache and current memory available to MATLAB.
I can't reproduce the problem, suspect it's system and data-size specific. But some general comments which could help you out of the predicament:
As pointed out by commenters and the above answers, file i/o within a double for loop can be extremely parasitic, especially in cases where you only need to access part of the data in the file, where other system operations delay the process, or where the data files are large enough to require virtual memory (windows) / swap space (linux) to even load them. In the latter case, you could be in a situation where you're moving a file from one part of the hard disk to another when you open it!
I assume that you're loading/saving because you don't have c.10GB of ram to hold everything in memory for computation. The actual problem is not described, so I can't be certain, but think you might find that the matfile class to be useful... TMW documentation. This is used to map directly to/from a mat file. This:
reduces file stream opening and closing IOPS
allows arbitrarily large variable sizes (governed by disk size, not memory)
allows you to read/write partially (i.e. write only some elements of an array without loading the whole file)
in the case that your mat file is too large to be held in memory, avoids loading it into swap space which would be extremely cumbersome.
Hope this helps.
Tom