From a Monte-Carlo simulation I have a range of files, say: file_1.mat, file_2.mat,...,file_n.mat, where n is large. Each file contains one or several (maximum 3 if it matters) large 1D arrays in time of interest, say var1, var2, var3.
I am now as always interested in finding the mean value of these variables. My question is now, how do I do this in the most efficient way? The keyword here is efficiency. Below you will find the MWE which is done the standard way, but this is quite time consuming as the files are large and there are many.
I am programming in Matlab, however ideas presented in pseudo code is also very well received.
MWE:(The standard way)
meanVar1 = zeros(1,1e6); %I do not remember the exact size, just use 1e6
meanVar2 = zeros(1,1e6);
meanVar3 = zeros(1,1e6);
for i 1=1:n
load(strcat('file_',int2str(i)),'var1','var2','var3')
meanVar1 = meanVar1 + var1;
meanVar2 = meanVar2 + var2;
meanVar3 = meanVar3 + var3;
end
meanVar1 = meanVar1/n;
meanVar2 = meanVar2/n;
meanVar3 = meanVar3/n;
Related
I am trying to estimate a SUR model of the form
y_{1,t} = \alpha_1 +\beta_1 x_{1,t} + \beta_2 x_{2,t} + \beta_3 y_{1,t-1} +\epsilon_{1,t}
y_{2,t} = \alpha_2 +\beta_4 x_{1,t} + \beta_5 x_{2,t} + \beta_6 y_{2,t-1} +\epsilon_{2,t}
Define mY = [y_1 y_2] and mX = [x_1 x_2].
For this purpose I am doing
iT = size(mY,1); iN = size(mY,2);
mXsur = kron(mX, eye(iN));
mXsurCell = mat2cell(mXsur, iN*ones(iT,1));
iR = size(mXsur,2);
Mdl = vgxset('n', iN, 'nAR',1, 'nX',iR,'Constant',true);
[SurOutput, SurSDerror, ~,SURcov] = vgxvarx(Mdl, mY, mXsurCell);
The issue is that the bit of code nAR, 1 seems to add 1 lag of both y variables to each equation and I only wish to add one per equation. Is there a quick way to do this?
(Of course I can include the lagged terms manually in the mX matrix, but my question is whether we can do this via vgxset in a quicker way. I think not based on my reading of the help file, but still want to double check). Thanks
I have a CSV file 1.6 GB large, that I need to feed into matlab. I will have to do this frequently and I need it to run quickly. The file is of the form:
20111205 00:00.2 99.18 6 E
20111205 00:00.2 99.18 5 E
20111205 00:00.2 99.18 1 E
20111205 00:00.2 99.195 5 E
20111205 00:00.2 99.195 5 E
20111205 01:27.0 99.19 5 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
The code I have right now is the following:
tic;
format long g
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
[c] = fscanf(fid, '%d,%d:%d.%d,%f,%d,%c');
c = reshape(c, 7, length(c)/7)
toc;
But this is far too slow. I would appreciate a method of getting this CSV file into matlab in the most efficient manner possible. Thank you!
Consider using a binary file format. Binary files are much smaller and don't need to be converted by MATLAB into the binary format. Hence they are much faster to read and write. They may also be more accurate (precision may be higher).
http://www.mathworks.com.au/help/matlab/ref/fread.html
The recommended syntax is textscan (http://www.mathworks.com/help/matlab/ref/textscan.html)
Your code would look like this:
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = textscan(fid, '%d,%d:%d.%d,%f,%d,%c');
fclose(fid);
You end up with a cell array... whether it's worth converting that to another shape really depends on how you want to access the data afterwards.
It is quite likely that this would be faster if you include a loop that allows you to use a smaller, fixed amount of memory for much of the operation. One problem with reading large files is the fact that you don't know ahead of time how big it will be - and that very likely means that Matlab guesses the amount of memory it needs, and frequently has to rescale. That is a very slow operation - if it happens every 1MB, say, then it copies 1 MB once, next 2 MB, then again 3 MB, etc - as you can see it is quadratic in the size of the array.
If instead you allocate a fixed amount of memory for the final result, and process in smaller batches, you avoid all that overhead. I'm pretty sure it will be much faster - but you would have to experiment a bit with the block size. That would look something like this:
block = 1000;
Nlines = 35E6;
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = struct(field1, field2, fieldn, value); %... initialize structure array or other storage for c ...
c_offset = 0;
while ~feof(fid)
temp = textscan(fid, '%d,%d:%d.%d,%f,%d,%c', block);
bt = size(temp, 1); % first dimension - should be `block`, except for last loop
%... extract, process, store in c(c_offset + (1:bt))...
c_offset = c_offset + bt;
end
fclose(fid);
Inspired by #Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.
To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16 (short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.
Here is the C code:
#include <stdio.h>
int main(){
FILE *fp, *fo;
long int ld1;
int d2, d3, d4, d5, d6, d7;
short int buf[9];
char c8;
int n;
short int year, monthday;
fp = fopen("bigdata.txt", "r");
fo = fopen("bigdata.bin", "wb");
if (fp == NULL || fo == NULL) {
printf("unable to open file\n");
return 1;
}
while(!feof(fp)) {
n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
&ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
year = d1 / 10000;
monthday = d1 - 10000 * year;
// move everything into buffer for single call to fwrite:
buf[0] = year;
buf[1] = monthday;
buf[2] = d2;
buf[3] = d3;
buf[4] = d4;
buf[5] = d5;
buf[6] = d6;
buf[7] = d7;
buf[8] = c8;
fwrite(buf, sizeof(short int), 9, fo);
}
fclose(fp);
fclose(fo);
return 0;
}
The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.
Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:
tic
fid = fopen('bigdata.bin');
d = fread(fid, 'int16');
d = reshape(d, 9, []);
toc
Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?
A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.
If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.
update
I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)
I am new to Mathematica and I am having difficulties with one thing. I have this Table that generates 10 000 times 13 numbers (12 numbers + 1 that is a starting number). I need to create a Histogram from all 10 000 13th numbers. I hope It's quite clear, quite tricky to explain.
This is the table:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]
The result for the following histogram could be a table that will take all the 13th numbers to one table - than It would be quite easy to create an histogram. Maybe with "select"? Or maybe you know other ways to solve this.
You can access different parts of a list using Part or (depending on what parts you need) some of the more specialised commands, such as First, Rest, Most and (the one you need) Last. As noted in comments, Histogram[Last/#F] or Histogram[F[[All,-1]]] will work fine.
Although it wasn't part of your question, I would like to note some things you could do for your specific problem that will speed it up enormously. You are defining Mu, Sigma etc 10,000 times, because they are inside the Table command. You are also recalculating Mu - Sigma^2/2)*t + Sigma*Sqrt[t] 120,000 times, even though it is a constant, because you have it inside the FoldList inside the Table.
On my machine:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]; // Timing
{4.19049, Null}
This alternative is ten times faster:
F = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
FoldList[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]]; // Timing
{0.403365, Null}
I use With for the local constants and Module for the things that are other redefined within the Table (Xi) or are calculations based on the local constants (beta). This question on the Mathematica StackExchange will help explain when to use Module versus Block versus With. (I encourage you to explore the Mathematica StackExchange further, as this is where most of the Mathematica experts are hanging out now.)
For your specific code, the use of Part isn't really required. Instead of using FoldList, just use Fold. It only retains the final number in the folding, which is identical to the last number in the output of FoldList. So you could try:
FF = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
Fold[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]];
Histogram[FF]
Calculating FF in this way is even a little faster than the previous version. On my system Timing reports 0.377 seconds - but such a difference from 0.4 seconds is hardly worth worrying about.
Because you are setting the seed with SeedRandom, it is easy to verify that all three code examples produce exactly the same results.
Making my comment an answer:
Histogram[Last /# F]
I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?
hey guys i want your suggestion that how can change value of two variables without 3rd one. in objective cc.
is there any way so please inform me,
it can be done in any language. x and y are 2 variables and we want to swap them
{
//lets say x , y are 1 ,2
x = x + y; // 1+2 =3
y = x - y; // 3 -2 = 1
x = x -y; // 3-1 = 2;
}
you can use these equation in any language to achieve this
Do you mean exchange the value of two variables, as in the XOR swap algorithm? Unless you're trying to answer a pointless interview question, programming in assembly language, or competing in the IOCCC, don't bother. A good optimizing compiler will probably handle the standard tmp = a; a = b; b = tmp; better than whatever trick you might come up with.
If you are doing one of those things (or are just curious), see the Wikipedia article for more info.
As far as number is concerned you can swap numbers in any language without using the third one whether it's java, objective-C OR C/C++,
For more info
Potential Problem in "Swapping values of two variables without using a third variable"
Since this is explicitly for iPhone, you can use the ARM instruction SWP, but it's almost inconceivable why you'd want to. The complier is much, much better at this kind of optimization. If you just want to avoid the temporary variable in code, write an inline function to handle it. The compiler will optimize it away if it can be done more efficiently.
NSString * first = #"bharath";
NSString * second = #"raj";
first = [NSString stringWithFormat:#"%#%#",first,second];
NSRange needleRange = NSMakeRange(0,
first.length - second.length);
second = [first substringWithRange:needleRange];
first = [first substringFromIndex:second.length];
NSLog(#"first---> %#, Second---> %#",first,second);