Massive CSV file into Matlab - matlab

I have a CSV file 1.6 GB large, that I need to feed into matlab. I will have to do this frequently and I need it to run quickly. The file is of the form:
20111205 00:00.2 99.18 6 E
20111205 00:00.2 99.18 5 E
20111205 00:00.2 99.18 1 E
20111205 00:00.2 99.195 5 E
20111205 00:00.2 99.195 5 E
20111205 01:27.0 99.19 5 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
20111205 02:01.4 99.185 1 E
The code I have right now is the following:
tic;
format long g
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
[c] = fscanf(fid, '%d,%d:%d.%d,%f,%d,%c');
c = reshape(c, 7, length(c)/7)
toc;
But this is far too slow. I would appreciate a method of getting this CSV file into matlab in the most efficient manner possible. Thank you!

Consider using a binary file format. Binary files are much smaller and don't need to be converted by MATLAB into the binary format. Hence they are much faster to read and write. They may also be more accurate (precision may be higher).
http://www.mathworks.com.au/help/matlab/ref/fread.html

The recommended syntax is textscan (http://www.mathworks.com/help/matlab/ref/textscan.html)
Your code would look like this:
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = textscan(fid, '%d,%d:%d.%d,%f,%d,%c');
fclose(fid);
You end up with a cell array... whether it's worth converting that to another shape really depends on how you want to access the data afterwards.
It is quite likely that this would be faster if you include a loop that allows you to use a smaller, fixed amount of memory for much of the operation. One problem with reading large files is the fact that you don't know ahead of time how big it will be - and that very likely means that Matlab guesses the amount of memory it needs, and frequently has to rescale. That is a very slow operation - if it happens every 1MB, say, then it copies 1 MB once, next 2 MB, then again 3 MB, etc - as you can see it is quadratic in the size of the array.
If instead you allocate a fixed amount of memory for the final result, and process in smaller batches, you avoid all that overhead. I'm pretty sure it will be much faster - but you would have to experiment a bit with the block size. That would look something like this:
block = 1000;
Nlines = 35E6;
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = struct(field1, field2, fieldn, value); %... initialize structure array or other storage for c ...
c_offset = 0;
while ~feof(fid)
temp = textscan(fid, '%d,%d:%d.%d,%f,%d,%c', block);
bt = size(temp, 1); % first dimension - should be `block`, except for last loop
%... extract, process, store in c(c_offset + (1:bt))...
c_offset = c_offset + bt;
end
fclose(fid);

Inspired by #Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.
To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16 (short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.
Here is the C code:
#include <stdio.h>
int main(){
FILE *fp, *fo;
long int ld1;
int d2, d3, d4, d5, d6, d7;
short int buf[9];
char c8;
int n;
short int year, monthday;
fp = fopen("bigdata.txt", "r");
fo = fopen("bigdata.bin", "wb");
if (fp == NULL || fo == NULL) {
printf("unable to open file\n");
return 1;
}
while(!feof(fp)) {
n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
&ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
year = d1 / 10000;
monthday = d1 - 10000 * year;
// move everything into buffer for single call to fwrite:
buf[0] = year;
buf[1] = monthday;
buf[2] = d2;
buf[3] = d3;
buf[4] = d4;
buf[5] = d5;
buf[6] = d6;
buf[7] = d7;
buf[8] = c8;
fwrite(buf, sizeof(short int), 9, fo);
}
fclose(fp);
fclose(fo);
return 0;
}
The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.
Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:
tic
fid = fopen('bigdata.bin');
d = fread(fid, 'int16');
d = reshape(d, 9, []);
toc
Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?
A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.
If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.
update
I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)

Related

Efficiently calculate mean value from files

From a Monte-Carlo simulation I have a range of files, say: file_1.mat, file_2.mat,...,file_n.mat, where n is large. Each file contains one or several (maximum 3 if it matters) large 1D arrays in time of interest, say var1, var2, var3.
I am now as always interested in finding the mean value of these variables. My question is now, how do I do this in the most efficient way? The keyword here is efficiency. Below you will find the MWE which is done the standard way, but this is quite time consuming as the files are large and there are many.
I am programming in Matlab, however ideas presented in pseudo code is also very well received.
MWE:(The standard way)
meanVar1 = zeros(1,1e6); %I do not remember the exact size, just use 1e6
meanVar2 = zeros(1,1e6);
meanVar3 = zeros(1,1e6);
for i 1=1:n
load(strcat('file_',int2str(i)),'var1','var2','var3')
meanVar1 = meanVar1 + var1;
meanVar2 = meanVar2 + var2;
meanVar3 = meanVar3 + var3;
end
meanVar1 = meanVar1/n;
meanVar2 = meanVar2/n;
meanVar3 = meanVar3/n;

How to do bitwise operation decently?

I'm doing analysis on binary data. Suppose I have two uint8 data values:
a = uint8(0xAB);
b = uint8(0xCD);
I want to take the lower two bits from a, and whole content from b, to make a 10 bit value. In C-style, it should be like:
(a[2:1] << 8) | b
I tried bitget:
bitget(a,2:-1:1)
But this just gave me separate [1, 1] logical type values, which is not a scalar, and cannot be used in the bitshift operation later.
My current solution is:
Make a|b (a or b):
temp1 = bitor(bitshift(uint16(a), 8), uint16(b));
Left shift six bits to get rid of the higher six bits from a:
temp2 = bitshift(temp1, 6);
Right shift six bits to get rid of lower zeros from the previous result:
temp3 = bitshift(temp2, -6);
Putting all these on one line:
result = bitshift(bitshift(bitor(bitshift(uint16(a), 8), uint16(b)), 6), -6);
This is doesn't seem efficient, right? I only want to get (a[2:1] << 8) | b, and it takes a long expression to get the value.
Please let me know if there's well-known solution for this problem.
Since you are using Octave, you can make use of bitpack and bitunpack:
octave> a = bitunpack (uint8 (0xAB))
a =
1 1 0 1 0 1 0 1
octave> B = bitunpack (uint8 (0xCD))
B =
1 0 1 1 0 0 1 1
Once you have them in this form, it's dead easy to do what you want:
octave> [B A(1:2)]
ans =
1 0 1 1 0 0 1 1 1 1
Then simply pad with zeros accordingly and pack it back into an integer:
octave> postpad ([B A(1:2)], 16, false)
ans =
1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0
octave> bitpack (ans, "uint16")
ans = 973
That or is equivalent to an addition when dealing with integers
result = bitshift(bi2de(bitget(a,1:2)),8) + b;
e.g
a = 01010111
b = 10010010
result = 00000011 100010010
= a[2]*2^9 + a[1]*2^8 + b
an alternative method could be
result = mod(a,2^x)*2^y + b;
where the x is the number of bits you want to extract from a and y is the number of bits of a and b, in your case:
result = mod(a,4)*256 + b;
an extra alternative solution close to the C solution:
result = bitor(bitshift(bitand(a,3), 8), b);
I think it is important to explain exactly what "(a[2:1] << 8) | b" is doing.
In assembly, referencing individual bits is a single operation. Assume all operations take the exact same time and "efficient" a[2:1] starts looking extremely inefficient.
The convenience statement actually does (a & 0x03).
If your compiler actually converts a uint8 to a uint16 based on how much it was shifted, this is not a 'free' operation, per se. Effectively, what your compiler will do is first clear the "memory" to the size of uint16 and then copy "a" into the location. This requires an extra step (clearing the "memory" (register)) that wouldn't normally be needed.
This means your statement actually is (uint16(a & 0x03) << 8) | uint16(b)
Now yes, because you're doing a power of two shift, you could just move a into AH, move b into AL, and AH by 0x03 and move it all out but that's a compiler optimization and not what your C code said to do.
The point is that directly translating that statement into matlab yields
bitor(bitshift(uint16(bitand(a,3)),8),uint16(b))
But, it should be noted that while it is not as TERSE as (a[2:1] << 8) | b, the number of "high level operations" is the same.
Note that all scripting languages are going to be very slow upon initiating each instruction, but will complete said instruction rapidly. The terse nature of Python isn't because "terse is better" but to create simple structures that the language can recognize so it can easily go into vectorized operations mode and start executing code very quickly.
The point here is that you have an "overhead" cost for calling bitand; but when operating on an array it will use SSE and that "overhead" is only paid once. The JIT (just in time) compiler, which optimizes script languages by reducing overhead calls and creating temporary machine code for currently executing sections of code MAY be able to recognize that the type checks for a chain of bitwise operations need only occur on the initial inputs, hence further reducing runtime.
Very high level languages are quite different (and frustrating) from high level languages such as C. You are giving up a large amount of control over code execution for ease of code production; whether matlab actually has implemented uint8 or if it is actually using a double and truncating it, you do not know. A bitwise operation on a native uint8 is extremely fast, but to convert from float to uint8, perform bitwise operation, and convert back is slow. (Historically, Matlab used doubles for everything and only rounded according to what 'type' you specified)
Even now, octave 4.0.3 has a compiled bitshift function that, for bitshift(ones('uint32'),-32) results in it wrapping back to 1. BRILLIANT! VHLL place you at the mercy of the language, it isn't about how terse or how verbose you write the code, it's how the blasted language decides to interpret it and execute machine level code. So instead of shifting, uint32(floor(ones / (2^32))) is actually FASTER and more accurate.

How to select the last column of numbers from a table created by FoldList in Mathematica

I am new to Mathematica and I am having difficulties with one thing. I have this Table that generates 10 000 times 13 numbers (12 numbers + 1 that is a starting number). I need to create a Histogram from all 10 000 13th numbers. I hope It's quite clear, quite tricky to explain.
This is the table:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]
The result for the following histogram could be a table that will take all the 13th numbers to one table - than It would be quite easy to create an histogram. Maybe with "select"? Or maybe you know other ways to solve this.
You can access different parts of a list using Part or (depending on what parts you need) some of the more specialised commands, such as First, Rest, Most and (the one you need) Last. As noted in comments, Histogram[Last/#F] or Histogram[F[[All,-1]]] will work fine.
Although it wasn't part of your question, I would like to note some things you could do for your specific problem that will speed it up enormously. You are defining Mu, Sigma etc 10,000 times, because they are inside the Table command. You are also recalculating Mu - Sigma^2/2)*t + Sigma*Sqrt[t] 120,000 times, even though it is a constant, because you have it inside the FoldList inside the Table.
On my machine:
F = Table[(Xi = RandomVariate[NormalDistribution[], 12];
Mu = -0.00644131;
Sigma = 0.0562005;
t = 1/12; s = 0.6416;
FoldList[(#1*Exp[(Mu - Sigma^2/2)*t + Sigma*Sqrt[t]*#2]) &, s,
Xi]), {SeedRandom[2]; 10000}]; // Timing
{4.19049, Null}
This alternative is ten times faster:
F = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
FoldList[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]]; // Timing
{0.403365, Null}
I use With for the local constants and Module for the things that are other redefined within the Table (Xi) or are calculations based on the local constants (beta). This question on the Mathematica StackExchange will help explain when to use Module versus Block versus With. (I encourage you to explore the Mathematica StackExchange further, as this is where most of the Mathematica experts are hanging out now.)
For your specific code, the use of Part isn't really required. Instead of using FoldList, just use Fold. It only retains the final number in the folding, which is identical to the last number in the output of FoldList. So you could try:
FF = Module[{Xi, beta}, With[{Mu = -0.00644131, Sigma = 0.0562005,
t = 1/12, s = 0.6416},
beta = (Mu - Sigma^2/2)*t + Sigma*Sqrt[t];
Table[(Xi = RandomVariate[NormalDistribution[], 12];
Fold[(#1*Exp[beta*#2]) &, s, Xi]), {SeedRandom[2];
10000}] ]];
Histogram[FF]
Calculating FF in this way is even a little faster than the previous version. On my system Timing reports 0.377 seconds - but such a difference from 0.4 seconds is hardly worth worrying about.
Because you are setting the seed with SeedRandom, it is easy to verify that all three code examples produce exactly the same results.
Making my comment an answer:
Histogram[Last /# F]

IDL and MatLab getting strange values from NetCDF file

I have a NetCDF file, which contains data representing total precipitation across the globe over several months (so it's stored in a three dimensional array). I first ensured that the data was sensible, and the way it was formed, both in XConv and ncdump. All looks sensible - values vary from very small (~10^-10 - this makes sense, as this is model data, and effectively represents zero) to about 5x10^-3.
The problems start when I try to handle this data in IDL or MatLab. The arrays generated in these programs are full of huge negative numbers such as -4x10^4, with occasional huge positive numbers, such as 5000. Strangely, looking at a plot of the data in MatLab with respect to latitude and longitude (at a specific time), the pattern of rainfall looks sensible, but the values are just completely wrong.
In IDL, I'm reading the file in to write it to a text file so it can be handled by some software that takes very basic text files. Here's the code I'm using:
PRO nao_heaps
address = '/Users/levyadmin/Downloads/'
file_base = 'output'
ncid = ncdf_open(address + file_base + '.nc')
MONTHS=['january','february','march','april','may','june','july','august','september','october','november','december']
varid_field = ncdf_varid(ncid, "tp")
varid_lon = ncdf_varid(ncid, "longitude")
varid_lat = ncdf_varid(ncid, "latitude")
varid_time = ncdf_varid(ncid, "time")
ncdf_varget,ncid, varid_field, total_precip
ncdf_varget,ncid, varid_lat, lats
ncdf_varget,ncid, varid_lon, lons
ncdf_varget,ncid, varid_time, time
ncdf_close,ncid
lats = reform(lats)
lons = reform(lons)
time = reform(time)
total_precip = reform(total_precip)
total_precip = total_precip*1000. ;put in mm
noLats=(size(lats))(1)
noLons=(size(lons))(1)
noMonths=(size(time))(1)
; the data may not be an integer number of years (otherwise we could make this next loop cleaner)
av_precip=fltarr(noLons,noLats,12)
for month=0, 11 do begin
year = 0
while ( (year*12) + month lt noMonths ) do begin
av_precip(*,*,month) = av_precip(*,*,month) + total_precip(*,*, (year*12)+month )
year++
endwhile
av_precip(*,*,month) = av_precip(*,*,month)/year
endfor
fname = address + file_base + '.dat'
OPENW,1,fname
PRINTF,1,'longitude'
PRINTF,1,lons
PRINTF,1,'latitude'
PRINTF,1,lats
for month=0,11 do begin
PRINTF,1,MONTHS(month)
PRINTF,1,av_precip(*,*,month)
endfor
CLOSE,1
END
Anyone have any ideas why I'm getting such strange values in MatLab and IDL?!
AH! Found the answer. NetCDF files use an offset, and a scale factor for the data to keep the size of the file to a minimum. To get the correct values, I simply need to:
total_precip = offset + (scale_factor * total_precip) ;put into correct range
At present I'm getting the scale factor and offset from ncdump, and hard coding them into my IDL program, but does anyone know how I can get them dynamically in my IDL code..?

How to read data in chunks from notepad file in Matlab?

My data is in following format:
TABLE NUMBER 1
FILE: name_1
name_2
TIME name_3
day name_4
-0.01 0
364.99 35368.4
729.99 29307
1094.99 27309.5
1460.99 26058.8
1825.99 25100.4
2190.99 24364
2555.99 23757.1
2921.99 23240.8
3286.99 22785
3651.99 22376.8
4016.99 22006.1
4382.99 21664.7
4747.99 21348.3
5112.99 21052.5
5477.99 20774.1
5843.99 20509.9
6208.99 20259.7
6573.99 20021.3
6938.99 19793.5
7304.99 19576.6
TABLE NUMBER 2
FILE: name_1
name_5
TIME name_6
day name_7
-0.01 0
364.99 43110.4
729.99 37974.1
1094.99 36175.9
1460.99 34957.9
1825.99 34036.3
2190.99 33293.3
2555.99 32665.8
2921.99 32118.7
3286.99 31626.4
3651.99 31175.1
4016.99 30758
4382.99 30368.5
4747.99 30005.1
5112.99 29663
5477.99 29340
5843.99 29035.2
6208.99 28752.4
6573.99 28489.7
6938.99 28244.2
7304.99 28012.9
TABLE NUMBER 3
Till now I was splitting this data and reading the variables (time and name_i) from each file in following way:
[TIME(:,j), name_i(:,j)]=textread('filename','%f\t%f','headerlines',5);
But now I am producing the data of those files into 1 file as shown in beginning. For example I want to read and store TIME data in vectors TIME1, TIME2, TIME3, TIME4, TIME5 for name_3, name_6, _9 respectively, and similarly for others.
First of all, I suggest you don't use variable names such as TIME1,TIME2 etc, since that gets messy quickly. Instead, you can e.g. use a cell array with five rows (one for each well), and one or two columns. In the sample code below, wellData{2,1} is the time for the second well, wellData{2,2} is the corresponding Oil Rate SC - Yearly.
There might be more elegant ways to do the reading; here's something quick:
%# open the file
fid = fopen('Reportq.rwo');
%# read it into one big array, row by row
fileContents = textscan(fid,'%s','Delimiter','\n');
fileContents = fileContents{1};
fclose(fid); %# don't forget to close the file again
%# find rows containing TABLE NUMBER
wellStarts = strmatch('TABLE NUMBER',fileContents);
nWells = length(wellStarts);
%# loop through the wells and read the numeric data
wellData = cell(nWells,2);
wellStarts = [wellStarts;length(fileContents)];
for w = 1:nWells
%# read lines containing numbers
tmp = fileContents(wellStarts(w)+5:wellStarts(w+1)-1);
%# convert strings to numbers
tmp = cellfun(#str2num,tmp,'uniformOutput',false);
%# catenate array
tmp = cat(1,tmp{:});
%# assign output
wellData(w,:) = mat2cell(tmp,size(tmp,1),[1,1]);
end