MATLAB: Large text file to Matrix conversion - matlab

I need to read a text file in MATLAB containing 436 float values in each line (text file size is 25GB, so you can estimate the number of lines) and then convert it to a matrix so that I can take the transpose. How to do it? Won't the format specifier be too long?

Let's assume that the floats in your file are written in a format such that you have 15 digits after the floating point. So we have 17 characters for each float. Then, let's also assume that they are separated by a character, and that we have a \n at the end of each line, so that we will have 436*18=7848 characters in total, which assuming ascii characters will use one byte each. Then, your file uses about 25G of memory, so you can say that you have (25*2^30)/7848=3.6041e+06 lines (using 2^30 bytes in a gygabyte, the scale is roughly the same if you prefer the definition as 10^9 bytes)`.
So, a matrix of size 4e6, 436 (I'm assuming a much larger higher bound on your matrix size) will, assuming each float takes 4 bytes, roughly take about 6.48G. This is nothing crazy, and you can find this amount of contiguous memory to allocate when reading the matrix using the load function, given you have a decent amount of RAM on your machine. I have 8G at the moment, and rand(4*1e6,436) allocates the desired amount of memory, although it ends up using the swap space and slowing down. I assume that load itself will have some amount of overhead, but if you have 16G of RAM (which is not crazy nowadays) you can safely go with load.
Now if you think that you wont find that much contiguous memory, I suggest that you just separate your file into chunks, like 10 matrices, and load and transpose them separately. How you do that is up to you and depends on the application, and whether there are any sparsity patterns in the data or not. Also, make sure (if you don't need the extra precision) that you are using single float precision.

Related

Loading a CSV File taking forever

I am trying to load a 4.0 GB large CSV file into Matlab. I have 40GB of RAM. However, the table does not seem to finish loading. (Activity Monitor showed fast increase of RAM use up to 38.64GB and stopped after that. CPU still in heavily in use.)
According to the force quit menu of apple, matlab is has not gotten stuck. (I'd guess a missing "Matlab is not responding"-message signals that.)
1st Question: Why does it even take up that much RAM? I've read RAM duplicates. Can I do something in this regard?
2nd Question: Can I speed this project up. Split the CSV somehow?
3rd Question: Can I speed up my computer? It is taking forever, while using only 30% of CPU capacity... Why does it not use more? The vents are not crazy loud, so I guess "it's chilling".
Edit: It went up to 72.80 and is now decreasing...
Edit: Now back down at 55.something
There are a few concepts you should be aware of with Matlab.
Strings are stored as UINT16 (sort of, I can never get this right). Importantly what this means is that every character requires 2 bytes. If you stored the entire file as a long string it would take up 8 GB.
Values, whether they are arrays or scalars, are stored with headers. This means that storing a string (technically a character array, strings - the ones with double quotes instead of single quotes - may be different) requires a header that is roughly 104 bytes. This means something like 'test' requires roughly 108 bytes! If you can store an array of numbers then the 104 byte overhead is minimal. If you have a cell array of scalars, then each scalar is taking up 112 byes (assuming the scalar is an 8 byte double). This might be a bit confusing but in the end it means if you're not careful reading a CSV file your memory requirements can explode.
So what can you do. Tables store columns as arrays where possible. You can try readtable although I think the underlying implementation might not be memory efficient.
For large files Matlab suggests using the datastore function. It will fix your memory problem although it may be a bit slow.
The other option is to read the entire file into memory and to do your own custom processing. For example, assuming you don't have anything escaped (i.e. commas that are not actually delimiters), you can find all relevant delimiters by using:
%Find comma or newline
I = regexp(temp,',|\n')
Here's an example of extracting various columns. As indicated above, this has a large overhead for strings (character arrays) but is efficient for numbers.
%Fake data as an example, 3 columns with middle one numeric
temp = sprintf('asdf,1234,temp\nfred,324,chip\ncheese,12,you are always right');
I = regexp(temp,',|\n');
starts = [0 I];
ends = [I length(temp)+1];
n_columns = 3;
%extract column 2
c2 = arrayfun(#(x,y) str2double(temp(x+1:y-1)),starts(2:n_columns:end),ends(2:n_columns:end));
%extract column 1
c1 = arrayfun(#(x,y) temp(x+1:y-1),...
starts(1:n_columns:end),ends(1:n_columns:end),'un',0);
Depending on your use case this may work or it may not. To read the file into memory you can use fileread
In answer to question(2): It is quite straightforward to split the csv up, assuming there are more rows than columns...
bigfile= csvread(filename);
bigLen=length(bigfile);
size=unint64(bliglen/2)
csvwrite('first.csv', bigfile(1:size,:));
csvwrite('second.csv', bigfile(size:beglen,:));
Or even doing this with SEVERAL files; it may not make it faster overall, but it would allow you to observe the process as each file is read.
I think that MatLab itself has a limit of how much input it is allowed to take in. I'm sure you can set that in the preferences if you have the high enough version.
Check this out: http://www.mathworks.com/help/matlab/matlab_env/set-workspace-and-variable-preferences.html

MATLAB takeing huge 350 mb memory to write one column vector to txt file

I have a variable sndpwr which have 18000 rows and just one column. When I use fprintf it takes 350mb txt file to write. Even csvwrite and dlmwrite take 200+ mb space.
Can anyone tell me any function or method that will write it in a small text file. I am importing it in another program which is not able to import such large files.
fid = fopen('sndpwr.txt','wt');
fprintf(fid,'%0.6f\r\n',sndpwr');
fclose(fid);
Thanks!
EDIt: in workspace it is described as 31957476x1 double. Sorry for my previous incorrect data.
Unfortunately, there is no way to compress your data without using an actual compression algorithm. You have 3x10^7 numbers, written with six digits after the decimal point, at least one before, and a couple of newline characters. This gives 3x10^7 * 10 = 3x10^8 bytes, as a bare minimum. Since 1MB is approximately 10^6 bytes, you are getting a file on the order or 300MB.
If you were to write the file in binary using the double datatype, the file would likely be about 20% smaller since doubles are generally 64-bit (8 byte numbers). If you were to use the single datatype, there might be some information lost since single can only hold approximately 5 digits of decimal precision, but the file would only be 40% of its current size.
If binary is not an option, you can always split the data into smaller text files.

loading large csv files in Matlab

I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.

Largest Matrix Matlab Linprog can Support

I want to use MATLAB linprog to solve a problem, and I check it by a much smaller, much simpler example.
But I wonder if MATLAB can support my real problem, there may be a 300*300*300*300 matrix...
Maybe I should give the exact problem. There is a directed graph of network nodes, and I want to get the lowest utilization of the edge capacity under some constraints. Let m be the number of edges, and n be the number of nodes. There are mn² variables and nm² constraints. Unfortunately, n may reach 300...
I want to use MATLAB linprog to solve it. As described above, I am afraid MATLAB can not support it...Lastly the matrix must be sparse, can some way simplify it?
First: a 300*300*300*300 array is not called a matrix, but a tensor (or simply array). Therefore you can not use matrix/vector algebra on it, because that is not defined for arrays with dimensionality greater than 2, and you can certainly not use it in linprog without some kind of interpretation step.
Second: if I interpret that 300⁴ to represent the number of elements in the matrix (and not the size), it really depends if MATLAB (or any other software) can support that.
As already answered by ben, if your matrix is full, then the answer is likely to be no. 300^4 doubles would consume almost 65GB of memory, so it's quite unlikely that any software package is going to be capable of handling that all from memory (unless you actually have > 65 GB of RAM). You could use a blockproc-type scheme, where you only load parts of the matrix in memory and leave the rest on harddisk, but that is insanely slow. Moreover, if you have matrices that huge, it's entirely possible you're overlooking some ways in which your problem can be simplified.
If you matrix is sparse (i.e., contains lots of zeros), then maybe. Have a look at MATLAB's sparse command.
So, what exactly is your problem? Where does that enormous matrix come from? Perhaps I or someone else sees a way in which to reduce that matrix to something more manageable.
On my system, with 24GByte RAM installed, running Matlab R2013a, memory gives me:
Maximum possible array: 44031 MB (4.617e+10 bytes) *
Memory available for all arrays: 44031 MB (4.617e+10 bytes) *
Memory used by MATLAB: 1029 MB (1.079e+09 bytes)
Physical Memory (RAM): 24574 MB (2.577e+10 bytes)
* Limited by System Memory (physical + swap file) available.
On a 64-bit version of Matlab, if you have enough RAM, it should be possible to at least create a full matrix as big as the one you suggest, but whether linprog can do anything useful with it in a realistic time is another question entirely.
As well as investigating the use of sparse matrices, you might consider working in single precision: that halves your memory usage for a start.
well you could simply try: X=zeros( 300*300*300*300 )
on my system it gives me a very clear statement:
>> X=zeros( 300*300*300*300 )
Error using zeros
Maximum variable size allowed by the program is exceeded.
since zeros is a build in function, which only fills a array of the given size with zeros you can asume that handling such a array will not be possible
you can also use the memory command
>> memory
Maximum possible array: 21549 MB (2.260e+10 bytes) *
Memory available for all arrays: 21549 MB (2.260e+10 bytes) *
Memory used by MATLAB: 685 MB (7.180e+08 bytes)
Physical Memory (RAM): 12279 MB (1.288e+10 bytes)
* Limited by System Memory (physical + swap file) available.
>> 2.278e+10 /8
%max bytes avail for arrays divided by 8 bytes for double-precision real values
ans =
2.8475e+09
>> 300*300*300*300
ans =
8.1000e+09
which means I dont even have the memory to store such a array.
while this may not answer your question directly it might still give you some insight.

Does MATLAB execute basic array operations in constant space?

I am getting an out of memory error on this line of MATLAB code:
result = (A(1:xmax,1:ymax,1:zmax) .* B(2:xmax+1,2:ymax+1,2:zmax+1) +
A(2:xmax+1,2:ymax+1,2:zmax+1) .* B(1:xmax,1:ymax,1:zmax)) ./ C
where C is another array. This is on 32 bit MATLAB (I can't seem to get the 64 bit version at the moment, which would temporarily fix my problems).
The arrays result, A, B, and C are pre-initialized and never change size. It is then my guess that this computation is not being performed in constant space.
Is this correct? Is there a way to make it run or check if it is running in constant space?
These arrays of are approximate size (250, 250, 250).
If MATLAB does not run this in constant size, does anyone have any experience as to whether Octave or Julia or (insert similar language) does?
edit 1:
I eliminated excess arrays. There are 10 arrays that are 258 x 258 x 338, which corresponds to 1.67 GB. There are a bunch of other variables but they are much smaller. The calculation presented is simplified, the form of the calculation is:
R = (A(3Drange) .* B(3Drange) + A(new_3Drange) .* D(new_3Drange) + . . . ) ./ C
where the ranges generally just differ by a shift of plus or minus 1 or 2.
The output of memory command:
Maximum possible array: 669 MB (7.013e+08 bytes) *
Memory available for all arrays: 1541 MB (1.616e+09 bytes) **
Memory used by MATLAB: 2209 MB (2.316e+09 bytes)
Physical Memory (RAM): 8154 MB (8.550e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
Apparently I should be violating the second line. However, the code runs fine until the first operation that I actually do with the arrays. Perhaps MATLAB is being lazy and not allocating when I type:
A=zeros(xmax+2,ymax+2,zmax+2);
but still telling me in the workspace that the variable is allocated.
This code has worked before with smaller arrays. (edit: but it seems the actual memory size is the problem, not the size of each individual array).
The very curious thing to me is why it does not error during allocation, but instead errors during the first calculation.
edit 2:
I have confirmed that the loop is not running constant in space. There is about a .8 GB of memory being allocated during the calculation. Here is an image of resource usage while the command is being executed in a loop:
However, I tried breaking up the computation into multiple lines. I split the computation at each addition and added on each part in a new command, treating R as a accumulator. The result is that less memory is allocated at one time, but presumably more often. Here is the picture:
I am still curious as to why MATLAB doesn't want to execute this in constant space. I think it perhaps has something to do with the indexing being shifted - I am planning on investigating it more later and then putting this all together in an answer, but someone may beat me to it, which would be great also. Now, though, I can run the array size I was looking for and can finish my project.
I guess that most of the question has already been answered:
Does it operate in constant space?
No as you verified, it does not.
Why doesn't it operate in constant space?
Matlab claims to be fast at vectorized matrix operations, not so much emphasis is placed on memory efficiency.
What to do now?
Here are different options, the first one is preferred if possible, the other two are certainly possible.
Make it fit, for example by upgrading to 64 bit matlab or by not putting other stuf in your workspace
Work on parts of the matrix, so for example cut it in half
Dont use vectorization at all but make a simple for loop
If you don't vectorize, you will have a minimal space solution.