Loading a CSV File taking forever - matlab

I am trying to load a 4.0 GB large CSV file into Matlab. I have 40GB of RAM. However, the table does not seem to finish loading. (Activity Monitor showed fast increase of RAM use up to 38.64GB and stopped after that. CPU still in heavily in use.)
According to the force quit menu of apple, matlab is has not gotten stuck. (I'd guess a missing "Matlab is not responding"-message signals that.)
1st Question: Why does it even take up that much RAM? I've read RAM duplicates. Can I do something in this regard?
2nd Question: Can I speed this project up. Split the CSV somehow?
3rd Question: Can I speed up my computer? It is taking forever, while using only 30% of CPU capacity... Why does it not use more? The vents are not crazy loud, so I guess "it's chilling".
Edit: It went up to 72.80 and is now decreasing...
Edit: Now back down at 55.something

There are a few concepts you should be aware of with Matlab.
Strings are stored as UINT16 (sort of, I can never get this right). Importantly what this means is that every character requires 2 bytes. If you stored the entire file as a long string it would take up 8 GB.
Values, whether they are arrays or scalars, are stored with headers. This means that storing a string (technically a character array, strings - the ones with double quotes instead of single quotes - may be different) requires a header that is roughly 104 bytes. This means something like 'test' requires roughly 108 bytes! If you can store an array of numbers then the 104 byte overhead is minimal. If you have a cell array of scalars, then each scalar is taking up 112 byes (assuming the scalar is an 8 byte double). This might be a bit confusing but in the end it means if you're not careful reading a CSV file your memory requirements can explode.
So what can you do. Tables store columns as arrays where possible. You can try readtable although I think the underlying implementation might not be memory efficient.
For large files Matlab suggests using the datastore function. It will fix your memory problem although it may be a bit slow.
The other option is to read the entire file into memory and to do your own custom processing. For example, assuming you don't have anything escaped (i.e. commas that are not actually delimiters), you can find all relevant delimiters by using:
%Find comma or newline
I = regexp(temp,',|\n')
Here's an example of extracting various columns. As indicated above, this has a large overhead for strings (character arrays) but is efficient for numbers.
%Fake data as an example, 3 columns with middle one numeric
temp = sprintf('asdf,1234,temp\nfred,324,chip\ncheese,12,you are always right');
I = regexp(temp,',|\n');
starts = [0 I];
ends = [I length(temp)+1];
n_columns = 3;
%extract column 2
c2 = arrayfun(#(x,y) str2double(temp(x+1:y-1)),starts(2:n_columns:end),ends(2:n_columns:end));
%extract column 1
c1 = arrayfun(#(x,y) temp(x+1:y-1),...
starts(1:n_columns:end),ends(1:n_columns:end),'un',0);
Depending on your use case this may work or it may not. To read the file into memory you can use fileread

In answer to question(2): It is quite straightforward to split the csv up, assuming there are more rows than columns...
bigfile= csvread(filename);
bigLen=length(bigfile);
size=unint64(bliglen/2)
csvwrite('first.csv', bigfile(1:size,:));
csvwrite('second.csv', bigfile(size:beglen,:));
Or even doing this with SEVERAL files; it may not make it faster overall, but it would allow you to observe the process as each file is read.

I think that MatLab itself has a limit of how much input it is allowed to take in. I'm sure you can set that in the preferences if you have the high enough version.
Check this out: http://www.mathworks.com/help/matlab/matlab_env/set-workspace-and-variable-preferences.html

Related

MATLAB: Large text file to Matrix conversion

I need to read a text file in MATLAB containing 436 float values in each line (text file size is 25GB, so you can estimate the number of lines) and then convert it to a matrix so that I can take the transpose. How to do it? Won't the format specifier be too long?
Let's assume that the floats in your file are written in a format such that you have 15 digits after the floating point. So we have 17 characters for each float. Then, let's also assume that they are separated by a character, and that we have a \n at the end of each line, so that we will have 436*18=7848 characters in total, which assuming ascii characters will use one byte each. Then, your file uses about 25G of memory, so you can say that you have (25*2^30)/7848=3.6041e+06 lines (using 2^30 bytes in a gygabyte, the scale is roughly the same if you prefer the definition as 10^9 bytes)`.
So, a matrix of size 4e6, 436 (I'm assuming a much larger higher bound on your matrix size) will, assuming each float takes 4 bytes, roughly take about 6.48G. This is nothing crazy, and you can find this amount of contiguous memory to allocate when reading the matrix using the load function, given you have a decent amount of RAM on your machine. I have 8G at the moment, and rand(4*1e6,436) allocates the desired amount of memory, although it ends up using the swap space and slowing down. I assume that load itself will have some amount of overhead, but if you have 16G of RAM (which is not crazy nowadays) you can safely go with load.
Now if you think that you wont find that much contiguous memory, I suggest that you just separate your file into chunks, like 10 matrices, and load and transpose them separately. How you do that is up to you and depends on the application, and whether there are any sparsity patterns in the data or not. Also, make sure (if you don't need the extra precision) that you are using single float precision.

How to write "Big Data" to a text file using Matlab

I am getting some readings off an accelerometer connected to an Arduino which is in turn connected to MATLAB through serial communication. I would like to write the readings into a text file. A 10 second reading will write around 1000 entries that make the text file size around 1 kbyte.
I will be using the following code:
%%%%%// Communication %%%%%
arduino=serial('COM6','BaudRate',9600);
fopen(arduino);
fileID = fopen('Readings.txt','w');
%%%%%// Reading from Serial %%%%%
for i=1:Samples
scan = fscanf(arduino,'%f');
if isfloat(scan),
vib = [vib;scan];
fprintf(fileID,'%0.3f\r\n',scan);
end
end
Any suggestions on improving this code ? Will this have a time or Size limit? This code is to be run for 3 days.
Do not use text files, use binary files. 42718123229.123123 is 18 bytes in ASCII, 4 bytes in a binary file. Don't waste space unnecessarily. If your data is going to be used later in MATLAB, then I just suggest you save in .mat files
Do not use a single file! Choose a reasonable file size (e.g. 100Mb) and make sure that when you get to that many amount of data you switch to another file. You could do this by e.g. saving a file per hour. This way you minimize the possible errors that may happen if the software crashes 2 minutes before finishing.
Now knowing the real dimensions of your problem, writing a text file is totally fine, nothing special is required to process such small data. But there is a problem with your code. You are writing a variable vid which increases over time. That may cause bad performance because you are not using preallocation and it may consume a lot of memory. I strongly recommend not to keep this variable, and if you need the dater read it afterwards.
Another thing you should consider is verification of your data. What do you do when you receive less samples than you expect? Include timestamps! Be aware that these timestamps are not precise because you add them afterwards, but it allows you to identify if just some random samples are missing (may be interpolated afterwards) or some consecutive series of maybe 100 samples is missing.

matlab cell array size much larger than actual data

I recently discovered an strange behavour of MATLAB cell arrays that was not happening before.
If I create a cell array with
a=cell(1,4)
its size is 32 bytes.
If then I put something inside, e.g.
a{2}='abcd'
its size becomes 144 bytes. But if I remove this content by putting
a{2}=[]
the size becomes 132 bytes and so on. What is the problem?
Simply put, the Matlab cell array needs some internal data structures to keep track of what is stored within.
As it seems, Matlab allocates memory as needed, and thus extends the storage needed by the cell array as you insert data.
Removing the data doesn't mean that matlab can return the now unused memory to the OS or internal memory pool -- that might either be something that is impossible with the internal storage structure, or something that would be unwise with respect to performance, because cell arrays from which data is removed are (speaking over all use cases of cell arrays) be structures that get updated often, so that "prematurely" returning memory just to acquire it back again a few instructions later would be pretty CPU-intense.
As a general note: Matlab has pretty terrible storage approaches for nearly everything but matrices and sparse matrices (vectors of course being special cases of matrices). That's because it's not Matlab's job to be e.g. a string parser etc.
If memory becomes a problem, it might be worth considering implementing the math core of your problem in Matlab and doing the rest in other, more generally usable programming languages and somehow interfacing your Matlab code with that -- I haven't tried that myself, but Mathworks has a Matlab engine for python, and I'd take writing python for things like storing arbitrary data over using Matlab every day; with that engine, you can call Matlab to do your dirty math work, and use python to do your everyday scripting/programming work.
Notice that my bottom line here is that Matlab has great Math routines and impressive documentation, but if you want to actually develop software, using a general purpose tool/language is much more likely to be satisfying quickly.
I'd even go as far as saying that it's probably worth your time to learn python, just to be able to circumvent having to deal with things that Matlab wasn't designed for (and cell arrays are a prime example of what Matlab is really complicated about and what's extremely easy in python).
You use
a{2}=[]
to 'kill' the data in that field. In reality you actually do access the data, that is you leave a non-empty cell entry with an empty double array. (Thanks to matlab for representing empty cells as empty doubles...)
but if you use (no curly braces, but parentheses):
a(2) = cell(1,1)
then the cell array size is back to "empty" = 32 bytes.

loading large csv files in Matlab

I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a
Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.

Elias Gamma Coding and upper bound

While reading about Elias Gamma coding on wikipedia, I see it mentions that:
"Gamma coding is used in applications where the largest encoded value is not known ahead of time."
and that:
"It is used most commonly when coding integers whose upper-bound cannot be determined beforehand."
I don't really understand what is meant by these sentences, because whenever this algorithm is coded, the largest value of the test data or range of the test data would be known before hand. Any help is appreciated!
As far as I'm acquainted with Elias-gamma/delta encoding, the first sentence simply states that these compression methods are global, which means that it does not rely on the input data to generate the code. In other words, these methods do not need to process the input before performing the compression (as local methods do); it compresses the data with a function that does not depend on information from the database.
As for the second sentence, it may be taken as a guarantee that, although there may be some very large integers, the encoding will still perform well (and will represent such values with feasible amount of bytes, i.e., it is a universal method). Notice that, if you knew the biggest integer, some approaches (like minimal hashes) could perform better.
As a last consideration, the same page you referred to also states that:
Gamma coding is used in applications where the largest encoded value is not known ahead of time, or to compress data in which small values are much more frequent than large values.
This may be obtained by generating lists of differences from the original lists of integers, and passing such differences to be compressed instead. For example, in a list of increasing numbers, you could generate:
list: 1 5 29 32 35 36 37
diff: 1 4 24 3 3 1 1
Which will give you many more small numbers, and therefore a greater level of compression, than the first list.