loading large csv files in Matlab - matlab

I had csv files of size 6GB and I tried using the import function on Matlab to load them but it failed due to memory issue. Is there a way to reduce the size of the files?
I think the no. of columns are causing the problem. I have a 133076 rows by 2329 columns. I had another file which is of the same no. of rows but only 12 rows and Matlab could handle that. However, once the columns increases, the files got really big.
Ulitmately, if I can read the data column wise so that I can have 2329 column vector of 133076, that will be great.
I am using Matlab 2014a

Numeric data are by default stored by Matlab in double precision format, which takes up 8 bytes per number. Data of size 133076 x 2329 therefore take up 2.3 GiB in memory. Do you have that much free memory? If not, reducing the file size won't help.
If the problem is not that the data themselves don't fit into memory, but is really about the process of reading such a large csv-file, then maybe using the syntax
M = csvread(filename,R1,C1,[R1 C1 R2 C2])
might help, which allows you to only read part of the data at one time. Read the data in chunks and assemble them in a (preallocated!) array.
If you do not have enough memory, another possibility is to read chunkwise and then convert each chunk to single precision before storing it. This reduces memory consumption by a factor of two.
And finally, if you don't process the data all at once, but can implement your algorithm such that it uses only a few rows or columns at a time, that same syntax may help you to avoid having all the data in memory at the same time.

Related

how large storage do I need for postgres comparing to data size

I did some analysis using some sample data and found table size is usually 2 twice as much as raw data (by importing a csv file into a postgres table, then csv file size is raw data size).
And the disk space seems 4 times as raw data most likely because of WAL log.
Is there any commonly used formulator to estimate how much disk space I need if we want to store like 1G size of data.
I know there are many factors affecting this, I just would like to have a quick estimate.

Loading a CSV File taking forever

I am trying to load a 4.0 GB large CSV file into Matlab. I have 40GB of RAM. However, the table does not seem to finish loading. (Activity Monitor showed fast increase of RAM use up to 38.64GB and stopped after that. CPU still in heavily in use.)
According to the force quit menu of apple, matlab is has not gotten stuck. (I'd guess a missing "Matlab is not responding"-message signals that.)
1st Question: Why does it even take up that much RAM? I've read RAM duplicates. Can I do something in this regard?
2nd Question: Can I speed this project up. Split the CSV somehow?
3rd Question: Can I speed up my computer? It is taking forever, while using only 30% of CPU capacity... Why does it not use more? The vents are not crazy loud, so I guess "it's chilling".
Edit: It went up to 72.80 and is now decreasing...
Edit: Now back down at 55.something
There are a few concepts you should be aware of with Matlab.
Strings are stored as UINT16 (sort of, I can never get this right). Importantly what this means is that every character requires 2 bytes. If you stored the entire file as a long string it would take up 8 GB.
Values, whether they are arrays or scalars, are stored with headers. This means that storing a string (technically a character array, strings - the ones with double quotes instead of single quotes - may be different) requires a header that is roughly 104 bytes. This means something like 'test' requires roughly 108 bytes! If you can store an array of numbers then the 104 byte overhead is minimal. If you have a cell array of scalars, then each scalar is taking up 112 byes (assuming the scalar is an 8 byte double). This might be a bit confusing but in the end it means if you're not careful reading a CSV file your memory requirements can explode.
So what can you do. Tables store columns as arrays where possible. You can try readtable although I think the underlying implementation might not be memory efficient.
For large files Matlab suggests using the datastore function. It will fix your memory problem although it may be a bit slow.
The other option is to read the entire file into memory and to do your own custom processing. For example, assuming you don't have anything escaped (i.e. commas that are not actually delimiters), you can find all relevant delimiters by using:
%Find comma or newline
I = regexp(temp,',|\n')
Here's an example of extracting various columns. As indicated above, this has a large overhead for strings (character arrays) but is efficient for numbers.
%Fake data as an example, 3 columns with middle one numeric
temp = sprintf('asdf,1234,temp\nfred,324,chip\ncheese,12,you are always right');
I = regexp(temp,',|\n');
starts = [0 I];
ends = [I length(temp)+1];
n_columns = 3;
%extract column 2
c2 = arrayfun(#(x,y) str2double(temp(x+1:y-1)),starts(2:n_columns:end),ends(2:n_columns:end));
%extract column 1
c1 = arrayfun(#(x,y) temp(x+1:y-1),...
starts(1:n_columns:end),ends(1:n_columns:end),'un',0);
Depending on your use case this may work or it may not. To read the file into memory you can use fileread
In answer to question(2): It is quite straightforward to split the csv up, assuming there are more rows than columns...
bigfile= csvread(filename);
bigLen=length(bigfile);
size=unint64(bliglen/2)
csvwrite('first.csv', bigfile(1:size,:));
csvwrite('second.csv', bigfile(size:beglen,:));
Or even doing this with SEVERAL files; it may not make it faster overall, but it would allow you to observe the process as each file is read.
I think that MatLab itself has a limit of how much input it is allowed to take in. I'm sure you can set that in the preferences if you have the high enough version.
Check this out: http://www.mathworks.com/help/matlab/matlab_env/set-workspace-and-variable-preferences.html

Performance issue with reading DICOM data into cell array

I need to read 4000 or more DICOM files. I have written the following code to read the files and store the data into a cell array so I can process them later. A single DICOM file contains 128 * 931 data. But once I execute the code it took more than 55 minutes to complete the iteration. Can someone point out to me the performance issue of the following code?
% read the file information form the disk to memory
readFile=dir('d:\images','*.dcm');
for i=1:4000
% Read the information form the dicom files in to arrays
data{i}=dicomread(readFile(i).name);
info{i}=dicominfo(readFile(i).name);
data_double{i}=double(data{1,i}); % convert 16 bit data into double
first_chip{i}=data_double{1,i}(1:129,1:129); % extracting first chip data into an array
end
You are reading 128*931*4000 pixels into memory (assuming 16-bit values, that's nearly 1 GB), converting that to doubles (4 GB) and extracting a region (129*129*4000*8 = 0.5 GB). You are keeping all three of these copies, which is a terrible amount of data! Try not keeping all that data around:
readFile = dir('d:\images','*.dcm');
first_chip = cell(size(readFile));
info = cell(size(readFile));
for ii = 1:numel(readFile)
info{ii} = dicominfo(readFile(ii).name);
data = dicomread(info{ii});
data = (1:129,1:129); % extracting first chip data
first_chip{ii} = double(data); % convert 16 bit data into double
end
Here, I have pre-allocated the first_chip and info arrays. If you don't do this, the arrays will be re-allocated every time you add an element, causing expensive copies. I have also extracted the ROI first, then converted to double, as suggested by Rahul in his answer. Finally, I am re-using the DICOM info structure to read the file. I don't know if this makes a big difference in speed, but it saves the dicomread function some effort.
But note that this process will still take a considerable amount of time. Reading DICOM files is complex, and takes time. I suggest you read them all in once, then save the first_chip and info cell arrays into a MAT-file, which will be a lot faster to read in at a later time.
You can run the profiler to check which part of the code is taking up most of the time! But as far as it looks to me is that its your iteration size & the time taken is very much genuine. You could try and use parallel computing ( parfor loop) if you have a multicore processor, that should decrease the runtime significantly depending upon the number of cores that you have.
One suggestion would be to exctract the 'first chip data' first and then convert it to double, as the conversion process takes a significant amount of time.

MATLAB takeing huge 350 mb memory to write one column vector to txt file

I have a variable sndpwr which have 18000 rows and just one column. When I use fprintf it takes 350mb txt file to write. Even csvwrite and dlmwrite take 200+ mb space.
Can anyone tell me any function or method that will write it in a small text file. I am importing it in another program which is not able to import such large files.
fid = fopen('sndpwr.txt','wt');
fprintf(fid,'%0.6f\r\n',sndpwr');
fclose(fid);
Thanks!
EDIt: in workspace it is described as 31957476x1 double. Sorry for my previous incorrect data.
Unfortunately, there is no way to compress your data without using an actual compression algorithm. You have 3x10^7 numbers, written with six digits after the decimal point, at least one before, and a couple of newline characters. This gives 3x10^7 * 10 = 3x10^8 bytes, as a bare minimum. Since 1MB is approximately 10^6 bytes, you are getting a file on the order or 300MB.
If you were to write the file in binary using the double datatype, the file would likely be about 20% smaller since doubles are generally 64-bit (8 byte numbers). If you were to use the single datatype, there might be some information lost since single can only hold approximately 5 digits of decimal precision, but the file would only be 40% of its current size.
If binary is not an option, you can always split the data into smaller text files.

large data file in matlab doesn't load/import

I have been trying to load data file (csv) into matlab 64 bit running on win7(64 bit) but get memory related errors. The file size is about 3 GB, containing date ( dd/mm/yyyy hh:mm:ss) in first column and bid and ask prices in another two columns. The memory command returns the following :
Maximum possible array: 19629 MB (2.058e+010 bytes) *
Memory available for all arrays: 19629 MB (2.058e+010 bytes) *
Memory used by MATLAB: 522 MB (5.475e+008 bytes)
Physical Memory (RAM): 16367 MB (1.716e+010 bytes)
* Limited by System Memory (physical + swap file) available.
Can somebody here please explain if the max possible array size is 19.6 GB then why would matlab throw a memory error while importing a data array that is just about 3GB. Apologies if this is a simple question to the experienced as I have little experience in process/app memory management.
I would greatly appreciate if someone would also suggest solution to being able to load this dataset into matlab workspace.
Thank you.
I am no expert in memory management but from experience I can tell you that you will run into all kinds of problems if you're importing/exporting 3GB text files.
I would either use an external tool to split your data before you read it or look into storing that data in another format that is more suited to large datasets. Personally, I have used hdf5 in the past---this is designed for large sets of data and is also supported by matlab.
In the meantime, these links may help:
Working with a big CSV file in MATLAB
Handling Large Data Sets Efficiently in MATLAB
I've posted before showing how to use memmapfile() to read huge text files in matlab. This technique may help you as well.