Loading huge data efficiently in matlab - matlab

I have a million rows data set in the following format
User Gameid Count
A 1 2
A 2 3
A 10 2
A 8 2
B 10 2
B 1 6
....
I want to create a sparse matrix from this with each row representing a user and the columns representing the gameid and the value in each cell is the count of the corresponding user and gameid.
The matrix is sparse. So it should be handled by matlab.
How can I directly load such data without iterating through each line which will take a lot of time. Any suggestions how to do it efficiently?
This is what I want
Column 1 2 3 4 5 6 7 8 9 10
2 3 2 2
6 2

I think you can use textscan.
fid = fopen(file);
content = textscan(fid, '%s%d%d');
user = content{1}; % cell
field = content{2}; % vector
count = content{3}; % vector

ok your file is file.txt
Ok so with an easy manipulation you can get what u want:
clear all
close all
fil = 'file.txt';
fid = fopen(fil,'r');
s = textscan(fid,'%s','Delimiter','\n');
s = s{1};
s = s(2:end);
ls = length(s);
fmat = '%s %u %u';
C = sscanf(char(s)',fmat);
C = reshape(C,[3 length(s)])';
A = char(C(:,1));
data = cellstr(A);
data(:,2:3) = num2cell(C(:,2:3));
Users = unique(A);
S_MAT_DATA = spalloc(length(Users),max(C(:,2)),round(ls/length(Users))+1);
nmat = 1;
for aa = 1:length(Users)
data_user = cell2mat(data(strncmp(Users(aa),data(:,1),1),2:3)');
S_MAT_DATA(aa,data_user(1,:)) = data_user(2,:);
end

Related

Optimal way to find max value and indices for all indices in cell of matrices

I have 100 images each of size 512 by 512 stored in a cell array.
I want to find the max value and indices for each pixel location by searching all the images.
Here is the sample representation:
My code:
imgs = cell(1,5);
imgs{1} = [2,3,2;3,2,2;3,1,1];
imgs{2} = [2,3,1;4,2,3;2,2,1];
imgs{3} = [3,2,1;5,3,5;3,2,3];
imgs{4} = [4,4,2;5,3,4;4,2,2];
imgs{5} = [4,5,2;4,2,5;3,3,1];
[nrows, ncols] = size(imgs{1});
maxVal_Mat = zeros(nrows,ncols);
maxIdx_Mat = zeros(nrows,ncols);
for nrow = 1:nrows
for ncol = 1:ncols
[maxVal_Mat(nrow, ncol), maxIdx_Mat(nrow, ncol)] = max(cellfun(#(x) x(nrow, ncol) , imgs));
end
end
maxVal_Mat =
4 5 2
5 3 5
4 3 3
maxIdx_Mat =
4 5 1
3 3 3
4 5 3
Any ideas on how to optimize this code to save execution time and memory.
Note: This is a sample demonstration of the problem, the original cell and matrices are quite large.
Thanks,
Gopi
Since all of your images are the same size it makes more sense to store them in a 3D matrix than a cell array, which also greatly simplifies performing operations like this on them. You can convert imgs from a cell array to a 3D matrix and find the maxima and indices like so:
imgs = cat(3, imgs{:}); % Concatenate into a 3D matrix
[maxValue, index] = max(imgs, [], 3) % Find max across third dimension
maxValue =
4 5 2
5 3 5
4 3 3
index =
4 5 1
3 3 3
4 5 3
There is some discussion of using cell arrays versus multidimensional arrays in this post. In general, a multidimensional array will give you better performance for many operations, but requires contiguous memory space for storage (which can cause you to hit memory limits quicker for increasing array size). Cell arrays don't require contiguous memory space and can therefore be more memory-efficient, but complicate certain operations.
I propose another solution that will probably:
Increase the execution time
Consume less memory
It is an option if your images are large and due to memory limitation you can not concatenate all the images.
Instead of loading all the images in a single 3D matrix, I compare the images by pairs.
If I take your example:
imgs = cell(1,5);
imgs{1} = [2,3,2;3,2,2;3,1,1];
imgs{2} = [2,3,1;4,2,3;2,2,1];
imgs{3} = [3,2,1;5,3,5;3,2,3];
imgs{4} = [4,4,2;5,3,4;4,2,2];
imgs{5} = [4,5,2;4,2,5;3,3,1];
% Only for the first image
Mmax = imgs{1};
Mind = ones(size(imgs{1}));
for ii = 2:numel(imgs)
% 2 by 2 comparison
[Mmax,ind] = max(cat(3,Mmax,imgs{ii}),[],3);
Mind(ind == 2) = ii;
end
Results:
Mmax =
4 5 2
5 3 5
4 3 3
Mind =
4 5 1
3 3 3
4 5 3
In concrete terms the same code will look like:
% your list of images
file = {'a.img','b.img','c.img'}
I = imread(file{1});
Mmax = I;
Mind = ones(size(I));
for ii = 2:numel(file)
I = imread(file{ii})
[Mmax,ind] = max(cat(3,Mmax,I),[],3);
Mind(ind == 2) = ii;
end

How to efficiently find in some dataset the number of occurrences of a given list of items, without using loops?

I have a dataset, M, where some items and their category types are stored in columns 1 and 2 respectively. The vector cat stores the unique category types present in M. Vector Y is a subset of items in M. I want to find how many times each category type is associated with the items in Y. This is the code I have written to do this:
cat(:,1) = unique(M(:,2)); % Unique items in M
cat(:,2) = zeros(size(cat,1),1); % initialize column 2 of cat to 0s
N = size(Y,1);
for i=1:N
item = Y(i,1);
temp = M(M(:,1)==item,:);
C(:,1) = unique(temp(:,2));
C(:,2) = histc(temp(:,2), unique(temp(:,2))); % Frequency of items in temp(:,2)
for j=1:size(cat,1)
for k=1:size(C,1)
if cat(j,1)==C(k,1)
cat(j,2) = cat(j,2)+C(k,2);
end
end
end
clear C; clear temp; clear item;
end
But this is obviously slow for even moderately sized M, Y and cat. How do I make it faster?
To illustrate with an example, say:
M=[3 2
4 12
1 7
3 4
2 10
1 6
4 19
4 6
3 12
1 10
2 12];
And,
Y=[2;3];
Then I want the output cat to be the following:
cat=[2 1
4 1
6 0
7 0
10 1
12 2
19 0];
If I correctly understand you want histogram of categories of those items from M that also appear in Y.
Using ismember you can find index of items of M that also appear in Y:
idx = ismember(M(:,1), Y);
Use that index to filter out desired items and save it to temp:
temp = M(idx, :);
Form histogram of temp with unique values from Cat(:,1):
Cat(:,2) = histc(temp(:, 2), Cat(:, 1));
Avoiding saving intermediate results the above code can be simplified :
idx = ismember(M(:,1),Y);
Cat(:,2) = histc(M(idx, 2), Cat(:,1));
Or all in one line:
Cat(:,2) = histc(M(ismember(M(:,1),Y), 2), Cat(:,1));
Note: cat is name of a builtin function in MATLAB so I renamed your variable cat to Cat

Using Matlab to make modification to a text file

Essentially I am writing a Matlab file to change the 2nd, 3rd and 4th numbers in the line below "STR" and above "CON" in the text file (which is given below and is called '0.dat'). Currently, my Matlab code makes no changes to the text file.
Text File
pri
3
len 0.03
vic 5 5
MAT
1 147E9 0.3 0 4.9E9 8.5E9
LAY
1 0.000125 1 45
2 0.000125 1 0
3 0.000125 1 -45
4 0.000125 1 90
5 0.000125 1 45
WAL
1 1 2 3 4 5
PLATE
1 0.005 1 1
STR
1 32217.442335442 3010.34241024889 2689.48842888812
CON
1 2 1 2 3 1 3 4 1 4 5 1 5 6 1 6 7 1
ATT
1 901 7 901
LON
34
POI
123456
1 7
X 0.015
123456
2 6
X 0.00381966011250105 0.026180339887499
123456
3 5
X 0.000857864376269049 0.0291421356237309
123456
4
X 0
PLO
2 3
CRO
0
RES
INMOD=1
END
Matlab code:
impafp = importdata('0.dat','\t');
afp = impafp.textdata;
fileID = fopen('0.dat','r+');
for i = 1:length(afp)
if (strncmpi(afp{i},'con',3))
newNx = 100;
newNxy = 50;
newNy = 500;
myformat = '%0.6f %0.9f %0.9f %0.9f\n';
newData = [1 newNx newNxy newNy];
afp{i-1} = fprintf(fileID, myformat, newData);
fclose(fileID);
end
end
From the help for importdata:
For ASCII files and spreadsheets, importdata expects to find numeric
data in a rectangular form (that is, like a matrix). Text headers can
appear above or to the left of numeric data. To import ASCII files
with numeric characters anywhere else, including columns of character
data or formatted dates or times, use TEXTSCAN instead of import data.
Indeed, if you print out the value of afp, you'll see that it just contains the first line. You were also not performing any operation that was writing to a file. And you were not closing the file ID if the if state wasn't triggered.
Here is one way to do this with textscan (which is probably faster too):
% Read in data as strings using textscan
fid = fopen('0.dat','r');
afp = textscan(fid,'%s','Delimiter','');
fclose(fid);
isSTR = strncmpi(afp{:},'str',3); % True for all lines starting with STR
isCON = strncmpi(afp{:},'con',3); % True for all lines starting with CON
% Find indices to replace - create logical indices
% True if line before is STR and line after is CON
% Offset isSTR and isCON by 2 elements in opposite directions to align
% Use & to perform vectorized AND
% Pad with FALSE on either side to make output the same length as afp{1}{:}
datIDX = [false;(isSTR(1:end-2)&isCON(3:end));false];
% Overwrite data using sprintf
myformat = '%0.6f %0.9f %0.9f %0.9f';
newNx = 100;
newNxy = 50;
newNy = 500;
newData = [1 newNx newNxy newNy];
afp{1}{datIDX} = sprintf(myformat, newData); % Set only elements that pass test
% Overwrite old file using fprintf (or change filename to new one)
fid = fopen('0.dat','w');
fprintf(fid,'%s\r\n',afp{1}{1:end-1});
fprintf(fid,'%s',afp{1}{end}); % Avoid blank line at end
fclose(fid);
If you're unfamiliar with logical indexing, you might read this blog post and this.
I would recommend just reading the entire file in, finding which lines contain your "keywords", modifying specific lines, and then writing it back out to a file, which can have the same name or a different one.
file = fileread('file.dat');
parts = regexp(file,'\n','split');
startIndex = find(~cellfun('isempty',regexp(parts,'STR')));
endIndex = find(~cellfun('isempty',regexp(parts,'CON')));
ind2Change = startIndex+1:endIndex-1;
tempCell{1} = sprintf('%0.6f %0.9f %0.9f %0.9f',[1,100,50,500]);
parts(ind2Change) = deal(tempCell);
out = sprintf('%s\n',parts{:});
out = out(1:end-1);
fh = fopen('file2.dat','w');
fwrite(fh,out);
fclose(fh);

Loading a data efficiently in matlab

I have the data of the following form in a text file
Userid Gameid Count
Jason 1 2
Jason 2 10
Jason 4 20
Mark 1 2
Mark 2 10
................
.................
There are a total of 81 Gameids and I have around 2 million distinct users.
What I want is to read this text file and create a sparse matrix of the form
Column 1 2 3 4 5 6 .
Row1 Jason 2 10 20
Row2 Mark 2 10
Now I can load this text file in matlab and read the users one by one, reading their count and initializing the sparse array. I have tried this, it takes 1 second to initialize the row of one user. So for a total of 2 million users, it will take me a lot of time.
what would be the most efficient way to do this?
Here is my code
data = sparse(10000000, num_games);
loc = 1;
for f=1:length(files)
file = files(f).name;
fid = fopen(file,'r');
s = textscan(fid,'%s%d%d');
count = (s(:,2));
count = count{1};
position = (s(:,3));
position = position{1};
A=s{:,1};
A=cellstr(A);
users = unique(A);
for aa = 1:length(Users)
a = strfind(A, char(Users(aa)));
ix=cellfun('isempty',a);
index = find(ix==0);
data(loc,position(index,:)) = count(index,:);
loc = loc + 1;
end
end
Avoid the inner loop by usingunique once more for GameID.
Store the user names, because in your original code you can't tell which name - relates to each row. The same thing for game IDs.
Make sure to close the file after opening it.
sparse matrix does not support 'int32' you need to store your data as double.
% Place holders for Count
Rows = [];
Cols = [];
for f = 1:length(files)
% Read the data into 's'
fid = fopen(files(f).name,'r');
s = textscan(fid,'%s%f%f');
fclose(fid);
% Spread the data
[U, G, Count{f}] = s{:};
[Users{f},~, r] = unique(U); % Unique user names
[GameIDs{f},~,c] = unique(G); % Unique GameIDs
Rows = [Rows; r + max([Rows; 0])];
Cols = [Cols; c + max([Cols; 0])];
end
% Convert to linear vectors
Count = cell2mat(Count');
Users = reshape([Users{:}], [], 1);
GameIDs = cell2mat(GameIDs');
% Create the sparse matrix
Data = sparse(Rows, Cols, Count, length(Users), length(GameIDs), length(Count));
The Users will contain be the Row header (user names) and GameIDs the Column header.

Matlab: Series of Variables in a Loop

This is a solution from another stackoverflow participant who helped me out.
Data is coming from a csv file:
States Damage Blizzards
Indiana 1 3
Alabama 2 3
Ohio 3 2
Alabama 4 2
%// Parse CSV file
[States, Damage, Blizzards] = textread(csvfilename, '%s %d %d', ...
'delimiter', ',', 'headerlines', 1);
%// Parse data and store in an array of structs
[U, ix, iu] = unique(States); %// Find unique state names
S = struct('state', U); %// Create a struct for each state
for k = 1:numel(U)
idx = (iu == k); %// Indices of rows matching current state
S(k).damage = Damage(idx); %// Add damage information
S(k).blizzards = Blizzards(idx); %// Add blizards information
end
In MATLAB, I need to create a series of assigned variables (A1,A2,A3) in a loop. So I have structure S with 3 fields: state, tornado, hurricane.
Now I have attempted this method to assign A1 =, A2 =, which I got an error because it will not work for structures:
for n = 1:numel(S)
eval(sprintf('A%d = [1:n]',S(n).states));
end
Output goal is a series of assigned variables in the loop to the fields of the structure:
A1 = 2 3
A2 = 2 3
A3 = 4 5
I'm not 100% sure I understand your question.
But maybe you are looking for something like this:
for n = 1:numel(S)
eval(sprintf('A%d = [S(n).damage S(n).blizzards]',n));
end
BTW using evalc instead of eval will suppress the command line output.
A little explanation, why
eval(sprintf('A%d = [1:n]',S(n).state));
does not work:
S(1).state
returns
ans =
Alabama
which is a string. However,
A%d
expects a number (see this for number formatting).
Additionally,
numel(S)
yields
ans =
3
Therefore,
eval(sprintf('A%d = [1:n]',n));
will simply return the following output:
A1 =
1
A2 =
1 2
A3 =
1 2 3
Hence, you want n as a counter for the variable name, but compose the vector of the entries in the other struct-fields (damage and blizzards), again, using n as a counter.