Loading a data efficiently in matlab - matlab

I have the data of the following form in a text file
Userid Gameid Count
Jason 1 2
Jason 2 10
Jason 4 20
Mark 1 2
Mark 2 10
................
.................
There are a total of 81 Gameids and I have around 2 million distinct users.
What I want is to read this text file and create a sparse matrix of the form
Column 1 2 3 4 5 6 .
Row1 Jason 2 10 20
Row2 Mark 2 10
Now I can load this text file in matlab and read the users one by one, reading their count and initializing the sparse array. I have tried this, it takes 1 second to initialize the row of one user. So for a total of 2 million users, it will take me a lot of time.
what would be the most efficient way to do this?
Here is my code
data = sparse(10000000, num_games);
loc = 1;
for f=1:length(files)
file = files(f).name;
fid = fopen(file,'r');
s = textscan(fid,'%s%d%d');
count = (s(:,2));
count = count{1};
position = (s(:,3));
position = position{1};
A=s{:,1};
A=cellstr(A);
users = unique(A);
for aa = 1:length(Users)
a = strfind(A, char(Users(aa)));
ix=cellfun('isempty',a);
index = find(ix==0);
data(loc,position(index,:)) = count(index,:);
loc = loc + 1;
end
end

Avoid the inner loop by usingunique once more for GameID.
Store the user names, because in your original code you can't tell which name - relates to each row. The same thing for game IDs.
Make sure to close the file after opening it.
sparse matrix does not support 'int32' you need to store your data as double.
% Place holders for Count
Rows = [];
Cols = [];
for f = 1:length(files)
% Read the data into 's'
fid = fopen(files(f).name,'r');
s = textscan(fid,'%s%f%f');
fclose(fid);
% Spread the data
[U, G, Count{f}] = s{:};
[Users{f},~, r] = unique(U); % Unique user names
[GameIDs{f},~,c] = unique(G); % Unique GameIDs
Rows = [Rows; r + max([Rows; 0])];
Cols = [Cols; c + max([Cols; 0])];
end
% Convert to linear vectors
Count = cell2mat(Count');
Users = reshape([Users{:}], [], 1);
GameIDs = cell2mat(GameIDs');
% Create the sparse matrix
Data = sparse(Rows, Cols, Count, length(Users), length(GameIDs), length(Count));
The Users will contain be the Row header (user names) and GameIDs the Column header.

Related

Loop to create cell array

I have a structure named data. The structure has 250 elements and one field called codes (whose dimension varies).
As an example: data(1).codes is a 300 x 1 cell of strings and data(2).codes is a 100 x 1 cell of strings.
What I am trying to do is to create a big cell with three columns: id count codes where id indexes the element number (1 to 250), count indexes the row of the string and codes are just the codes.
An example to make it clear:
for k = 1:size(data,2)
id = repmat(k,size(data(k).codes,1),1);
count = linspace(1, size(data(k).codes,1), size(data(k).codes,1))';
codes= data(k).codes;
end
The loop above creates the columns I want. Now I just need to append them one below the other and then save to excel. If these where only numbers I knew how to concatenate/append matrices. But with cells I am unsure how to do it.
Here is what I have tried:
output = {};
for k = 1:size(data,2)
id = repmat(k,size(data(k).codes,1),1);
count = linspace(1, size(data(k).codes,1), size(data(k).codes,1))';
codes= data(k).codes;
output{1,1} = {output{1,1}; id};
output{1,2} = {output{1,2}; count};
output{1,3} = {output{1,3};
end
Build up your output into a new cell array, allowing for pre-allocation, then concatenate all of your results.
% Initialise
output = cell(size(data,2), 1);
% Create output for each element of data
for k = 1:size(data,2)
id = repmat(k,size(data(k).codes,1),1);
count = linspace(1, size(data(k).codes,1), size(data(k).codes,1))';
codes = data(k).codes;
% add to output
output{k} = [id, count, codes];
end
% Vertically concatenate all cell elements
output = vertcat(output{:});
Note: this assumes codes is numerical, and the output will be a numerical matrix. If it isn't, you will need to do some cell conversions for your numerical data (id and count) like so:
id = repmat({k}, size(data(k).codes,1), 1);
count = num2cell(linspace( ... )');

Importing Multiple text File to Matlab

I need to import some text files as matrix In Matlab. Can anyone help me for code please? Here is the my text file names.
elist_S06n1.txt
elist_S06n2.txt
elist_S06n3.txt
elist_S06n4.txt
elist_S07n1.txt
elist_S07n2.txt
elist_S07n3.txt
elist_S07n4.txt
.
.
.
elist_S27n5.txt
So, till elist_S09n1.tx n is going 1 through 4, then it is going 1 through 5.
Thank you in advance.
Thanks for your update so we can see what you have tried so far.
It seems to me that you have difficulties generating the proper filename. Instead of looping over your cell array index, you could use two loops, one from 6 to 27 and the other from 1 to 4 or 5. Based on these values, you can easily generate the desired filename (mind the leading zero!). Within the loop, you keep track of an index for the resulting cell array.
By the way, if I count the number of files, I arrive at a total of 18*5 + 4*4 = 106 and not 95.
The code:
numfiles = (27-9)*5 + (9-5)*4;
mydata = cell(1, numfiles);
idx = 0; % index for mydata
n = 4;
for k1 = 6:27
if k1 == 10
n = 5; % switch to 5 files if k1 reaches 10
end
for k2 = 1:n
idx = idx+1;
myfilename = sprintf('elist_S%02dn%d.txt', k1, k2);
mydata{idx} = importdata(myfilename);
end
end

Loading huge data efficiently in matlab

I have a million rows data set in the following format
User Gameid Count
A 1 2
A 2 3
A 10 2
A 8 2
B 10 2
B 1 6
....
I want to create a sparse matrix from this with each row representing a user and the columns representing the gameid and the value in each cell is the count of the corresponding user and gameid.
The matrix is sparse. So it should be handled by matlab.
How can I directly load such data without iterating through each line which will take a lot of time. Any suggestions how to do it efficiently?
This is what I want
Column 1 2 3 4 5 6 7 8 9 10
2 3 2 2
6 2
I think you can use textscan.
fid = fopen(file);
content = textscan(fid, '%s%d%d');
user = content{1}; % cell
field = content{2}; % vector
count = content{3}; % vector
ok your file is file.txt
Ok so with an easy manipulation you can get what u want:
clear all
close all
fil = 'file.txt';
fid = fopen(fil,'r');
s = textscan(fid,'%s','Delimiter','\n');
s = s{1};
s = s(2:end);
ls = length(s);
fmat = '%s %u %u';
C = sscanf(char(s)',fmat);
C = reshape(C,[3 length(s)])';
A = char(C(:,1));
data = cellstr(A);
data(:,2:3) = num2cell(C(:,2:3));
Users = unique(A);
S_MAT_DATA = spalloc(length(Users),max(C(:,2)),round(ls/length(Users))+1);
nmat = 1;
for aa = 1:length(Users)
data_user = cell2mat(data(strncmp(Users(aa),data(:,1),1),2:3)');
S_MAT_DATA(aa,data_user(1,:)) = data_user(2,:);
end

Matlab processing data from text file

I try to read data from a text file. I can do it via import. It works fine.
My data imported as:
UserID|SportID|Rating
There are a lot of users that can like any sport with any rating for example:
User|SportID|Rating
1 2 10
1 3 5
2 1 10
2 3 2
I try to create a new matrix like below
UserID Sport1 Sport2 Sport3
1 (null) 10 5
2 10 (null) 2
I tried to this via "for" and "loop" however there are almost 2000 user and 1000 sports and their data is almost 100000. How can I do this?
To do this fast, you can use a sparse matrix with one dimension UserID and the other Sports. The sparse matrix will behave for most things like a normal matrix. Construct it like so
out = sparse(User, SportID, Rating)
where User, SportID and Rating are the vectors corresponding to the columns of your text file.
Note 1: for duplicate of User and SportID the Rating will be summed.
Note 2: empty entries, as were written as (null) in the question are not stored in sparse matrices, only the non-zero ones (that is the main point of sparse matrices).
You can do the following:
% Test Input
inputVar = [1 2 10; 1 3 5; 2 1 10; 2 3 2];
% Determine number of users, and sports to create the new table
numSports = max(inputVar(1:end,2));
numUsers = max(inputVar(1:end,1));
newTable = NaN(numUsers, numSports);
% Iterate for each row of the new table (# of users)
for ii = 1:numUsers
% Determine where the user rated from input mat, which sport he/she rated, and the rating
userRating = find(inputVar(1:end,1) == ii);
sportIndex = inputVar(userRating, 2)';
sportRating = inputVar(userRating, 3)';
newTable(ii, sportIndex) = sportRating; % Crete the new table based on the ratings.
end
newTable
Which produced the following:
newTable =
NaN 10 5
10 NaN 2
This would only have to run for the amount of users that are in your input table.
I suppose you have already defined null as a number for simplification.
Null = -1; % or any other value which could not be a rating.
Considering:
nSports = 1000; % Number of sports
nUsers = 2000; % Number of users
Pre-allocate the result:
Rating_Mat = ones(nUsers, nSports) * Null; % Pre-allocation
Then use sub2ind (similar to this answer):
Rating_Mat (sub2ind([nUsers nSports], User, SportID) = Rating;
Or accumarray:
Rating_Mat = accumarray([User, SportID], Rating);
assuming that User and SportID are Nx1.
Hope it helps.

Matlab: Series of Variables in a Loop

This is a solution from another stackoverflow participant who helped me out.
Data is coming from a csv file:
States Damage Blizzards
Indiana 1 3
Alabama 2 3
Ohio 3 2
Alabama 4 2
%// Parse CSV file
[States, Damage, Blizzards] = textread(csvfilename, '%s %d %d', ...
'delimiter', ',', 'headerlines', 1);
%// Parse data and store in an array of structs
[U, ix, iu] = unique(States); %// Find unique state names
S = struct('state', U); %// Create a struct for each state
for k = 1:numel(U)
idx = (iu == k); %// Indices of rows matching current state
S(k).damage = Damage(idx); %// Add damage information
S(k).blizzards = Blizzards(idx); %// Add blizards information
end
In MATLAB, I need to create a series of assigned variables (A1,A2,A3) in a loop. So I have structure S with 3 fields: state, tornado, hurricane.
Now I have attempted this method to assign A1 =, A2 =, which I got an error because it will not work for structures:
for n = 1:numel(S)
eval(sprintf('A%d = [1:n]',S(n).states));
end
Output goal is a series of assigned variables in the loop to the fields of the structure:
A1 = 2 3
A2 = 2 3
A3 = 4 5
I'm not 100% sure I understand your question.
But maybe you are looking for something like this:
for n = 1:numel(S)
eval(sprintf('A%d = [S(n).damage S(n).blizzards]',n));
end
BTW using evalc instead of eval will suppress the command line output.
A little explanation, why
eval(sprintf('A%d = [1:n]',S(n).state));
does not work:
S(1).state
returns
ans =
Alabama
which is a string. However,
A%d
expects a number (see this for number formatting).
Additionally,
numel(S)
yields
ans =
3
Therefore,
eval(sprintf('A%d = [1:n]',n));
will simply return the following output:
A1 =
1
A2 =
1 2
A3 =
1 2 3
Hence, you want n as a counter for the variable name, but compose the vector of the entries in the other struct-fields (damage and blizzards), again, using n as a counter.