Matlab processing data from text file - matlab

I try to read data from a text file. I can do it via import. It works fine.
My data imported as:
UserID|SportID|Rating
There are a lot of users that can like any sport with any rating for example:
User|SportID|Rating
1 2 10
1 3 5
2 1 10
2 3 2
I try to create a new matrix like below
UserID Sport1 Sport2 Sport3
1 (null) 10 5
2 10 (null) 2
I tried to this via "for" and "loop" however there are almost 2000 user and 1000 sports and their data is almost 100000. How can I do this?

To do this fast, you can use a sparse matrix with one dimension UserID and the other Sports. The sparse matrix will behave for most things like a normal matrix. Construct it like so
out = sparse(User, SportID, Rating)
where User, SportID and Rating are the vectors corresponding to the columns of your text file.
Note 1: for duplicate of User and SportID the Rating will be summed.
Note 2: empty entries, as were written as (null) in the question are not stored in sparse matrices, only the non-zero ones (that is the main point of sparse matrices).

You can do the following:
% Test Input
inputVar = [1 2 10; 1 3 5; 2 1 10; 2 3 2];
% Determine number of users, and sports to create the new table
numSports = max(inputVar(1:end,2));
numUsers = max(inputVar(1:end,1));
newTable = NaN(numUsers, numSports);
% Iterate for each row of the new table (# of users)
for ii = 1:numUsers
% Determine where the user rated from input mat, which sport he/she rated, and the rating
userRating = find(inputVar(1:end,1) == ii);
sportIndex = inputVar(userRating, 2)';
sportRating = inputVar(userRating, 3)';
newTable(ii, sportIndex) = sportRating; % Crete the new table based on the ratings.
end
newTable
Which produced the following:
newTable =
NaN 10 5
10 NaN 2
This would only have to run for the amount of users that are in your input table.

I suppose you have already defined null as a number for simplification.
Null = -1; % or any other value which could not be a rating.
Considering:
nSports = 1000; % Number of sports
nUsers = 2000; % Number of users
Pre-allocate the result:
Rating_Mat = ones(nUsers, nSports) * Null; % Pre-allocation
Then use sub2ind (similar to this answer):
Rating_Mat (sub2ind([nUsers nSports], User, SportID) = Rating;
Or accumarray:
Rating_Mat = accumarray([User, SportID], Rating);
assuming that User and SportID are Nx1.
Hope it helps.

Related

Matlab find unique column-combinations in matrix and respective index

I have a large matrix with with multiple rows and a limited (but larger than 1) number of columns containing values between 0 and 9 and would like to find an efficient way to identify unique row-wise combinations and their indices to then build sums (somehwat like a pivot logic). Here is an example of what I am trying to achieve:
a =
1 2 3
2 2 3
3 2 1
1 2 3
3 2 1
uniqueCombs =
1 2 3
2 2 3
3 2 1
numOccurrences =
2
1
2
indizies:
[1;4]
[2]
[3;5]
From matrix a, I want to first identify the unique combinations (row-wise), then count the number occurrences / identify the row-index of the respective combination.
I have achieved this through generating strings with num2str and strcat, but this method appears to be very slow. Along these thoughts I have tried to find a way to form a new unique number through concatenating the values horizontally, but Matlab does not seem to support this (e.g. from [1;2;3] build 123). Sums won't work because they would remove the possibility to identify unique combinations. Any suggestions on how to best achieve this? Thanks!
To get the unique rows you can use unique with the 'rows' option enabled:
[C, ix, ic] = unique(a, 'rows', 'stable');
C contains the unique rows; ix the indexes of the first occurrences of those rows in C; ic basically contains the information you want. To access it you can loop over the indexes of ix and save them in a cell array:
indexes = cell(1, length(ix));
for k = 1:length(ix)
indexes{k} = find(ic == ix(k));
end
indexes will be a cell array containing the indexes you were looking for. For example:
indexes{1}
% ans =
%
% 1
% 4
And to count the occurrences of a particular combination you can just use numel. For example:
numel(indexes{1})
% ans =
%
% 2

Grouping by nested unique values

I have a matrix A in Matlab:
A = [176 5406 1 4 7903;
155 5406 1 5 7903;
122 5407 0 4 7903;
140 5407 0 5 7904;
130 5407 0 3 7904];
Just for information - the second column is a user ID, while the fourth column is a time. So 5406 is one user and 5407 is another user. Both of these users have some information stored in the first column and the 4th column which I am interested in accessing.
So basically what I want to do is:
For each user take the median of their values in the first column. I have written code (below) that works for this.
If there are two equal "time" values in column 5 for each user then I want to average the values in column 4. So like for user 5406 the time values are both 7903, I want to the average of values in column 4 - i.e. the average of 4 and 5 to end up with one value (4.5).
But for example for the next user 5407 I will have two final values - one will be the average of 5 and 3 (because 7904 is repeated) and one will be 4 (because 7903 is not repeated).
I am a bit confused about how to do this, I know there needs to be an if statement of some sort, but I've been stuck on it for ages. Can anyone help?
Thanks
Code for the first part:
u=unique(A(:,2));
for i=1:size(u,1)
M=find(A(i,2)==u(i));
med(i)=median(A(M,1));
end
You could run unique for the time values of each user (within the loop) and do a similar sub loop to collect the mean of unique timestamp for that user.
But here I think it's neater to use accumarray. In first example below, I've modified your code just a bit.
% Get unique
[user, ~, userIdx] = unique(A(:,2));
nUser = numel(user);
% Allocate container for result
med = zeros(nUser,1);
men = cell(nUser,1); % <-- Need a cell since length of result could vary
for i = 1:nUser
% Median of col #1
med(i) = median(A(userIdx == i, 1));
% Mean of col #4 for unique times
[~, ~, timeIdx] = unique(A(userIdx == i, 5));
men{i} = accumarray(timeIdx, A(userIdx == i, 4), [], #mean);
end
Result:
>> med =
165.5
130
>> celldisp(men)
men{1} =
4.5
men{2} =
4
4
To squeeze it a bit more, you could take unique time for entire A and use accumarray for both
[~, ~, userIdx] = unique(A(:,2));
[~, ~, timeIdx] = unique(A(:,5));
med = accumarray(userIdx, A(:,1), [], #median);
men = accumarray([userIdx timeIdx], A(:,4), [], #mean, NaN);
This gives men not as a cell but a matrix. Therefore the blank spaces has to be filled (here I choose NaN since 0 could be a result of #mean).
men =
4.5 NaN
4 4
If you want it as a cell without NaN you could just loop over the rows and pick non-NaN values, or place only the men calculation in the loop, or any other way...
If you are sure that column 4 of A doesn't contain any negative or zero numbers (mean value should never risk being 0), you could collect the result of men as a sparse matrix instead
men = accumarray([userIdx timeIdx], A(:,4), [], #mean, 0, true);
men =
(1,1) 4.5
(2,1) 4
(2,2) 4
I got another solution for your task without using any loops:
Median values.
u=unique(A(:,2));
umedians = arrayfun( #(x) median (A( A(:,2)==x, 1)), u);
Explanation:
find all unique users first. Then using arrayfun to find all data for current user and calculate median for every one of them.
Average values of column 4.
This task is a bit harder. We can go this way:
temp = arrayfun( #(x) unique(A ( A(:,2)==x,5 )), u, 'UniformOutput',false);
result = cellfun( #(y,z) arrayfun( #(x) mean( A( A(:,2) == u(z) & A(:,5) == x ,4) ), ...
y, 'UniformOutput',false), temp , num2cell( [1:size(u,1)]'), 'UniformOutput',false)
Explanation: first of all lets find all unique times for each users. Save it to cell array temp. Now we need for each cell find the same times and calculate mean. So lets use cellfun to made it for each cell of temp and use arrayfun into it to calculate mean.
Hope it helps!

How to efficiently find in some dataset the number of occurrences of a given list of items, without using loops?

I have a dataset, M, where some items and their category types are stored in columns 1 and 2 respectively. The vector cat stores the unique category types present in M. Vector Y is a subset of items in M. I want to find how many times each category type is associated with the items in Y. This is the code I have written to do this:
cat(:,1) = unique(M(:,2)); % Unique items in M
cat(:,2) = zeros(size(cat,1),1); % initialize column 2 of cat to 0s
N = size(Y,1);
for i=1:N
item = Y(i,1);
temp = M(M(:,1)==item,:);
C(:,1) = unique(temp(:,2));
C(:,2) = histc(temp(:,2), unique(temp(:,2))); % Frequency of items in temp(:,2)
for j=1:size(cat,1)
for k=1:size(C,1)
if cat(j,1)==C(k,1)
cat(j,2) = cat(j,2)+C(k,2);
end
end
end
clear C; clear temp; clear item;
end
But this is obviously slow for even moderately sized M, Y and cat. How do I make it faster?
To illustrate with an example, say:
M=[3 2
4 12
1 7
3 4
2 10
1 6
4 19
4 6
3 12
1 10
2 12];
And,
Y=[2;3];
Then I want the output cat to be the following:
cat=[2 1
4 1
6 0
7 0
10 1
12 2
19 0];
If I correctly understand you want histogram of categories of those items from M that also appear in Y.
Using ismember you can find index of items of M that also appear in Y:
idx = ismember(M(:,1), Y);
Use that index to filter out desired items and save it to temp:
temp = M(idx, :);
Form histogram of temp with unique values from Cat(:,1):
Cat(:,2) = histc(temp(:, 2), Cat(:, 1));
Avoiding saving intermediate results the above code can be simplified :
idx = ismember(M(:,1),Y);
Cat(:,2) = histc(M(idx, 2), Cat(:,1));
Or all in one line:
Cat(:,2) = histc(M(ismember(M(:,1),Y), 2), Cat(:,1));
Note: cat is name of a builtin function in MATLAB so I renamed your variable cat to Cat

Loading a data efficiently in matlab

I have the data of the following form in a text file
Userid Gameid Count
Jason 1 2
Jason 2 10
Jason 4 20
Mark 1 2
Mark 2 10
................
.................
There are a total of 81 Gameids and I have around 2 million distinct users.
What I want is to read this text file and create a sparse matrix of the form
Column 1 2 3 4 5 6 .
Row1 Jason 2 10 20
Row2 Mark 2 10
Now I can load this text file in matlab and read the users one by one, reading their count and initializing the sparse array. I have tried this, it takes 1 second to initialize the row of one user. So for a total of 2 million users, it will take me a lot of time.
what would be the most efficient way to do this?
Here is my code
data = sparse(10000000, num_games);
loc = 1;
for f=1:length(files)
file = files(f).name;
fid = fopen(file,'r');
s = textscan(fid,'%s%d%d');
count = (s(:,2));
count = count{1};
position = (s(:,3));
position = position{1};
A=s{:,1};
A=cellstr(A);
users = unique(A);
for aa = 1:length(Users)
a = strfind(A, char(Users(aa)));
ix=cellfun('isempty',a);
index = find(ix==0);
data(loc,position(index,:)) = count(index,:);
loc = loc + 1;
end
end
Avoid the inner loop by usingunique once more for GameID.
Store the user names, because in your original code you can't tell which name - relates to each row. The same thing for game IDs.
Make sure to close the file after opening it.
sparse matrix does not support 'int32' you need to store your data as double.
% Place holders for Count
Rows = [];
Cols = [];
for f = 1:length(files)
% Read the data into 's'
fid = fopen(files(f).name,'r');
s = textscan(fid,'%s%f%f');
fclose(fid);
% Spread the data
[U, G, Count{f}] = s{:};
[Users{f},~, r] = unique(U); % Unique user names
[GameIDs{f},~,c] = unique(G); % Unique GameIDs
Rows = [Rows; r + max([Rows; 0])];
Cols = [Cols; c + max([Cols; 0])];
end
% Convert to linear vectors
Count = cell2mat(Count');
Users = reshape([Users{:}], [], 1);
GameIDs = cell2mat(GameIDs');
% Create the sparse matrix
Data = sparse(Rows, Cols, Count, length(Users), length(GameIDs), length(Count));
The Users will contain be the Row header (user names) and GameIDs the Column header.

Matlab - insert/append rows into matrix iteratively

How in matlab I can interactively append matrix with rows?
For example lets say I have empty matrix:
m = [];
and when I run the for loop, I get rows that I need to insert into matrix.
For example:
for i=1:5
row = v - x; % for example getting 1 2 3
% m.append(row)?
end
so after inserting it should look something like:
m = [
1 2 3
3 2 1
1 2 3
4 3 2
1 1 1
]
In most programming languages you can simply append rows into array/matrix. But I find it hard to do it in matlab.
m = [m ; new_row]; in your loop. If you know the total row number already, define m=zeros(row_num,column_num);, then in your loop m(i,:) = new_row;
Just use
m = [m; row];
Take into account that extending a matrix is slow, as it involves memory reallocation. It's better to preallocate the matrix to its full size,
m = NaN(numRows,numCols);
and then fill the row values at each iteration:
m(ii,:) = row;
Also, it's better not to use i as a variable name, because by default it represents the imaginary unit (that's why I'm using ii here as iteration index).
To create and add a value into the matrix you can do this and can make a complete matrix like yours.
Here row = 5 and then column = 3 and for hence two for loop.
Put the value in M(i, j) location and it will insert the value in the matrix
for i=1:5
for j=1:3
M(i, j) = input('Enter a value = ')
end
fprintf('Row %d inserted successfully\n', i)
end
disp('Full Matrix is = ')
disp(M)
Provably if you enter the same values given, the output will be like yours,
Full Matrix is =
1 2 3
3 2 1
1 2 3
4 3 2
1 1 1