Using Matlab to randomly split an Excel Sheet - matlab

I have an Excel sheet containing 1838 records and I need to RANDOMLY split these records into 3 Excel Sheets. I am trying to use Matlab but I am quite new to it and I have just managed the following code:
[xlsn, xlst, raw] = xlsread('data.xls');
numrows = 1838;
randindex = ceil(3*rand(numrows, 1));
raw1 = raw(:,randindex==1);
raw2 = raw(:,randindex==2);
raw3 = raw(:,randindex==3);

Your general procedure will be to read the spreadsheet into some matlab variables, operate on those matrices such that you end up with three thirds and then write each third back out.
So you've got the read covered with xlsread, that results in the two matrices xlsnum and xlstxt. I would suggest using the syntax
[~, ~, raw] = xlsread('data.xls');
In the xlsread help file (you can access this by typing doc xlsread into the command window) it says that the three output arguments hold the numeric cells, the text cells and the whole lot. This is because a matlab matrix can only hold one type of value and a spreadsheet will usually be expected to have text or numbers. The raw value will hold all of the values but in a 'cell array' instead, a different kind of matlab data type.
So then you will have a cell array valled raw. From here you want to do three things:
work out how many rows you have (I assume each record is a row) by using the size function and specifying the appropriate dimension (again check the help file to see how to do this)
create an index of random numbers between 1 and 3 inclusive, which you can use as a mask
randindex = ceil(3*rand(numrows, 1));
apply the mask to your cell array to extract the records matching each index
raw1 = raw(:,randindex==1); % do the same for the other two index values
write each cell back to a file
xlswrite('output1.xls', raw1);
You will probably have to fettle the arguments to get it to work the way you want but be sure to check the doc functionname page to get the syntax just right. Your main concern will be to get the indexing correct - matlab indexes row-first whereas spreadsheets tend to be column-first (e.g. cell A2 is column A and row 2, but matlab matrix element M(1,2) is the first row and the second column of matrix M, i.e. cell B1).
UPDATE: to split the file evenly is surprisingly more trouble: because we're using random numbers for the index it's not guaranteed to split evenly. So instead we can generate a vector of random floats and then pick out the lowest 33% of them to make index 1, the highest 33 to make index 3 and let the rest be 2.
randvec = rand(numrows, 1); % float between 0 and 1
pct33 = prctile(randvec,100/3); % value of 33rd percentile
pct67 = prctile(randvec,200/3); % value of 67th percentile
randindex = ones(numrows,1);
randindex(randvec>pct33) = 2;
randindex(randvec>pct67) = 3;
It probably still won't be absolutely even - 1838 isn't a multiple of 3. You can see how many members each group has this way
numel(find(randindex==1))

Related

MATLAB automatically assigns unwanted dimensions to a dynamically updated cell array

I want to generate a 3D cell array called timeData so that timeData(:,:,a) for some integer a is an nx1 matrix of data, and the number of rows n varies with the value of a in a 1:1 correspondence. To do this, I am generating a 2D array of data called data that is nx1. This assignment statement takes place within a for loop as follows:
% Before iterating, I define an array of indices where I want to store the
% data sets in timeData. This choice of storage location is for
% organizational purposes.
A = [2, 5, 9, 21, 34, 100]; % Notice they are in ascending order, but have
% gaps that have no predictability.
sizeA = size(A);
numIter = A(1);
for m = 1:numIter % numIter is the number of data sets that I need to store
% in timeData
% At this point, some code that is entirely irrelevant to my question
% generates a nx1 array of data. One example of this data array is below.
data = [1.1;2.3;5.5;4.4]; % This is one example of what data could be. Its
% number of rows, n, changes each iteration, as
% do its contents.
B = size(data);
timeData(1:B(1),1,A(m)) = num2cell(data);
end
This code does put all contents of data in the appropriate locations within timeData as I want. However, it also adds {0x0 double} rows to all 2D arrays of timeData(:,:,a) for any a whose corresponding number of rows n was not the largest number of rows. Thus, there are many of these 2D arrays that have 10 to a couple hundred 0-valued rows that I don't want. For values of a that did not have a corresponding data set, the content of timeData(:,:,a) is an nx1 array of {0x0 double}.
I need to iterate over the contents of timeData in subsequent code, and I need to be able to find the size of the data set that is in timeData(:,:,a) without somehow discounting all the {0x0 double}.
How can I modify my assignment statement to fix this?
Edit: Desired output of the above example is the following with n = 5. Let this data set be represented by a = 9.
timeData(:,:,9) = {[1.1]}
{[2.3]}
{[5.5]}
{[8.6]}
{[4.4]}
Now, consider the possibility that a previous or subsequent value of the A matrix had a data set with n = 7, and n = 7 is the largest data set (largest n value). timeData(:,:,9) outputs like so in my code:
timeData(:,:,9) = {[1.1]}
{[2.3]}
{[5.5]}
{[8.6]}
{[4.4]}
{[0x0 double]}
{[0x0 double]}
#Dev-iL, as I understand it, your answer gives me the ability to delete the cells that have {[0x0 double]} in them (this is what I mean by "discounting"). This is a good plan B, but is there a way to prevent the {[0x0 double]} cells from showing up in the first place?
Edit 2: Update to the above statement "your answer gives me the ability to delete the cells that have {[0x0 double]} in them (this is what I mean by "discounting")". The cellfun(#isempty... ) function makes the {[0x0 double]}cells go to {[0x0 cell]}, it does not remove them. In other words, size(timeData(:,:,9)) is the same before and after the command is performed. This is not what I want. I want size(timeData(:,:,9)) to be 5x1 no matter what n is for any other value of a.
Edit 3: I just realized that the most desired output would be the following:
timeData(:,:,9) = {[1.1;2.3;5.5;8.6;4.4]} % An n x 1 column matrix within
% the cell.
but I can work with this outcome or the outcome as described above.
Unfortunately, I don't understand the structure of your dataset, which is why I can't suggest a better assignment method. However, I'd like to point out an operation that can you help deal with your data after it's been created:
cellfun(#isempty,timeData);
What the above does is return a logical array the size of timeData, indicating which cells contain something "empty". Typically, an array of arbitrary datatype is considered "empty" when it has at least one dimension that is equal to 0.
How can you use it to your advantage?
%% Example 1: counting non-empty cells:
nData = sum(~cellfun(#isempty,timeData(:)));
%% Example 2: assigning empty cells in place of empty double arrays:
timeData(cellfun(#isempty,timeData)) = {{}};

Printing progress in command window

I'd like to use fprintf to show code execution progress in the command window.
I've got a N x 1 array of structures, let's call it myStructure. Each element has the fields name and data. I'd like to print the name side by side with the number of data points, like such:
name1 number1
name2 number2
name3 number3
name4 number4
...
I can use repmat N times along with fprintf. The problem with that is that all the numbers have to come in between the names in a cell array C.
fprintf(repmat('%s\t%d',N,1),C{:})
I can use cellfun to get the names and number of datapoints.
names = {myStucture.name};
numpoints = cellfun(#numel,{myStructure.data});
However I'm not sure how to get this into a cell array with alternating elements for C to make the fprintf work.
Is there a way to do this? Is there a better way to get fprintf to behave as I desire?
You're very close. What I would do is change your cellfun call so that the output is a cell array instead of a numeric array. Use the 'UniformOutput' flag and set this to 0 or false.
When you're done, make a new cell array where both the name cell array and the size cell array are stacked on top of each other. You can then call fprintf once.
% Save the names in a cell array
A = {myStructure.name};
% Save the sizes in another cell array
B = cellfun(#numel, {myStructure.data}, 'UniformOutput', 0);
% Create a master cell array where the first row are the names
% and the second row are the sizes
out = [A; B];
% Print out the elements side-by-side
fprintf('%s\t%d\n', out{:});
The trick with the third line of code is that when you unroll the cell array using {:}, this creates a comma-separated list unrolled in column-major format, and so doing out{:} actually gives you:
A{1}, B{1}, A{2}, B{2}, ..., A{n}, B{n}
... which provides the interleaving you need. Therefore, providing this order into fprintf coincides with the format specifiers that are specified and thus gives you what you need. That's why it's important to stack the cell arrays so that each column gives the information you need.
Minor Note
Of course one should never forget that one of the easiest ways to tackle your problem is to just use a simple for loop. Even though for loops are considered bad practice, their performance has come a long way throughout MATLAB's evolution.
Simply put, just do this:
for ii = 1 : numel(myStructure)
fprintf('%s\t%d\n', myStructure(ii).name, numel(myStructure(ii).data));
end
The above code is arguably more readable in comparison to what we did above with cell arrays. You're accessing the structure directly rather than having to create intermediate variables for the purpose of calling fprintf once.
Example Run
Here's an example of this running. Using the data shown below:
clear myStructure;
myStructure(1).name = 'hello';
myStructure(1).data = rand(5,1);
myStructure(2).name = 'hi';
myStructure(2).data = zeros(3,3);
myStructure(3).name = 'huh';
myStructure(3).data = ones(6,4);
I get the following output after running the printing code:
hello 5
hi 9
huh 24
We can see that the sizes are correct as the first element in the structure is simply a random 5 element vector, the second element is a 3 x 3 = 9 zeroes matrix while the last element is a 6 x 4 = 24 ones matrix.

Importing text file into matrix form with indexes as strings?

I'm new to Matlab so bear with me. I have a text file in this form :
b0002 b0003 999
b0002 b0004 999
b0002 b0261 800
I need to read this file and convert it into a matrix. The first and second column in the text file are analogous to row and column of a matrix(the indices). I have another text file with a list of all values of 'indices'. So it should be possible to create an empty matrix beforehand.
b0002
b0003
b0004
b0005
b0006
b0007
b0008
Is there anyway to access matrix elements using custom string indices(I doubt it but just wondering)? If not, I'm guessing the only way to do this is to assign the first row and first column the index string values and then assign the third column values based on the first text file. Can anyone help me with that?
You can easily convert those strings to numbers and then use those as indices. For a given string, b0002:
s = 'b0002'
str2num(s(2:end); % output = 2
Furthermore, you can also do this with a char matrix:
t = ['b0002';'b0003';'b0004']
t =
b0002
b0003
b0004
str2num(t(:,2:end))
ans =
2
3
4
First, we use textscan to read the data in as two strings and a float (could use other numerical formats. We have to open the file for reading first.
fid = fopen('myfile.txt');
A = textscan(fid,'%s%s%f');
textscan returns a cell array, so we have to extract your three variables. x and y are converted to single char arrays using cell2mat (works only if all the strings inside are the same length), n is a list of numbers.
x = cell2mat(A{1});
y = cell2mat(A{2});
n = A{3};
We can now convert x and y to numbers by telling it to take every row : but only the second to final part of the row 2:end, e.g 002, 003 , not b002, b003.
x = str2num(x(:,2:end));
y = str2num(y(:,2:end));
Slight problem with indexing - if I have a matrix A and I do this:
A = magic(8);
A([1,5],[3,8])
Then it returns four elements - [1,3],[5,3],[1,8],[5,8] - not two. But what you want is the location in your matrix equivalent to x(1),y(1) to be set to n(1) and so on. To do this, we need to 1) work out the final size of matrix. 2) use sub2ind to calculate the right locations.
% find the size
% if you have a specified size you want the output to be use that instead
xsize = max(x);
ysize = max(y);
% initialise the output matrix - not always necessary but good practice
out = zeros(xsize,ysize);
% turn our x,y into linear indices
ind = sub2ind([xsize,ysize],x,y);
% put our numbers in our output matrix
out(ind) = n;

Increasing the length of a column in MATLAB

I'm just beginning to teach myself MATLAB, and I'm making a 501x6 array. The columns will contain probabilities for flipping 101 sided die, and as such, the columns contain 101,201,301 entries, not 501. Is there a way to 'stretch the column' so that I add 0s above and below the useful data? So far I've only thought of making a column like a=[zeros(200,1);die;zeros(200,1)] so that only the data shows up in rows 201-301, and similarly, b=[zeros(150,1);die2;zeros(150,1)], if I wanted 200 or 150 zeros to precede and follow the data, respectively in order for it to fit in the array.
Thanks for any suggestions.
You can do several thing:
Start with an all-zero matrix, and only modify the elements you need to be non-zero:
A = zeros(501,6);
A(someValue:someOtherValue, 5) = value;
% OR: assign the range to a vector:
A(someValue:someOtherValue, 5) = 1:20; % if someValue:someOtherValue is the same length as 1:20

MATLAB - Index exceeds matrix dimensions

Hi I have problem with matrix..
I have many .txt files with different number of rows but have the same number of column (1 column)
e.g. s1.txt = 1234 rows
s2.txt = 1200 rows
s2.txt = 1100 rows
I wanted to combine the three files. Since its have different rows .. when I write it to a new file I got this error = Index exceeds matrix dimensions.
How I can solved this problem? .
You can combine three matrices simply by stacking them: Assuming that s1, etc are the matrices you read in, you can make a new one like this:
snew = [s1; s2; s3];
You could also use the [] style stacking without creating the new matrix variable if you only need to do it once.
You have provided far too little information for an accurate diagnosis of your problem. Perhaps you have loaded the data from your files into variables in your workspace. Perhaps s1 has 1 column and 1234 rows, etc. Then you can concatenate the variables into one column vector like this:
totalVector = [s1; s2; s3];
and write it out to a file with a save() statement.
Does that help ?
Let me make an assumption that this question is connecting with your another question, and you want to combine those matrices by columns, leaving empty values in columns with fewer data.
In this case this code should work:
BaseFile ='s';
n=3;
A = cell(1,n);
for k=1:n
A{k} = dlmread([BaseFile num2str(k) '.txt']);
end
% create cell array with maximum number of rows and n number of columns
B = cell(max(cellfun(#numel,A)),n);
% convert each matrix in A to cell array and store in B
for k=1:n
B(1:numel(A{k}),k) = num2cell(A{k});
end
% save the data
xlswrite('output.txt',B)
The code assumes you have one column in each file, otherwise it will not work.