Convert dataset column to obsnames - matlab

I have many large dataset arrays in my workspace (loaded from a .mat file).
A minimal working example is like this
>> disp(old_ds)
Date firm1 firm2 firm3 firm4
734692 880,0 102,1 32,7 204,2
734695 880,0 102,0 30,9 196,4
734696 880,0 100,0 30,9 200,2
734697 880,0 101,4 30,9 200,2
734698 880,0 100,8 30,9 202,2
where the first row (with the strings) already are headers in the dataset, that is they are already displayed if I run old_ds.Properties.VarNames.
I'm wondering whether there is an easy and/or fast way to make the first column as ObsNames.
As a first approach, I've thought of "exporting" the data matrix (columns 2 to 5, in the example), the vector of dates and then creating a new dataset where the rows have names.
Namely:
>> mat = double(old_ds(:,2:5)); % taking the data, making it a matrix array
>> head = old_ds.Properties.VarNames % saving headers
>> head(1,1) = []; % getting rid of 'Date' from head
>> dates = dataset2cell(old_ds(:,1)); % taking dates as column cell array
>> dates(1) = []; % getting rid of 'Date' from dates
>> new_ds = mat2dataset(mat,'VarNames',head,'ObsNames',dates);
Apart from the fact that the last line returns the following error, ...
Error using setobsnames (line 25)
NEWNAMES must be a nonempty string or a cell array of nonempty strings.
Error in dataset (line 377)
a = setobsnames(a,obsnamesArg);
Error in mat2dataset (line 75)
d = dataset(vars{:},args{:});
...I would have found a solution, then created a function (such to generalize the process for all 22 dataset arrays that I have) and then run the function 22 times (once for each dataset array).
To put things into perspective, each dataset has 7660 rows and a number of columns that ranges from 2 to 1320.
I have no idea about how I could (and if I could) make the dataset directly "eat" the first column as ObsNames.
Can anyone give me a hint?
EDIT: attached a sample file.

Actually it should be quite easy (but the fact that I'm reading your question means that having the same problem, I first googled it before looking up the documentation... ;)
When loading the dataset, use the following command (adjusted to your case of course):
cell_dat{1} = dataset('File', 'YourDataFile.csv', 'Delimiter', ';',...
'ReadObsNames', true);
The 'ReadObsNames' default is false. It takes the header of the first column and saves it in the file or range as the name of the first dimension in A.Properties.DimNames.
(see the Documentation, Section: "Name/value pairs available when using text files or Excel spreadsheets as inputs")
I can't download your sample file, but if you haven't yet solved the problem otherwise, just try the suggested solution and tell if it works. Glad if I could help.

You are almost there, the error message you got is basically saying that Obsname have to be strings. In your case the 'dates' variable is cell array containing doubles. So you just need to convert them to string.
mat = double(piHU(:,2:end)); % taking the data, making it a matrix array
head = piHU.Properties.VarNames % saving headers
head(1) = []; % getting rid of 'Date' from head
dates = dataset2cell(piHU(:,1)); % taking dates as column cell array, here dates are of type double. try typing on the command window class(dates{2}), you can see the output is double.
dates(1) = []; % getting rid of 'Date' from dates
dates_str=cellfun(#(s) num2str(s),dates,'UniformOutput',false); % convert dates to string, now try typing class(dates_str{2}), the output should be char
new_ds = mat2dataset(mat,'VarNames',head,'ObsNames',dates_str); % construct new dataset.

Related

Read specific portions of an excel file based on string values in MATLAB

I have an excel file and I need to read it based on string values in the 4th column. I have written the following but it does not work properly:
[num,txt,raw] = xlsread('Coordinates','Centerville');
zn={};
ctr=0;
for i = 3:size(raw,1)
tf = strcmp(char(raw{i,4}),char(raw{i-1,4}));
if tf == 0
ctr = ctr+1;
end
zn{ctr}=raw{i,4};
end
data=zeros(1,10); % 10 corresponds to the number of columns I want to read (herein, columns 'J' to 'S')
ctr=0;
for j = 1:length(zn)
for i=3:size(raw,1)
tf=strcmp(char(raw{i,4}),char(zn{j}));
if tf==1
ctr=ctr+1;
data(ctr,:,j)=num(i-2,10:19);
end
end
end
It gives me a "15129x10x22 double" thing and when I try to open it I get the message "Cannot display summaries of variables with more than 524288 elements". It might be obvious but what I am trying to get as the output is 'N = length(zn)' number of matrices which represent the data for different strings in the 4th column (so I probably need a struct; I just don't know how to make it work). Any ideas on how I could fix this? Thanks!
Did not test it, but this should help you get going:
EDIT: corrected wrong indexing into raw vector. Also, depending on the format you might want to restrict also the rows of the raw matrix. From your question, I assume something like selector = raw(3:end,4); and data = raw(3:end,10:19); should be correct.
[~,~,raw] = xlsread('Coordinates','Centerville');
selector = raw(:,4);
data = raw(:,10:19);
[selector,~,grpidx] = unique(selector);
nGrp = numel(selector);
out = cell(nGrp,1);
for i=1:nGrp
idx = grpidx==i;
out{i} = cell2mat(data(idx,:));
end
out is the output variable. The key here is the variable grpidx that is an output of the unique function and allows you to trace back the unique values to their position in the original vector. Note that unique as I used it may change the order of the string values. If that is an issue for you, use the setOrderparameter of the unique function and set it to 'stable'

How to extract columns of data from .txt files MATLAB

I have some data in a .txt file. that are separated by commas.
for example:
1.4,2,3,4,5
2,3,4.2,5,6
24,5,2,33.4,62
what if you want the average of columns, like first column (1.4,2 and 24)? or second column(2,3 and 5)?
I think putting the column in an array and using the built in mean function would work, but so far, I am only able to extract rows, not columns
instead of making another thread, I thought i'd edit this one. I am working on getting the average of each column of the well known iris data set.
I cut a small portion of the data:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
delimiterln= ',';
data = importdata('iris.txt', delimiterln);
meanCol1 = mean(data(:,1))
meanCol2 = mean(data(:,2))
meanCol3 = mean(data(:,3))
meanCol4 = mean(data(:,4))
Undefined function 'sum' for input arguments of type 'cell'.
Error in mean (line 115)
y = sum(x, dim, flag)/size(x,dim);
Error in irisData(line 6)
meanCol1 = mean(data(:,1))
it looks like there is an error with handling data type...any thoughts on this? I tried getting rid of the last column, which are strings. and it seems to work without error. So i am thinking that it's because of the strings.
Use comma separated file reading function:
M = csvread(filename);
Now you have the matrix M:
col1Mean=mean(M(:,1));

Save a string, double and table Matlab

I have a loop which runs 100 times. In each iteration there is a string, double and a table assigned, and in the next iteration new values are assigned for them. What I want to do is to accumulate these values and after the loop finishes save the total result as result.mat using the matlab save function. I've tried putting them in cell-array but its not working so far, so if anyone could please advise how this can be done.
This is what I did:
results_cell=(100,3);
.
.
.
results_cell(i,1)=stringA;
results_cell(i,2)=TableA;
results_cell(i,3)=DoubleA;
But it gives this error Coversion to Cell from Table is not possible. So I've tried converting TableA to array of Doubles using table2array but I still get this Coversion to Cell from Double is not possible
I think using a structure would be a good way to store your data, since they are of different types and you can assign it meaningful field names for easy reference.
For example, let's call the structure Results. You can initialize it like so.
Results = struct('StringData',[],'TableData',[],'DoubleData',[])
Since you know its dimensions, you can even do this:
N = 100;
Results(N).StringData = [];
Results(N).TableData = [];
Results(N).DoubleData = [];
This automatically create a 1xN structure with 3 fields.
Then in your loop you can assign each field with its associated data like so:
for k = 1:N
Results(k).StringData = String(k);
Results(k).TableData = Table(k);
Results(k).DoubleData = Double(k);
end
where String(k), Table(k) and Double(k) are just generic names for your actual data.
When you're done with the loop you can access any type of data using a single index and the right field name.
In order to save a .mat file, use something like this:
save SomeFileName.mat Results
Which you can load into the workspace as you would with any .mat file:
Eg:
S = load('SomeFileName.mat')
R = S.Results
Hope that helps!

Matlab: How do you seperate text in existing cell

I'm a bit new to the matlab world, and I'm running into an issue that I'm sure has an easy solution.
I've imported some data from a text file and parsed out the headers, which resulted in a 1x35 cell called Data. In each cell (for example Data{1,1,1}) is data that looks like:
'600000 -947.772827 -107.045776 -70.818062'
'600001 -920.431396 -86.098122 -56.485119'
'600002 -878.332886 -88.673630 -85.249130'
'600003 -851.637695 -68.546539 -96.691711'
'600004 -834.707642 -28.951260 -73.218872'
'600005 -783.431580 40.657402 24.242268'
The problem is, each line is contained in a single column. I'd like to parse it out so that I have 4 columns instead of one.
I tried parsing out the Data cell even further using:
textscan(Data{1,1,1}, '%u%f10%f10%f10', 1)
But it resulted in the following error:
Error using textscan
First input must be of type double or string.
Can I use textscan this way, or do I need to use some other method to break out the text?
With textscan, you can only specify a single string or a single number. With your input, I suspect it is a 6 x 1 cell array of strings. As such, you have no choice but to iterate over each cell and convert each cell array contents with textscan Also, get rid of the %10 spacing as it's actually screwing up where you're parsing out the string. Also, set the identifier to identify the first number you see to double (%f) as opposed to unsigned integer (%u) to allow for easier conversion.
Therefore, do something like this:
>> Data{1,1,1} = {'600000 -947.772827 -107.045776 -70.818062'
'600001 -920.431396 -86.098122 -56.485119'
'600002 -878.332886 -88.673630 -85.249130'
'600003 -851.637695 -68.546539 -96.691711'
'600004 -834.707642 -28.951260 -73.218872'
'600005 -783.431580 40.657402 24.242268'};
>> format long g;
>> vals = cell2mat(cellfun(#(x) cell2mat(textscan(x, '%f%f%f%f', 1)), Data{1,1,1}, 'uni', 0))
vals =
Columns 1 through 3
600000 -947.772827 -107.045776
600001 -920.431396 -86.098122
600002 -878.332886 -88.67363
600003 -851.637695 -68.546539
600004 -834.707642 -28.95126
600005 -783.43158 40.657402
Column 4
-70.818062
-56.485119
-85.24913
-96.691711
-73.218872
24.242268
That statement vals = ... is quite a mouthful, but easy to explain. Start with this statement:
cell2mat(textscan(x, '%f%f%f%f', 1))
For a given cell x in Data{1,1,1}, we want to parse out four numbers for each string that is stored in x. textscan will place these numbers as individual cell elements into a cell array. We want to convert each element into a numeric array, and so cell2mat is required for us to do so.
In order to operate over all of the elements in Data{1,1,1}, we need to use cellfun to allow us to do so:
cellfun(#(x) cell2mat(textscan(x, '%f%f%f%f', 1)), Data{1,1,1}, 'uni', 0)
The first input is a function that operates on each cell stored in Data{1,1,1} (the second input). We are basically telling cellfun that we want to operate on each cell in the cell array stored in Data{1,1,1} in the way I talked about before. This function has input parameter x, which is one cell from Data{1,1,1}. Now, the uni flag is set to 0 because the output of cellfun will not be a single number, but an array of numbers - one array per line that you have in your cell array. The output of this stage would be a 6 element cell array where each location is a 4 element numeric array. To finish it off, we call cell2mat on this output to finally convert our text into a 2D matrix and therefore:
vals = cell2mat(cellfun(#(x) cell2mat(textscan(x, '%f%f%f%f', 1)), Data{1,1,1}, 'uni', 0))
format long g allows for better display formatting so we can see both the dominant number as well as the floating point numbers neatly.

candlestick chart in matlab

I have a .csv file which contains some data e.g. date(30/10/2013), closePrice(361.08), volume(4500014), openPrice(362.62), highPrice(365), lowPrice(358.65). The file contains data of 2510X6, I want to plot a chandelle-stick chart can someone help me?. This is what i done:
fid = fopen('Amazon.csv');
HDRS = textscan(fid,'%s %s %s %s %s %s',1, 'delimiter',',');
DATA = textscan(fid,'%s %f %f %f %f %f','delimiter',',');
fclose(fid);
outCell = cell(size(DATA{1},1), length(HDRS));
for i = 1:length(HDRS);
if isnumeric(DATA{i});
outCell(:,i) = num2cell(DATA{i});
else
outCell(:,i) = DATA{i};
end
end
candle (outCell{:,5}, outCell{:,6}, outCell{:,2}, outCell{:,4}, 'b', outCell{:,1});
When running the file i get a error saying Error using candle Too many input arguments. i am using cell of array because i have date and to convert date into vector I decide to use cell of array.
Curly-bracket derefencing, as in outCell{:, 5} in your call to candle, expands to what Matlab calls a "comma-separated list". Whenever you see curly-bracket dereferencing, you can think of it as being exactly equivalent to typing out the separate elements that are implied, separated by commas---so if size(outCell, 1) is 3, then this is as if you had typed outCell{1, 5}, outCell{2, 5}, outCell{3, 5}. That's three input arguments to candle right there, where you thought you were passing just one.
I'm unfamiliar with candle itself, but if it wants a single-column cell array as its first argument, then the way to get a single-column cell array out of outCell is to slice it with ordinary round-bracket dereferencing: outCell(:, 5)
If on the other hand candle wants a numeric vector rather than a cell array, you can say cell2mat(outCell(:, 5)). Another way (and this second example is where the power of curly-bracket dereferencing and comma-separated lists becomes apparent) would be to say [outCell{:, 5}]' - that's a comma-separated list, caught inside square brackets, which means horizontal concatenation of the elements.
I found the following way to do this:
Firstly, I observed that you need date in the column vector format and not a cell. Only way to achieve that is to convert date into some numerical representation. That's exactly what datenum does. Example as follows:
DateString = '11/12/2013';
formatIn = 'mm/dd/yyyy';
datenum(DateString,formatIn)
ans =
735550
Convert all your dates in this format. Next, I feel that if you construct the time series object, it would be much easier to plot as shown here. This needs a financial time series object to work. No problem. It can be constructed as shown here. In this case, I believe it can be constructed as (dummy example):
dates={'11/12/2013';'11/13/2013'}
higPrice=[100;100]
lowPrice=[10;10]
closePrice=[90;80]
openPrice=[80;70]
%construct a financial time series object
tsobj = fints(datenum(dates,formatIn), [higPrice lowPrice closePrice openPrice], {'high','low','close','open'}) %put in correct order
candle(tsobj); %I get the plot
EDIT: I forgot to mention that if I try to give any other names than 'high','low','open','close' it doesn't work. For example, I tried with 'highPrice','lowPrice','openPrice','closePrice'. I do not know the reason for this as I am also using candle for the first time.