I have used HashingTF with below:
tf = HashingTF(inputCol="text_tokenized", outputCol="text_tf", numFeatures=1000)
pipeline = Pipeline(stages=[tf])
hash_fit = pipeline.fit(df_disco_train_o)
df_disco_train_h = hash_fit.transform(df_disco_train_o)
And would like to get dictionary of {index: term} or {term: index} pair
One way to get it is via
tf.indexOf("first") # will give you index of "0"
But in order to do that I need to get all unique terms:
df_term_count = df_disco_train_h.select(explode('text_tokenized').alias('exploded')).groupby('exploded').count()
which is extremely costly. Is there a somekind of metadata or some function I can call to retrieve it without a query?
Here's what the output looks like, which is List Of text & sparse matrix.
Related
I have a dataset that I would like to categorise and store in a structure based on the value in one column of the dataset. For example, the data can be categorised into element 'label_100', 'label_200' or 'label_300' as I attempt below:
%The labels I would like are based on the dataset
example_data = [repmat(100,1,100),repmat(200,1,100),repmat(300,1,100)];
data_names = unique(example_data);
%create a cell array of strings for the structure fieldnames
for i = 1:length(data_names)
cell_data_names{i}=sprintf('label_%d', data_names(i));
end
%create a cell array of data (just 0's for now)
others = num2cell(zeros(size(cell_data_names)));
%try and create the structure
data = struct(cell_data_names{:},others{:})
This fails and I get the following error message:
"Error using struct
Field names must be strings."
(Also, is there a more direct method to achieve what I am trying to do above?)
According to the documentation of struct,
S = struct('field1',VALUES1,'field2',VALUES2,...) creates a
structure array with the specified fields and values.
So you need to have each value right after its field name. The way you are calling struct now is
S = struct('field1','field2',VALUES1,VALUES2,...)
instead of the correct
S = struct('field1',VALUES1,'field2',VALUES2,...).
You can solve that by concatenating cell_data_names and others vertically and then using {:} to produce a comma-separated list. This will give the cells' contents in column-major order, so each field name fill be immediately followed by the corresponding value:
cell_data_names_others = [cell_data_names; others]
data = struct(cell_data_names_others{:})
I have a loop which runs 100 times. In each iteration there is a string, double and a table assigned, and in the next iteration new values are assigned for them. What I want to do is to accumulate these values and after the loop finishes save the total result as result.mat using the matlab save function. I've tried putting them in cell-array but its not working so far, so if anyone could please advise how this can be done.
This is what I did:
results_cell=(100,3);
.
.
.
results_cell(i,1)=stringA;
results_cell(i,2)=TableA;
results_cell(i,3)=DoubleA;
But it gives this error Coversion to Cell from Table is not possible. So I've tried converting TableA to array of Doubles using table2array but I still get this Coversion to Cell from Double is not possible
I think using a structure would be a good way to store your data, since they are of different types and you can assign it meaningful field names for easy reference.
For example, let's call the structure Results. You can initialize it like so.
Results = struct('StringData',[],'TableData',[],'DoubleData',[])
Since you know its dimensions, you can even do this:
N = 100;
Results(N).StringData = [];
Results(N).TableData = [];
Results(N).DoubleData = [];
This automatically create a 1xN structure with 3 fields.
Then in your loop you can assign each field with its associated data like so:
for k = 1:N
Results(k).StringData = String(k);
Results(k).TableData = Table(k);
Results(k).DoubleData = Double(k);
end
where String(k), Table(k) and Double(k) are just generic names for your actual data.
When you're done with the loop you can access any type of data using a single index and the right field name.
In order to save a .mat file, use something like this:
save SomeFileName.mat Results
Which you can load into the workspace as you would with any .mat file:
Eg:
S = load('SomeFileName.mat')
R = S.Results
Hope that helps!
I am trying to take the averages of a pretty large set of data, so i have created a function to do exactly that.
The data is stored in some struct1.struct2.data(:,column)
there are 4 struct1 and each of these have between 20 and 30 sub-struct2
the data that I want to average is always stored in column 7 and I want to output the average of each struct2.data(:,column) into a 2xN array/double (column 1 of this output is a reference to each sub-struct2 column 2 is the average)
The omly problem is, I can't find a way (lots and lots of reading) to point at each structure properly. I am using a string to refer to the structures, but I get error Attempt to reference field of non-structure array. So clearly it doesn't like this. Here is what I used. (excuse the inelegence)
function [avrg] = Takemean(prefix,numslits)
% place holder arrays
avs = [];
slits = [];
% iterate over the sub-struct (struct2)
for currslit=1:numslits
dataname = sprintf('%s_slit_%02d',prefix,currslit);
% slap the average and slit ID on the end
avs(end+1) = mean(prefix.dataname.data(:,7));
slits(end+1) = currslit;
end
% transpose the arrays
avs = avs';
slits = slits';
avrg = cat(2,slits,avs); % slap them together
It falls over at this line avs(end+1) = mean(prefix.dataname.data,7); because as you can see, prefix and dataname are strings. So, after hunting around I tried making these strings variables with genvarname() still no luck!
I have spent hours on what should have been 5min of coding. :'(
Edit: Oh prefix is a string e.g. 'Hs' and the structure of the structures (lol) is e.g. Hs.Hs_slit_XX.data() where XX is e.g. 01,02,...27
Edit: If I just run mean(Hs.Hs_slit_01.data(:,7)) it works fine... but then I cant iterate over all of the _slit_XX
If you simply want to iterate over the fields with the name pattern <something>_slit_<something>, you need neither the prefix string nor numslits for this. Pass the actual structure to your function, extract the desired fields and then itereate them:
function avrg = Takemean(s)
%// Extract only the "_slit_" fields
names = fieldnames(s);
names = names(~cellfun('isempty', strfind(names, '_slit_')));
%// Iterate over fields and calculate means
avrg = zeros(numel(names), 2);
for k = 1:numel(names)
avrg(k, :) = [k, mean(s.(names{k}).data(:, 7))];
end
This method uses dynamic field referencing to access fields in structs using strings.
First of all, think twice before you use string construction to access variables.
If you really really need it, here is how it can be used:
a.b=123;
s1 = 'a';
s2 = 'b';
eval([s1 '.' s2])
In your case probably something like:
Hs.Hs_slit_01.data= rand(3,7);
avs = [];
dataname = 'Hs_slit_01';
prefix = 'Hs';
eval(['avs(end+1) = mean(' prefix '.' dataname '.data(:,7))'])
I have a matlab structure that follows the following pattern:
S.field1.data1
...
.field1.dataN
...
.fieldM.data1
...
.fieldM.dataN
I would like to assign values to one data field (say, data3) from all fields simultaneously. That would be semantically similar to:
S.*.data3 = value
Where the wildcard "*" represents all fields (field1,...,fieldM) in the structure. Is this something that can be done without a loop in matlab?
Since field1 .. fieldM are structure arrays with identical fields, why not make a struct array for "field"? Then you can easily set all "data" members to a specific value using deal.
field(1).data1 = 1;
field(1).data2 = 2;
field(2).data1 = 3;
field(2).data2 = 4;
[field.data1] = deal(5);
disp([field.data1]);
A loop-based solution can be flexible and easily readable:
names = strtrim(cellstr( num2str((1:5)','field%d') )); %'# field1,field2,...
values = num2cell(1:5); %# any values you want
S = struct();
for i=1:numel(names)
S.(names{i}).data3 = values{i};
end
In simple cases, you could do that by converting your struct into a cell array using struct2cell(). As you have a nested structure, I don't think that will work here.
On the other side, is there any reason why your data is structured like this. Your description gives the impression that a simple MxN array or cell array would be more suitable.
I'm interested in the general problem of accessing a field which may be buried an arbitrary number of levels deep in a containing structure. A concrete example using two levels is below.
Say I have a structure toplevel, which I define from the MATLAB command line with the following:
midlevel.bottomlevel = 'foo';
toplevel.midlevel = midlevel;
I can access the midlevel structure by passing the field name as a string, e.g.:
fieldnameToAccess = 'midlevel';
value = toplevel.(fieldnameToAccess);
but I can't access the bottomlevel structure the same way -- the following is not valid syntax:
fieldnameToAccess = 'midlevel.bottomlevel';
value = toplevel.(fieldnameToAccess); %# throws ??? Reference to non-existent field 'midlevel.bottomlevel'
I could write a function that looks through fieldnameToAccess for periods and then recursively iterates through to get the desired field, but I am wondering if there's some clever way to use MATLAB built-ins to just get the field value directly.
You would have to split the dynamic field accessing into two steps for your example, such as:
>> field1 = 'midlevel';
>> field2 = 'bottomlevel';
>> value = toplevel.(field1).(field2)
value =
foo
However, there is a way you can generalize this solution for a string with an arbitrary number of subfields delimited by periods. You can use the function TEXTSCAN to extract the field names from the string and the function GETFIELD to perform the recursive field accessing in one step:
>> fieldnameToAccess = 'midlevel.bottomlevel';
>> fields = textscan(fieldnameToAccess,'%s','Delimiter','.');
>> value = getfield(toplevel,fields{1}{:})
value =
foo