Scala- Databricks- Linear Regression - scala

Can someone please explain me the meaning of below line of code (in scala-databricks)
val categoricalIndexers = categoricalVariables
.map(i => new StringIndexer().setHandleInvalid("skip").setInputCol(i)
.setOutputCol(i+"Index"))

The aim of the StringIndexer is to map a string column of labels to an ML column of label indices (see the documentation of Apache Spark for more details and codes(.
You can use setHandleInvalid to choose how the unseen labels are handled using the StringIndexer. If you choose the "skip" setting the rows containing the unseen labels will be skipped (removed from the output).
And the aim of the end of the code .setInputCol(i) and .setOutputCol(i+"Index") is to index the string variables and to return the indexed variables; the names of the indexed variables are the names of the original string variables + "Index". For example, let's use the string variable "City". Once the string variable is indexed, the name of the indexed variable will be CityIndex.
You can use the following lines to index all the string variables of your data set :
var categoricalCols = DataSet.dtypes.filter(_._2 == "StringType").map(_._1)
var indexOutputCols = categoricalCols.map(_ + "_Index")
// Handle string variables
var stringIndexer = new StringIndexer()
.setInputCols(categoricalCols)
.setOutputCols(indexOutputCols)
.setHandleInvalid("skip")
The indexed variables have the original name + "_indexed".

Related

readtable start at specific row in Matlab

How can I read a file with readtable (docs) starting at row 6?
I tried the following but this only reads the first two columns (I have columns A:L):
opts = detectImportOptions(fileName);
opts.VariableNamesRange = 'A6';
opts.DataRange = 'A7';
raw = readtable(fileName,opts,'ReadVariableNames',true)
When I do
opts.VariableNamesRange = 'A6:L6';
opts.DataRange = 'A7:L7';
I get the error message:
Invalid 'VariableNamesRange'. The column size must match the number of
variables.
Before setting the VariableNamesRange and DataRange fields of opts, try setting the VariableNames field to something like opts.VariableNames = cellstr(['A':'L']').
A couple of notes on this:
The number of columns in the VariableNamesRange and DataRange fields needs to match the length of the VariableNames field. Check the results detectImportOptions to see how many columns it detected;
If you do this, then check the VariableTypes field to make sure that all variables are the correct type (double or char).

MATLAB: Loop through the values of a list from 'who' function

I have a long list of variables in my workspace.
First, I'm finding the potential variables I could be interested in using the who function. Next, I'd like to loop through this list to find the size of each variable, however who outputs only the name of the variables as a string.
How could I use this list to refer to the values of the variables, rather than just the name?
Thank you,
list = who('*time*')
list =
'time'
'time_1'
'time_2'
for i = 1:size(list,1);
len(i,1) = length(list(i))
end
len =
1
1
1
If you want details about the variables, you can use whos instead which will return a struct that contains (among other things) the dimensions (size) and storage size (bytes).
As far as getting the value, you could use eval but this is not recommended and you should instead consider using cell arrays or structs with dynamic field names rather than dynamic variable names.
S = whos('*time*');
for k = 1:numel(S)
disp(S(k).name)
disp(S(k).bytes)
disp(S(k).size)
% The number of elements
len(k) = prod(S(k).size);
% You CAN get the value this way (not recommended)
value = eval(S(k).name);
end
#Suever nicely explained the straightforward way to get this information. As I noted in a comment, I suggest that you take a step back, and don't generate those dynamically named variables to begin with.
You can access structs dynamically, without having to resort to the slow and unsafe eval:
timestruc.field = time;
timestruc.('field1') = time_1;
fname = 'field2';
timestruc.(fname) = time_2;
The above three assignments are all valid for a struct, and so you can address the fields of a single data struct by generating the field strings dynamically. The only constraint is that field names have to be valid variable names, so the first character of the field has to be a letter.
But here's a quick way out of the trap you got yourself into: save your workspace (well, the relevant part) in a .mat file, and read it back in. You can do this in a way that will give you a struct with fields that are exactly your variable names:
time = 1;
time_1 = 2;
time_2 = rand(4);
save('tmp.mat','time*'); % or just save('tmp.mat')
S = load('tmp.mat');
afterwards S will be a struct, each field will correspond to a variable you saved into 'tmp.mat':
>> S
S =
time: 1
time_1: 2
time_2: [4x4 double]
An example writing variables from workspace to csv files:
clear;
% Writing variables of myfile.mat to csv files
load('myfile.mat');
allvars = who;
for i=1:length(allvars)
varname = strjoin(allvars(i));
evalstr = strcat('csvwrite(', char(39), varname, '.csv', char(39), ', ', varname, ')');
eval(evalstr);
end

Create structure fieldnames from array of numbers

I have a dataset that I would like to categorise and store in a structure based on the value in one column of the dataset. For example, the data can be categorised into element 'label_100', 'label_200' or 'label_300' as I attempt below:
%The labels I would like are based on the dataset
example_data = [repmat(100,1,100),repmat(200,1,100),repmat(300,1,100)];
data_names = unique(example_data);
%create a cell array of strings for the structure fieldnames
for i = 1:length(data_names)
cell_data_names{i}=sprintf('label_%d', data_names(i));
end
%create a cell array of data (just 0's for now)
others = num2cell(zeros(size(cell_data_names)));
%try and create the structure
data = struct(cell_data_names{:},others{:})
This fails and I get the following error message:
"Error using struct
Field names must be strings."
(Also, is there a more direct method to achieve what I am trying to do above?)
According to the documentation of struct,
S = struct('field1',VALUES1,'field2',VALUES2,...) creates a
structure array with the specified fields and values.
So you need to have each value right after its field name. The way you are calling struct now is
S = struct('field1','field2',VALUES1,VALUES2,...)
instead of the correct
S = struct('field1',VALUES1,'field2',VALUES2,...).
You can solve that by concatenating cell_data_names and others vertically and then using {:} to produce a comma-separated list. This will give the cells' contents in column-major order, so each field name fill be immediately followed by the corresponding value:
cell_data_names_others = [cell_data_names; others]
data = struct(cell_data_names_others{:})

Give value, return field name in matlab structure

I have a Matlab structure like this:
Columns.T21=6;
Columns.ws21=9;
Columns.wd21=10;
Columns.u21=11;
Is there some elegant way I can give the value and return the field name? For instance, if I give 6 and it would return 'T21.' I know that fieldnames() will return all the field names, but I want the fieldname for a specific value. Many thanks!
Assuming that the structure contains fields with scalar numeric values, you can use this struct2array based approach -
search_num = 6; %// Edit this for a different search number
fns=fieldnames(Columns) %// Get field names
out = fns(struct2array(Columns)==search_num) %// Logically index into names to find
%// the one that matches our search
Goal:
Construct two vectors from your struct, one for the names of fields and the other for their respective values. This has analogy to the dict in Python or map in C++, where you have unique keys being mapped to possibly non-unique values.
Simple Solution:
You can do this very simply using the various functions defined for struct in Matlab, namely: struc2cell() and cell2mat()
For the particular element of interest, say 1 of your struct Columns, get the names of all fields in the form of a cell array, using fieldnames() function:
fields = fieldnames( Columns(1) )
Similarly, get the values of all the fields of that element of Columns, in the form of a matrix
vals = cell2mat( struct2cell( Columns(1) ) )
Next, find the field with the corresponding value, say 6 here, using the find function and convert the resulting 1x1 cell into a char using cell2mat() function :
cell2mat( fields( find( vals == 6 ) ) )
which will yield:
T21
Now, you can define a function that does this for you, e.g.:
function fieldname = getFieldForValue( myStruct, value)
Advanced Solution using Map Container Data Abstraction:
You can also choose to define an object of the containers.map class using the field-names of your struct as the keySet and values as valueSet.
myMap = containers.Map( fieldnames( Columns(1) ), struct2cell( Columns(1) ) );
This allows you to get keys and values using corresponding built-in functions:
myMapKeys = keys(myMap);
myMapValues = values(myMap);
Now, you can find all the keys corresponding to a particular value, say 6 in this case:
cell2mat( myMapKeys( find( myMapValues == 6) )' )
which again yields:
T21
Caution: This method, or for that matter all methods for doing so, will only work if all the fields have the values of the same type, because the matrix to which we are converting vals to, need to have a uniform type for all its elements. But I assume from your example that this would always be the case.
Customized function/ logic:
struct consists of elements that contain fields which have values, all in that order. An element is thus a key for which field is a value. The essence of "lookup" is to find values (which are non-unique) for specific keys (which are unique). Thus, Matlab has a built-in way of doing so. But what you want is the other way around, i.e. to find keys for specific values. Since its not a typical use case, you need to write up your own logic or function for it.
Suppose your structure is called S. First extract all the field names into an array:
fNames=fieldnames(S);
Now define a following anonymous function in your code:
myfun=#(yourArray,desiredValue) yourArray==desiredValue;
Then you can get the desired field name as:
desiredFieldIndex=myfun(structfun(#(x) x,S),3) %desired value is 3 (say)
desiredFieldName=fNames(desiredFieldIndex)
Alternative using containers.Map
Assuming each field in the structure contains one scalar value as in the question (not an array).
Aim is to create a Map object with the field values as keys and the field names as values
myMap = containers.Map(struct2cell(Columns),fieldnames(Columns))
Now to get the fieldname for a value index into myMap with the value
myMap(6)
ans =
T21
This has the advantage that if the structure doesn't change you can repeatedly use myMap to find other value-field name pairs

How can I dynamically access a field of a field of a structure in MATLAB?

I'm interested in the general problem of accessing a field which may be buried an arbitrary number of levels deep in a containing structure. A concrete example using two levels is below.
Say I have a structure toplevel, which I define from the MATLAB command line with the following:
midlevel.bottomlevel = 'foo';
toplevel.midlevel = midlevel;
I can access the midlevel structure by passing the field name as a string, e.g.:
fieldnameToAccess = 'midlevel';
value = toplevel.(fieldnameToAccess);
but I can't access the bottomlevel structure the same way -- the following is not valid syntax:
fieldnameToAccess = 'midlevel.bottomlevel';
value = toplevel.(fieldnameToAccess); %# throws ??? Reference to non-existent field 'midlevel.bottomlevel'
I could write a function that looks through fieldnameToAccess for periods and then recursively iterates through to get the desired field, but I am wondering if there's some clever way to use MATLAB built-ins to just get the field value directly.
You would have to split the dynamic field accessing into two steps for your example, such as:
>> field1 = 'midlevel';
>> field2 = 'bottomlevel';
>> value = toplevel.(field1).(field2)
value =
foo
However, there is a way you can generalize this solution for a string with an arbitrary number of subfields delimited by periods. You can use the function TEXTSCAN to extract the field names from the string and the function GETFIELD to perform the recursive field accessing in one step:
>> fieldnameToAccess = 'midlevel.bottomlevel';
>> fields = textscan(fieldnameToAccess,'%s','Delimiter','.');
>> value = getfield(toplevel,fields{1}{:})
value =
foo