How to cast a numeric column to string within Pyspark ML Pipeline?

How to cast a numeric column to string within Pyspark ML Pipeline? - pyspark

Let's say I have a simple pipeline like this:
feature_columns = ["x1", "x2"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
pipeline = Pipeline(stages=[assembler, rf])
It has two input features x1 and x2 and a target column named label.
I want add a typecast step to convert a type of x1 from int to double. I could have typecasted it explicitly before feeding to the model but I want to have the cast step as a part of the pipeline itself.
How can I do that? One option would be to use a custom transformer that would do the casting but I wish to know if there is anything already available to do so.

Related

When subclassing "double" with new properties in MATLAB, is there an easy way to access the data value?

Say I have a class subclassing double, and I want to add a string (Similar to the 'extendDouble' in the documentation). Is there an easy way to access the actual numeric value without the extra properties, particular for reassigning? Or if I want to change the value, will I have to recreate the value as a new member of the class with the new value and the same string?
e.g.
classdef myDouble < double
properties
string
end
methods
function obj = myDouble(s)
% Construct object (simplified)
obj.string = s;
end
end
end
----------
x = myDouble(2,'string')
x =
2 string
x = 3
x =
3 string

Short answer: NO. There is no easy way to access a single member of a class when the class contains more than one member. You'll always have to let MATLAB know which part of the class you want to manipulate.
You have multiple questions in your post but let's tackle the most interesting one first:
% you'd like to instanciate a new class this way (fine)
x = myDouble(2,'string')
x =
2 string
% then you'd like to easily refer to the only numeric part of your class
% for assignment => This can NEVER work in MATLAB.
x = 3
x =
3 string
This can never work in MATLAB because of how the interpreter works. Consider the following statements:
% direct assignment
(1) dummy = 3
% indexed assignments
(2) dummy(1) = 3
(3) dummy{1} = 3
(4) dummy.somefieldname = 3
You would like the simplicity of the first statement for assignment, but this is the one we cannot achieve. The statement 2, 3 and 4 are all possible with some fiddling with subasgn and subsref.
The main difference between (1) and [2,3,4] is this:
Direct assignment:
In MATLAB, when you execute a direct assignment to a simple variable name (without indexing with () or {} or a field name) like dummy=3, MATLAB does not check the type of dummy beforehand, in fact it does not even check whether the variable dummy exists at all. No, with this kind of assignment, MATLAB goes the quickest way, it immediately create a new variable dummy and assign it the type and value accordingly. If a variable dummy existed before, too bad for it, that one is lost forever (and a lot of MATLAB users have had their fingers bitten once or twice by this behavior actually as it is an easy mistake to overwrite a variable and MATLAB will not raise any warning or complaint)
Indexed assignments:
In all the other cases, something different happens. When you execute dummy(1)=3, you are not telling MATLAB "create a new dummy variable with that value", you are telling MATLAB, "find the existing dummy variable, find the existing subindex I am telling you, then assign the value to that specific subindex". MATLAB will happlily go on, if it finds everything it does the sub-assignment, or it might complains/error about any kind of misassignment (wrong index, type mismatch, indices length mismatch...).
To find the subindex, MATLAB will call the subassgn method of dummy. If dummy is a built-in class, the subassgn method is also built in and usually under the hood, if dummy is a custom class, then you can write your own subassgn and have full control on how MATLAB will treat the assignment. You can check for the type of the input and decide to apply to this field or another if it's more suitable. You can even do some range check and reject the assignment altogether if it is not suitable, or just assign a default value. You have full control, MATLAB will not force you to anything in your own subassgn.
The problem is, to trigger MATLAB to relinquish control and give the hand to your own subassgn, you have to use an indexed assignment (like [2,3 or 4] above). You cannot do that with type (1) assignment.
Other considerations: You also ask if you can change the numeric part of the class without creating a new object. The answer to that is no as well. This is because of the way value classes work in matlab. There could be a long explanation of what happens under the hood, but the best example is from the MATLAB example you referenced yourself. If we look at the class definition of ExtendDouble, then observe the custom subassgn method which will perform the change of numeric value, what happens there is:
obj = ExtendDouble(b,obj.DataString);
So even Mathworks, to change the numeric value of their extended double class, have to recreate a brand new one (with a new numeric value b, and transfering the old string value obj.DataString).

How to return integer value from notebook in adf pipeline

I have a usecase where I need to return an integer as output from a synapse notebook in pipeline and pass this output in next stage of my pipeline.
Currently mssparkutils.notebook.exit() takes only string values. Is there any utility methods available for this?
I know we can cast the integer to string type and send it to the exit("") method. I wanted to know if I could achieve this without casting.

cast()function is the standard and official method suggested by Spark itself. AFAIK, there is no other method. Otherwise, you need to manage it programmatically.
You can also try #equals in dynamic content to check whether the exitValue fetched from the notebook activity output equals to some specific value.
#equals(activity('Notebook').output.status.Output.result.exitValue, '<value>')
Refer: Spark Cast String Type to Integer Type (int), Transform data by running a Synapse notebook

instead, you can convert the string number to an integer in dynamic content. like this:
#equals(
int(activity('Notebook').output.status.Output.result.exitValue)
,1)
or add an activity that sets the string value to a variable that is an int.

MATLAB equivalent to Python argmax with array of user defined objects

I have an array as shown here. Here Bandit is a class that I created.
bandits = [Bandit(m1),Bandit(m2),Bandit(m3)];
Now, I want to do the following. Following is Python code which immediately gives me the maxarg of the value of the mean of each of these objects.
j = np.argmax([b.mean for b in bandits])
How can I do the same in MATLAB? To give more clarity, every bandit object has an attribute mean_value. I.e. if b1 is a bandit object, then I can get that value using dot operator (b1.mean_value). I want to find which among b1, b2, b3 has maximum mean_val and need to get the index for it. (See the python code above. If b2 has the highest mean_val, then finally, j will contain index 2.)

arrayfun applies a function to each element of an array. It results in a new array with the results of the operation. To this result you can then apply max as usual:
[~,arg] = max(arrayfun(#mean,bandits));
Note that this might not work if you have overloaded the subsref or size methods for the Bandit class.
Edit:
So now I understand that mean was not a function but an attribute. The operation x.mean can be expressed as the function call subsref(x,substruct('.','mean')). Thus, it is possible to change the solution above to call this function on each array element:
op = #(x)subsref(x,substruct('.','mean'))
[~,arg] = max(arrayfun(op,bandits));
That is, instead of calling the function mean, we call the function subsref to index the attribute mean.
If bandits is a simple struct array, then the following will work also:
[~,arg] = max([bandits.mean]);
Here, bandits.mean will extract the mean value for each element of the struct array, yielding a comma-separated list. This list is captured with the square brackets to form a vector. This vector is again input into the max function as usual.
I'm not sure if this latter solution works also for custom classes. I don't have your Bandit class to test. Please let me know if this latter solution works, so I can update the post with correct information.

Print value of variable in tensor flow from a member function of a class

I have a viable defined as follows which is also the weight matrix for a regular neural net.
W1 = tf.Variable(tf.truncated_normal([feature_space_size, hidden1], stddev=1.0 / math.sqrt(feature_space_size),dtype=tf.float64), name='W1')
How can I print its value while I am debugging ? The issue is that its defined in a constructor and I need to access it in a member function of the same class. I tried fetching using
tf.get_variable('W1',[4,300])
But I am not able to print its value using self.sess.run(). Please advise. There should really be a simpler way to print the value of variables. And moreover, it seems likee after I do get_variable, its no longer in the op graph for TF.

use this line of code to make a list of the variables that tensor flow assigns to all the tf.Variables().
v = [a.name for a in tf.trainable_variables()]
The string a.name contains a string part of which is the variable name.
The value can be accesses by using sess.run(a.name)

Set the initial type of a vector in Matlab

I'd like to declare an empty vector that accepts insertion of user-defined types. In the following examples node is a type I've defined with classdef node ...
The following code is rejected by the Matlab interpreter because the empty vector is automatically initialized as type double, so it can't have a node inserted into it.
>> a = [];
>> a(1) = node(1,1,1);
The following error occurred converting from node to double:
Conversion to double from node is not possible.
The code below is accepted because the vector is initialized with a node in it, so it can later have nodes inserted.
>> a = [node(1,1,1)];
>> a(1) = node(1,2,1);
However, I want to create an empty vector that can have nodes inserted into it. I can do it awkwardly like this:
>> a = [node(1,1,1)];
>> a(1) = [];
What's a better way? I'm looking for something that declares the initial type of the empty vector to be node. If I could make up the syntax, it would look like:
>> a = node[];
But that's not valid Matlab syntax. Is there a good way to do this?

Empty object can be created by
A = MyClass.empty;
It works with your own class, but also with Matlab's class such as
A = int16.empty;
This method is able to create multi-dimensional empty objects with this syntax
A = MyClass.empty(n,m,0,p,q);
as long as one dimension is set to zero.
See the doc.

You don't specify what your class contains, but yes, generally speaking it is possible to use array creation functions such as zeros, ones, and others for user-defined classes as well.
For in-built classes, you might have a call like
A = zeros(2,3,'uint8');
to create a 2-by-3 matrix of zeros of datatype uint8. The similar syntax can also be applied for appropriate types of user-defined classes, for instance:
A = zeros(2,3,'MyClass');
where 'MyClass' is the name of your class, or by giving an example:
p = MyClass(...);
A = zeros(2,3,'like',p);
The source for this information, along with a specification of how to implement support for array creation funtions in user-defined classes may be found here.
A call such as zeros(0,0,'MyClass') would then produce an empty vector of type MyClass.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to cast a numeric column to string within Pyspark ML Pipeline? - pyspark

Related

When subclassing "double" with new properties in MATLAB, is there an easy way to access the data value?

How to return integer value from notebook in adf pipeline

MATLAB equivalent to Python argmax with array of user defined objects

Print value of variable in tensor flow from a member function of a class

Set the initial type of a vector in Matlab

Categories

Resources