How to use length and rlike using logical operator inside when clause - scala

Want to check if the column has values that have certain length and contains only digits.
The problem is that the .rlike or .contains returns a Column type. Something like
.when(length(col("abc")) == 20 & col("abc").rlike(...), myValue)
won't work as col("abc").rlike(...) will return Column and unlike length(col("abc")) == 20 which returns Boolean (length() however also returns Column). How do I combine the two?

After doing a bit of searching in compiled code, found this
def when(condition : org.apache.spark.sql.Column, value : scala.Any) : org.apache.spark.sql.Column
Therefore the conditions in when must return Column type. length(col("abc")) == 20 was evaluating to Boolean.
Also, found this function with the following signature
def equalTo(other : scala.Any) : org.apache.spark.sql.Column
So, converted the whole expression to this
.when(length(col("abc")).equalTo(20) && col("abc").rlike(...), myValue)
Note that the logical operator is && and not &.
Edit/Update : #Histro's comment is correct.

Related

Check if Column is Vector Type

I am trying to determine if a column is a vector type, but am running into issues.
After I run a model and create a dataframe called predictions, there is a field called probability.
When I run this code to see the datatype if shows a vector.
predictions.schema['probability'].dataType
Out[128]: VectorUDT
Then when I run this I get a false returned
predictions.schema["probability"].dataType == 'VectorUDT'
Out[129]: False
So I tried this
dict(predictions.dtypes)['probability'] == 'vector'
Out[130]: True
However, when I try to use that in my dataframe I get an error stating TypeError: unhashable type: 'Column'
.withColumn('test', when(dict(predictions.dtypes)['probability'] == 'vector',1)
.otherwise(0)) \
predictions.schema['probability'] returns a StructField and StructField.dataType returns a DataType. So
predictions.schema["probability"].dataType == 'VectorUDT'
compares a DataType with a string and this always false.
One way to check for a vector column is to check the type of the dataType using isinstance:
from pyspark.ml.linalg import VectorUDT
isinstance(predictions.schema["probability"].dataType, VectorUDT) #True
Another way to check the column type is to use DataType.simpleString or DataType.typeName. Both return the same string, vector for VectorUDT fields:
predictions.schema['probability'].dataType.simpleString() == 'vector' #True

Scala characteristic function

We got three functions. The first one defines type alias for Boolean condition
type Set = Int => Boolean
I understand that this is the alias definition. Now the second fucntion
def contains(set: Set, elem: Int): Boolean = set(elem)
calls the (Int=>Boolean) on elem:Int.
QUESTION 1: Where is the logic of the function under Set?
I mean, do I have to pass the Set function actual parameter (in which case the contains is a higher order function) when calling contains eg. for even numbers set:
val in:Boolean = contains({x=>(x%2)==0},2)
In the third function:
def singletonSet(elem: Int): Set = set => set == elem
Question 2: Where does the set come form? Its not in the formal parameter list.
QUESTION 1: Yes, you have to pass a Set which would be the "implementation" of the function. The point of this exercise (Odersky's course?) is to show that a Set can be defined not as a collection of items (the "usual" definition of a set), but rather as a function that says whether an item is included in the set or not. So the Set is the function.
QUESTION 2: set is the name given to the argument of the anonymous function we're returning here: Since singletonSet's return type is Set, which as we've said is actually a function of type Int => Boolean, we return an (anonymous) function. To create such a function, one uses the syntax x => f(x), where x is any name you'd like and f(x) is an expression using it (or not).
1) Since a Set is a function, contains is indeed a higher order function which takes a function and an element of the appropriate type and applies the function to the element. The logic of it is that sets are being represented by Boolean-valued functions where an element evaluates to true if and only if it is in the corresponding set. The function contains evaluates the function at the element and returns its value, which is either true or false depending on whether or not it is in the set.
2) singleton returns an anonymous function, one that evaluates to true if and only if the input (set) equals the element in question.

Julia: Immutable composite types

I am still totally new to julia and very irritated by the following behaviour:
immutable X
x::ASCIIString
end
"Foo" == "Foo"
true
X("Foo") == X("Foo")
false
but with Int instead of ASCIIString
immutable Y
y::Int
end
3 == 3
true
Y(3) == Y(3)
true
I had expected X("Foo") == X("Foo") to be true. Can anyone clarify why it is not?
Thank you.
Julia have two types of equality comparison:
If you want to check that x and y are identical, in the sense that no program could distinguish them. Then the right choice is to use is(x,y) function, and the equivalent operator to do this type of comparison is === operator. The tricky part is that two mutable objects is equal if their memory addresses are identical, but when you compare two immutable objects is returns true if contents are the same at the bit level.
2 === 2 #=> true, because numbers are immutable
"Foo" === "Foo" #=> false
== operator or it's equivalent isequal(x,y) function that is called generic comparison and returns true if, firstly suitable method for this type of argument exists, and secondly that method returns true. so what if that method isn't listed? then == call === operator.
Now for the above case, you have an immutable type that do not have == operator, so you really call === operator, and it checks if two objects are identical in contents at bit level, and they are not because they refer to different string objects and "Foo" !== "Foo"
Check source Base.operators.jl.
Read documentation
EDIT:
as #Andrew mentioned, refer to Julia documentation, Strings are of immutable data types, so why "test"!=="true" #=> true? If you look down into structure of a String data type e.g by xdump("test") #=> ASCIIString data: Array(UInt8,(4,)) UInt8[0x74,0x65,0x73,0x74], you find that Strings are of composite data types with an important data field. Julia Strings are mainly a sequence of bytes that stores in data field of String type. and isimmutable("test".data) #=> false

Match function response when no match found

I have a Match function nested within an IF function. If the Match is found within the range, the formula works. If the match is not found, I get "N/A": the match is not found. I would like to get a blank cell instead.
Here is a link to an example spreadsheet that illustrates my question: https://docs.google.com/spreadsheets/d/1_yLiQK_-7ygxAkoIgtFJLvWM3d-IA_rf1qo0_O_klWQ/edit?usp=sharing
Have you tried what just the Match function returns? Is that 0 for no result or N/A?
You could if it returns 0 replace
if (match(..) >= 1, "1")
with
if (match(..) >= 1, "1", "")
Otherwise you should check the result of match against some logic function like is ISNA or ISNUMBER
Edit: apparently it had to be:
=iferror((then all your formula),"")

Algorithm to evaluate value of Boolean expression

I had programming interview which consisted of 3 interviewers, 45 min each.
While first two interviewers gave me 2-3 short coding questions (i.e reverse linked list, implement rand(7) using rand(5) etc ) third interviewer used whole timeslot for single question:
You are given string representing correctly formed and parenthesized
boolean expression consisting of characters T, F, &, |, !, (, ) an
spaces. T stands for True, F for False, & for logical AND, | for
logical OR, ! for negate. & has greater priority than |. Any of these
chars is followed by a space in input string. I was to evaluate value
of expression and print it (output should be T or F). Example: Input:
! ( T | F & F ) Output: F
I tried to implement variation of Shunting Yard algorithm to solve the problem (to turn input in postfix form, and then to evaluate postfix expression), but failed to code it properly in given timeframe, so I ended up explaining in pseudocode and words what I wanted.
My recruiter said that first two interviewers gave me "HIRE", while third interviewer gave me "NO HIRE", and since the final decision is "logical AND", he thanked me for my time.
My questions:
Do you think that this question is appropriate to code on whiteboard in approx. 40 mins? To me it seems to much code for such a short timeslot and dimensions of whiteboard.
Is there shorter approach than to use Shunting yard algorithm for this problem?
Well, once you have some experience with parsers postfix algorithm is quite simple.
1. From left to right evaluate for each char:
if its operand, push on the stack.
if its operator, pop A, then pop B then push B operand A onto the stack. Last item on the stack will be the result. If there's none or more than one means you're doing it wrong (assuming the postfix notation is valid).
Infix to postfix is quite simple as well. That being said I don't think it's an appropriate task for 40 minutes if You don't know the algorithms. Here is a boolean postfix evaluation method I wrote at some stage (uses Lambda as well):
public static boolean evaluateBool(String s)
{
Stack<Object> stack = new Stack<>();
StringBuilder expression =new StringBuilder(s);
expression.chars().forEach(ch->
{
if(ch=='0') stack.push(false);
else if(ch=='1') stack.push(true);
else if(ch=='A'||ch=='R'||ch=='X')
{
boolean op1 = (boolean) stack.pop();
boolean op2 = (boolean) stack.pop();
switch(ch)
{
case 'A' : stack.push(op2&&op1); break;
case 'R' : stack.push(op2||op1); break;
case 'X' : stack.push(op2^op1); break;
}//endSwitch
}else
if(ch=='N')
{
boolean op1 = (boolean) stack.pop();
stack.push(!op1);
}//endIF
});
return (boolean) stack.pop();
}
In your case to make it working (with that snippet) you would first have to parse the expression and replace special characters like "!","|","^" etc with something plain like letters or just use integer char value in your if cases.