Spark Scala case when with multiple conditions - scala

I'm trying to do a case on a DF I have but I'm getting an error. I want to implement this with built in spark functions - withcolumn, when, otherwise:
CASE WHEN vehicle="BMW"
AND MODEL IN ("2020","2019","2018","2017")
AND value> 100000 THEN 1
ELSE 0 END AS NEW_COLUMN
Currently I have this
DF.withColumn(NEW_COLUMN, when(col(vehicle) === "BMW"
and col(model) isin(listOfYears:_*)
and col(value) > 100000, 1).otherwise(0))
But I'm getting an error due to data type mismatch, (boolean and string)... I understand my condition returns booleans and strings, which is causing the error. What's the correct syntax for executing a case like that one? also, I was using && instead of and but the third && was giving me a "cannot resolve symbol &&"
Thanks for the help!

I think && is correct - with the built-in spark functions, all of the expressions are of type Column, checking the API it looks like && is correct and should work fine. Could it be as simple as an order-of-operations issue, where you need parentheses around each of the boolean conditions? The function / "operator" isin would have a lower precedence than &&, which might trip things up.

Related

How does a missing boolean operator still compile?

I have code like this:
val pop: Bool = (
(fsm === Fsm.None && canPop) ||
(fsm === Fsm.Some && canPop && fooBar)
(fsm === Fsm.Other && canPop && dragonFruit) ||
(fsm === Fsm.Mom && canPop))
fsm is a ChiselEnum based state machine and the other signals are just Bools. There is a missing || on the second statement line (after fooBar). This compiles and runs, although causes a bug in the behavior. My question is... why does this compile? This doesn't seem like valid syntax. Is it?
Please help me understand this. I just spent numerous days debugging a large circuit to find this issue.
Correct Answer
The problem is that this is a bit extract. Chisel lets you do foo(bar) to extract the bit at index bar out of foo. The above code, as written, is a bit extract even though the user wants it to be a mistake.
Unfortunately, this is a known gotcha and it's not reasonable to error on this without false positives while still allowing users to use the () syntax for bit extracts.
Incorrect Original Answer
It's valid syntax per Scala rules. What's going on here is that the code is really two statements, but the indentation is confusing. It's equivalent to:
val foo = a + b // Create an immutable value equal to "a + b".
c + d // A bare statement that adds "c + d", but doesn't use the result.
Or infamous C bugs like:
if (cond)
foo();
bar(); // bar() is always executed, but the indentation is confusing.
Usually this is the type of thing that you'd notice due to a code formatter or via linting. However, the problem is common to almost every language.
Setting up Scalafmt for your project may help.

Slick "for/yield"-query doesn't compile with negative comparison

I ran into an odd problem with my slick-query:
As you can see, the function below is compiling although it's basically the same query but with a possitive comparison (I don't know if it's actually doing what it's supposed to do, though). When swapping the order of the if conditions, it tells me that && cannot be resolved. I don't know if that's the case, but I guess the second table query object, in this case contents, doesn't seem to be finished yet. However, that begs the question why the second function/query is compiling properly.
Do you have an answer to this? Am I doing something wrong here?
Thanks in advance!
You should use =!= for inequality and === for equality in queries according to slick docs
I guess I've fixed the problem.
Instead of:
if a.documentId === documentId && b.contentTypeId !== ContentType.PROCESS
I needed to write:
if a.documentId === documentId && !(b.contentTypeId === ContentType.PROCESS)
Still a weird behavior I can't really explain, espacially since negative comparisons like !== are generally allowed in those if-statements

What's the meaning of "$" in Dataset's operators (like select or filter)?

I am a bit confused about using $ to reference columns in DataFrame operators like select or filter.
The following statements work:
df.select("app", "renders").show
df.select($"app", $"renders").show
But, only the first statement in the following works:
df.filter("renders = 265").show // <-- this works
df.filter($"renders" = 265).show // <-- this does not work (!) Why?!
However, this again works:
df.filter($"renders" > 265).show
Basically, what is this $ in DataFrame's operators and when/how should I use it?
Implicits are a major feature of the Scala language that take a lot of different forms--like implicit classes as we will see shortly. They have different purposes, and they all come with varying levels of debate regarding how useful or dangerous they are. Ultimately though, implicits generally come down to simply having the compiler convert one class to another when you bring them into scope.
Why does this matter? Because in Spark there is an implicitclass called StringToColumn that endows a StringContext with additional functionality. As you can see, StringToColumn adds the $ method to the Scala class StringContext. This method produces a ColumnName, which extends Column.
The end result of all this is that the $ method allows you to treat the name of a column, represented as a String, as if it were the Column itself. Implicits, when used wisely, can produce convenient conversions like this to make development easier.
So let's use this to understand what you found:
df.select("app","renders").show -- succeeds because select takes multiple Strings
df.select($"app",$"renders").show -- succeeds because select takes multiple Columnss that result after the implicit conversions are applied
df.filter("renders = 265").show -- succeeds because Spark supports SQL-like filters
df.filter($"renders" = 265).show -- fails because $"renders" is of type Column after implicit conversion, and Columns use the custom === operator for equality (unlike the case in SQL).
df.filter($"renders" > 265).show -- succeeds because you're using a Column after implicit conversion and > is a function on Column.
$ is a way to convert a string to the column with that name.
Both options of select work originally because select can receive either a column or a string.
When you do the filter $"renders" = 265 is an attempt at assigning a number to the column. > on the other hand is a comparison method. You should be using === instead of =.

Squeryl, inhibitWhen function doesnt seems to work

Im using Squeryl with the function inhibitWhen to ignore a Where clause when a scala value is Empty. I tried this code but Squeryl still checking for the where clause that i want to ignore!
(l.book_id in params.get("book_id").toList.flatMap(_.split(",")).map(_.toInt)).inhibitWhen(params.get("book_id").map(_.isEmpty) == Some(true))
So what i need to do is when book_id has some values (1,4,6) it use the in clause but when it's empty the in clause should be ignored !

workaround for final == and != (equals and not equals) methods in scala DSL

So I'm wrapping bits of the Mechanical Turk API, and you need to specify qualification requirements such as:
Worker_Locale == "US"
Worker_PercentAssignmentsApproved > 95
...
In my code, I'd like to allow the syntax above and have these translated into something like:
QualificationRequirement("00000000000000000071", "LocaleValue.Country", "EqualTo", "US")
QualificationRequirement("000000000000000000L0", "IntegerValue", "GreaterThan", 95)
I can achieve most of what I want by declaring an object like:
object Worker_PercentAssignmentsApproved {
def >(x: Int) = {
QualificationRequirement("000000000000000000L0", "IntegerValue", "GreaterThan", x)
}
}
But I can't do the same thing for the "==" (equals) or "!=" (not equals) methods since they're declared final in AnyRef. Is there a standard workaround for this? Perhaps I should just use "===" and "!==" instead?
(I guess one good answer might be a summary of how a few different scala DSLs have chosen to work around this issue and then I can just do whatever the majority of those do.)
Edit: Note that I'm not trying to actually perform an equality comparison. Instead, I'm trying to observe the comparison operator the user indicated in scala code, save an object based description of that comparison, and give that description to the server. Specifically, the following scala code:
Worker_Locale == "US"
will result in the following parameters being added to my request:
&QualificationRequirement.1.QualificationTypeId=000000000000000000L0
&QualificationRequirement.1.Comparator=EqualTo
&QualificationRequirement.1.LocaleValue.Country=US
So I can't override equals since it returns a Boolean, and I need to return a structure that represents all these parameters.
If you look at the definition of == and != in the scala reference, (§ 12.1), you’ll find that they are defined in terms of eq and equals.
eq is the reference equality and is also final (it is only used to check for null in that case) but you should be able to override equals.
Note that you’ll probably also need to write the hashCode method to ensure
∀ o1, o2 with o1.equals(o2) ⇒ (o1.hashCode.equals(o2.hashCode)).
However, if you need some other return type for your DSL than Boolean or more flexibility in general, you should maybe use ===, as has been done in Squeryl for example.
Here's a little survey of what various DSLs use for this kind of thing.
Liftweb uses === in Javascript expressions:
JsIf(ValById("username") === value.toLowerCase, ...)
Squeryl uses === for SQL expressions:
authors.where(a=> a.lastName === "Pouchkine")
querydsl uses $eq for SQL expressions:
person.firstName $eq "Ben"
Prolog-in-Scala uses === for Prolog expressions:
'Z === 'A
Scalatest uses === to get an Option instead of a Boolean:
assert("hello" === "world")
So I think the consensus is mostly to use ===.
I've been considering a similar problem. I was thinking of creating a DSL for writing domain-specific formulas. The trouble is that users might want to do string manipulation too and you end up with expression like
"some string" + <someDslConstruct>
No matter what you do its going to lex this as a something like
stringToLiteralString("some string" + <someDslConstruct>)
I think the only potential way out of this pit would be to try using macros. In your example perhaps you could have a macro that wraps a scala expression and converts the raw AST into a query? Doing this for arbitrary expressions wouldn't be feasible but if your domain is sufficiently well constrained it might be a workable alternative solution.