how set order in a multiple condition in `when` function? - pyspark

I have a complex code and I am using when to make a new column under some conditions. Consider the following code:
df.select(
'*',
F.when((A)|(B)|(C),top_val['val']).alias('match'))
let A,B and C are my conditions. I want to put an order on these conditions like this:
If A satisfied then don't check B and C
If B satisfied then don't check C.
Is there any way to put this order?

As stated in this blog and quoted in this answer, I don't think you can guarantee the order of evaluation of an or expression.
Spark SQL (including SQL and the DataFrame and Dataset API) does not guarantee the order of evaluation of subexpressions. In particular, the inputs of an operator or function are not necessarily evaluated left-to-right or in any other fixed order. For example, logical AND and OR expressions do not have left-to-right “short-circuiting” semantics.
However, you can nest the when() inside .otherwise() to form a series like this and achieve what you want:
df.select(
'*',
F.when((A),top_val['val'])
.otherwise(F.when((B),top_val['val'])
.otherwise(F.when((C), top_val['val']))).alias('match'))

Related

Spark - adding multiple columns under the same when condition

I need to add a couple of columns to a Spark DataFrame.
The value for both columns is conditional, using a when clause, but the condition is the same for both of them.
val df: DataFrame = ???
df
.withColumn("colA", when(col("condition").isNull, f1).otherwise(f2))
.withColumn("colB", when(col("condition").isNull, f3).otherwise(f4))
Since the condition in both when clauses is the same, is there a way I can rewrite this without repeating myself? I don't mean just extracting the condition to a variable, but actually reducing it to a single when clause, to avoid having to run the test multiple times on the DataFrame.
Also, in case I leave it like that, will Spark calculate the condition twice, or will it be able to optimize the work plan and run it only once?
The corresponding columns f1/f3 and f2/f4 can be packed into an array and then separated into two different columns after evaluating the condition.
df.withColumn("colAB", when(col("condition").isNull, array('f1, 'f3)).otherwise(array('f2, 'f4)))
.withColumn("colA", 'colAB(0))
.withColumn("colB", 'colAB(1))
The physical plans for my code and the code in the question are (ignoring the intermediate column colAB) the same:
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colA#71, colB#78]
== Physical Plan ==
LocalTableScan [f1#16, f2#17, f3#18, f4#19, condition#20, colAB#47, colA#54, colB#62]
so in both cases the condition is evaluated only once. This is at least true if condition is a regular column.
A reason to combine the two when statements could be that the code is better readable, although this judgement depends on the reader.

How to avoid column names like 'sum(<column>)' in aggregation in Spark/Scala?

The aggregation
df.groupBy($"whatever").sum("A","B","C")
produces a DataFrame with column names like sum(A), sum(B) and sum(C). Often the names A, B and C are already correct names for the final aggregates. Is there a way to avoid doing this:
df.groupBy($"whatever").sum($"A".as("A"), $"B".as("B"), $"C".as("C"))
No, there is not.
You need to use alias via .as as you state yourself.
You can of course rename the columns latterly. scala - how to substring column names after the last dot? provides good guidance here with replaceAll on col name.

Why can you nest aggregate functions when using a window function in PostgreSQL?

I'm trying to understand window functions a bit better, and I'm stumped as to why I can't run a nested aggregate function normally, but I can when using a window function.
This is the dbfiddle I'm working off of: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=76d62fcf4066053db18783e70269438c
Before running the window function, basically everything else in my query is evaluated (JOIN and GROUP BY).
So I believe the data the window function is working off of is something like this (after grouping):
Or is it something like this?
So why can I do this: SUM(COUNT(votes.option_id)) OVER(), but I can't do it without OVER()?
As far as I understand, OVER() makes the SUM(COUNT(votes.option_id)) run on this related data set, but it's still a nested aggregate function.
What am I missing?
Thank you very much!
If you have something like SUM(COUNT(votes.option_id)) OVER() you can think of COUNT(votes.option_id) as a column generated in the GROUP BY clause.
According to the documentation:
The rows considered by a window function are those of the “virtual table” produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any.
This means that window functions operate at a level above the GROUP BY clause and any aggregates, and therefore aggregates are available to be used inside window functions. In your example the "virtual table" corresponds to the second picture.
The reason you cannot nest aggregate functions is that you cannot have multiple levels of GROUP BY on the same query. Similarly you cannot nest window functions. The documentation is clear on what type of expression are allowed inside aggregate and window functions. For aggregates functions we can use:
any value expression that does not itself contain an aggregate expression or a window function call
while for window functions we can use:
any value expression that does not itself contain window function calls

array_agg guaranteed consistent across multiple columns in Postgres?

Suppose I have the following table in Postgres 9.4:
a | b
---+---
1 | 2
3 | 1
2 | 3
1 | 1
If I run
select array_agg(a) as a_agg, array_agg(b) as b_agg from foo
I get what I want
a_agg | b_agg
-----------+-----------
{1,3,2,1} | {2,1,3,1}
The orderings of the two arrays are consistent: the first element of each comes from a single row, as does the second, as does the third. I don't actually care about the order of the arrays, only that they be consistent across columns.
It seems natural that this would "just happen", and it seems to. But is it reliable? Generally, the ordering of SQL things is undefined unless an ORDER BY clause is specified. It is perfectly possible to get postgres to generate inconsistent pairings with inconsistent ORDER BY clauses within array_agg (with some explicitly counterproductive extra work):
select array_agg(a order by b) as agg_a, array_agg(b order by a) as agg_b from foo;
yields
agg_a | agg_b
-----------+-----------
{3,1,1,2} | {2,1,3,1}
This is no longer consistent. The first array elements 3 and 2 did not come from the same original row.
I'd like to be certain that, without any ORDER BY clause, the natural thing just happens. Even with an ordering on either column, ambiguity would remain because of the duplicate elements. I'd prefer to avoid imposing an unambiguous sort, because in my real application, the tables will be large and the sorting might be costly. But I can't find any documentation that guarantees or specifies that, absent imposition of inconsistent orderings, multiple array_agg calls will be ordered consistently, even though it'd be very surprising if they weren't.
Is it safe to assume that the ordering of multiple array_agg columns will be consistently ordered when no ordering is explicitly imposed on the query or within the aggregate functions?
According to PostgreSQL documentation :
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. [...]
However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering.
The way I understand it : you can't be sure that the order of rows is preserved unless you use ORDER BY.
It seems there is a similar (or almost same) question here:
PostgreSQL array_agg order
I prefer ebk's answer
So I think it's fine to assume that all the aggregates, none of which uses ORDER BY, in your query will see input data in the same order. The order itself is unspecified though (which depends on the order the FROM clause supplies rows).
But you can still add order in array_agg function to force same order.

How to obtain the symmetric difference between two DataFrames?

In the SparkSQL 1.6 API (scala) Dataframe has functions for intersect and except, but not one for difference. Obviously, a combination of union and except can be used to generate difference:
df1.except(df2).union(df2.except(df1))
But this seems a bit awkward. In my experience, if something seems awkward, there's a better way to do it, especially in Scala.
You can always rewrite it as:
df1.unionAll(df2).except(df1.intersect(df2))
Seriously though this UNION, INTERSECT and EXCEPT / MINUS is pretty much a standard set of SQL combining operators. I am not aware of any system which provides XOR like operation out of the box. Most likely because it is trivial to implement using other three and there is not much to optimize there.
why not the below?
df1.except(df2)
If you are looking for Pyspark solution, you should use subtract() docs.
Also, unionAll is deprecated in 2.0, use union() instead.
df1.union(df2).subtract(df1.intersect(df2))
Notice that the EXCEPT (or MINUS which is just an alias for EXCEPT) de-dups results. So if you expect "except" set (the diff you mentioned) + "intersect" set to be equal to original dataframe, consider this feature request that keeps duplicates:
https://issues.apache.org/jira/browse/SPARK-21274
As I wrote there, "EXCEPT ALL" can be rewritten in Spark SQL as
SELECT a,b,c
FROM tab1 t1
LEFT OUTER JOIN
tab2 t2
ON (
(t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
)
WHERE
COALESCE(t2.a, t2.b, t2.c) IS NULL
I think it could be more efficient using a left join and then filtering out the nulls.
df1.join(df2, Seq("some_join_key", "some_other_join_key"),"left")
.where(col("column_just_present_in_df2").isNull)