PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data - pyspark

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway

Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

Related

PySpark group by collect_list over a window

I have a data frame with multiple columns. I'm trying to aggregate few columns using collect_list grouped on id, over a window function. I'm trying some thing like this:
exprs = [(collect_list(x).over(window)).alias(f"{x}_list") for x in cols]
df = df.groupBy('id').agg(*exprs)
I'm getting the below error:
expression is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get
If I do the same for a single column, instead of for multiple columns it is working.

I found a way for this. I guess, window functions wont work for agg(*exprs) operations. So, I modified the above to
for col_name in cols:
df = df.withColumn(col_name + "_list", collect_list(col(col_name)).over(window_spec))
This served my purpose.
Thank you.

Why can you nest aggregate functions when using a window function in PostgreSQL?

I'm trying to understand window functions a bit better, and I'm stumped as to why I can't run a nested aggregate function normally, but I can when using a window function.
This is the dbfiddle I'm working off of: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=76d62fcf4066053db18783e70269438c
Before running the window function, basically everything else in my query is evaluated (JOIN and GROUP BY).
So I believe the data the window function is working off of is something like this (after grouping):
Or is it something like this?
So why can I do this: SUM(COUNT(votes.option_id)) OVER(), but I can't do it without OVER()?
As far as I understand, OVER() makes the SUM(COUNT(votes.option_id)) run on this related data set, but it's still a nested aggregate function.
What am I missing?
Thank you very much!

If you have something like SUM(COUNT(votes.option_id)) OVER() you can think of COUNT(votes.option_id) as a column generated in the GROUP BY clause.
According to the documentation:
The rows considered by a window function are those of the “virtual table” produced by the query's FROM clause as filtered by its WHERE, GROUP BY, and HAVING clauses if any.
This means that window functions operate at a level above the GROUP BY clause and any aggregates, and therefore aggregates are available to be used inside window functions. In your example the "virtual table" corresponds to the second picture.
The reason you cannot nest aggregate functions is that you cannot have multiple levels of GROUP BY on the same query. Similarly you cannot nest window functions. The documentation is clear on what type of expression are allowed inside aggregate and window functions. For aggregates functions we can use:
any value expression that does not itself contain an aggregate expression or a window function call
while for window functions we can use:
any value expression that does not itself contain window function calls

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?

Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

Spark sql group by and sum changing column name?

In this data frame I am finding total salary from each group. In Oracle I'd use this code
select job_id,sum(salary) as "Total" from hr.employees group by job_id;
In Spark SQL tried the same, I am facing two issues
empData.groupBy($"job_id").sum("salary").alias("Total").show()
The alias total is not displaying instead it is showing "sum(salary)" column
I could not use $ (I think Scala SQL syntax). Getting compilation issue
empData.groupBy($"job_id").sum($"salary").alias("Total").show()
Any idea?

Use Aggregate function .agg() if you want to provide alias name. This accepts scala syntax ($" ")
empData.groupBy($"job_id").agg(sum($"salary") as "Total").show()
If you dont want to use .agg(), alias name can be also be provided using .select():
empData.groupBy($"job_id").sum("salary").select($"job_id", $"sum(salary)".alias("Total")).show()

Spark groupBy agg not working as expected

I am getting similar issue:
(df
.groupBy("email")
.agg(last("user_id") as "user_id")
.select("user_id").count,
df
.groupBy("email")
.agg(last("user_id") as "user_id")
.select("user_id")
.distinct
.count)
When run on one computer it gives: (15123144,15123144)
When run on cluster it gives: (15123144,24)
The first one is expected and looks correct but second one is horribly wrong. One more observation - even if I change data where total count is more/less than 15123144 I get distinct = 24 on cluster.
Even if I interchange user_id and email, it gives same distinct count.
I am more confused by seeing: https://spark.apache.org/docs/1.5.2/api/scala/index.html#org.apache.spark.sql.DataFrame
Agg doc says: Aggregates on the entire DataFrame without groups. "Without group"? what does that mean?
Any clue? or Jira ticket? or what can be fix for now?

Lets start with "without group" part. As it is described in the docs:
df.agg(...) is a shorthand for df.groupBy().agg(...)
If it is still not clear it translates to SQL:
SELECT SOME_AGGREGATE_FUNCTION(some_column) FROM table
Regarding your second problem it is hard to give you a good answer without an access to the data but generally speaking these two queries are not equivalent. The first simply counts distinct email values, the second one count unique values of the last user_id per email. Moreover last without explicit ordering is meaningless.