pyspark - difference between select and agg - select

What is the difference between the following two -
df.select(min("salary")).show()
and
df.agg({'salary':'min'}).show()
Also, what is the difference in these two -
df.groupBy("department").agg({'salary':'min'}).show()
and
df.groupBy("Company").min('salary').show()

In Spark, there are many different ways to write the same thing. It depends mostly if you prefer a SQL writting or a python writting.
df.select(min("salary")) is the equivalent of SQL :
select min(salary) from df
This query computes the min of the column salary without any group by clause.
It is equivalent to
from pyspark.sql import functions as F
df.groupBy().agg(F.min("salary"))
# OR
df.groupBy().agg({'salary':'min'})
As you can see, the groupBy is empty, so you do not group by anything. Python also can interpret the dict {'salary':'min'} which is equivalent to the function F.min("salary").
The method agg depends on the object. Applied to a Dataframe, it is equivalent to df.groupBy().agg. agg is also a method of the object GroupedData which is created when you do df.groupBy(). I added the link to the officiel doc where you can see the difference between the two methods.
When writting df.groupBy("department").agg({'salary':'min'}), you can specify in the method agg several different aggregation. When using just min, you are limited to one column. For example, you can do this :
from pyspark.sql import functions as F
df.groupBy("department").agg(F.min("salary"), F.max("age"))
# OR
df.groupBy("department").agg({'salary':'min', 'age':'max'})
# But you cannot do
df.groupBy("department").min("salary").max("age")
>> AttributeError: 'DataFrame' object has no attribute 'max'

Related

Pyspark apply multiple groupBy UDF's

I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way to run both of them in the same groupBy. I run the first UDF with the applyInPandas function but can't find a way to run any other function with it running.

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway
Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?
Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type
Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?
The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.