pyspark - difference between select and agg

pyspark - difference between select and agg - select

What is the difference between the following two -
df.select(min("salary")).show()
and
df.agg({'salary':'min'}).show()
Also, what is the difference in these two -
df.groupBy("department").agg({'salary':'min'}).show()
and
df.groupBy("Company").min('salary').show()

In Spark, there are many different ways to write the same thing. It depends mostly if you prefer a SQL writting or a python writting.
df.select(min("salary")) is the equivalent of SQL :
select min(salary) from df
This query computes the min of the column salary without any group by clause.
It is equivalent to
from pyspark.sql import functions as F
df.groupBy().agg(F.min("salary"))
# OR
df.groupBy().agg({'salary':'min'})
As you can see, the groupBy is empty, so you do not group by anything. Python also can interpret the dict {'salary':'min'} which is equivalent to the function F.min("salary").
The method agg depends on the object. Applied to a Dataframe, it is equivalent to df.groupBy().agg. agg is also a method of the object GroupedData which is created when you do df.groupBy(). I added the link to the officiel doc where you can see the difference between the two methods.
When writting df.groupBy("department").agg({'salary':'min'}), you can specify in the method agg several different aggregation. When using just min, you are limited to one column. For example, you can do this :
from pyspark.sql import functions as F
df.groupBy("department").agg(F.min("salary"), F.max("age"))
# OR
df.groupBy("department").agg({'salary':'min', 'age':'max'})
# But you cannot do
df.groupBy("department").min("salary").max("age")
>> AttributeError: 'DataFrame' object has no attribute 'max'

Related

Pyspark apply multiple groupBy UDF's

I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way to run both of them in the same groupBy. I run the first UDF with the applyInPandas function but can't find a way to run any other function with it running.

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

I have a dataset which is getting grouped by multiple variables where we finding aggregates like mean , std dev etc. Now i want to find Mean of a variables excluding the top 1 percentile of data
I am trying something like
df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),expr('percentile(value, array(0.99))')[0].alias('99_percentile'),mean(when(col('value')<=col('99_percentile'),col('value')))
But it seems spark cannot use the agg name which is defined in the same group statement.
I even tried this ,
~df_final=df.groupby(groupbyElement).agg(mean('value').alias('Mean'),stddev('value').alias('Stddev'),mean(when(col('value')<=expr('percentile(value, array(0.99))')[0],col('value')))~
But it throws below error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.
I hope some one would be able to answer this
Update :
I try doing the otherway

Here's a straightforward modification of your code. It will aggregate df twice. As far as I can tell, that's what is required.
df_final=(
df.join(df
.groupby(groupbyElement)
.agg(expr('percentile(value, array(0.99))')[0].alias('99_percentile'),
on=["groupbyElement"], how="left"
)
.groupby(groupbyElement)
.agg(mean('value').alias('Mean'),
stddev('value').alias('Stddev'),
mean(when(col('value')<=col('99_percentile'), col('value')))
)

Using min/max operations in groupByKey on a spark dataset

I am trying to achieve min and max inside agg of a groupByKey operation. The code looks like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.TypedColumn
import org.apache.spark.sql.expressions.scalalang.typed.{
count => typedCount,
sum => typedSum }
inputFlowRecords.groupByKey(inputFlowRecrd => inputFlowRecrd.FlowInformation)
.agg(typedSum[InputFlowRecordV1](_.FlowStatistics.minFlowTime).name("minFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.maxFlowTime).name("maxFlowTime"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowStartedCount).name("flowStartedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.flowEndedCount).name("flowEndedCount"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromSource).name("packetsCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromSource).name("bytesCountFromSource"),
typedSum[InputFlowRecordV1](_.FlowStatistics.packetsCountFromDestination).name("packetsCountFromDestination"),
typedSum[InputFlowRecordV1](_.FlowStatistics.bytesCountFromDestination).name("bytesCountFromDestination"))
I am facing 2 problems here:
Instead of sum I want to take min/max on few columns. When I try to use org.apache.spark.sql.functions.min/max operations, the error says TypedColumns should be used. How can this be solved?
The agg function lets us specify only 4 columns max. inside it while I have 8 columns to aggregate. How can this be achieved?

Unfortunately it seems that:
min/max are not yet supported (see "todos" in typed.scala)
agg function indeed only supports up to 4 columns (see in KeyValueGroupedDataset.scala)
In your case a reasonable thing to do might be to define your own specialized aggregator that would aggregate InputFlowStatistics objects, so you only have single argument to agg.
Typed aggregators are defined here: typedaggregators.scala and Spark documentation provides some information on creating custom ones (->link).

how to replace missing values from another column in PySpark?

I want to use values in t5 to replace some missing values in t4. Searched code, but doesn’t work for me
Current:
example of current
Goal:
example of target
df is a dataframe.Code:
pdf = df.toPandas()
from pyspark.sql.functions import coalesce
pdf.withColumn("t4", coalesce(pdf.t4, pdf.t5))
 Error: 'DataFrame' object has no attribute 'withColumn'
Also, tried the following code previously, didnt work neither.
new_pdf=pdf['t4'].fillna(method='bfill', axis="columns")
Error: No axis named columns for object type

Like the error indicates .withColumn() is not a method of pandas dataframes but spark dataframes. Note that when using .toPandas() your pdf becomes a pandas dataframe, so if you want to use .withColumn() avoid the transformation
UPDATE:
If pdf is a pandas dataframe you can do:
pdf['t4']=pdf['t4'].fillna(pdf['t5'])

Column having list datatype : Spark HiveContext

The following code does aggregation and create a column with list datatype:
groupBy(
"column_name_1"
).agg(
expr("collect_list(column_name_2) "
"AS column_name_3")
)
So it seems it is possible to have 'list' as column datatype in a dataframe.
I was wondering if I can write a udf that returns custom datatype, for example a python dict?

The list is a representation of spark's Array datatype. You can try using the Map datatype (pyspark.sql.types.MapType).
an example of something which creates it is: pyspark.sql.functions.create_map which creates a map from several columns
That said if you want to create a custom aggregation function to do anything not already available in pyspark.sql.functions you will need to use scala.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pyspark - difference between select and agg - select

What is the difference between the following two - df.select(min("salary")).show() and df.agg({'salary':'min'}).show() Also, what is the difference in these two - df.groupBy("department").agg({'salary':'min'}).show() and df.groupBy("Company").min('salary').show()

Related

Pyspark apply multiple groupBy UDF's

PYSPARK : Finding a Mean of a variables excluding the top 1 percentile of data

Using min/max operations in groupByKey on a spark dataset

how to replace missing values from another column in PySpark?

Column having list datatype : Spark HiveContext

Categories

Resources