aggregrated_table = df_input.groupBy('city', 'income_bracket') \
.agg(
count('suburb').alias('suburb'),
sum('population').alias('population'),
sum('gross_income').alias('gross_income'),
sum('no_households').alias('no_households'))
Would like to group by city and income bracket but within each city certain suburbs have different income brackets. How do I group by the most frequently occurring income bracket per city?
for example:
city1 suburb1 income_bracket_10
city1 suburb1 income_bracket_10
city1 suburb2 income_bracket_10
city1 suburb3 income_bracket_11
city1 suburb4 income_bracket_10
Would be grouped by income_bracket_10
Using a window function before aggregating might do the trick:
from pyspark.sql import Window
import pyspark.sql.functions as psf
w = Window.partitionBy('city')
aggregrated_table = df_input.withColumn(
"count",
psf.count("*").over(w)
).withColumn(
"rn",
psf.row_number().over(w.orderBy(psf.desc("count")))
).filter("rn = 1").groupBy('city', 'income_bracket').agg(
psf.count('suburb').alias('suburb'),
psf.sum('population').alias('population'),
psf.sum('gross_income').alias('gross_income'),
psf.sum('no_households').alias('no_households'))
you can also use a window function after aggregating since you're keeping a count of (city, income_bracket) occurrences.
You don't necessarily need Window functions:
aggregrated_table = (
df_input.groupby("city", "suburb","income_bracket")
.count()
.withColumn("count_income", F.array("count", "income_bracket"))
.groupby("city", "suburb")
.agg(F.max("count_income").getItem(1).alias("most_common_income_bracket"))
)
I think this does what you require. I don't really know if it performs better than the window based solution.
For pyspark version >=3.4 you can use the mode function directly to get the most frequent element per group:
from pyspark.sql import functions as f
df = spark.createDataFrame([
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("Java", 2012, 20000), ("dotNET", 2012, 5000),
... ("dotNET", 2013, 48000), ("Java", 2013, 30000)],
... schema=("course", "year", "earnings"))
>>> df.groupby("course").agg(f.mode("year")).show()
+------+----------+
|course|mode(year)|
+------+----------+
| Java| 2012|
|dotNET| 2012|
+------+----------+
https://github.com/apache/spark/blob/7f1b6fe02bdb2c68d5fb3129684ca0ed2ae5b534/python/pyspark/sql/functions.py#L379
The solution by mfcabrera gave wrong results when F.max was used on F.array column as the values in ArrayType are treated as String and integer max didnt work as expected.
The below solution worked.
w = Window.partitionBy('city', "suburb").orderBy(f.desc("count"))
aggregrated_table = (
input_df.groupby("city", "suburb","income_bracket")
.count()
.withColumn("max_income", f.row_number().over(w2))
.filter(f.col("max_income") == 1).drop("max_income")
)
aggregrated_table.display()
Related
I have a dataset that has Id, Value and Timestamp columns. Id and Value columns are strings. Sample:
Id
Value
Timestamp
Id1
100
1658919600
Id1
200
1658919602
Id1
300
1658919601
Id2
433
1658919677
I want to concatenate Values that belong to the same Id, and order them by Timestamp. E.g. for rows with Id1 the result would look like:
Id
Values
Id1
100;300;200
Some pseudo code would be:
res = SELECT Id,
STRING_AGG(Value,";") WITHING GROUP ORDER BY Timestamp AS Values
FROM table
GROUP BY Id
Can someone help me write this in Databricks? PySpark and SQL are both fine.
You can collect lists of struct ofTimestamp and Value (in that order) for each Id, sort them (sort_array will sort by the first value of struct, i.e Timestamp) and combine Value's values into string using concat_ws.
PySpark (Spark 3.1.2)
import pyspark.sql.functions as F
(df
.groupBy("Id")
.agg(F.expr("concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values"))
).show(truncate=False)
# +---+-----------+
# |Id |Values |
# +---+-----------+
# |Id1|100;300;200|
# |Id2|433 |
# +---+-----------+
in SparkSQL
SELECT Id, concat_ws(';', sort_array(collect_list(struct(Timestamp, Value))).Value) as Values
FROM table
GROUP BY Id
This is a beautiful question!! This is a perfect use case for Fugue which can port Python and Pandas code to PySpark. I think this is something that is hard to express in Spark but easy to express in native Python or Pandas.
Let's just concern ourselves with 1 ID first. For one ID, using pure native Python, it would look like below. Assume the Timestamps are already sorted when this is applied.
import pandas as pd
df = pd.DataFrame({"Id": ["Id1", "Id1", "Id1", "Id2","Id2","Id2"],
"Value": [100,200,300,433, 500,600],
"Timestamp": [1658919600, 1658919602, 1658919601, 1658919677, 1658919670, 1658919672]})
from typing import Iterable, List, Dict, Any
def logic(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
_id = df[0]['Id']
items = []
for row in df:
items.append(row['Value'])
yield {"Id": _id, "Values": items}
Now we can call Fugue with one line of code to run this on Pandas. Fugue uses the type annotation from the logic function to handle conversions for you as it enters the function. We can run this for 1 ID (not sorted yet).
from fugue import transform
transform(df.loc[df["Id"] == "Id1"], logic, schema="Id:str,Values:[int]")
and that generates this:
Id Values
0 Id1 [100, 200, 300]
Now we are ready to bring it to Spark. All we need to do is add the engine and partitioning strategy to the transform call.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = transform(df,
logic,
schema="Id:str,Values:[int]",
partition={"by": "Id", "presort": "Timestamp asc"},
engine=spark)
sdf.show()
Because we passed in the SparkSession, this code will run on Spark.sdf is a SparkDataFrame so we need .show() because it evaluates lazily. Schema is a requirement for Spark so we need it too on Fugue but it's significantly simplified. The partitioning strategy will run logic on each Id, and will sort the items by Timestamp for each partition.
For the FugueSQL version, you can do:
from fugue_sql import fsql
fsql(
"""
SELECT *
FROM df
TRANSFORM PREPARTITION BY Id PRESORT Timestamp ASC USING logic SCHEMA Id:str,Values:[int]
PRINT
"""
).run(spark)
Easiest Solution :
df1=df.sort(asc('Timestamp')).groupBy("id").agg(collect_list('Value').alias('newcol'))
+---+---------------+
| id| newcol|
+---+---------------+
|Id1|[100, 300, 200]|
|Id2| [433]|
+---+---------------+
df1.withColumn('newcol',concat_ws(";",col("newcol"))).show()
+---+-----------+
| id| newcol|
+---+-----------+
|Id1|100;300;200|
|Id2| 433|
+---+-----------+
Let's say I have the following two dataframes:
DF1:
+----------+----------+----------+
| Place|Population| IndexA|
+----------+----------+----------+
| A| Int| X_A|
| B| Int| X_B|
| C| Int| X_C|
+----------+----------+----------+
DF2:
+----------+----------+
| City| IndexB|
+----------+----------+
| D| X_D|
| E| X_E|
| F| X_F|
| ....| ....|
| ZZ| X_ZZ|
+----------+----------+
The dataframes above are normally of much larger size.
I want to determine to which City(DF2) the shortest distance is from every Place from DF1. The distance can be calculated based on the index. So for every row in DF1, I have to iterate over every row in DF2 and look for the shortest distances based on the calculations with the indexes. For the distance calculation there is a function defined:
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
I tried the following:
val output = DF1.agg(functions.min(distance(col("IndexA"), DF2.col("IndexB"))))
But this, the code compiles but I get the following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s)
H3Index#220L missing from Places#316,Population#330,IndexAx#338L in operator !Aggregate
[min(if ((isnull(IndexA#338L) OR isnull(IndexB#220L))) null else
UDF(knownnotnull(IndexA#338L), knownnotnull(IndexB#220L))) AS min(UDF(IndexA, IndexB))#346].
So I suppose I do something wrong with iterating over each row in DF2 when taking one row from DF1 but I couldn't find a solution.
What am I doing wrong? And am I in the right direction?
You are getting this error because the index column you are using only exists in DF2 and not DF1 where you are attempting to perform the aggregation.
In order to make this field accessible and determine the distance from all points, you would need to
Cross join DF1 and Df2 to have every index of Df1 matching every index of DF2
Determine the distance using your udf
Find the min on this new cross joined udf with the distances
This may look like :
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, min, udf}
val distance = udf(
(indexA: Long, indexB: Long) => {
h3.instance.h3Distance(indexA, indexB)
})
val resultDF = DF1.crossJoin(DF2)
.withColumn("distance", distance(col("IndexA"), col("IndexB")))
//instead of using a groupby then matching the min distance of the aggregation with the initial df. I've chosen to use a window function min to determine the min_distance of each group (determined by Place) and filter by the city with the min distance to each place
.withColumn("min_distance", min("distance").over(Window.partitionBy("Place")))
.where(col("distance") === col("min_distance"))
.drop("min_distance")
This will result in a dataframe with columns from both dataframes and and additional column distance.
NB. Your current approach which is comparing every item in one df to every item in another df is an expensive operation. If you have the opportunity to filter early (eg joining on heuristic columns, i.e. other columns which may indicate a place may be closer to a city), this is recommended.
Let me know if this works for you.
If you have only a few cities (less than or around 1000), you can avoid crossJoin and Window shuffle by collecting cities in an array and then perform distance computation for each place using this collected array:
import org.apache.spark.sql.functions.{array_min, col, struct, transform, typedLit, udf}
val citiesIndexes = df2.select("City", "IndexB")
.collect()
.map(row => (row.getString(0), row.getLong(1)))
val result = df1.withColumn(
"City",
array_min(
transform(
typedLit(citiesIndexes),
x => struct(distance(col("IndexA"), x.getItem("_2")), x.getItem("_1"))
)
).getItem("col2")
)
This piece of code works for Spark 3 and greater. If you are on a Spark version smaller than 3.0, you should replace array_min(...).getItem("col2") part by an user-defined function.
I am pre-processing data for machine learning inputs, a target value column, call it "price" has many outliers, and rather than winsorizing price over the whole set I want to winsorize within groups labeled by "product_category". There are other features, product_category is just a price-relevant label.
There is a Scala stat function that works great:
df_data.stat.approxQuantile("price", Array(0.01, 0.99), 0.00001)
// res19: Array[Double] = Array(3.13, 318.54)
Unfortunately, it doesn't support computing the quantiles within groups. Nor does is support window partitions.
df_data
.groupBy("product_category")
.approxQuantile($"price", Array(0.01, 0.99), 0.00001)
// error: value approxQuantile is not a member of
// org.apache.spark.sql.RelationalGroupedDataset
What is the best way to compute say the p01 and p99 within groups of a spark dataframe, for the purpose of replacing values beyond that range, ie winsorizing?
My dataset schema can be imagined like this, and its over 20MM rows with appx 10K different labels for "product_category", so performance is also a concern.
df_data and a winsorized price column:
+---------+------------------+--------+---------+
| item | product_category | price | pr_winz |
+---------+------------------+--------+---------+
| I000001 | XX11 | 1.99 | 5.00 |
| I000002 | XX11 | 59.99 | 59.99 |
| I000003 | XX11 |1359.00 | 850.00 |
+---------+------------------+--------+---------+
supposing p01 = 5.00, p99 = 850.00 for this product_category
Here is what I came up with, after struggling with the documentation (there are two functions approx_percentile and percentile_approx that apparently do the same thing).
I was not able to figure out how to implement this except as a spark sql expression, not sure exactly why grouping only works there. I suspect because its part of Hive?
Spark DataFrame Winsorizor
Tested on DF in 10 to 100MM rows range
// Winsorize function, groupable by columns list
// low/hi element of [0,1]
// precision: integer in [1, 1E7-ish], in practice use 100 or 1000 for large data, smaller is faster/less accurate
// group_col: comma-separated list of column names
import org.apache.spark.sql._
def grouped_winzo(df: DataFrame, winz_col: String, group_col: String, low: Double, hi: Double, precision: Integer): DataFrame = {
df.createOrReplaceTempView("df_table")
spark.sql(s"""
select distinct
*
, percentile_approx($winz_col, $low, $precision) over(partition by $group_col) p_low
, percentile_approx($winz_col, $hi, $precision) over(partition by $group_col) p_hi
from df_table
""")
.withColumn(winz_col + "_winz", expr(s"""
case when $winz_col <= p_low then p_low
when $winz_col >= p_hi then p_hi
else $winz_col end"""))
.drop(winz_col, "p_low", "p_hi")
}
// winsorize the price column of a dataframe at the p01 and p99
// percentiles, grouped by 'product_category' column.
val df_winsorized = grouped_winzo(
df_data
, "price"
, "product_category"
, 0.01, 0.99, 1000)
Hi how's it going? I'm a Python developer trying to learn Spark Scala. My task is to create date range bins, and count the frequency of occurrences in each bin (histogram).
My input dataframe looks something like this
My bin edges are this (in Python):
bins = ["01-01-1990 - 12-31-1999","01-01-2000 - 12-31-2009"]
and the output dataframe I'm looking for is (counts of how many values in original dataframe per bin):
Is there anyone who can guide me on how to do this is spark scala? I'm a bit lost. Thank you.
Are You Looking for A result Like Following:
+------------------------+------------------------+
|01-01-1990 -- 12-31-1999|01-01-2000 -- 12-31-2009|
+------------------------+------------------------+
| 3| null|
| null| 2|
+------------------------+------------------------+
It can be achieved with little bit of spark Sql and pivot function as shown below
check out the left join condition
val binRangeData = Seq(("01-01-1990","12-31-1999"),
("01-01-2000","12-31-2009"))
val binRangeDf = binRangeData.toDF("start_date","end_date")
// binRangeDf.show
val inputDf = Seq((0,"10-12-1992"),
(1,"11-11-1994"),
(2,"07-15-1999"),
(3,"01-20-2001"),
(4,"02-01-2005")).toDF("id","input_date")
// inputDf.show
binRangeDf.createOrReplaceTempView("bin_range")
inputDf.createOrReplaceTempView("input_table")
val countSql = """
SELECT concat(date_format(c.st_dt,'MM-dd-yyyy'),' -- ',date_format(c.end_dt,'MM-dd-yyyy')) as header, c.bin_count
FROM (
(SELECT
b.st_dt, b.end_dt, count(1) as bin_count
FROM
(select to_date(input_date,'MM-dd-yyyy') as date_input , * from input_table) a
left join
(select to_date(start_date,'MM-dd-yyyy') as st_dt, to_date(end_date,'MM-dd-yyyy') as end_dt from bin_range ) b
on
a.date_input >= b.st_dt and a.date_input < b.end_dt
group by 1,2) ) c"""
val countDf = spark.sql(countSql)
countDf.groupBy("bin_count").pivot("header").sum("bin_count").drop("bin_count").show
Although, since you have 2 bin ranges there will be 2 rows generated.
We can achieve this by looking at the date column and determining within which range each record falls.
// First we set up the problem
// Create a format that looks like yours
val dateFormat = java.time.format.DateTimeFormatter.ofPattern("MM-dd-yyyy")
// Get the current local date
val now = java.time.LocalDate.now
// Create a range of 1-10000 and map each to minusDays
// so we can have range of dates going 10000 days back
val dates = (1 to 10000).map(now.minusDays(_).format(dateFormat))
// Create a DataFrame we can work with.
val df = dates.toDF("date")
So far so good. We have date entries to work with, and they are like your format (MM-dd-yyyy).
Next up, we need a function which returns 1 if the date falls within range, and 0 if not. We create a UserDefinedFunction (UDF) from this function so we can apply it to all rows simultaneously across Spark executors.
// We will process each range one at a time, so we'll take it as a string
// and split it accordingly. Then we perform our tests. Using Dates is
// necessary to cater to your format.
import java.text.SimpleDateFormat
def isWithinRange(date: String, binRange: String): Int = {
val format = new SimpleDateFormat("MM-dd-yyyy")
val startDate = format.parse(binRange.split(" - ").head)
val endDate = format.parse(binRange.split(" - ").last)
val testDate = format.parse(date)
if (!(testDate.before(startDate) || testDate.after(endDate))) 1
else 0
}
// We create a udf which uses an anonymous function taking two args and
// simply pass the values to our prepared function
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
def isWithinRangeUdf: UserDefinedFunction =
udf((date: String, binRange: String) => isWithinRange(date, binRange))
Now that we have our UDF setup, we create new columns in our DataFrame and group by the given bins and sum the values over (hence why we made our functions evaluate to an Int)
// We define our bins List
val bins = List("01-01-1990 - 12-31-1999",
"01-01-2000 - 12-31-2009",
"01-01-2010 - 12-31-2020")
// We fold through the bins list, creating a column from each bin at a time,
// enriching the DataFrame with more columns as we go
import org.apache.spark.sql.functions.{col, lit}
val withBinsDf = bins.foldLeft(df){(changingDf, bin) =>
changingDf.withColumn(bin, isWithinRangeUdf(col("date"), lit(bin)))
}
withBinsDf.show(1)
//+----------+-----------------------+-----------------------+-----------------------+
//| date|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+----------+-----------------------+-----------------------+-----------------------+
//|09-01-2020| 0| 0| 1|
//+----------+-----------------------+-----------------------+-----------------------+
//only showing top 1 row
Finally we select our bin columns and groupBy them and sum.
val binsDf = withBinsDf.select(bins.head, bins.tail:_*)
val sums = bins.map(b => sum(b).as(b)) // keep col name as is
val summedBinsDf = binsDf.groupBy().agg(sums.head, sums.tail:_*)
summedBinsDf.show
//+-----------------------+-----------------------+-----------------------+
//|01-01-1990 - 12-31-1999|01-01-2000 - 12-31-2009|01-01-2010 - 12-31-2020|
//+-----------------------+-----------------------+-----------------------+
//| 2450| 3653| 3897|
//+-----------------------+-----------------------+-----------------------+
2450 + 3653 + 3897 = 10000, so it seems our work was correct.
Perhaps I overdid it and there is a simpler solution, please let me know if you know a better way (especially to handle MM-dd-yyyy dates).
I have a pyspark dataframe where i am finding out min/max values and count of min/max values for each columns. I am able to select min/max values using:
df.select([min(col(c)).alias(c) for c in df.columns])
I want to have the count of min/max values as well in same dataframe.
Specific output I need:
...| col_n | col_m |...
...| xn | xm |... min(col(coln))
...| count(col_n==xn) | count(col_m==xm) |...
try this,
from pyspark.sql.functions import udf,struct,array
from pyspark.sql.window import Window
tst= sqlContext.createDataFrame([(1,7,2,11),(1,3,4,12),(1,5,6,13),(1,7,8,14),(2,9,10,15),(2,11,12,16),(2,13,14,17)],schema=['col1','col2','col3','col4'])
expr=[F.max(coln).alias(coln+'_max') for coln in tst.columns]
tst_mx = tst.select(*expr)
#%%
tst_dict = tst_mx.collect()[0].asDict()
#%%
expr1=( [F.count(F.when(F.col(coln)==tst_dict[coln+'_max'],F.col(coln))).alias(coln+'_max_count') for coln in tst.columns])
#%%
tst_res = tst.select(*(expr+expr1))
In the expr, I have just tried out for max function. you can scale this to other functions like min,mean etc., even use a list comprehension for the function list. Refer this answer for such a scaling : pyspark: groupby and aggregate avg and first on multiple columns
It is explained for aggregate, can also be done to select statement.