+-------+--------------------+-------+
| brand| category_code| count|
+-------+--------------------+-------+
|samsung|electronics.smart...|1782386|
| apple|electronics.smart...|1649525|
| xiaomi|electronics.smart...| 924383|
| huawei|electronics.smart...| 477946|
| oppo|electronics.smart...| 242022|
|samsung|electronics.video.tv| 183988|
| apple|electronics.audio...| 165277|
| acer| computers.notebook| 154599|
| casio| electronics.clocks| 141403|
I want to select a value from the column brand corresponding to the max value of column count after performing a groupBy on column category_code. So in the first row for the group electronics.smartphone in column category_code I want string samsung from column brand because it has the highest value in the count column...
First groupBy to identify rows with the largest count for each category_code, then join with the original dataframe to retrieve brand value corresponding to max count:
df1 = df.groupBy("category_code").agg(F.max("count").alias("count"))
df2 = df.join(df1, ["count", "category_code"]).drop("count")
this will produce df2 as follows
category_code brand
---------------------------
electronics.smart... samsung
electronics.video.tv samsung
electronics.audio apple
computers.notebook acer
electronics.clocks casio
Related
I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))
I've looked into my job and have identified that I do indeed have a skewed task. How do I determine what the actual value is inside this task that is causing the skew?
My Python Transforms code looks like this:
from transforms.api import Input, Output, transform
#transform(
...
)
def my_compute_function(...):
...
df = df.join(df_2, ["joint_col"])
...
Theory
Skew problems originate from anything that causes an exchange in your job. Things that cause exchanges include but are not limited to: joins, windows, groupBys.
These operations result in data movement across your Executors based upon the found values inside the DataFrames used. This means that when a used DataFrame has many repeated values on the column dictating the exchange, those rows all end up in the same task, thus increasing its size.
Example
Let's consider the following example distribution of data for your join:
DataFrame 1 (df1)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_1 | 2 |
| key_2 | 1 |
DataFrame 2 (df2)
| col_1 | col_2 |
|-------|-------|
| key_1 | 1 |
| key_1 | 2 |
| key_1 | 3 |
| key_1 | 1 |
| key_2 | 2 |
| key_3 | 1 |
These DataFrames when joined together on col_1 will have the following data distributed across the executors:
Task 1:
Receives: 5 rows of key_1 from df1
Receives: 4 rows of key_1 from df2
Total Input: 9 rows of data sent to task_1
Result: 5 * 4 = 20 rows of output data
Task 2:
Receives: 1 row of key_2 from df1
Receives: 1 row of key_2 from df2
Total Input: 2 rows of data sent to task_2
Result: 1 * 1 = 1 rows of output data
Task 3:
Receives: 1 row of key_3 from df2
Total Input: 1 rows of data sent to task_3
Result: 1 * 0 = 0 rows of output data (missed key; no key found in df1)
If you therefore look at the counts of input and output rows per task, you'll see that Task 1 has far more data than the others. This task is skewed.
Identification
The question now becomes how we identify that key_1 is the culprit of the skew since this isn't visible in Spark (the underlying engine powering the join).
If we look at the above example, we see that all we need to know is the actual counts per key of the joint column. This means we can:
Aggregate each side of the join on the joint key and count the rows per key
Multiply the counts of each side of the join to determine the output row counts
The easiest way to do this is by opening the Analysis (Contour) tool in Foundry and performing the following analysis:
Add df1 as input to a first path
Add Pivot Table board, using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df1_, resulting in an output from the path which is only df1_col_1 and df1_COUNT.
Add df2 as input to a second path
Add Pivot Table board, again using col_1 as the rows, and Row count as the aggregate
Click the ⇄ Switch to pivoted data button
Use the Multi-Column Editor board to keep only col_1 and the COUNT column. Prefix each of them with df2_, resulting in an output from the path which is only df2_col_1 and df2_COUNT.
Create a third path, using the result of the first path (df1_col_1 and df1_COUNT1)
Add a Join board, making the right side of the join the result of the second path (df2_col_1 and df2_col_1). Ensure the join type is Full join
Add all columns from the right side (you don't need to add a prefix, all the columns are unique
Configure the join board to join on df1_col_1 equals df2_col_1
Add an Expression board to create a new column, output_row_count which multiplies the two COUNT columns together
Add a Sort board that sorts on output_row_count descending
If you now preview the resultant data, you will have a sorted list of keys from both sides of the join that are causing the skew
I have a PySpark DF, with ID and Date column, looking like this.
ID
Date
1
2021-10-01
2
2021-10-01
1
2021-10-02
3
2021-10-02
I want to count the number of unique IDs that did not exist in the date one day before. So, here the result would be 1 as there is only one new unique ID in 2021-10-02.
ID
Date
Count
1
2021-10-01
-
2
2021-10-01
-
1
2021-10-02
1
3
2021-10-02
1
I tried following this solution but it does not work on date type value. Any help would be highly appreciated.
Thank you!
If you want to avoid a self-join (e.g. for performance reasons), you could work with Window functions:
from pyspark.sql import Row, Window
import datetime
df = spark.createDataFrame([
Row(ID=1, date=datetime.date(2021,10,1)),
Row(ID=2, date=datetime.date(2021,10,1)),
Row(ID=1, date=datetime.date(2021,10,2)),
Row(ID=2, date=datetime.date(2021,10,2)),
Row(ID=1, date=datetime.date(2021,10,3)),
Row(ID=3, date=datetime.date(2021,10,3)),
])
First add the number of days since an ID was last seen (will be None if it never appeared before)
df = df.withColumn('days_since_last_occurrence', F.datediff('date', F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))
Second, we add a column marking rows where this number of days is not 1. We add a 1 into this column so that we can later sum over this column to count the rows
df = df.withColumn('is_new', F.when(F.col('days_since_last_occurrence') == 1, None).otherwise(1))
Now we do the sum of all rows with the same date and then remove the column we do not require anymore:
(
df
.withColumn('count', F.sum('is_new').over(Window.partitionBy('date'))) # sum over all rows with the same date
.drop('is_new', 'days_since_last_occurrence')
.sort('date', 'ID')
.show()
)
# Output:
+---+----------+-----+
| ID| date|count|
+---+----------+-----+
| 1|2021-10-01| 2|
| 2|2021-10-01| 2|
| 1|2021-10-02| null|
| 2|2021-10-02| null|
| 1|2021-10-03| 1|
| 3|2021-10-03| 1|
+---+----------+-----+
Take out the id list of the current day and the previous day, and then get the size of the difference between the two to get the final result.
Update to a solution to eliminate join.
df = df.select('date', F.expr('collect_set(id) over (partition by date) as id_arr')).dropDuplicates() \
.select('*', F.expr('size(array_except(id_arr, lag(id_arr,1,id_arr) over (order by date))) as count')) \
.select(F.explode('id_arr').alias('id'), 'date', 'count')
df.show(truncate=False)
This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.
I'm trying to get the Frequency of distinct values in a Spark dataframe column, something like "value_counts" from Python Pandas. By frequency I mean, the highest occurring value in a table column (such as rank 1 value, rank 2, rank 3 etc. In the expected output, 1 has occurred 9 times in column a, so it has topmost frequency.
I'm using Spark SQL but it is not working out, may be because of the reduce operation I have written is wrong.
**Pandas Example**
value_counts().index[1]
**Current Code in Spark**
val x= parquetRDD_subset.schema.fieldNames
val dfs = x.map(field => spark.sql
(s"select 'ParquetRDD' as TableName,
'$field' as column,
min($field) as min, max($field) as max,
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field) as frequency from peopleRDDtable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct()
withSum.show()
The problem area is with query below.
SELECT number_cnt FROM (SELECT $field as value,
approx_count_distinct($field) as number_cnt FROM peopleRDDtable
group by $field)
**Expected output**
TableName | column | min | max | frequency1 |
_____________+_________+______+_______+____________+
ParquetRDD | a | 1 | 30 | 9 |
_____________+_________+______+_______+____________+
ParquetRDD | b | 2 | 21 | 5 |
How do I solve this ? please help.
I could solve the issue with below with using count($field) instead of approx_count_distinct($field). Then I used Rank analytical function to get the first rank of value. It worked.