Selecting subset spark dataframe by months - pyspark

I have this dataset:
i want to take a 3 month subset of it (eg the months: april, may and august) using pyspark.
I still haven't found anything that would let me near this dataframe using pyspark.

You can extract the month using month() and then apply a isin function to find rows matching the filter criteria.
from pyspark.sql import functions as F
data = [(1, "2021-01-01", ), (2, "2021-04-01", ), (3, "2021-05-01", ), (4, "2021-06-01", ), (5, "2021-07-01", ), (6, "2021-08-01", ), ]
df = spark.createDataFrame(data, ("cod_item", "date_emissao", )).withColumn("date_emissao", F.to_date("date_emissao"))
df.filter(F.month("date_emissao").isin(4, 5, 8)).show()
"""
+--------+------------+
|cod_item|date_emissao|
+--------+------------+
| 2| 2021-04-01|
| 3| 2021-05-01|
| 6| 2021-08-01|
+--------+------------+
"""

Related

to_date gives null on format yyyyww (202001 and 202053)

I have a dataframe with a yearweek column that I want to convert to a date. The code I wrote seems to work for every week except for week '202001' and '202053', example:
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", F.to_date(F.col("week_year"), "yyyyw")).show()
I can't figure out what the error is or how to fix these weeks. How can I convert weeks 202001 and 202053 to a valid date?
Dealing with ISO week in Spark is indeed a headache - in fact this functionality was deprecated (removed?) in Spark 3. I think using Python datetime utilities within a UDF is a more flexible way to do this.
import datetime
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%G%V%u')
df = spark.createDataFrame([
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year| date|
+---+---------+----------+
| 1| 202001|2019-12-30|
| 2| 202002|2020-01-06|
| 3| 202003|2020-01-13|
| 4| 202052|2020-12-21|
| 5| 202053|2020-12-28|
+---+---------+----------+
Based on mck's answer this is the solution I ended up using for Python version 3.5.2 :
import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F
#F.udf('date')
def week_year_to_date(week_year):
# the '1' is for specifying the first day of the week
return datetime.datetime.strptime(week_year + '1', '%Y%W%w') - relativedelta(weeks = 1)
df = spark.createDataFrame([
(9, "201952"),
(1, "202001"),
(2, "202002"),
(3, "202003"),
(4, "202052"),
(5, "202053")
], ['id', 'week_year'])
df.withColumn("date", week_year_to_date('week_year')).show()
Without the use of the in 3.6 added '%G%V%u' I had to subtract a week from the date to get the correct dates.
The following will not use udf, but instead, a more efficient vectorized pandas_udf:
import pandas as pd
#F.pandas_udf('date')
def week_year_to_date(week_year: pd.Series) -> pd.Series:
return pd.to_datetime(week_year + '1', format='%G%V%u')
df.withColumn('date', week_year_to_date('week_year')).show()
# +---+---------+----------+
# | id|week_year| date|
# +---+---------+----------+
# | 1| 202001|2019-12-30|
# | 2| 202002|2020-01-06|
# | 3| 202003|2020-01-13|
# | 4| 202052|2020-12-21|
# | 5| 202053|2020-12-28|
# +---+---------+----------+

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

pyspark join two rdds and flatten the results

Environment is pyspark, Spark Version 2.2.
We have two rdds test1 and test2, below are sample data
test1 = [('a', 20), ('b', 10), ('c', 2)]
test2 = [('a', 2), ('b', 3)]
Now we want to generate output1 as below, any help is appreciated.
[('a', 20, 2), ('b', 10, 3)]
You can accomplish this with a simple join followed by a call to map to flatten the values.
test1.join(test2).map(lambda (key, values): (key,) + values).collect()
#[('a', 20, 2), ('b', 10, 3)]
To explain, the result of the join is the following:
test1.join(test2).collect()
#[('a', (20, 2)), ('b', (10, 3))]
This is almost the desired output, but you want to flatten the results. We can accomplish this by calling map and returning a new tuple with the desired format. The syntax (key,) will create a one element tuple with just the key, which we add to the values.
You can also use the DataFrame API, by using pyspark.sql.DataFrame.toDF() to convert your RDDs to DataFrames:
test1.toDF(["key", "value1"]).join(test2.toDF(["key", "value2"]), on="key").show()
#+---+------+------+
#|key|value1|value2|
#+---+------+------+
#| b| 10| 3|
#| a| 20| 2|
#+---+------+------+

Problem in converting MS-SQL Query to spark SQL

I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
An example:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
You will need to tailor to your own situation. E.g. count instead of sum. As follows:
val df2 = df
.groupBy("c1")
.agg(count("*"))
.withColumn("fraction", col("count(1)") / sum("count(1)").over())
returning:
+---+--------+-------------------+
| c1|count(1)| fraction|
+---+--------+-------------------+
| E| 1|0.16666666666666666|
| B| 1|0.16666666666666666|
| D| 1|0.16666666666666666|
| C| 1|0.16666666666666666|
| A| 2| 0.3333333333333333|
+---+--------+-------------------+
You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.

Multiplying two columns from different data frames in spark

I have two dataframes representing the following csv data:
Store Date Weekly_Sales
1 05/02/2010 249
2 12/02/2010 455
3 19/02/2010 415
4 26/02/2010 194
Store Date Weekly_Sales
5 05/02/2010 400
6 12/02/2010 460
7 19/02/2010 477
8 26/02/2010 345
What i'm attempting to do is for each date, read the associated weekly sales for it in both dataframes and find the average of the two numbers. I'm not sure how to accomplish this.
Assuming that you want to have individual store data in the result data set, one approach would be to union the two dataframes and use Window function to calculate average weekly sales (along with the corresponding list of stores, if wanted), as follows:
val df1 = Seq(
(1, "05/02/2010", 249),
(2, "12/02/2010", 455),
(3, "19/02/2010", 415),
(4, "26/02/2010", 194)
).toDF("Store", "Date", "Weekly_Sales")
val df2 = Seq(
(5, "05/02/2010", 400),
(6, "12/02/2010", 460),
(7, "19/02/2010", 477),
(8, "26/02/2010", 345)
).toDF("Store", "Date", "Weekly_Sales")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"Date")
df1.union(df2).
withColumn("Avg_Sales", avg($"Weekly_Sales").over(window)).
withColumn("Store_List", collect_list($"Store").over(window)).
orderBy($"Date", $"Store").
show
// +-----+----------+------------+---------+----------+
// |Store| Date|Weekly_Sales|Avg_Sales|Store_List|
// +-----+----------+------------+---------+----------+
// | 1|05/02/2010| 249| 324.5| [1, 5]|
// | 5|05/02/2010| 400| 324.5| [1, 5]|
// | 2|12/02/2010| 455| 457.5| [2, 6]|
// | 6|12/02/2010| 460| 457.5| [2, 6]|
// | 3|19/02/2010| 415| 446.0| [3, 7]|
// | 7|19/02/2010| 477| 446.0| [3, 7]|
// | 4|26/02/2010| 194| 269.5| [4, 8]|
// | 8|26/02/2010| 345| 269.5| [4, 8]|
// +-----+----------+------------+---------+----------+
You should first merge them using union function. Then grouping on Date column find the average ( using avg inbuilt function) as
import org.apache.spark.sql.functions._
df1.union(df2)
.groupBy("Date")
.agg(collect_list("Store").as("Stores"), avg("Weekly_Sales").as("average_weekly_sales"))
.show(false)
which should give you
+----------+------+--------------------+
|Date |Stores|average_weekly_sales|
+----------+------+--------------------+
|26/02/2010|[4, 8]|269.5 |
|12/02/2010|[2, 6]|457.5 |
|19/02/2010|[3, 7]|446.0 |
|05/02/2010|[1, 5]|324.5 |
+----------+------+--------------------+
I hope the answer is helpful