Multiplying two columns from different data frames in spark - scala

I have two dataframes representing the following csv data:
Store Date Weekly_Sales
1 05/02/2010 249
2 12/02/2010 455
3 19/02/2010 415
4 26/02/2010 194
Store Date Weekly_Sales
5 05/02/2010 400
6 12/02/2010 460
7 19/02/2010 477
8 26/02/2010 345
What i'm attempting to do is for each date, read the associated weekly sales for it in both dataframes and find the average of the two numbers. I'm not sure how to accomplish this.

Assuming that you want to have individual store data in the result data set, one approach would be to union the two dataframes and use Window function to calculate average weekly sales (along with the corresponding list of stores, if wanted), as follows:
val df1 = Seq(
(1, "05/02/2010", 249),
(2, "12/02/2010", 455),
(3, "19/02/2010", 415),
(4, "26/02/2010", 194)
).toDF("Store", "Date", "Weekly_Sales")
val df2 = Seq(
(5, "05/02/2010", 400),
(6, "12/02/2010", 460),
(7, "19/02/2010", 477),
(8, "26/02/2010", 345)
).toDF("Store", "Date", "Weekly_Sales")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"Date")
df1.union(df2).
withColumn("Avg_Sales", avg($"Weekly_Sales").over(window)).
withColumn("Store_List", collect_list($"Store").over(window)).
orderBy($"Date", $"Store").
show
// +-----+----------+------------+---------+----------+
// |Store| Date|Weekly_Sales|Avg_Sales|Store_List|
// +-----+----------+------------+---------+----------+
// | 1|05/02/2010| 249| 324.5| [1, 5]|
// | 5|05/02/2010| 400| 324.5| [1, 5]|
// | 2|12/02/2010| 455| 457.5| [2, 6]|
// | 6|12/02/2010| 460| 457.5| [2, 6]|
// | 3|19/02/2010| 415| 446.0| [3, 7]|
// | 7|19/02/2010| 477| 446.0| [3, 7]|
// | 4|26/02/2010| 194| 269.5| [4, 8]|
// | 8|26/02/2010| 345| 269.5| [4, 8]|
// +-----+----------+------------+---------+----------+

You should first merge them using union function. Then grouping on Date column find the average ( using avg inbuilt function) as
import org.apache.spark.sql.functions._
df1.union(df2)
.groupBy("Date")
.agg(collect_list("Store").as("Stores"), avg("Weekly_Sales").as("average_weekly_sales"))
.show(false)
which should give you
+----------+------+--------------------+
|Date |Stores|average_weekly_sales|
+----------+------+--------------------+
|26/02/2010|[4, 8]|269.5 |
|12/02/2010|[2, 6]|457.5 |
|19/02/2010|[3, 7]|446.0 |
|05/02/2010|[1, 5]|324.5 |
+----------+------+--------------------+
I hope the answer is helpful

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Weighted mean median quartiles in Spark

I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

scala how to drop lines from df based on the column value

i have Data frame with these values i need to filtered min date (groupby( id,count) and summary should change as equal to more
id secid count date summary
1 2 9 20170608 equal
1 3 9 20160608 equal
2 3 8 20170608 less
3 3 9 20160608 equal
I need to show
id secid count date summary
1 2 9 20170608 more
2 3 8 20170608 less
3 3 9 20160608 equal
You can use groupBy to group id and count together and then use when and otherwise to change your summary field to more in case you have more date for the same id and count.
//create your original DF
val df = Seq((1, 2, 9, 20170608, "equal"),
(1, 3, 9, 20160608, "equal"),
(2, 3, 8, 20170608, "less"),
(3, 3, 9, 20160608, "equal"),
(1, 2, 8, 20170608, "random"),
(1, 2, 8, 20170608, "random"))
.toDF("id", "secid", "count", "date", "summary")
//Create a UDF to find the length of datelist after grouping
val isMoreThanOne = udf((lst: Seq[Int], summary: String) => lst.size > 1 && summary.equals("equal"))
//apply groupby and other operations to get the result
df.groupBy("id", "count")
.agg(collect_list("date").as("datelist"),
max("date").as("date"),
first("secid").as("secid"),
first("summary").as("summary"))
.withColumn("summary",
when(isMoreThanOne($"datelist", $"summary"), "more").otherwise($"summary"))
.drop("datelist")
.show()
// output
// +---+-----+--------+-----+-------+
// | id|count| date|secid|summary|
// +---+-----+--------+-----+-------+
// | 1| 9|20170608| 2| more|
// | 1| 8|20170608| 2| random|
// | 3| 9|20160608| 3| equal|
// | 2| 8|20170608| 3| less|
// +---+-----+--------+-----+-------+

How to delete duplicated pairs of nodes in Spark?

I have the following DataFrame in Spark:
nodeFrom nodeTo value date
1 2 11 2016-10-12T12:10:00.000Z
1 2 12 2016-10-12T12:11:00.000Z
1 2 11 2016-10-12T12:09:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
5 3 11 2016-10-12T14:00:00.000Z
I need to delete duplicated pairs of nodeFrom and nodeTo, while taking the earliest and latest date and the average of corresponding value values.
The expected output is the following one:
nodeFrom nodeTo value date
1 2 11.5 [2016-10-12T12:09:00.000Z,2016-10-12T12:11:00.000Z]
4 2 34 [2016-10-12T14:00:00.000Z]
5 3 11 [2016-10-12T14:00:00.000Z]
Using the struct function with min and max, only a single groupBy and agg step is necessary.
Assuming that this is your data:
val data = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
data.show()
You can get the average and the array with earliest/latest date as follows:
import org.apache.spark.sql.functions._
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
array($"date1.date", $"date2.date") as 'date
)
.show(60, false)
This will give you almost what you want, with the minor difference every array of dates has size 2:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
+--------+------+-----+----------------------------------------------------+
If you really (really?) want to eliminate the duplicates from the array column, it seems that the easiest way is to use a custom udf for that:
val elimDuplicates = udf((_: collection.mutable.WrappedArray[String]).distinct)
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
elimDuplicates(array($"date1.date", $"date2.date")) as 'date
)
.show(60, false)
This will produce:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
+--------+------+-----+----------------------------------------------------+
Brief explanation:
min(struct('date, 'value)) as date1 selects the earliest date together with the corresponding value
Same with max
The average is computed directly from these two tuples by summing and dividing by 2
The corresponding values are written to array column
(optional) the array is de-duplicated
Hope that helps.
You could do a normal groupBy and then use a udf to make date Columns as desired like below:
val df = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
def zipDates = udf((date1: String, date2: String) => {
if (date1 == date2)
Seq(date1)
else
Seq(date1, date2)
})
val dfT = df
.groupBy('nodeFrom, 'nodeTo)
.agg(avg('value) as "value", min('date) as "minDate", max('date) as "maxDate")
.select('nodeFrom, 'nodeTo, 'value, zipDates('minDate, 'maxDate) as "date")
dfT.show(10, false)
// +--------+------+------------------+----------------------------------------------------+
// |nodeFrom|nodeTo|value |date |
// +--------+------+------------------+----------------------------------------------------+
// |1 |2 |11.333333333333334|[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
// |5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
// |4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
// +--------+------+------------------+----------------------------------------------------+