I have a Spark SQL dataframe:
id
Value
Weights
1
2
4
1
5
2
2
1
4
2
6
2
2
9
4
3
2
4
I need to groupBy by 'id' and aggregate to get the weighted mean, median, and quartiles of the values per 'id'. What is the best way to do this?
Before the calculation you should do a small transformation to your Value column:
F.explode(F.array_repeat('Value', F.col('Weights').cast('int')))
array_repeat creates an array out of your number - the number inside the array will be repeated as many times as is specified in the column 'Weights' (casting to int is necessary, because array_repeat expects this column to be of int type. After this part the first value of 2 will be transformed into [2,2,2,2].
Then, explode will create a row for every element in the array. So, the line [2,2,2,2] will be transformed into 4 rows, each containing an integer 2.
Then you can calculate statistics, the results will have weights applied, as your dataframe is now transformed according to the weights.
Full example:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[(1, 2, 4),
(1, 5, 2),
(2, 1, 4),
(2, 6, 2),
(2, 9, 4),
(3, 2, 4)],
['id', 'Value', 'Weights']
)
df = df.select('id', F.explode(F.array_repeat('Value', F.col('Weights').cast('int'))))
df = (df
.groupBy('id')
.agg(F.mean('col').alias('weighted_mean'),
F.expr('percentile(col, 0.5)').alias('weighted_median'),
F.expr('percentile(col, 0.25)').alias('weighted_lower_quartile'),
F.expr('percentile(col, 0.75)').alias('weighted_upper_quartile')))
df.show()
#+---+-------------+---------------+-----------------------+-----------------------+
#| id|weighted_mean|weighted_median|weighted_lower_quartile|weighted_upper_quartile|
#+---+-------------+---------------+-----------------------+-----------------------+
#| 1| 3.0| 2.0| 2.0| 4.25|
#| 2| 5.2| 6.0| 1.0| 9.0|
#| 3| 2.0| 2.0| 2.0| 2.0|
#+---+-------------+---------------+-----------------------+-----------------------+
Related
I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?
If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+
As the title states, I would like to subtract each value of a specific column by the mean of that column.
Here is my code attempt:
val test = moviePairs.agg(avg(col("rating1")).alias("avgX"), avg(col("rating2")).alias("avgY"))
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - test.select("avgX").collect())
.withColumn("meanDeltaY", col("rating2") - test.select("avgY").collect())
subMean.show()
You can either use Spark's DataFrame functions or a mere SQL query to a DataFrame to aggregate the values of the means for the columns you are focusing on (rating1, rating2).
val moviePairs = spark.createDataFrame(
Seq(
("Moonlight", 7, 8),
("Lord Of The Drinks", 10, 1),
("The Disaster Artist", 3, 5),
("Airplane!", 7, 9),
("2001", 5, 1),
)
).toDF("movie", "rating1", "rating2")
// find the means for each column and isolate the first (and only) row to get their values
val means = moviePairs.agg(avg("rating1"), avg("rating2")).head()
// alternatively, by using a simple SQL query:
// moviePairs.createOrReplaceTempView("movies")
// val means = spark.sql("select AVG(rating1), AVG(rating2) from movies").head()
val subMean = moviePairs.withColumn("meanDeltaX", col("rating1") - means.getDouble(0))
.withColumn("meanDeltaY", col("rating2") - means.getDouble(1))
subMean.show()
Output for the test input DataFrame moviePairs (with the good ol' double precision loss which you can manage as seen here):
+-------------------+-------+-------+-------------------+-------------------+
| movie|rating1|rating2| meanDeltaX| meanDeltaY|
+-------------------+-------+-------+-------------------+-------------------+
| Moonlight| 7| 8| 0.5999999999999996| 3.2|
| Lord Of The Drinks| 10| 1| 3.5999999999999996| -3.8|
|The Disaster Artist| 3| 5|-3.4000000000000004|0.20000000000000018|
| Airplane!| 7| 9| 0.5999999999999996| 4.2|
| 2001| 5| 1|-1.4000000000000004| -3.8|
+-------------------+-------+-------+-------------------+-------------------+
I have a fairly large dataset with roughly 5bil. records. I would like to take 1mio random samples out of that. The issue is that the labels are not balanced
+---------+----------+
| label| count|
+---------+----------+
|A | 768866802|
|B |4241039902|
|C | 584150833|
+---------+----------+
Label B has a lot more data then the other labels. I know there is a concept of down and up sampling but given the large quantity of data I probably don't have to do that technique since I can easily find 1 mio records from each of the labels.
I was wondering how I could efficiently take ~ 1 mio. random samples (without replacement) so that I have an even amount over all labels ~ 333k in each label. Using PySpark
One idea I have is to split the dataset into 3 different df. Take 300k random samples out of it and stitch them together. But maybe there is more efficient ways of doing it.
You can create a column with random values and use row_number to filter 1M random samples for each label:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.window import Window
n = 333333 # number of samples
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
This will always give you 1M samples for each label.
Another way of doing this is by stratified sampling using spark's stat.sampleBy
n = 333333
seed = 12345
# Creating a dictionary of fractions for eacch label
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
sampleBy however results in an approximate solution depending on the run and does not guarantee an exact number of records for each label.
Example dataframe:
schema = StructType([StructField("id", IntegerType()), StructField("label", IntegerType())])
data = [[1, 2], [1, 2], [1, 3], [2, 3], [1, 2],[1, 1], [1, 2], [1, 3], [2, 2], [1, 1],[1, 2], [1, 2], [1, 3], [2, 3], [1, 1]]
df = spark.createDataFrame(data,schema=schema)
df.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 7|
| 3| 5|
+-----+-----+
Method 1:
# Sampling 3 records from each label
n = 3
# Assign a column with random values
df = df.withColumn('rand_col', F.rand())
sample_df1 = df.withColumn("row_num",row_number().over(Window.partitionBy("label")\
.orderBy("rand_col"))).filter(col("row_num")<=3)\
.drop("rand_col", "row_num")
sample_df1.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 3|
+-----+-----+
Method 2:
# Sampling 3 records from each label
n = 3
seed = 12
fractions = df.groupBy("label").count().withColumn("required_n", n/col("count"))\
.drop("count").rdd.collectAsMap()
sample_df2 = df.stat.sampleBy("label", fractions, seed)
sample_df2.groupBy("label").count().show()
+-----+-----+
|label|count|
+-----+-----+
| 1| 3|
| 2| 3|
| 3| 4|
+-----+-----+
As you can see, sampleBy tends to give you an approximately equal distribution. But not exactly. I'd prefer Method 1 for your problem.
I have two dataframes representing the following csv data:
Store Date Weekly_Sales
1 05/02/2010 249
2 12/02/2010 455
3 19/02/2010 415
4 26/02/2010 194
Store Date Weekly_Sales
5 05/02/2010 400
6 12/02/2010 460
7 19/02/2010 477
8 26/02/2010 345
What i'm attempting to do is for each date, read the associated weekly sales for it in both dataframes and find the average of the two numbers. I'm not sure how to accomplish this.
Assuming that you want to have individual store data in the result data set, one approach would be to union the two dataframes and use Window function to calculate average weekly sales (along with the corresponding list of stores, if wanted), as follows:
val df1 = Seq(
(1, "05/02/2010", 249),
(2, "12/02/2010", 455),
(3, "19/02/2010", 415),
(4, "26/02/2010", 194)
).toDF("Store", "Date", "Weekly_Sales")
val df2 = Seq(
(5, "05/02/2010", 400),
(6, "12/02/2010", 460),
(7, "19/02/2010", 477),
(8, "26/02/2010", 345)
).toDF("Store", "Date", "Weekly_Sales")
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy($"Date")
df1.union(df2).
withColumn("Avg_Sales", avg($"Weekly_Sales").over(window)).
withColumn("Store_List", collect_list($"Store").over(window)).
orderBy($"Date", $"Store").
show
// +-----+----------+------------+---------+----------+
// |Store| Date|Weekly_Sales|Avg_Sales|Store_List|
// +-----+----------+------------+---------+----------+
// | 1|05/02/2010| 249| 324.5| [1, 5]|
// | 5|05/02/2010| 400| 324.5| [1, 5]|
// | 2|12/02/2010| 455| 457.5| [2, 6]|
// | 6|12/02/2010| 460| 457.5| [2, 6]|
// | 3|19/02/2010| 415| 446.0| [3, 7]|
// | 7|19/02/2010| 477| 446.0| [3, 7]|
// | 4|26/02/2010| 194| 269.5| [4, 8]|
// | 8|26/02/2010| 345| 269.5| [4, 8]|
// +-----+----------+------------+---------+----------+
You should first merge them using union function. Then grouping on Date column find the average ( using avg inbuilt function) as
import org.apache.spark.sql.functions._
df1.union(df2)
.groupBy("Date")
.agg(collect_list("Store").as("Stores"), avg("Weekly_Sales").as("average_weekly_sales"))
.show(false)
which should give you
+----------+------+--------------------+
|Date |Stores|average_weekly_sales|
+----------+------+--------------------+
|26/02/2010|[4, 8]|269.5 |
|12/02/2010|[2, 6]|457.5 |
|19/02/2010|[3, 7]|446.0 |
|05/02/2010|[1, 5]|324.5 |
+----------+------+--------------------+
I hope the answer is helpful
I have a Spark dataframe with several columns. I want to add a column on to the dataframe that is a sum of a certain number of the columns.
For example, my data looks like this:
ID var1 var2 var3 var4 var5
a 5 7 9 12 13
b 6 4 3 20 17
c 4 9 4 6 9
d 1 2 6 8 1
I want a column added summing the rows for specific columns:
ID var1 var2 var3 var4 var5 sums
a 5 7 9 12 13 46
b 6 4 3 20 17 50
c 4 9 4 6 9 32
d 1 2 6 8 10 27
I know it is possible to add columns together if you know the specific columns to add:
val newdf = df.withColumn("sumofcolumns", df("var1") + df("var2"))
But is it possible to pass a list of column names and add them together? Based off of this answer which is basically what I want but it is using the python API instead of scala (Add column sum as new column in PySpark dataframe) I think something like this would work:
//Select columns to sum
val columnstosum = ("var1", "var2","var3","var4","var5")
// Create new column called sumofcolumns which is sum of all columns listed in columnstosum
val newdf = df.withColumn("sumofcolumns", df.select(columstosum.head, columnstosum.tail: _*).sum)
This throws the error value sum is not a member of org.apache.spark.sql.DataFrame. Is there a way to sum across columns?
Thanks in advance for your help.
You should try the following:
import org.apache.spark.sql.functions._
val sc: SparkContext = ...
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val input = sc.parallelize(Seq(
("a", 5, 7, 9, 12, 13),
("b", 6, 4, 3, 20, 17),
("c", 4, 9, 4, 6 , 9),
("d", 1, 2, 6, 8 , 1)
)).toDF("ID", "var1", "var2", "var3", "var4", "var5")
val columnsToSum = List(col("var1"), col("var2"), col("var3"), col("var4"), col("var5"))
val output = input.withColumn("sums", columnsToSum.reduce(_ + _))
output.show()
Then the result is:
+---+----+----+----+----+----+----+
| ID|var1|var2|var3|var4|var5|sums|
+---+----+----+----+----+----+----+
| a| 5| 7| 9| 12| 13| 46|
| b| 6| 4| 3| 20| 17| 50|
| c| 4| 9| 4| 6| 9| 32|
| d| 1| 2| 6| 8| 1| 18|
+---+----+----+----+----+----+----+
Plain and simple:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, col}
def sum_(cols: Column*) = cols.foldLeft(lit(0))(_ + _)
val columnstosum = Seq("var1", "var2", "var3", "var4", "var5").map(col _)
df.select(sum_(columnstosum: _*))
with Python equivalent:
from functools import reduce
from operator import add
from pyspark.sql.functions import lit, col
def sum_(*cols):
return reduce(add, cols, lit(0))
columnstosum = [col(x) for x in ["var1", "var2", "var3", "var4", "var5"]]
select("*", sum_(*columnstosum))
Both will default to NA if there is a missing value in the row. You can use DataFrameNaFunctions.fill or coalesce function to avoid that.
I assume you have a dataframe df. Then you can sum up all cols except your ID col. This is helpful when you have many cols and you don't want to manually mention names of all columns like everyone mentioned above. This post has the same answer.
val sumAll = df.columns.collect{ case x if x != "ID" => col(x) }.reduce(_ + _)
df.withColumn("sum", sumAll)
Here's an elegant solution using python:
NewDF = OldDF.withColumn('sums', sum(OldDF[col] for col in OldDF.columns[1:]))
Hopefully this will influence something similar in Spark ... anyone?.