I have a df that looks like below. I need to break down annual data into quarters, so for each company for each year create a new row with quarter date and new EV (simply divide annual value by 4). Any suggestions how to do it?
+----------+------+---+
| date|entity| EV|
+----------+------+---+
|2018-12-31| x| 40|
|2019-12-31| x| 80|
|2018-12-31| y|120|
+----------+------+---+
Expected output:
+----------+------+---+
| date|entity| EV|
+----------+------+---+
|2018-03-31| x| 10|
|2018-06-30| x| 10|
|2018-09-30| x| 10|
|2018-12-31| x| 10|
|2019-03-31| x| 20|
|2019-06-30| x| 20|
|2019-09-30| x| 20|
|2019-12-31| x| 20|
|2018-03-31| y| 30|
|2018-06-30| y| 30|
|2018-09-30| y| 30|
|2018-12-31| y| 30|
+----------+------+---+
Here's one way to do it using arrays and transform.
data_sdf. \
withColumn('qtr_dt_suffix',
func.array(func.lit('03-31'), func.lit('06-30'), func.lit('09-30'), func.lit('12-31'))
). \
withColumn('qtr_dts',
func.transform('qtr_dt_suffix', lambda x: func.concat(func.year('dt'), func.lit('-'), x).cast('date'))
). \
select(func.explode('qtr_dts').alias('qtr_dt'), 'entity', (func.col('ev') / 4).alias('ev')). \
show()
# +----------+------+----+
# |qtr_dt |entity|ev |
# +----------+------+----+
# |2018-03-31|x |10.0|
# |2018-06-30|x |10.0|
# |2018-09-30|x |10.0|
# |2018-12-31|x |10.0|
# |2019-03-31|x |20.0|
# |2019-06-30|x |20.0|
# |2019-09-30|x |20.0|
# |2019-12-31|x |20.0|
# |2018-03-31|y |30.0|
# |2018-06-30|y |30.0|
# |2018-09-30|y |30.0|
# |2018-12-31|y |30.0|
# +----------+------+----+
Idea is to create an array containing all the quarter ending months and their end dates - [03-31, 06-30, 09-30, 12-31]. Use transform on this array to create dates for that year - [2018-03-31, 2018-06-30, 2018-09-30, 2018-12-31]. Explode this resulting array to create rows for each quarter dates.
If transform is not available in your spark version, you can use transform in expr.
data_sdf. \
withColumn('qtr_dt_suffix',
func.array(func.lit('03-31'), func.lit('06-30'), func.lit('09-30'), func.lit('12-31'))
). \
withColumn('qtr_dts',
func.expr('transform(qtr_dt_suffix, x -> cast(concat(year(dt), "-", x) as date))')
). \
show(truncate=False)
Related
Input:
Request:
Would like to calculate 3 months rolling sum, avg across time.
But, there are two rows for "2022-07-01". Would like to get result for both row.
Expected Output:
If you can use rowsBetween instead of rangeBetween, you can assign a row number to each of the dates in an order and then use that row number in the sum and avg window.
Below is an example.
data_sdf. \
withColumn('rn', func.row_number().over(wd.partitionBy().orderBy('ym'))). \
withColumn('sum_val_roll3m',
func.sum('val').over(wd.partitionBy().orderBy('rn').rowsBetween(-2, 0))
). \
withColumn('mean_val_roll3m',
func.avg('val').over(wd.partitionBy().orderBy('rn').rowsBetween(-2, 0))
). \
show()
# +----------+---+---+--------------+-----------------+
# | ym|val| rn|sum_val_roll3m| mean_val_roll3m|
# +----------+---+---+--------------+-----------------+
# |2022-03-01| 8| 1| 8| 8.0|
# |2022-04-01| 7| 2| 15| 7.5|
# |2022-05-01| 7| 3| 22|7.333333333333333|
# |2022-06-01| 10| 4| 24| 8.0|
# |2022-07-01| 4| 5| 21| 7.0|
# |2022-07-01| 1| 6| 15| 5.0|
# +----------+---+---+--------------+-----------------+
I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark.sql.functions import pandas_udf, PandasUDFType
from scipy.stats import zscore
#pandas_udf('float')
def zscore_udf(x: pd.Series) -> pd.Series:
return zscore(x)
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["id","x"]
data = [("a", 81.0),
("b", 36.2),
("c", 12.0),
("d", 81.0),
("e", 36.3),
("f", 12.0),
("g", 111.7)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
df = df.withColumn('y', zscore_udf(df.x))
df.show()
Which results in obviously wrong calculations:
+---+-----+----+
| id| x| y|
+---+-----+----+
| a| 81.0|null|
| b| 36.2| 1.0|
| c| 12.0|-1.0|
| d| 81.0| 1.0|
| e| 36.3|-1.0|
| f| 12.0|-1.0|
| g|111.7| 1.0|
+---+-----+----+
Thank you for your help.
How to fix:
instead of using a UDF calculate the stddev_pop and the avg of the dataframe and calculate z-score manually.
I suggest using "window function" over the entire dataframe for the first step and then a simple arithmetic to get the z-score.
see suggested code:
from pyspark.sql.functions import avg, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
avg("x").over(Window.partitionBy()).alias("avg_x"),
stddev_pop("x").over(Window.partitionBy()).alias("stddev_x"),
) \
.withColumn("manual_z_score", (col("x") - col("avg_x")) / col("stddev_x"))
Why the UDF didn't work?
Spark is used for distributed computation. When you perform operations on a DataFrame Spark distributes the workload into partitions on the executors/workers available.
pandas_udf is not different. When running a UDF from the type pd.Series -> pd.Series some rows are sent to partition X and some to partition Y, then when zscore is run it calculates the mean and std of the data in the partition and writes the zscore based on that data only.
I'll use spark_partition_id to "prove" this.
rows a,b,c were mapped in partition 0 while d,e,f,g in partition 1. I've calculated manually the mean/stddev_pop of both the entire set and the partitioned data and then calculated the z-score. the UDF z-score was equal to the z-score of the partition.
from pyspark.sql.functions import pandas_udf, spark_partition_id, avg, stddev, col, stddev_pop
from pyspark.sql.window import Window
df2 = df \
.select(
"*",
zscore_udf(df.x).alias("z_score"),
spark_partition_id().alias("partition"),
avg("x").over(Window.partitionBy(spark_partition_id())).alias("avg_partition_x"),
stddev_pop("x").over(Window.partitionBy(spark_partition_id())).alias("stddev_partition_x"),
) \
.withColumn("partition_z_score", (col("x") - col("avg_partition_x")) / col("stddev_partition_x"))
df2.show()
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| id| x| z_score|partition| avg_partition_x|stddev_partition_x| partition_z_score|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
| a| 81.0| 1.327058| 0|43.06666666666666|28.584533502500186| 1.3270579815484989|
| b| 36.2|-0.24022315| 0|43.06666666666666|28.584533502500186|-0.24022314955974558|
| c| 12.0| -1.0868348| 0|43.06666666666666|28.584533502500186| -1.0868348319887526|
| d| 81.0| 0.5366879| 1| 60.25|38.663063768925504| 0.5366879387524718|
| e| 36.3|-0.61945426| 1| 60.25|38.663063768925504| -0.6194542714757446|
| f| 12.0| -1.2479612| 1| 60.25|38.663063768925504| -1.247961110593097|
| g|111.7| 1.3307275| 1| 60.25|38.663063768925504| 1.3307274433163698|
+---+-----+-----------+---------+-----------------+------------------+--------------------+
I also added df.repartition(8) prior to the calculation and managed to get similar results as in the original question.
partitions with 0 stddev --> null z score, partition with 2 rows --> (-1, 1) z scores.
+---+-----+-------+---------+---------------+------------------+-----------------+
| id| x|z_score|partition|avg_partition_x|stddev_partition_x|partition_z_score|
+---+-----+-------+---------+---------------+------------------+-----------------+
| a| 81.0| null| 0| 81.0| 0.0| null|
| d| 81.0| null| 0| 81.0| 0.0| null|
| f| 12.0| null| 1| 12.0| 0.0| null|
| b| 36.2| -1.0| 6| 73.95| 37.75| -1.0|
| g|111.7| 1.0| 6| 73.95| 37.75| 1.0|
| c| 12.0| -1.0| 7| 24.15|12.149999999999999| -1.0|
| e| 36.3| 1.0| 7| 24.15|12.149999999999999| 1.0|
+---+-----+-------+---------+---------------+------------------+-----------------+
Any examples on how to transform rdd to dataframe and transform dataframe back to rdd in pyspark 1.6.1?
toDF() can not be used in 1.6.1?
For example, I have a rdd like this:
data = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])
If for some reason you can't use .toDF() method cannot, the solution I propose is this:
data = sqlContext.createDataFrame(sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), \
('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)]))
This will create a DF with names "_n" where n is the number of the column. If you want to rename the columns I suggest that you look this post: How to change dataframe column names in pyspark?. But all you need to do is:
data_named = data.selectExpr("_1 as One", "_2 as Two", "_3 as Three", "_4 as Four", "_5 as Five")
Now let's see the DF:
data_named.show()
And this will output:
+---+---+-----+----+----+
|One|Two|Three|Four|Five|
+---+---+-----+----+----+
| a| b| c| 1| 4|
| o| u| w| 9| 3|
| s| q| a| 8| 6|
| l| g| z| 8| 3|
| a| b| c| 9| 8|
| s| q| a| 10| 10|
| l| g| z| 20| 20|
| o| u| w| 77| 77|
+---+---+-----+----+----+
EDIT: Try again, because you should be able to use .toDF() in spark 1.6.1
I do not see a reason why rdd.toDF cannot be used in pyspark for spark 1.6.1. Please check spark 1.6.1 python docs for example on toDF(): https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.SQLContext
As per your requirement,
rdd = sc.parallelize([('a','b','c', 1,4), ('o','u','w', 9,3), ('s','q','a', 8,6), ('l','g','z', 8,3), ('a','b','c', 9,8), ('s','q','a', 10,10), ('l','g','z', 20,20), ('o','u','w', 77,77)])
#rdd to dataframe
df = rdd.toDF()
## can provide column names like df2 = df.toDF('col1', 'col2','col3,'col4')
#dataframe to rdd
rdd2 = df.rdd
df2000.drop('jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec').show()
now it's showing without deleted columns in dataframe
df2000.show()
when i run the show command alone to check the table .but comes with deleted column.
drop is not a side-effecting function. it returns a new Dataframe with specified columns removed. so you would have assign the new dataframe to a value to be referenced later as shown below.
>>> df2000 = spark.createDataFrame([('a',10,20,30),('a',10,20,30),('a',10,20,30),('a',10,20,30)],['key', 'jan', 'feb', 'mar'])
>>> cols = ['jan', 'feb', 'mar']
>>> df2000.show()
+---+---+---+---+
|key|jan|feb|mar|
+---+---+---+---+
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
| a| 10| 20| 30|
+---+---+---+---+
>>> cols = ['jan', 'feb', 'mar']
>>> df2000_dropped_col = reduce(lambda x,y: x.drop(y),cols,df2000)
>>> df2000_dropped_col.show()
+---+
|key|
+---+
| a|
| a|
| a|
| a|
+---+
now doing a show on the new dataframe will yield the desired result with all the month columns dropped.
I have the following DataFrame:
January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50
I'm trying to add a column to this which is the sum of the values of each row.
January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150
As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?
I've gotten as far as
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
You were very close with this:
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
Instead, try this:
val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!
Alternatively and using Hugo's approach and example, you can create a UDF that receives any quantity of columns and sum them all.
from functools import reduce
def superSum(*cols):
return reduce(lambda a, b: a + b, cols)
add = udf(superSum)
df.withColumn('total', add(*[df[x] for x in df.columns])).show()
+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
This code is in Python, but it can be easily translated:
# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()
# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March
df.withColumn('TOTAL', df.January + df.February + df.March).show()
Output:
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can also create an User Defined Function it you want, here a link of Spark documentation:
UserDefinedFunction (udf)
Working Scala example with dynamic column selection:
import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can use expr() for this.In scala use
df.withColumn("TOTAL", expr("January+February+March"))