I'm collecting dataframe statistics.
The maximum minimum average value of the column. The number of zeros in the column. The number of empty values in the column.
Conditions:
Number of columns n < 2000
Number of dataframe entries r < 10^9
The stack() function is used for the solution
https://www.hadoopinrealworld.com/understanding-stack-function-in-spark/#:~:text=stack%20function%20in%20Spark%20takes,an%20argument%20followed%20by%20expressions.&text=stack%20function%20will%20generate%20n%20rows%20by%20evaluating%20the%20expressions.
What scares:
The number of rows in the intermediate dataframe rusultDF. cal("period_date").dropDuplicates * columnsNames.size * r = many
Input:
val columnsNames = List("col_name1", "col_name2")
+---------+---------+-----------+
|col_name1|col_name2|period_date|
+---------+---------+-----------+
| 11| 21| 2022-01-31|
| 12| 22| 2022-01-31|
| 13| 23| 2022-03-31|
+---------+---------+-----------+
Output:
+-----------+---------+----------+----------+---------+---------+---------+
|period_date| columns|count_null|count_zero|avg_value|mix_value|man_value|
+-----------+---------+----------+----------+---------+---------+---------+
| 2022-01-31|col_name2| 0| 0| 21.5| 21| 22|
| 2022-03-31|col_name1| 0| 0| 13.0| 13| 13|
| 2022-03-31|col_name2| 0| 0| 23.0| 23| 23|
| 2022-01-31|col_name1| 0| 0| 11.5| 11| 12|
+-----------+---------+----------+----------+---------+---------+---------+
My solution:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import org.apache.spark.sql.functions._
val spark = SparkSession.builder().master("local").appName("spark test5").getOrCreate()
import spark.implicits._
case class RealStructure(col_name1: Int, col_name2: Int, period_date: String)
val userTableDf = List(
RealStructure(11, 21, "2022-01-31"),
RealStructure(12, 22, "2022-01-31"),
RealStructure(13, 23, "2022-03-31")
) toDF()
//userTableDf.show()
//Start
new StatisticCollector(userTableDf)
class StatisticCollector(userTableDf: DataFrame) {
val columnsNames = List("col_name1", "col_name2")
val stack = s"stack(${columnsNames.length}, ${columnsNames.map(name => s"'$name', $name").mkString(",")})"
val resultDF = userTableDf.select(col("period_date"),
expr(s"$stack as (columns, values)")
)
//resultDF.show()
println(stack)
/**
+-----------+---------+------+
|period_date| columns|values|
+-----------+---------+------+
| 2022-01-31|col_name1| 11|
| 2022-01-31|col_name2| 21|
| 2022-01-31|col_name1| 12|
| 2022-01-31|col_name2| 22|
| 2022-03-31|col_name1| 13|
| 2022-03-31|col_name2| 23|
+-----------+---------+------+
stack(2, 'col_name1', col_name1,'col_name2', col_name2)
**/
val superResultDF = resultDF.groupBy(col("period_date"), col("columns")).agg(
sum(when(col("values").isNull, 1).otherwise(0)).alias("count_null"),
sum(when(col("values") === 0, 1).otherwise(0)).alias("count_zero"),
avg("values").cast("double").alias("avg_value"),
min(col("values")).alias("mix_value"),
max(col("values")).alias("man_value")
)
superResultDF.show()
}
Please give your assessment, if you see what can be solved more efficiently, then write how you would solve it.
The calculation speed is important.
It is necessary as quickly as it is provided by God.
I have tried rank function but it gives results in numbers which goes beyond 180 days.
This is the result I am getting but I do not want this result this is wrong it is giving transaction beyond 180 days
window = Window.partitionBy(df3['acctno']).orderBy(df3['trans_date'])
df3.select('*', rank().over(window).alias('rank')) \
.filter(col('rank') <= 180) \
.show(500)
{ |year|month|day|date| txnrefid| acc|branch|channel|rank|
+----+-----+---+-----------+-----------+--------------+----------+-----------+----+
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666| CBS| 4|
But I want like this
{|year|month|day|date| txnrefid| acc|branch|channel|rank|
|2020| 2| 6| 2020-02-06| 1234abcd6| 2074-556-1111| 6666| CBS| 1|
|2020| 2| 7| 2020-02-07| 1234abcd7| 2074-556-1111| 6666| CBS| 2|
|2020| 2| 8| 2020-02-08| 1234abcd8| 2074-556-1111| 6666| CBS| 3|
|2020| 2| 9| 2020-02-09| 1234abcd9| 2074-556-1111| 6666|
}
As you edited your question, here is a new answer that use a different approach.
The idea is to get for each account number the min date, compute the limite date (min date + 180 days) then remove all the lines that are older.
df.count() # I used your sample data, so 60 lines
> 60
w = Window.partitionBy(df["acctno"])
df = df.withColumn("min_date", F.min(F.col("trans_date").cast("date")).over(w))
df = df.where(
F.col("trans_date").cast("date")
<= F.date_add( # Use F.date_add to add days or F.add_months to add month.
F.col("min_date"), 180
)
).drop("min_date")
df.count() # Final dataframe limited to 180 days, nothing older than 2020-08-04
> 54
if you want the first 6 month, then you should use fields "year" and "month", not "trans_date".
something like window = Window.partitionBy(df3['acctno']).orderBy(df3['year'], df3['month']) should give you better results.
Then you filter on rank <= 6
df3.select("*", dense_rank().over(window).alias("rank")).filter(col("rank") <= 6).show(500)
EDIT: You need to use a dense_rank
I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+
I'm having a hard time framing the following Pyspark dataframe manipulation.
Essentially I am trying to group by category and then pivot/unmelt the subcategories and add new columns.
I've tried a number of ways, but they are very slow and and are not leveraging Spark's parallelism.
Here is my existing (slow, verbose) code:
from pyspark.sql.functions import lit
df = sqlContext.table('Table')
#loop over category
listids = [x.asDict().values()[0] for x in df.select("category").distinct().collect()]
dfArray = [df.where(df.category == x) for x in listids]
for d in dfArray:
#loop over subcategory
listids_sub = [x.asDict().values()[0] for x in d.select("sub_category").distinct().collect()]
dfArraySub = [d.where(d.sub_category == x) for x in listids_sub]
num = 1
for b in dfArraySub:
#renames all columns to append a number
for c in b.columns:
if c not in ['category','sub_category','date']:
column_name = str(c)+'_'+str(num)
b = b.withColumnRenamed(str(c), str(c)+'_'+str(num))
b = b.drop('sub_category')
num += 1
#if no df exists, create one and continually join new columns
try:
all_subs = all_subs.drop('sub_category').join(b.drop('sub_category'), on=['cateogry','date'], how='left')
except:
all_subs = b
#Fixes missing columns on union
try:
try:
diff_columns = list(set(all_cats.columns) - set(all_subs.columns))
for d in diff_columns:
all_subs = all_subs.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except:
diff_columns = list(set(all_subs.columns) - set(all_cats.columns))
for d in diff_columns:
all_cats = all_cats.withColumn(d, lit(None))
all_cats = all_cats.union(all_subs)
except Exception as e:
print e
all_cats = all_subs
But this is very slow. Any guidance would be greatly appreciated!
Your output is not really logical, but we can achieve this result using the pivot function. You need to precise your rules otherwise I can see a lot of cases it may fails.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df.show()
+----------+---------+------------+------------+------------+
| date| category|sub_category|metric_sales|metric_trans|
+----------+---------+------------+------------+------------+
|2018-01-01|furniture| bed| 100| 75|
|2018-01-01|furniture| chair| 110| 85|
|2018-01-01|furniture| shelf| 35| 30|
|2018-02-01|furniture| bed| 55| 50|
|2018-02-01|furniture| chair| 45| 40|
|2018-02-01|furniture| shelf| 10| 15|
|2018-01-01| rug| circle| 2| 5|
|2018-01-01| rug| square| 3| 6|
|2018-02-01| rug| circle| 3| 3|
|2018-02-01| rug| square| 4| 5|
+----------+---------+------------+------------+------------+
df.withColumn("fg", F.row_number().over(Window().partitionBy('date', 'category').orderBy("sub_category"))).groupBy('date', 'category', ).pivot('fg').sum('metric_sales', 'metric_trans').show()
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
| date| category|1_sum(CAST(`metric_sales` AS BIGINT))|1_sum(CAST(`metric_trans` AS BIGINT))|2_sum(CAST(`metric_sales` AS BIGINT))|2_sum(CAST(`metric_trans` AS BIGINT))|3_sum(CAST(`metric_sales` AS BIGINT))|3_sum(CAST(`metric_trans` AS BIGINT))|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
|2018-02-01| rug| 3| 3| 4| 5| null| null|
|2018-02-01|furniture| 55| 50| 45| 40| 10| 15|
|2018-01-01|furniture| 100| 75| 110| 85| 35| 30|
|2018-01-01| rug| 2| 5| 3| 6| null| null|
+----------+---------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+-------------------------------------+
I have a dataset and I need to drop columns which has a standard deviation equal to 0. I've tried:
val df = spark.read.option("header",true)
.option("inferSchema", "false").csv("C:/gg.csv")
val finalresult = df
.agg(df.columns.map(stddev(_)).head, df.columns.map(stddev(_)).tail: _*)
I want to compute the standard deviation of each column and drop the column if it it is is equal to zero
RowNumber,Poids,Age,Taille,0MI,Hmean,CoocParam,LdpParam,Test2,Classe,
0,87,72,160,5,0.6993,2.9421,2.3745,3,4,
1,54,70,163,5,0.6301,2.7273,2.2205,3,4,
2,72,51,164,5,0.6551,2.9834,2.3993,3,4,
3,75,74,170,5,0.6966,2.9654,2.3699,3,4,
4,108,62,165,5,0.6087,2.7093,2.1619,3,4,
5,84,61,159,5,0.6876,2.938,2.3601,3,4,
6,89,64,168,5,0.6757,2.9547,2.3676,3,4,
7,75,72,160,5,0.7432,2.9331,2.3339,3,4,
8,64,62,153,5,0.6505,2.7676,2.2255,3,4,
9,82,58,159,5,0.6748,2.992,2.4043,3,4,
10,67,49,160,5,0.6633,2.9367,2.333,3,4,
11,85,53,160,5,0.6821,2.981,2.3822,3,4,
You can try this, use getValueMap and filter to get the column names which you want to drop, and then drop them:
//Extract the standard deviation from the data frame summary:
val stddev = df.describe().filter($"summary" === "stddev").drop("summary").first()
// Use `getValuesMap` and `filter` to get the columns names where stddev is equal to 0:
val to_drop = stddev.getValuesMap[String](df.columns).filter{ case (k, v) => v.toDouble == 0 }.keys
//Drop 0 stddev columns
df.drop(to_drop.toSeq: _*).show
+---------+-----+---+------+------+---------+--------+
|RowNumber|Poids|Age|Taille| Hmean|CoocParam|LdpParam|
+---------+-----+---+------+------+---------+--------+
| 0| 87| 72| 160|0.6993| 2.9421| 2.3745|
| 1| 54| 70| 163|0.6301| 2.7273| 2.2205|
| 2| 72| 51| 164|0.6551| 2.9834| 2.3993|
| 3| 75| 74| 170|0.6966| 2.9654| 2.3699|
| 4| 108| 62| 165|0.6087| 2.7093| 2.1619|
| 5| 84| 61| 159|0.6876| 2.938| 2.3601|
| 6| 89| 64| 168|0.6757| 2.9547| 2.3676|
| 7| 75| 72| 160|0.7432| 2.9331| 2.3339|
| 8| 64| 62| 153|0.6505| 2.7676| 2.2255|
| 9| 82| 58| 159|0.6748| 2.992| 2.4043|
| 10| 67| 49| 160|0.6633| 2.9367| 2.333|
| 11| 85| 53| 160|0.6821| 2.981| 2.3822|
+---------+-----+---+------+------+---------+--------+
OK, I have written a solution that is independent of your dataset. Required imports and example data:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{lit, stddev, col}
val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
df.show(5)
// +---+---+
// | X1| X2|
// +---+---+
// | 1| 0|
// | 2| 0|
// | 3| 0|
// | 4| 0|
// | 5| 0|
First compute standard deviation by column:
// no need to rename but I did it to become more human
// readable when you show df2
val aggs = df.columns.map(c => stddev(c).as(c))
val stddevs = df.select(aggs: _*)
stddevs.show // df2 contains the stddev of each columns
// +-----------------+---+
// | X1| X2|
// +-----------------+---+
// |288.5307609250702|0.0|
// +-----------------+---+
Collect the first row and filter columns to keep:
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
Select and check the results:
df.select(columnsToKeep: _*).show
// +---+
// | X1|
// +---+
// | 1|
// | 2|
// | 3|
// | 4|
// | 5|