Quicksight: Get Year to Date totals in table visual - visualization

I want to get year to date total in quicksight table visual instead of just total. Currently my visualization looks like below:
Current:
|**Dates**|Jan20|Feb20|Mar20|Apr20|May20|Jun20|Jul20|Aug20|Sep20|Oct20|Nov20|Dec20|Jan21|**Total**|
|**number**|20 | 30 | 40 |50 | 60 | 75| 14 |15 | 25 | 45| 25 | 36| 21| **456**|
Expected:
|**Dates**|Jan20|Feb20|Mar20|Apr20|May20|Jun20|Jul20|Aug20|Sep20|Oct20|Nov20|Dec20|Jan21|**Total YTD21**| **Total**|
|**number**|20 | 30 | 40 |50 | 60 | 75| 14 |15 | 25 | 45| 25 | 36| 21| **21**| **456**|
Is there any way to get YTD total like above in quicksight table visual. Please refer to the attached image for more understanding.
Any help would be appreciated.

Related

PySpark Relate multiple rows

I have a problem where I need to relate rows to each other. I have tried many things but I am now completly stuck. I have tried partitioning, lag, groupbys but nothing works.
The rows below the ID 26 wil relate to the MPAN of 26
ID | MPAN | Value
---------------------
26 | 12345678 | Hello
27 | 99900234 | Bye
30 | 77563820 | Help
33 | 89898937 | Stuck
26 | 54877273 | Need a genius
29 | 54645643 | So close
30 | 22222222 | Thanks
e.g.
ID | MPAN | Value | Relation
----------------------------------------
26 | 12345678 | Hello | NULL
27 | 99900234 | Bye | 12345678
30 | 77563820 | Help | 12345678
33 | 89898937 | Stuck | 12345678
26 | 54877273 | Genius | NULL
29 | 54645643 | So close | 54877273
30 | 22222222 | Thanks | 54877273
This code below only works for previous row and not the LAG for the 26 record
df = spark.read.load('abfss://Files/', format='parquet')
df = df.withColumn("identity", F.monotonically_increasing_id())
win = Window.orderBy("identity")
condition = F.col("Prop_0") != '026'
df = df.withColumn("FlagY", F.when(condition, mpanlookup))
df.show()
As I said in my comment, you need a column to maintain the order. In your example, you used monotonically_increasing_id to create that "ordering" column, but that is absurd because
The function is non-deterministic because its result depends on partition IDs.
Assuming you have a proper "ordering" column :
df.show()
+---+---+--------+-------------+
|idx| ID| MPAN| Value|
+---+---+--------+-------------+
| 1| 26|12345678|Hello |
| 2| 27|99900234|Bye |
| 3| 30|77563820|Help |
| 4| 33|89898937|Stuck |
| 5| 26|54877273|Need a genius|
| 6| 29|54645643|So close |
| 7| 30|22222222|Thanks |
+---+---+--------+-------------+
you can simply do that with last function :
from pyspark.sql import functions as F, Window
df.withColumn(
"Relation",
F.last(F.when(F.col("ID") == 26, F.col("MPAN")), ignorenulls=True).over(
Window.orderBy("idx")
),
).show()
+---+---+--------+-------------+--------+
|idx| ID| MPAN| Value|Relation|
+---+---+--------+-------------+--------+
| 1| 26|12345678|Hello |12345678|
| 2| 27|99900234|Bye |12345678|
| 3| 30|77563820|Help |12345678|
| 4| 33|89898937|Stuck |12345678|
| 5| 26|54877273|Need a genius|54877273|
| 6| 29|54645643|So close |54877273|
| 7| 30|22222222|Thanks |54877273|
+---+---+--------+-------------+--------+

Create a Dataframe based on ranges of other Dataframe

I have a Spark Dataframe containing ranges of numbers (column start and column end), and a column containing the type of this range.
I want to create a new Dataframe with two columns, the first one lists all ranges (incremented by one), and the second one lists the range's type.
To explain more, this is the input Dataframe :
+-------+------+---------+
| start | end | type |
+-------+------+---------+
| 10 | 20 | LOW |
| 21 | 30 | MEDIUM |
| 31 | 40 | HIGH |
+-------+------+---------+
And this is the desired result :
+-------+---------+
| nbr | type |
+-------+---------+
| 10 | LOW |
| 11 | LOW |
| 12 | LOW |
| 13 | LOW |
| 14 | LOW |
| 15 | LOW |
| 16 | LOW |
| 17 | LOW |
| 18 | LOW |
| 19 | LOW |
| 20 | LOW |
| 21 | MEDIUM |
| 22 | MEDIUM |
| .. | ... |
+-------+---------+
Any ideas ?
Try this.
val data = List((10, 20, "Low"), (21, 30, "MEDIUM"), (31, 40, "High"))
import spark.implicits._
val df = data.toDF("start", "end", "type")
df.withColumn("nbr", explode(sequence($"start", $"end"))).drop("start","end").show(false)
output:
+------+---+
|type |nbr|
+------+---+
|Low |10 |
|Low |11 |
|Low |12 |
|Low |13 |
|Low |14 |
|Low |15 |
|Low |16 |
|Low |17 |
|Low |18 |
|Low |19 |
|Low |20 |
|MEDIUM|21 |
|MEDIUM|22 |
|MEDIUM|23 |
|MEDIUM|24 |
|MEDIUM|25 |
|MEDIUM|26 |
|MEDIUM|27 |
|MEDIUM|28 |
|MEDIUM|29 |
+------+---+
only showing top 20 rows
The solution provided by #Learn-Hadoop works if you're on Spark 2.4+ .
For older Spark version, consider creating a simple UDF to mimic the sequence function:
val sequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
df.withColumn("nbr",explode(sequence($"start",$"end"))).drop("start","end").show(false)

how can i bring the months in calender order like from jan to dec in scala dataframe

+---------+------------------+
| Month|sum(buss_days)|
+---------+------------------+
| April| 83.93|
| August| 94.895|
| December| 53.47|
| February| 22.90|
| January| 97.45|
| July| 95.681|
| June| 23.371|
| March| 35.957|
| May| 4.24|
| November| 1.56|
| October| 1.00|
|September| 93.51|
+---------+------------------+
and i want output like this
+---------+------------------+
| Month|sum(avg_buss_days)|
+---------+------------------+
| January| 97.45
February| 22.90
March| 35.957
April| 83.93|
| May| 4.24
June| 23.371
July| 95.681
August| 94.895|
| September| 93.51
October| 1.00
November| 1.56
December| 53.47|
+---------+------------------+
this is what it is i did
df.groupBy("Month[order(match(month$month, month.abb)), ]")
And i got this.....
Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve column name "Month[order(match(month$month, month.abb)), ]".Here Month is Column name in dataframe
Convert the Month Into Date form and sort the items should do.
Please find the snippet unix_timestamp(col("Month"),"MMMMM")
Df.sort(unix_timestamp(col("Month"),"MMMMM")).show
+---------+-------------+
| Month|avg_buss_days|
+---------+-------------+
| January| 97.45|
| February| 22.90|
| March| 35.957|
| April| 83.93|
| May| 4.24|
| June| 23.371|
| July| 95.681|
| August| 94.895|
|September| 93.51|
| October| 1.00|
| November| 1.56|
| December| 53.47|
+---------+-------------+

Time series with scala and spark. Rolling window

I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.
This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+

Convert a single row into multiple rows by the columns in Postgresql

I have a table cash_drawer which stores quantity for each denomination of currency for each day at day end:
cash_drawer(
date DATE,
100 SMALLINT,
50 SMALLINT,
20 SMALLINT,
10 SMALLINT,
5 SMALLINT,
1 SMALLINT
)
Now any given day, I wish to get each denomination as a row.
If lets say for day 2016-11-25, if we have the following row:
+------------+-------+------+------+------+-----+-----+
| date | 100 | 50 | 20 | 10 | 5 | 1 |
+------------+-------+------+------+------+-----+-----+
| 2016-11-25 | 5 | 12 | 27 | 43 | 147 | 129 |
+------------+-------+------+------+------+-----+-----+
Now I wish to get the out put of the query as:
+------------+--------+
|denomination|quantity|
+------------+--------+
|100 |5 |
+------------+--------+
|50 |12 |
+------------+--------+
|20 |27 |
+------------+--------+
|10 |43 |
+------------+--------+
|5 |147 |
+------------+--------+
|1 |129 |
+------------+--------+
Is there a method by which this is possible? If you have any other suggestion please be free to suggest.
Use json functions:
select key as denomination, value as quantity
from cash_drawer c,
lateral json_each(row_to_json(c))
where key <> 'date'
and date = '2016-11-25';
denomination | quantity
--------------+----------
100 | 5
50 | 12
20 | 27
10 | 43
5 | 147
1 | 129
(6 rows)
Test it here.