I have following table:
|Type | Year | amount |
_______________________
|t1 | 2001 | 40 |
|t1 | 2000 | 50 |
|t2 | 2003 | 30 |
|t2 | 2003 | 20 |
|t3 | 2004 | 10 |
and I would like to show it as:
| type |2001 |2000 |2003 |2004|
|______________________________
| t1 |40 |50 |0 |0 |
| t2 | 0 |0 | 50 |0 |
| t3 |0 |0 | 0 |10 |
I don't want to hard code the years and I need to do that in POstgresql 8.4, which doesn't support:
CREATE EXTENSION
IF NOT EXISTS tablefunc;
I have pivoted table before, using following code: more explanation for the following code is here
sum(CASE
WHEN year = 2000 THEN
total
ELSE 0
END)
In which the total =sum (amount) in each year and I had calculated it in another CTE. But at that time the years were already known but for the above table I need to loop through the years and read each and then calculate the sum(amount) and the years may change in the main table.
Related
I have a problem with pivot tables ....
I don't understand what to do ...
My table is as follows:
|CODART|MONTH|QT |
|------|-----|----|
|ART1 |1 |100 |
|ART2 |1 |30 |
|ART3 |1 |30 |
|ART1 |2 |10 |
|ART4 |2 |40 |
|ART3 |4 |50 |
|ART5 |4 |60 |
I would like to get a summary table by month:
|CODART|1 |2 |3 |4 |5 |6 |7 |8 |9 |10 |11 |12 |
|------|---|---|---|---|---|---|---|---|---|---|---|---|
|ART1 |100|10 | | | | | | | | | | |
|ART2 |30 | | | | | | | | | | | |
|ART3 |30 | | |50 | | | | | | | | |
|ART4 | |2 | | | | | | | | | | |
|ART5 | | | |60 | | | | | | | | |
|TOTAL |160|12 | |110| | | | | | | | |
Too many requests? :-)
Thanks for the support
WITH MYTAB (CODART, MONTH, QT) AS
(
VALUES
('ART1', 1, 100)
, ('ART2', 1, 30)
, ('ART3', 1, 30)
, ('ART1', 2, 10)
, ('ART4', 2, 40)
, ('ART3', 4, 50)
, ('ART5', 4, 60)
)
SELECT
CASE GROUPING (CODART) WHEN 0 THEN CODART ELSE 'TOTAL' END AS CODART
, SUM (CASE MONTH WHEN 1 THEN QT END) AS "1"
, SUM (CASE MONTH WHEN 2 THEN QT END) AS "2"
, SUM (CASE MONTH WHEN 3 THEN QT END) AS "3"
, SUM (CASE MONTH WHEN 4 THEN QT END) AS "4"
---
, SUM (CASE MONTH WHEN 12 THEN QT END) AS "12"
FROM MYTAB T
GROUP BY ROLLUP (T.CODART)
ORDER BY GROUPING (T.CODART), T.CODART
CODART
1
2
3
4
12
ART1
100
10
ART2
30
ART3
30
50
ART4
40
ART5
60
TOTAL
160
50
110
I have a task with working with hierarchical data, but the source data contains errors in the hierarchy, namely: some parent-child links are broken. I have an algorithm for reestablishing such connections, but I have not yet been able to implement it on my own.
Example:
Initial data is
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Schematically it looks like:
As you can see, connections with C1 and D3 are lost here.
In order to restore connections, I need to apply the following algorithm for this table:
if for some NAME the ID is not in the PARENTID column (like ID = 18, 10), then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID), and take ID and NAME such that the current ID < ID of the node from the LEVEL above.
Result must be like:
+------+----+----------+-------+
| NAME | ID | PARENTID | LEVEL |
+------+----+----------+-------+
| A1 | 1 | 2 | 1 |
| B1 | 2 | 3 | 2 |
| B1 | 2 | 18 | 2 |#
| C1 | 18 | 4 | 3 |
| C2 | 3 | 5 | 3 |
| C2 | 3 | 10 | 3 |#
| D1 | 4 | NULL | 4 |
| D2 | 5 | NULL | 4 |
| D3 | 10 | 11 | 4 |
| E1 | 11 | NULL | 5 |
+------+----+----------+-------+
Where rows with # - new rows created.And new schema looks like:
Are there any ideas on how to do this algorithm in spark/scala? Thanks!
You can build a createdRows dataframe from your current dataframe that you union with your current dataframe to obtain your final dataframe.
You can build this createdRows dataframe in several step:
The first step is to get the IDs (and LEVEL) that are not in PARENTID column. You can use a self left anti join to do that.
Then, you renameID column to PARENTID and updating LEVEL column, decreasing it by 1.
Then, you take ID and NAME columns of new rows by joining it with your input dataframe on the LEVEL column
Finally, you apply your condition ID < PARENTID
You end up with the following code, dataframe is the dataframe with your initial data:
import org.apache.spark.sql.functions.col
val createdRows = dataframe
// if for some NAME the ID is not in the PARENTID column (like ID = 18, 10)
.select("LEVEL", "ID")
.filter(col("LEVEL") > 1) // Remove root node from created rows
.join(dataframe.select("PARENTID"), col("PARENTID") === col("ID"), "left_anti")
// then create a row with a 'parent' with LEVEL = (current LEVEL - 1) and PARENTID = (current ID)
.withColumnRenamed("ID", "PARENTID")
.withColumn("LEVEL", col("LEVEL") - 1)
// and take ID and NAME
.join(dataframe.select("NAME", "ID", "LEVEL"), Seq("LEVEL"))
// such that the current ID < ID of the node from the LEVEL above.
.filter(col("ID") < col("PARENTID"))
val result = dataframe
.unionByName(createdRows)
.orderBy("NAME", "PARENTID") // Optional, if you want an ordered result
And in result dataframe you get:
+----+---+--------+-----+
|NAME|ID |PARENTID|LEVEL|
+----+---+--------+-----+
|A1 |1 |2 |1 |
|B1 |2 |3 |2 |
|B1 |2 |18 |2 |
|C1 |18 |4 |3 |
|C2 |3 |5 |3 |
|C2 |3 |10 |3 |
|D1 |4 |null |4 |
|D2 |5 |null |4 |
|D3 |10 |11 |4 |
|E1 |11 |null |5 |
+----+---+--------+-----+
Let's say we have
|bin | min | end | start |
|1 | 5 | 10 |
|2 | 12 | 24 |
|3 | 28 | 36 |
|4 | 40 | 50 |
|5 | null| null |
I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.
What lag gives us would be
df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min | end | start |
|1 | 5 | 10 | (5 wanted)
|2 | 12 | 24 | 10
|3 | 28 | 36 | 24
|4 | 40 | 50 | 36
|5 | null| null | null
My questions :
1/ What do we put in default_value for lag to take another column of current row, in this case min
2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?
Use coalesce to get a column value for the first row in a group.
from pyspark.sql import functions as F
df.withColumn("start", F.coalesce(F.lag(col("end"), 1).over(orderBy(col("bin")),col("min")))
lag currently doesn't support ignorenulls option, so you might have to separate out the null rows, compute the start column for non-null rows and union the data frames.
I have a table which contains transaction dates and the balances of those transactions. Example Below:
select id, transaction_bal1, transaction_bal2, transaction_date
from transactions
Which results in
ID | transaction_bal1 | transaction_bal2 | transaction_date
1 | -10000 | 1000 | 2017.01.02
2 | 4000 | 1000 | 2017.02.02
3 | 4000 | 1000 | 2017.03.02
etc...
What I want to do is to generate a series with '1 day'::interval so that I select all days between the transaciton dates and that the rows in the above table falls under the right day. Something like this:
Gen_series | ID | transaction 1 | transaction 2 | transaction_date
2017.01.01 |null| 0 | 0 | null
2017.01.02 |1 | -10000 | 1000 | 2017.01.02
2017.01.03 |null| 0 | 0 | null
...
2017.02.01 |null| 0 | 0 | null
2017.02.02 |1 | 4000 | 1000 | 2017.01.02
2017.02.03 |null| 0 | 0 | null
etc...
I use Postgresql(dont know which version) but I use PgAdmin 4 3.2 (if that is of any help)
Feel free to ask any questions if I need to flesh out anything.
I have a table cash_drawer which stores quantity for each denomination of currency for each day at day end:
cash_drawer(
date DATE,
100 SMALLINT,
50 SMALLINT,
20 SMALLINT,
10 SMALLINT,
5 SMALLINT,
1 SMALLINT
)
Now any given day, I wish to get each denomination as a row.
If lets say for day 2016-11-25, if we have the following row:
+------------+-------+------+------+------+-----+-----+
| date | 100 | 50 | 20 | 10 | 5 | 1 |
+------------+-------+------+------+------+-----+-----+
| 2016-11-25 | 5 | 12 | 27 | 43 | 147 | 129 |
+------------+-------+------+------+------+-----+-----+
Now I wish to get the out put of the query as:
+------------+--------+
|denomination|quantity|
+------------+--------+
|100 |5 |
+------------+--------+
|50 |12 |
+------------+--------+
|20 |27 |
+------------+--------+
|10 |43 |
+------------+--------+
|5 |147 |
+------------+--------+
|1 |129 |
+------------+--------+
Is there a method by which this is possible? If you have any other suggestion please be free to suggest.
Use json functions:
select key as denomination, value as quantity
from cash_drawer c,
lateral json_each(row_to_json(c))
where key <> 'date'
and date = '2016-11-25';
denomination | quantity
--------------+----------
100 | 5
50 | 12
20 | 27
10 | 43
5 | 147
1 | 129
(6 rows)
Test it here.