Postgres weighted average given two time intervals columns and separate table - postgresql

I want to calculate a weighted average for parent orders comprised of several child orders. The first table defines the parent orders where the product id, parent order, start_time and end time are defined. The second table is where I need to aggregate the data.
Groups Definition Table:
id | order_Parent | start_time | end_time
1 | 1 | 2018-01-26 15:53:00 | 2018-01-26 15:54:00
2 | 2 | 2018-01-26 15:51:00 | 2018-01-26 16:01:00
2 | 3 | 2018-01-26 15:27:00 | 2018-01-26 15:35:00
Data Table To Calculate Weighted Average :
id | order_child | time_stamp | weight | target_value
1 | 1 | 2018-01-26 15:53:00 | 100 | 99.99
1 | 1 | 2018-01-26 15:53:00 | 200 | 89.99
1 | 1 | 2018-01-26 15:53:30 | 50 | 114.99
2 | 2 | 2018-01-26 15:49:00 | 100 | 49.99
2 | 2 | 2018-01-26 15:55:00 | 100 | 59.99
2 | 2 | 2018-01-26 15:57:30 | 250 | 54.99
2 | 3 | 2018-01-26 15:27:30 | 100 | 54.99
2 | 3 | 2018-01-26 15:31:30 | 75 | 49.99
2 | 3 | 2018-01-26 15:34:30 | 100 | 54.99
Ideal Output:
id | order_Parent | start_time | end_time | WgtAvg
1 | 1 | 2018-01-26 15:53:00 | 2018-01-26 15:54:00 | 96.41
2 | 2 | 2018-01-26 15:51:00 | 2018-01-26 16:01:00 | 54.99
2 | 3 | 2018-01-26 15:27:00 | 2018-01-26 15:35:00 | 53.62
The problem seems clear, but I am stumped on how to use the first table for my group definitions in order to calculate the weighted average per parent order.
Any thoughts are greatly appreciated.

Related

KDB query : Get blank column values from other table if value is null

I have 2 tables as a result of query as following :
select customer,date,product,orderId,version,size from tableA where date=2020.04.08,product in (`Derivative)
+----------+----------+------------+---------+---------+------+
| customer | date | product | orderId | version | size |
+----------+----------+------------+---------+---------+------+
| XYZ fund | 4/8/2020 | Derivative | 1 | 6 | |
| XYZ fund | 4/8/2020 | Derivative | 2 | 6 | 1000 |
| XYZ fund | 4/8/2020 | Derivative | 3 | 4 | |
+----------+----------+------------+---------+---------+------+
select sum size by date,product,parent_orderId,parent_version from tableB where date=2020.04.08,product in (`Derivative)
+----------+------------+----------------+----------------+------+
| date | product | parent_orderId | parent_version | size |
+----------+------------+----------------+----------------+------+
| 4/8/2020 | Derivative | 1 | 1 | 10 |
| 4/8/2020 | Derivative | 1 | 2 | 10 |
| 4/8/2020 | Derivative | 1 | 3 | 10 |
| 4/8/2020 | Derivative | 1 | 4 | 10 |
| 4/8/2020 | Derivative | 1 | 5 | 10 |
| 4/8/2020 | Derivative | 1 | 6 | 10 |
| 4/8/2020 | Derivative | 3 | 1 | 20 |
| 4/8/2020 | Derivative | 3 | 2 | 20 |
| 4/8/2020 | Derivative | 3 | 3 | 20 |
| 4/8/2020 | Derivative | 3 | 4 | 20 |
+----------+------------+----------------+----------------+------+
So basically I want that if the Result 1 has missing size then it should be populated from Result 2 based on matching columns i.e date=date,product=product,orderId=parent_orderId,version=parent_version. Is there any way to do it using query in KBD?
Following is expected o/p :
+----------+----------+------------+---------+---------+------+
| customer | date | product | orderId | version | size |
+----------+----------+------------+---------+---------+------+
| XYZ fund | 4/8/2020 | Derivative | 1 | 6 | 10 |
| XYZ fund | 4/8/2020 | Derivative | 2 | 6 | 1000 |
| XYZ fund | 4/8/2020 | Derivative | 3 | 4 | 20 |
+----------+----------+------------+---------+---------+------+
You can use the left join operator to achieve this:
q)res1:select customer,date,product,orderId,version,size from tableA where date=2020.04.08,product in (`Derivative);
q)res2:select sum size by date,product,orderId:parent_orderId,version:parent_version from tableB where date=2020.04.08,product in (`Derivative);
q)res1 lj res2
customer date product orderId version size
-------------------------------------------------
XYZ fund 4/8/2020 Derivative 1 6 10
XYZ fund 4/8/2020 Derivative 2 6 1000
XYZ fund 4/8/2020 Derivative 3 4 20
Note that we had to ensure that the column names in the second tables matched those we wanted to join on in the first table.

Copy the date in org table

Suppose such a spreadsheet in org table
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| | B | 52.67 | 2 | 105.34 | diagnosis |
| | C | 3.08 | 1 | 3.08 | materials |
| | D | 3.85 | 2 | 7.7 | materials |
| | E | 33.66 | 2 | 67.32 | materials |
| | F | 40 | 1 | 40 | treatments |
| | G | 16.5 | 1 | 16.5 | materials |
| | H | 4 | 3 | 12 | treatments |
| | I | 40 | 1 | 40 | bed |
| | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: $5=$3*$4
How could copy the date 2019/09.17 to the bottom of data column?
The link that #manandearth posted in the comments describes how to duplicate (perhaps with slight modifications) the entries in a column. Briefly, pressing S-RET in a cell duplicates its contents from the cell above (if it is not empty) - if the cell is full and the next cell is empty then it duplicates the full cell to the empty cell. If the contents are numeric, then the "duplication" involves a slight modification: it increases the value by 1. The same happens with a date: it increases the date to next day (but the date has to be in a format that Org mode recognizes: either an active date <YYYY-MM-DD> or an inactive data [YYYY-MM-DD]). The increment by default is 1 in these cases, but can be set to something else by setting the variable org-table-copy-increment to a different value. That's the "interactive" case I mention in my comment.
The other way to fill a column in a table is by using a formula. For example here's a formula to fill the first column with a copy of the first entry in the column:
#+TBLFM: #3$1..#>$1 = #2$1
This says: Set all rows from row 3 (#3) to the last row (#>) of column 1 ($1) to the value of the cell in row 2 (#2), column 1 ($1). Note that row 1 is the header. Press C-c C-c on the table formula line above and ... wait, what happened?
|------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------+-------+------------+--------+--------+------------|
| 2019/09/17 | A | 2.64 | 1 | 2.64 | materials |
| 13.196078 | B | 52.67 | 2 | 105.34 | diagnosis |
| 13.196078 | C | 3.08 | 1 | 3.08 | materials |
| 13.196078 | D | 3.85 | 2 | 7.7 | materials |
| 13.196078 | E | 33.66 | 2 | 67.32 | materials |
| 13.196078 | F | 40 | 1 | 40 | treatments |
| 13.196078 | G | 16.5 | 1 | 16.5 | materials |
| 13.196078 | H | 4 | 3 | 12 | treatments |
| 13.196078 | I | 40 | 1 | 40 | bed |
| 13.196078 | M | 6 | 13 | 78 | treatments |
|------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1
It does not quite work in this case for a technical reason: Org mode uses Calc in table formula calculations and Calc looks at 2019/09/17 and says: "Aha, I have to divide 2019 by 9 and then divide the result by 17", and fills the rest of the column with the result of the divisions: 13.196078. You may have meant 2019/09/17 to be a date, but Org mode does not know that: it gives it to Calc which interprets it as an arithmetic expression. The solution here is the same as in the linked answer: make Org mode aware that it's a date by making it either an active date: <2019-09-17> or an inactive date: [2019-09-17]:
|------------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------------+-------+------------+--------+--------+------------|
| [2019-09-17] | A | 2.64 | 1 | 2.64 | materials |
| [2019-09-17 Tue] | B | 52.67 | 2 | 105.34 | diagnosis |
| [2019-09-17 Tue] | C | 3.08 | 1 | 3.08 | materials |
| [2019-09-17 Tue] | D | 3.85 | 2 | 7.7 | materials |
| [2019-09-17 Tue] | E | 33.66 | 2 | 67.32 | materials |
| [2019-09-17 Tue] | F | 40 | 1 | 40 | treatments |
| [2019-09-17 Tue] | G | 16.5 | 1 | 16.5 | materials |
| [2019-09-17 Tue] | H | 4 | 3 | 12 | treatments |
| [2019-09-17 Tue] | I | 40 | 1 | 40 | bed |
| [2019-09-17 Tue] | M | 6 | 13 | 78 | treatments |
|------------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1
This does not do automatic incrementation but if that's what you want, it's easy to accomplish: Calc can do calculations on dates, so we can increment daily by adding to the date in each row the row number minus 2 (e.g. row 3 would get an increment of 3 - 2 = 1, row 4 would get 4 - 2 = 2, etc). To accomplish this, you have to get the row number of the current row: the idiom is ##. Then the formula becomes:
#+TBLFM: #3$1..#>$1 = #2$1 + ## - 2
and the table becomes:
|------------------+-------+------------+--------+--------+------------|
| Date | Items | Unit Price | Amount | Amount | Categories |
|------------------+-------+------------+--------+--------+------------|
| [2019-09-17] | A | 2.64 | 1 | 2.64 | materials |
| [2019-09-18 Wed] | B | 52.67 | 2 | 105.34 | diagnosis |
| [2019-09-19 Thu] | C | 3.08 | 1 | 3.08 | materials |
| [2019-09-20 Fri] | D | 3.85 | 2 | 7.7 | materials |
| [2019-09-21 Sat] | E | 33.66 | 2 | 67.32 | materials |
| [2019-09-22 Sun] | F | 40 | 1 | 40 | treatments |
| [2019-09-23 Mon] | G | 16.5 | 1 | 16.5 | materials |
| [2019-09-24 Tue] | H | 4 | 3 | 12 | treatments |
| [2019-09-25 Wed] | I | 40 | 1 | 40 | bed |
| [2019-09-26 Thu] | M | 6 | 13 | 78 | treatments |
|------------------+-------+------------+--------+--------+------------|
#+TBLFM: #3$1..#>$1 = #2$1+ ## - 2
The various anomalies of the display of dates (do we include the day of the week? do we include the time?) might be worked around using org-time-stamp-custom-formats but that gets us into waters that I have not explored.

count continuously postgresql data

i need help with counting some data
this what i want
| user_id | action_id | count |
-------------------------------------
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 2 |
| 4 | 3 | 1 |
| 5 | 3 | 2 |
| 6 | 3 | 3 |
| 7 | 4 | 1 |
| 8 | 5 | 1 |
| 9 | 5 | 2 |
| 10 | 6 | 1 |
this is what i have
| user_id | action_id | count |
-------------------------------
| 1 | 1 | 1 |
| 2 | 2 | 1 |
| 3 | 2 | 1 |
| 4 | 3 | 1 |
| 5 | 3 | 1 |
| 6 | 3 | 1 |
| 7 | 4 | 1 |
| 8 | 5 | 1 |
| 9 | 5 | 1 |
| 10 | 6 | 1 |
i really need it for create some research about second action from users
how do i do it?
thank you
Using ROW_NUMBER should work here:
SELECT
user_id,
action_id,
ROW_NUMBER() OVER (PARTITION BY action_id ORDER BY user_id) count
FROM yourTable
ORDER BY
user_id;
Demo

Redshift Distribution By Child Columns

My Situation
I have some tables in my redshift cluster that all break down into either an order_id, shipment_id, or shipment_item_id depending on how granular the table is. order_id is a 1 to many relationship on shipment_id and shipment_id is a 1 to many on shipemnt_item_id.
My Question
I distribute on order_id, so all shipment_id and shipment_item_id records should be on the same nodes across the tables since they are grouped by order_id. My question is, when I have to join on shipment_id or shipment_item_id then will redshift know that the records are on the same nodes, or will it still broadcast the tables since they aren't joined on order_id?
Example Tables
unified_order shipment_details
+----------+-------------+------------------+ +-------------+-----------+--------------+
| order_id | shipment_id | shipment_item_id | | shipment_id | ship_day | ship_details |
+----------+-------------+------------------+ +-------------+-----------+--------------+
| 1 | 1 | 1 | | 1 | 1/1/2017 | stuff |
| 1 | 1 | 2 | | 2 | 5/1/2017 | other stuff |
| 1 | 1 | 3 | | 3 | 6/14/2017 | more stuff |
| 1 | 2 | 4 | | 4 | 5/13/2017 | less stuff |
| 1 | 2 | 5 | | 5 | 6/19/2017 | that stuff |
| 1 | 3 | 6 | | 6 | 7/31/2017 | what stuff |
| 2 | 4 | 7 | | 7 | 2/5/2017 | things |
| 2 | 4 | 8 | +-------------+-----------+--------------+
| 3 | 5 | 9 |
| 3 | 5 | 10 |
| 4 | 6 | 11 |
| 5 | 7 | 12 |
| 5 | 7 | 13 |
+----------+-------------+------------------+
Distribution
distribution_by_node
+------+----------+-------------+------------------+
| node | order_id | shipment_id | shipment_item_id |
+------+----------+-------------+------------------+
| 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 2 |
| 1 | 1 | 1 | 3 |
| 1 | 1 | 2 | 4 |
| 1 | 1 | 2 | 5 |
| 1 | 1 | 3 | 6 |
| 1 | 5 | 7 | 12 |
| 1 | 5 | 7 | 13 |
| 2 | 2 | 4 | 7 |
| 2 | 2 | 4 | 8 |
| 3 | 3 | 5 | 9 |
| 3 | 3 | 5 | 10 |
| 4 | 4 | 6 | 11 |
+------+----------+-------------+------------------+
The Amazon Redshift documentation does not go into detail how information is shared between nodes, but it is doubtful that it "broadcasts the tables".
Rather, information is probably sent between nodes based on need -- only the relevant columns would be shared, and possibly only sub-ranges of the data.
Rather than worrying too much about the internal implementation, you should test various DISTKEY and SORTKEY strategies against real queries to determine performance.
Follow the recommendations from Choose the Best Distribution Style to minimize the amount of data that needs to be sent between nodes and consult Amazon Redshift Best Practices for Designing Queries to improve queries.
You can EXPLAIN your query to see how data will be distributed (or not) during the execution. In this doc you'll see how to read the query plan:
Evaluating the Query Plan

Spotfire - Calculate average only if there are minimum 3 values

I want to create a cross table in Spotfire where in which Average is calculated only when there are at least 3 values. If there are no values or less than 3 values the average should be blank.
+-------+-----+---------+
| Month | Age | Average |
+-------+-----+---------+
| 1 | 10 | |
| 2 | 11 | |
| 3 | 2 | 7.7 |
| 4 | | |
| 5 | 13 | |
| 6 | 14 | |
| 7 | | |
| 8 | 19 | |
| 9 | 20 | |
| 10 | 21 | 20 |
+-------+-----+---------+
If I'm understanding you correctly, you want to group by Month, and then have something like this as your aggregation:
If(Count()>2,Avg([Age]),null) as [AverageAge_3Min]