I have a file with a [start] and [end] date in Tableau and would like to create a calculated field that counts number of rows on a rolling basis that occur between [start] and [end] for each [person]. This data is like so:
| Start | End | Person
|1/1/2019 |1/7/2019 | A
|1/3/2019 |1/9/2019 | A
|1/8/2019 |1/15/2019| A
|1/1/2019 |1/7/2019 | B
I'd like to create a calculated field [count] with results like so:
| Start | End | Person | Count
|1/1/2019 |1/7/2019 | A | 1
|1/3/2019 |1/9/2019 | A | 2
|1/8/2019 |1/15/2019| A | 2
|1/1/2019 |1/7/2019 | B | 1
EDITED: A good analogy for what [count] represents is: "how many videos does each person rented at the same time as of that moment?" With the 1st row for person A, count is 1, with 1 item rented. As of row 2, person A has 2 items rented. But for the 3rd row [count]= 2 since the video rented in the first row is no longer rented.
Related
I have a Postgres table where one id may have multiple Channel values as follows
ID |Channel | Column 3 | Column 4
_____|________|__________|_________
1 | Sports | x | null
1 | Organic| x | z
2 | Organic| null | q
3 | Arts | b | w
3 | Organic| e | r
4 | Sports | sp | t
No ID will have a duplicate channel name, and no ID will be both Sports and Arts. That is, ID 1 could have a Sports and Organic channel, a Sports and Arts channel, but not two sports or two organic entries and not a Sports and Arts channel. I want all IDs to be in the query, but if there is a non-organic channel I prefer that. The result I would want would be
ID |Channel | Column 3 | Column 4
_____|________|__________|_________
1 | Sports | x | null
2 | Organic| null | q
3 | Arts | b | w
4 | Sports | sp | t
I feel like there is some CTE here, a rank and partition or something that could do the trick, but I'm just not getting it. I'm only including Columns 3 and 4 to show there are extra columns.
Does anyone have any ideas on the code to deploy here?
You could use DISTINCT ON with an appropriate ORDER BY clause:
SELECT DISTINCT ON (id)
id, channel, column3, column4
FROM atable
ORDER BY id, channel = 'Organic';
This relies on the fact that FALSE < TRUE.
I ended up using a rank over function
ROW_NUMBER () over (partition by salesforce_id order by case when channel is organic then 0 else 1 end desc, timestamp desc) as id_rank
I didn't include in the original question that I had a timestamp! This works now. Thanks
I will be very grateful for your advice regarding the following issue.
Given:
PostgreSQL database
Initial (basic) query
select day, Value_1, Value_2, Value_3
from table
where day=current_date
which returns a row with following columns
Day | Value_1(int) | Value_2(int) | Value 3 (int)
2019-11-14 | 10 | 10 | 14
It is needed to create a view with this starting information and add a new row every day based on the outcome of initial query executed at 22:00.
The expected outcome tomorrow at 22:01 will be
Day | Value_1 | Value_2 | Value_3
2019-11-14 | 10 | 10 | 14
2019-11-15 | N | M | P
Many thanks in advance for your time and support.
I have two tables,
1. items
2. festival_rents
Sample records:
items
id | name | rent
------------------------
1 | Car | 100
2 | Truck | 150
3 | Van | 200
Sample records:
festival_rents
id | items_id | start_date | end_date | rent
------------------------------------------------
1 | 1 | 2018-07-01 | 2018-07-02 | 200
2 | 1 | 2018-07-04 | 2018-07-06 | 300
3 | 3 | 2018-07-06 | 2018-07-07 | 400
The table items contains list of items with name and rent. Each item in the items table may or may not have festival_rents. The table festival_rents has higher rents for each item for a date range with start_date and end_date. It is possible for a item to have multiple festival_rents with different date ranges. But it's for sure that date ranges for multiple festival_rents belonging to a same item won't collide and all date ranges are isolated.
The query that I'm looking for is, for a given start_date and end_date range, for each item in the items table, calculate the total rent and display each item with it's calculated total rent. The rent calculation for each item should include the festival_rents also, if any of the items has festival_rents falling within the given start_date and end_date.
Expected result:
Input: start_date=2018-07-01 and end_date=2018-07-06
Output:
id | name | total_price
------------------------
1 | Car | 1100 // 1st 2 days festival rent + 1 day normal rent + last 3 days festival rent (2 * 200) + (1 * 100) + (3 * 200)
2 | Truck | 900 // 6 days normal rent (6 * 150)
3 | Van | 1400 // 5 days normal rent + 1 day festival rent (200 * 5) + (400 * 1)
You need a list of days either on a table or create on the fly:
How to get list of dates between two dates in mysql select query
Generating time series between two dates in PostgreSQL
SELECT i.name, SUM (f.rent)
FROM allDays a
JOIN festival_rents f
ON a.day >= f.start_date
AND a.day < f.end_date
JOIN items i
ON f.item_id = i.item_id
WHERE a.day BETWEEN #start_date
AND #end_date
GROUP BY i.name
I assume the end_date is open range. So if you have ranges [A,B) and [B,C) Date B will have the rent from [B,C)
I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.
Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table