I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.
Related
For the sake of simplicity, suppose you have a table with numbers like:
| number |
----------
|123 |
|1234 |
|12345 |
|123456 |
|111 |
|1111 |
|2 |
|700 |
What would be an efficient way of retrieving the shortest numbers (call them roots or whatever) and all values derived from them, eg:
| root | derivatives |
--------------------------------
| 123 | 1234, 12345, 123456 |
| 111 | 1111 |
Numbers 2 & 700 are excluded from the list because they're unique, and thus have no derivatives.
An output as the above would be ideal, but since it's probably difficult to achieve, the next best thing would be something like below, which I can then post-process:
| root | derivative |
-----------------------
| 123 | 1234 |
| 123 | 12345 |
| 123 | 123456 |
| 111 | 1111 |
My naive initial attempt to at least identify roots (see below) has been running for 4h now with a dataset of ~500k items, but the real one I'd have to inspect consists of millions.
select number
from numbers n1
where exists(
select number
from numbers n2
where n2.number <> n1.number
and n2.number like n1.number || '_%'
);
This works if number is an integer or bigint:
select min(a.number) as root, b.number as derivative
from nums a
cross join lateral generate_series(1, 18) as gs(power)
join nums b
on b.number / (10^gs.power)::bigint = a.number
group by b.number
order by root, derivative;
EDIT: I moved a non-working query to the bottom. It fails for reasons outlined by #Morfic in the comments.
We can do a similar and simpler join using like for character types:
select min(a.number) as root, b.number as derivative
from numchar a
join numchar b on b.number like a.number||'%'
and b.number != a.number
group by b.number
order by root, derivative;
Updated fiddle.
Faulty Solution Follows
If number is a character type, then try this:
with groupings as (
select number,
case
when number like (lag(number) over (order by number))||'%' then 0
else 1
end as newgroup
from numchar
), groupnums as (
select number, sum(newgroup) over (order by number) as groupnum
from groupings
), matches as (
select min(number) over (partition by groupnum) as root,
number as derivative
from groupnums
)
select *
from matches
where root != derivative;
There should be only a single sort on groupnum in this execution since the column is your table's primary key.
db<>fiddle here
I am trying to create a "queue" system by adding an arbitrary column that creates a number based on a condition and date, to sort the importance of a row.
For example, below is the query result I pulled in Postgres:
Table: task
Result:
description | status/condition| task_created |
bla | A | 2019-12-01 07:00:00|
pikachu | A | 2019-12-01 16:32:10|
abcdef | B | 2019-12-02 18:34:22|
doremi | B | 2019-12-02 15:09:43|
lalala | A | 2019-12-03 22:10:59|
In the above, each task has a date/timestamp and status/condition applied to them. I would like to create another column that gives a number to a row where it prioritises the older tasks first, BUT if the condition is B, then we take the older task of those in B as first priority.
The expected end result (based on the example) should be:
Table1: task
description | status/condition| task_created | priority index
bla | A | 2019-12-01 07:00:00| 3
pikachu | A | 2019-12-01 16:32:10| 4
abcdef | B | 2019-12-02 18:34:22| 2
doremi | B | 2019-12-02 15:09:43| 1
lalala | A | 2019-12-03 22:10:59| 5
For priority number, 1 being most urgent to do/resolve, while 5 being the least.
How would I go about adding this additional column into the existing query? especially since there's another condition apart from just the task_created date/time.
Any help is appreciated. Many thanks!
You maybe want the Rank or Dense Rank function (depends on your needs) window functions.
If you don't need a conditional order on the status you can use this one.
SELECT *,
rank() OVER (
ORDER BY status desc, task_created
) as priority_index
FROM task
If you need a custom order based on the value of the status:
SELECT *,
rank() OVER (
ORDER BY
CASE status
WHEN 'B' THEN 1
WHEN 'A' THEN 2
WHEN 'C' THEN 3
ELSE 4
END, task_created
) as priority_index
FROM task
If you have few values this is good enough, because we can simply specify your custom order. But if you have a lot of values and the ordering information is fixed, then it should have its own table.
I have a file with a [start] and [end] date in Tableau and would like to create a calculated field that counts number of rows on a rolling basis that occur between [start] and [end] for each [person]. This data is like so:
| Start | End | Person
|1/1/2019 |1/7/2019 | A
|1/3/2019 |1/9/2019 | A
|1/8/2019 |1/15/2019| A
|1/1/2019 |1/7/2019 | B
I'd like to create a calculated field [count] with results like so:
| Start | End | Person | Count
|1/1/2019 |1/7/2019 | A | 1
|1/3/2019 |1/9/2019 | A | 2
|1/8/2019 |1/15/2019| A | 2
|1/1/2019 |1/7/2019 | B | 1
EDITED: A good analogy for what [count] represents is: "how many videos does each person rented at the same time as of that moment?" With the 1st row for person A, count is 1, with 1 item rented. As of row 2, person A has 2 items rented. But for the 3rd row [count]= 2 since the video rented in the first row is no longer rented.
I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.
Situation
Using Python 3, Django 1.9, Cubes 1.1, and Postgres 9.5.
These are my datatables in pictorial form:
The same in text format:
Store table
------------------------------
| id | code | address |
|-----|------|---------------|
| 1 | S1 | Kings Row |
| 2 | S2 | Queens Street |
| 3 | S3 | Jacks Place |
| 4 | S4 | Diamonds Alley|
| 5 | S5 | Hearts Road |
------------------------------
Product table
------------------------------
| id | code | name |
|-----|------|---------------|
| 1 | P1 | Saucer 12 |
| 2 | P2 | Plate 15 |
| 3 | P3 | Saucer 13 |
| 4 | P4 | Saucer 14 |
| 5 | P5 | Plate 16 |
| and many more .... |
|1000 |P1000 | Bowl 25 |
|----------------------------|
Sales table
----------------------------------------
| id | product_id | store_id | amount |
|-----|------------|----------|--------|
| 1 | 1 | 1 |7.05 |
| 2 | 1 | 2 |9.00 |
| 3 | 2 | 3 |1.00 |
| 4 | 2 | 3 |1.00 |
| 5 | 2 | 5 |1.00 |
| and many more .... |
| 1000| 20 | 4 |1.00 |
|--------------------------------------|
The relationships are:
Sales belongs to Store
Sales belongs to Product
Store has many Sales
Product has many Sales
What I want to achieve
I want to use cubes to be able to do a display by pagination in the following manner:
Given the stores S1-S3:
-------------------------
| product | S1 | S2 | S3 |
|---------|----|----|----|
|Saucer 12|7.05|9 | 0 |
|Plate 15 |0 |0 | 2 |
| and many more .... |
|------------------------|
Note the following:
Even though there were no records in sales for Saucer 12 under Store S3, I displayed 0 instead of null or none.
I want to be able to do sort by store, say descending order for, S3.
The cells indicate the SUM total of that particular product spent in that particular store.
I also want to have pagination.
What I tried
This is the configuration I used:
"cubes": [
{
"name": "sales",
"dimensions": ["product", "store"],
"joins": [
{"master":"product_id", "detail":"product.id"},
{"master":"store_id", "detail":"store.id"}
]
}
],
"dimensions": [
{ "name": "product", "attributes": ["code", "name"] },
{ "name": "store", "attributes": ["code", "address"] }
]
This is the code I used:
result = browser.aggregate(drilldown=['Store','Product'],
order=[("Product.name","asc"), ("Store.name","desc"), ("total_products_sale", "desc")])
I didn't get what I want.
I got it like this:
----------------------------------------------
| product_id | store_id | total_products_sale |
|------------|----------|---------------------|
| 1 | 1 | 7.05 |
| 1 | 2 | 9 |
| 2 | 3 | 2.00 |
| and many more .... |
|---------------------------------------------|
which is the whole table with no pagination and if the products not sold in that store it won't show up as zero.
My question
How do I get what I want?
Do I need to create another data table that aggregates everything by store and product before I use cubes to run the query?
Update
I have read more. I realised that what I want is called dicing as I needed to go across 2 dimensions. See: https://en.wikipedia.org/wiki/OLAP_cube#Operations
Cross-posted at Cubes GitHub issues to get more attention.
This is a pure SQL solution using crosstab() from the additional tablefunc module to pivot the aggregated data. It typically performs better than any client-side alternative. If you are not familiar with crosstab(), read this first:
PostgreSQL Crosstab Query
And this about the "extra" column in the crosstab() output:
Pivot on Multiple Columns using Tablefunc
SELECT product_id, product
, COALESCE(s1, 0) AS s1 -- 1. ... displayed 0 instead of null
, COALESCE(s2, 0) AS s2
, COALESCE(s3, 0) AS s3
, COALESCE(s4, 0) AS s4
, COALESCE(s5, 0) AS s5
FROM crosstab(
'SELECT s.product_id, p.name, s.store_id, s.sum_amount
FROM product p
JOIN (
SELECT product_id, store_id
, sum(amount) AS sum_amount -- 3. SUM total of product spent in store
FROM sales
GROUP BY product_id, store_id
) s ON p.id = s.product_id
ORDER BY s.product_id, s.store_id;'
, 'VALUES (1),(2),(3),(4),(5)' -- desired store_id's
) AS ct (product_id int, product text -- "extra" column
, s1 numeric, s2 numeric, s3 numeric, s4 numeric, s5 numeric)
ORDER BY s3 DESC; -- 2. ... descending order for S3
Produces your desired result exactly (plus product_id).
To include products that have never been sold replace [INNER] JOIN with LEFT [OUTER] JOIN.
SQL Fiddle with base query.
The tablefunc module is not installed on sqlfiddle.
Major points
Read the basic explanation in the reference answer for crosstab().
I am including with product_id because product.name is hardly unique. This might otherwise lead to sneaky errors conflating two different products.
You don't need the store table in the query if referential integrity is guaranteed.
ORDER BY s3 DESC works, because s3 references the output column where NULL values have been replaced with COALESCE. Else we would need DESC NULLS LAST to sort NULL values last:
PostgreSQL sort by datetime asc, null first?
For building crosstab() queries dynamically consider:
Dynamic alternative to pivot with CASE and GROUP BY
I also want to have pagination.
That last item is fuzzy. Simple pagination can be had with LIMIT and OFFSET:
Displaying data in grid view page by page
I would consider a MATERIALIZED VIEW to materialize results before pagination. If you have a stable page size I would add page numbers to the MV for easy and fast results.
To optimize performance for big result sets, consider:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Optimize query with OFFSET on large table