Counting by earliest date found from an inner join? - postgresql

I have two tables, customerusermap and users. Whenever a user signs up with our product, they immediately get added into a table called users but it isn't until they start paying for a user that they get added to a table called customerusermap.
The users table looks like this:
id | customer_id | firstname | lastname | created_at
-------------------------------------------------------
1725 | cus_3hEmhErE2jbwsO | Abby | Smith | 2015-03-19
1726 | cus_7oNweUrE4jbwr2 | Sam | Peters | 2015-06-20
The customerusermap table looks like this:
customer_id | user_id | created_at
------------------------------------------
cus_3hEmhErE2jbwsO | 9275 | 2015-09-01
cus_3hEmhErE2jbwsO | 2628 | 2015-09-05
cus_3hEmhErE2jbwsO | 2358 | 2015-07-05
cus_3hEmhErE2jbwsO | 3158 | 2015-08-05
cus_3hEmhErE2jbwsO | 2487 | 2015-08-05
cus_3hEmhErE2jbwsO | 6044 | 2015-08-05
cus_7oNweUrE4jbwr2 | 8094 | 2015-08-25
cus_7oNweUrE4jbwr2 | 2345 | 2015-09-02
In this example, Abby(cus_3hEmhErE2jbwsO) is paying for 6 users. She started paying for user 2358 2015-07-05 so she should be considered a paying customer 07-2015, not 03-2015. Sam is paying for 2 users and he started paying for user 8094 in 08-2015 so he is considered to be a paying customer for 08-2015, not 06-2015. I have a query that grabs and groups by the number of paying customers each month:
SELECT concat(extract(MONTH from u.created_at),'-',extract(year from u.created_at)) as "Month",
COUNT(distinct u.email) as "Total AB Paying Customers"
FROM customerusermap AS cm, users AS u
WHERE cm.customer_id=u.customer_id AND cm.user_id <> u.id
GROUP BY 1,extract(month from u.created_at),extract(year from u.created_at)
ORDER BY extract(year from u.created_at),extract(month from u.created_at);
But this grabs and counts by the date the customer was added to the users table, not the date they actually started paying. How would I grab the counts so that it grabs for the earliest date in the customerusermap table? What the needed output should look like in this example is:
Month | Total AB Paying Customers
-------------------------------------
07-2015 | 1
08-2015 | 1

You can use the following query:
SELECT CONCAT(EXTRACT(MONTH FROM startedPayingDate), '-',
EXTRACT(YEAR FROM startedPayingDate)) AS "Month",
COUNT(*) AS "Total AB Paying Customers"
FROM (
SELECT customer_id, MIN(created_at) AS startedPayingDate
FROM customerusermap AS cm
WHERE NOT EXISTS (SELECT 1
FROM users AS u
WHERE cm.user_id = u.id)
GROUP BY customer_id ) AS t
GROUP BY 1
I used a NOT EXISTS operator to exclude records that relate to 'paying for themselves' customers (if that is really your intention).
Once you get the MIN(created_at) date per customer_id, then you can easily count per date in an outer query.
Demo here

Related

PostgreSQL query to return free rooms for booking availability calender

I need bit of a help in writing an SQL query.
A simple scenario is that I have a table named BookedRooms in which three columns are used most, checkInDate and checkOutDate, both are of type timestamp and roomId which is a foreign key to the Rooms table.
Now Rooms table has PK, name column and roomNo column.
This is BookedRooms table
+----+----------------------------+-------------------------+------------------+--+
| PK | checkInDate | checkOutDate | roomId | |
+----+----------------------------+-------------------------+------------------+--+
| 1 | 2022-05-26T00:00:00Z | 2022-05-29T00:00:00Z | 2 | |
| 2 | 2022-05-29T00:00:00Z | 2022-05-30T00:00:00Z | 3 | |
+----+----------------------------+-------------------------+------------------+--+
This is Rooms table
+----+------------+-------------------+--+
| PK | name | roomNo | |
+----+------------+-------------------+--+
| 2 | Deluxe | 102 | |
| 3 | King | 103 | |
+----+------------+-------------------+--+
Now, i wanna write a query in which if i put the month number like 4 , it tells me name and roomNo of Rooms which are free for each particular day of the month.
The logic to check if a room is occupied is that, if for example room 102 has a checkin date of 03 of month April and checkout date of 06 of month April , then the query will not include this room in the result set until the checkout date has come, only for that date and onwards would it include room 102 in the result set, again until this room appears in another checkInDate column somewhere.
Thank you
I recommend creating an exclusion constraint on bookedrooms. Not only can the GiST index that implements the constraint speed up the search you want, but it will also exclude double booking.
CREATE EXTENSION IF NOT EXISTS btree_gist;
ALTER TABLE bookedrooms ADD EXCLUDE USING gist (
tstzrange(checkindate, checkoutdate) WITH &&,
roomid WITH =
);
The query you need is
SELECT roomno FROM bookedrooms
EXCEPT
SELECT roomno FROM bookedrooms
WHERE tstzrange(checkindate, checkoutdate) &&
tstzrange(
date_trunc('year', current_timestamp) + INTERVAL '1 month' * 4,
date_trunc('year', current_timestamp) + INTERVAL '1 month' * (4 + 1)
);
&& is the "overlaps" operator for ranges.

how to daily rolling balance in post gresql using txn table

The three tables being used here are
1) customer - cust_id , email
2) account - cust_id, account_id, account_balance
3) Txn - txn_id, account_id, timestamp, credit/debit, amount
i need to calculate the account balance of xyz#abc.com for the past 10 days on a daily basis
eg:
Date | balance
10/6 | 100
09/6 | 100
08/6 | 250
07/6 | 200
.
.
01/6 | 200
txn table would look like this for the above example
account id | txn id | time_stamp | type | amount
1 | 4 | 08/6 | credit | 50
1 | 5 | 09/6 | debit | 150
i wrote the following code by creating another table containing daily_total txns as daily_txn, still only second row is getting generated
with cte as (
select dates, daily_balance, daily_total, row_number() over (order by dates desc) as seqnum
from Daily_txn
)
select t.dates, t.daily_total, tprev.daily_balance - coalesce(tprev.daily_total, 0) as new_balance
from cte t left outer join
cte tprev
on t.seqnum = tprev.seqnum + 1;

Fill in missing rows when aggregating over multiple fields in Postgres

I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.
SELECT
sd.date,
COUNT(sd.sale_id) AS sales,
sd.product
FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date
This produces the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-17 | 2 | shower gel
2017-08-21 | 1 | shower gel
As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.
What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-18 | 0 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-21 | 0 | soap
2017-08-17 | 2 | shower gel
2017-08-18 | 0 | shower gel
2017-08-19 | 0 | shower gel
2017-08-20 | 0 | shower gel
2017-08-21 | 1 | shower gel
In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:
create a full range of dates using generate_series
LEFT JOIN the already aggregated sales data onto the date series
COALESCE any NULL counts to 0 in the missing rows
The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.
It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.
You could use:
WITH cte AS (
SELECT date, s.product
FROM ... -- some way to generate date series
CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT
c.date,
c.product,
COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;
First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.
Oracle has great feature for this scenarios called Partitioned Outer Joins:
SELECT times.time_id, product, quantity
FROM inventory PARTITION BY (product)
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id)
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('06/04/01', 'DD/MM/YY')
ORDER BY 2,1;
select
date,
count(sale_id) as sales,
product
from
sales_data
right join (
(
select d::date as date
from generate_series (
(select min(date) from sales_data),
(select max(date) from sales_data),
'1 day'
) gs (d)
) gs
cross join
(select distinct product from sales_data) p
) cj using (product, date)
group by product, date
order by product, date

select thresholds dynamically from sql

I have a range of data on search queries across diffrent merchants.
I have a python script that 1st creates the head, torso & tail query sets from the main table in qsql, based on count(query) instances as 1000, 100 etc.
Since the number of merchants I my script runs of could have/not have queries that meet that threshold, the script does not log the "head.csv" "torso.csv" .. tail.csv always being produced.
How can I break the queries into head, torso & tail groups by respecting the logic above.
I also tried ntile to break the groups by percentiles(33, 33, 33), but that skews both the head & torso, if a merchant has a very long tail.
Current :
# head
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) >=1000
#torso
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <1000 and count(*) >=100
#tail
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <100
# using ntile - but note that I have percentiles of "3" , 33.#% each, which introduces the skew
select trim(query), count(*) as query_count,
ntile(3) over(order by query_count desc) AS group_ntile
from my_merchant_table
group by trim(query)
order by query_count desc limit 100;
Ideally the solution can build on top of this -:
select trim(query), count(*) as query_count,
ntile(100) over(order by query_count desc) AS group_ntile
from my_merchant_table
-- other conditions & date range
group by trim(query)
order by query_count desc
This gives,
btrim query_count group_ntile
q0 1277 1
q1 495 1
q2 357 1
q3 246 1
# so on till group_ntile =100 , while the query_count reduces.
Question :
What is the best way for the logic, to make the overall logic merchant agnostic/no hard-coding the configs ?
Note : I am fetching the data in Redshift, the solution should be compatible to postgres 8.0 & redshift in particular.
I imagine that you from some programming language invokes its queries to process information. My recommendation in this regard is get all the records and apply a filter over they. Consider that if you queries the database where there are several operations over the data this would result that the response time of the application is affected.
Assuming that the main challenge is to create the 'tiles' from a list of values, here is some sample code. It takes the 13 provinces of Canada and breaks it into a requested number of groups. It uses the province names, but numbers would work just as well.
SELECT * FROM Provinces ORDER BY province; -- To see what we are working with
+---------------------------+
| province |
+---------------------------+
| Alberta |
| British Columbia |
| Manitoba |
| New Brunswick |
| Newfoundland and Labrador |
| Northwest Territories |
| Nova Scotia |
| Nunavut |
| Ontario |
| Prince Edward Island |
| Quebec |
| Saskatchewan |
| Yukon |
+---------------------------+
13 rows in set (0.00 sec)
Now for the code:
SELECT #n := COUNT(*), -- Find total count (13)
#j := 0.5, -- 'trust me'
#tiles := 3 -- The number of groupings
FROM Provinces;
SELECT group_start
FROM (
SELECT
IF((#j * #tiles) % #n < #tiles, province, NULL) AS group_start,
#j := #j + 1
FROM Provinces
ORDER BY province
) x
WHERE group_start IS NOT NULL;
+---------------------------+
| group_start |
+---------------------------+
| Alberta |
| Newfoundland and Labrador |
| Prince Edward Island |
+---------------------------+
3 rows in set (0.00 sec)
With #tiles set to 4:
+---------------+
| group_start |
+---------------+
| Alberta |
| New Brunswick |
| Nova Scotia |
| Quebec |
+---------------+
4 rows in set (0.00 sec)
It is reasonably efficient: 1 pass to count the number of rows, 1 pass to do the computation, 1 pass to filter out the non-break values.

Selecting rows ordered by some column and distinct on another

Related to - PostgreSQL DISTINCT ON with different ORDER BY
I have table purchases (product_id, purchased_at, address_id)
Sample data:
| id | product_id | purchased_at | address_id |
| 1 | 2 | 20 Mar 2012 21:01 | 1 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 3 | 2 | 20 Mar 2012 21:39 | 2 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
The result I expect is the most recent purchased product (full row) for each address_id and that result must be sorted in descendant order by the purchased_at field:
| id | product_id | purchased_at | address_id |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
Using query:
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
I'm getting:
| id | product_id | purchased_at | address_id |
| 2 | 2 | 20 Mar 2012 21:33 | 1 |
| 4 | 2 | 20 Mar 2012 21:48 | 2 |
So the rows is same, but order is wrong. Any way to fix it?
Quite a clear question :)
SELECT t1.* FROM purchases t1
LEFT JOIN purchases t2
ON t1.address_id = t2.address_id AND t1.purchased_at < t2.purchased_at
WHERE t2.purchased_at IS NULL
ORDER BY t1.purchased_at DESC
And most likely a faster approach:
SELECT t1.* FROM purchases t1
JOIN (
SELECT address_id, max(purchased_at) max_purchased_at
FROM purchases
GROUP BY address_id
) t2
ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at
ORDER BY t1.purchased_at DESC
Your ORDER BY is used by DISTINCT ON for picking which row for each distinct address_id to produce. If you then want to order the resulting records, make the DISTINCT ON a subselect and order its results:
SELECT * FROM
(
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.*
FROM "purchases"
WHERE "purchases"."product_id" = 2
ORDER BY purchases.address_id ASC, purchases.purchased_at DESC
) distinct_addrs
order by distinct_addrs.purchased_at DESC
This query is trickier to rephrase properly than it looks.
The currently accepted, join-based answer doesn’t correctly handle the case where two candidate rows have the same given purchased_at value: it will return both rows.
You can get the right behaviour this way:
SELECT * FROM purchases AS given
WHERE product_id = 2
AND NOT EXISTS (
SELECT NULL FROM purchases AS other
WHERE given.address_id = other.address_id
AND (given.purchased_at < other.purchased_at OR given.id < other.id)
)
ORDER BY purchased_at DESC
Note how it has a fallback of comparing id values to disambiguate the case in which the purchased_at values match. This ensures that the condition can only ever be true for a single row among those that have the same address_id value.
The original query using DISTINCT ON handles this case automatically!
Also note the way that you are forced to encode the fact that you want “the latest for each address_id” twice, both in the given.purchased_at < other.purchased_at condition and the ORDER BY purchased_at DESC clause, and you have to make sure they match. I had to spend a few extra minutes to convince myself that this query is really positively correct.
It’s much easier to write this query correctly and understandbly by using DISTINCT ON together with an outer subquery, as suggested by dbenhur.
Try this !
SELECT DISTINCT ON (address_id) *
FROM purchases
WHERE product_id = 2
ORDER BY address_id, purchased_at DESC