How to find cross selling products in postgres

How to find cross selling products in postgres - postgresql

I have to create a report which shows the cross selling of a product ,Means if i have a product X i need to find the combination of this product which is purchased with other products and it's count.
So i have the table structure as follows,
Below is the same data.Reference is the order number and for each item there is a separate row which shows the product and category details.
Reference. Product Name. Prod ID. Category
1000001 Honda x12 10023 Machinery
1000001 Honda cv12 10025 Machinery
1000002 Medic. 12x 10026 Medicine
1000002 Honda x12 10023 Machinery
1000003 Honda x12 10023 Machinery
1000004 Appliance x12 10033 Household
1000004 Honda x12 10023 Machinery
1000005 Bag x234 100265 Bags
I want the output to be like,
Suppose i want to find the cross sell products for Honda x12, means i wan to know which all are the products sold in combination with Honda x12 and number of occurrences that particular combination count occured.
Can anyone suggest me how i can do this in PostgreSQL(Version 11).
Thanks in advance

I think that's a self-join with an inequality condition:
select t.prod_id prod_id1, x.prod_id prod_id2, count(*) cnt
from mytable t
inner join mytable x
on x.reference = t.reference
and x.prod_id > t.prod_id
group by t.prod_id, x.prod_id
order by 1, 2
> is on purpose in the join predicate instead of <>, to avoid "mirror" records in the result set.
For your sample data, this generates:
prod_id1 | prod_id2 | cnt
-------: | -------: | --:
10023 | 10025 | 1
10023 | 10026 | 1
10023 | 10033 | 1
This gives you the results for all products at once. If you want the list of "pairs" of a given product only, then it is slightly different:
select t.prod_id, count(*) cnt
from mytable t
inner join mytable x
on x.reference = t.reference
and x.prod_id <> t.prod_id
where x.prod_id = 10023
group by t.prod_id
order by 1
Demo:
prod_id | cnt
------: | --:
10025 | 1
10026 | 1
10033 | 1

Related

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

I have a Database that has product names in column 1 and product release dates in column 2. I want to find 'old' products by their release date. However, I'm only interested in finding 'old' products that released a minimum of 1 year ago. I cannot make any edits to the original database infrastructure.
The table looks like this:
Product| Release_Day
A | 2018-08-23
A | 2017-08-23
A | 2019-08-21
B | 2018-08-22
B | 2016-08-22
B | 2017-08-22
C | 2018-10-25
C | 2016-10-25
C | 2019-08-19
I have already tried multiple versions of DISTINCT, MAX, BETWEEN, >, <, etc.
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2015-08-22' and '2018-08-22'
and release_day not between '2018-08-23' and '2019-08-22'
GROUP BY 1
ORDER BY MAX(release_day) DESC
The expected results should not contain any products found by this query:
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2018-08-23' and '2019-08-22'
AND product = A
GROUP BY 1
However, every check I complete returns a product from this date range.
This is the output of the initial query:
Product|Most_Recent_Release
A | 2018-08-23
B | 2018-08-22
C | 2015-10-25
And, for example, if I run the check query on Product A, I get this:
Product|Most_Recent_Release
A | 2019-08-21

Use HAVING to filter on most_recent_release
SELECT product, MAX(release_day) as most_recent_release
FROM Product_Release
GROUP BY product
HAVING most_recent_release < '2018-08-23'
ORDER BY most_recent_release DESC
There's no need to use DISTINCT when you use GROUP BY -- you can't get duplicates if there's only one row per product.

Recursive CTE PostgreSQL Connecting Multiple IDs with Additional Logic for Other Fields

Within my PostgreSQL database, I have an id column that shows each unique lead that comes in. I also have a connected_lead_id column which shows whether accounts are related to each other (ie husband and wife, parents and children, group of friends, group of investors, etc).
When we count the number of ids created during a time period, we want to see the number of unique "groups" of connected_ids during a period. In other words, we wouldn't want to count both the husband and wife pair, we would only want to count one since they are truly one lead.
We want to be able to create a view that only has the "first" id based on the "created_at" date and then contains additional columns at the end for "connected_lead_id_1", "connected_lead_id_2", "connected_lead_id_3", etc.
We want to add in additional logic so that we take the "first" id's source, unless that is null, then take the "second" connected_lead_id's source unless that is null and so on. Finally, we want to take the earliest on_boarded_date from the connected_lead_id group.
id | created_at | connected_lead_id | on_boarded_date | source |
2 | 9/24/15 23:00 | 8 | |
4 | 9/25/15 23:00 | 7 | |event
7 | 9/26/15 23:00 | 4 | |
8 | 9/26/15 23:00 | 2 | |referral
11 | 9/26/15 23:00 | 336 | 7/1/17 |online
142 | 4/27/16 23:00 | 336 | |
336 | 7/4/16 23:00 | 11 | 9/20/18 |referral
End Goal:
id | created_at | on_boarded_date | source |
2 | 9/24/15 23:00 | | referral |
4 | 9/25/15 23:00 | | event |
11 | 9/26/15 23:00 | 7/1/17 | online |
Ideally, we would also have i number of extra columns at the end to show each connected_lead_id that is attached to the base id.
Thanks for the help!

Ok the best I can come up with at the moment is to first build maximal groups of related IDs, and then join back to your table of leads to get the rest of the data (See this SQL Fiddle for the setup, full queries and results).
To get the maximal groups you can use a recursive common table expression to first grow the groups, followed by a query to filter the CTE results down to just the maximal groups:
with recursive cte(grp) as (
select case when l.connected_lead_id is null then array[l.id]
else array[l.id, l.connected_lead_id]
end from leads l
union all
select grp || l.id
from leads l
join cte
on l.connected_lead_id = any(grp)
and not l.id = any(grp)
)
select * from cte c1
The CTE above outputs several similar groups as well as intermediary groups. The query predicate below prunes out the non maximal groups, and limits results to just one permutation of each possible group:
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| grp |
|------------|
| 2,8 |
| 4,7 |
| 14 |
| 11,336,142 |
| 12,13 |
Next join the final query above back to your leads table and use window functions to get the remaining column values, along with the distinct operator to prune it down to the final result set:
with recursive cte(grp) as (
...
)
select distinct
first_value(l.id) over (partition by grp order by l.created_at) id
, first_value(l.created_at) over (partition by grp order by l.created_at) create_at
, first_value(l.on_boarded_date) over (partition by grp order by l.created_at) on_boarded_date
, first_value(l.source) over (partition by grp
order by case when l.source is null then 2 else 1 end
, l.created_at) source
, grp CONNECTED_IDS
from cte c1
join leads l
on l.id = any(grp)
where not exists (select 1 from cte c2
where c1.grp && c2.grp
and ((not c1.grp #> c2.grp)
or (c2.grp < c1.grp
and c1.grp #> c2.grp
and c1.grp <# c2.grp)));
Results:
| id | create_at | on_boarded_date | source | connected_ids |
|----|----------------------|-----------------|----------|---------------|
| 2 | 2015-09-24T23:00:00Z | (null) | referral | 2,8 |
| 4 | 2015-09-25T23:00:00Z | (null) | event | 4,7 |
| 11 | 2015-09-26T23:00:00Z | 2017-07-01 | online | 11,336,142 |
| 12 | 2015-09-26T23:00:00Z | 2017-07-01 | event | 12,13 |
| 14 | 2015-09-26T23:00:00Z | (null) | (null) | 14 |

demo:db<>fiddle
Main idea - sketch:
Looping through the ordered set. Get all ids, that haven't been seen before in any connected_lead_id (cli). These are your starting points for recursion.
The problem is your number 142 which hasn't been seen before but is in same group as 11 because of its cli. So it is would be better to get the clis of the unseen ids. With these values it's much simpler to calculate the ids of the groups later in the recursion part. Because of the loop a function/stored procedure is necessary.
The recursion part: First step is to get the ids of the starting clis. Calculating the first referring id by using the created_at timestamp. After that a simple tree recursion over the clis can be done.
1. The function:
CREATE OR REPLACE FUNCTION filter_groups() RETURNS int[] AS $$
DECLARE
_seen_values int[];
_new_values int[];
_temprow record;
BEGIN
FOR _temprow IN
-- 1:
SELECT array_agg(id ORDER BY created_at) as ids, connected_lead_id FROM groups GROUP BY connected_lead_id ORDER BY MIN(created_at)
LOOP
-- 2:
IF array_length(_seen_values, 1) IS NULL
OR (_temprow.ids || _temprow.connected_lead_id) && _seen_values = FALSE THEN
_new_values := _new_values || _temprow.connected_lead_id;
END IF;
_seen_values := _seen_values || _temprow.ids;
_seen_values := _seen_values || _temprow.connected_lead_id;
END LOOP;
RETURN _new_values;
END;
$$ LANGUAGE plpgsql;
Grouping all ids that refer to the same cli
Loop through the id arrays. If no element of the array was seen before, add the referred cli the output variable (_new_values). In both cases add the ids and the cli to the variable which stores all yet seen ids (_seen_values)
Give out the clis.
The result so far is {8, 7, 336} (which is equivalent to the ids {2,4,11,142}!)
2. The recursion:
-- 1:
WITH RECURSIVE start_points AS (
SELECT unnest(filter_groups()) as ids
),
filtered_groups AS (
-- 3:
SELECT DISTINCT
1 as depth, -- 3
first_value(id) OVER w as id, -- 4
ARRAY[(MIN(id) OVER w)] as visited, -- 5
MIN(created_at) OVER w as created_at,
connected_lead_id,
MIN(on_boarded_date) OVER w as on_boarded_date -- 6,
first_value(source) OVER w as source
FROM groups
WHERE connected_lead_id IN (SELECT ids FROM start_points)
-- 2:
WINDOW w AS (PARTITION BY connected_lead_id ORDER BY created_at)
UNION
SELECT
fg.depth + 1,
fg.id,
array_append(fg.visited, g.id), -- 8
LEAST(fg.created_at, g.created_at),
g.connected_lead_id,
LEAST(fg.on_boarded_date, g.on_boarded_date), -- 9
COALESCE(fg.source, g.source) -- 10
FROM groups g
JOIN filtered_groups fg
-- 7
ON fg.connected_lead_id = g.id AND NOT (g.id = ANY(visited))
)
SELECT DISTINCT ON (id) -- 11
id, created_at,on_boarded_date, source
FROM filtered_groups
ORDER BY id, depth DESC;
The WITH part gives out the results from the function. unnest() expands the id array into each row for each id.
Creating a window: The window function groups all values by their clis and orders the window by the created_at timestamp. In your example all values are in their own window excepting 11 and 142 which are grouped.
This is a help variable to get the latest rows later on.
first_value() gives the first value of the ordered window frame. Assuming 142 had a smaller created_at timestamp the result would have been 142. But it's 11 nevertheless.
A variable is needed to save which id has been visited yet. Without this information an infinite loop would be created: 2-8-2-8-2-8-2-8-...
The minimum date of the window is taken (same thing here: if 142 would have a smaller date than 11 this would be the result).
Now the starting query of the recursion is calculated. Following describes the recursion part:
Joining the table (the original function results) against the previous recursion result. The second condition is the stop of the infinite loop I mentioned above.
Appending the currently visited id into the visited variable.
If the current on_boarded_date is earlier it is taken.
COALESCE gives the first NOT NULL value. So the first NOT NULL source is safed throughout the whole recursion
After the recursion which gives a result of all recursion steps we want to filter out only the deepest visits of every starting id.
DISTINCT ON (id) gives out the row with the first occurence of an id. To get the last one, the whole set is descendingly ordered by the depth variable.

Fill in missing rows when aggregating over multiple fields in Postgres

I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.
SELECT
sd.date,
COUNT(sd.sale_id) AS sales,
sd.product
FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date
This produces the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-17 | 2 | shower gel
2017-08-21 | 1 | shower gel
As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.
What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-18 | 0 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-21 | 0 | soap
2017-08-17 | 2 | shower gel
2017-08-18 | 0 | shower gel
2017-08-19 | 0 | shower gel
2017-08-20 | 0 | shower gel
2017-08-21 | 1 | shower gel
In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:
create a full range of dates using generate_series
LEFT JOIN the already aggregated sales data onto the date series
COALESCE any NULL counts to 0 in the missing rows
The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.
It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.

You could use:
WITH cte AS (
SELECT date, s.product
FROM ... -- some way to generate date series
CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT
c.date,
c.product,
COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;
First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.
Oracle has great feature for this scenarios called Partitioned Outer Joins:
SELECT times.time_id, product, quantity
FROM inventory PARTITION BY (product)
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id)
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('06/04/01', 'DD/MM/YY')
ORDER BY 2,1;

select
date,
count(sale_id) as sales,
product
from
sales_data
right join (
(
select d::date as date
from generate_series (
(select min(date) from sales_data),
(select max(date) from sales_data),
'1 day'
) gs (d)
) gs
cross join
(select distinct product from sales_data) p
) cj using (product, date)
group by product, date
order by product, date

Counting by earliest date found from an inner join?

I have two tables, customerusermap and users. Whenever a user signs up with our product, they immediately get added into a table called users but it isn't until they start paying for a user that they get added to a table called customerusermap.
The users table looks like this:
id | customer_id | firstname | lastname | created_at
-------------------------------------------------------
1725 | cus_3hEmhErE2jbwsO | Abby | Smith | 2015-03-19
1726 | cus_7oNweUrE4jbwr2 | Sam | Peters | 2015-06-20
The customerusermap table looks like this:
customer_id | user_id | created_at
------------------------------------------
cus_3hEmhErE2jbwsO | 9275 | 2015-09-01
cus_3hEmhErE2jbwsO | 2628 | 2015-09-05
cus_3hEmhErE2jbwsO | 2358 | 2015-07-05
cus_3hEmhErE2jbwsO | 3158 | 2015-08-05
cus_3hEmhErE2jbwsO | 2487 | 2015-08-05
cus_3hEmhErE2jbwsO | 6044 | 2015-08-05
cus_7oNweUrE4jbwr2 | 8094 | 2015-08-25
cus_7oNweUrE4jbwr2 | 2345 | 2015-09-02
In this example, Abby(cus_3hEmhErE2jbwsO) is paying for 6 users. She started paying for user 2358 2015-07-05 so she should be considered a paying customer 07-2015, not 03-2015. Sam is paying for 2 users and he started paying for user 8094 in 08-2015 so he is considered to be a paying customer for 08-2015, not 06-2015. I have a query that grabs and groups by the number of paying customers each month:
SELECT concat(extract(MONTH from u.created_at),'-',extract(year from u.created_at)) as "Month",
COUNT(distinct u.email) as "Total AB Paying Customers"
FROM customerusermap AS cm, users AS u
WHERE cm.customer_id=u.customer_id AND cm.user_id <> u.id
GROUP BY 1,extract(month from u.created_at),extract(year from u.created_at)
ORDER BY extract(year from u.created_at),extract(month from u.created_at);
But this grabs and counts by the date the customer was added to the users table, not the date they actually started paying. How would I grab the counts so that it grabs for the earliest date in the customerusermap table? What the needed output should look like in this example is:
Month | Total AB Paying Customers
-------------------------------------
07-2015 | 1
08-2015 | 1

You can use the following query:
SELECT CONCAT(EXTRACT(MONTH FROM startedPayingDate), '-',
EXTRACT(YEAR FROM startedPayingDate)) AS "Month",
COUNT(*) AS "Total AB Paying Customers"
FROM (
SELECT customer_id, MIN(created_at) AS startedPayingDate
FROM customerusermap AS cm
WHERE NOT EXISTS (SELECT 1
FROM users AS u
WHERE cm.user_id = u.id)
GROUP BY customer_id ) AS t
GROUP BY 1
I used a NOT EXISTS operator to exclude records that relate to 'paying for themselves' customers (if that is really your intention).
Once you get the MIN(created_at) date per customer_id, then you can easily count per date in an outer query.
Demo here

select thresholds dynamically from sql

I have a range of data on search queries across diffrent merchants.
I have a python script that 1st creates the head, torso & tail query sets from the main table in qsql, based on count(query) instances as 1000, 100 etc.
Since the number of merchants I my script runs of could have/not have queries that meet that threshold, the script does not log the "head.csv" "torso.csv" .. tail.csv always being produced.
How can I break the queries into head, torso & tail groups by respecting the logic above.
I also tried ntile to break the groups by percentiles(33, 33, 33), but that skews both the head & torso, if a merchant has a very long tail.
Current :
# head
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) >=1000
#torso
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <1000 and count(*) >=100
#tail
select trim(query) as query, count(*)
from my_merchant_table
-- other conditions & date range
GROUP BY trim(query)
having count(*) <100
# using ntile - but note that I have percentiles of "3" , 33.#% each, which introduces the skew
select trim(query), count(*) as query_count,
ntile(3) over(order by query_count desc) AS group_ntile
from my_merchant_table
group by trim(query)
order by query_count desc limit 100;
Ideally the solution can build on top of this -:
select trim(query), count(*) as query_count,
ntile(100) over(order by query_count desc) AS group_ntile
from my_merchant_table
-- other conditions & date range
group by trim(query)
order by query_count desc
This gives,
btrim query_count group_ntile
q0 1277 1
q1 495 1
q2 357 1
q3 246 1
# so on till group_ntile =100 , while the query_count reduces.
Question :
What is the best way for the logic, to make the overall logic merchant agnostic/no hard-coding the configs ?
Note : I am fetching the data in Redshift, the solution should be compatible to postgres 8.0 & redshift in particular.

I imagine that you from some programming language invokes its queries to process information. My recommendation in this regard is get all the records and apply a filter over they. Consider that if you queries the database where there are several operations over the data this would result that the response time of the application is affected.

Assuming that the main challenge is to create the 'tiles' from a list of values, here is some sample code. It takes the 13 provinces of Canada and breaks it into a requested number of groups. It uses the province names, but numbers would work just as well.
SELECT * FROM Provinces ORDER BY province; -- To see what we are working with
+---------------------------+
| province |
+---------------------------+
| Alberta |
| British Columbia |
| Manitoba |
| New Brunswick |
| Newfoundland and Labrador |
| Northwest Territories |
| Nova Scotia |
| Nunavut |
| Ontario |
| Prince Edward Island |
| Quebec |
| Saskatchewan |
| Yukon |
+---------------------------+
13 rows in set (0.00 sec)
Now for the code:
SELECT #n := COUNT(*), -- Find total count (13)
#j := 0.5, -- 'trust me'
#tiles := 3 -- The number of groupings
FROM Provinces;
SELECT group_start
FROM (
SELECT
IF((#j * #tiles) % #n < #tiles, province, NULL) AS group_start,
#j := #j + 1
FROM Provinces
ORDER BY province
) x
WHERE group_start IS NOT NULL;
+---------------------------+
| group_start |
+---------------------------+
| Alberta |
| Newfoundland and Labrador |
| Prince Edward Island |
+---------------------------+
3 rows in set (0.00 sec)
With #tiles set to 4:
+---------------+
| group_start |
+---------------+
| Alberta |
| New Brunswick |
| Nova Scotia |
| Quebec |
+---------------+
4 rows in set (0.00 sec)
It is reasonably efficient: 1 pass to count the number of rows, 1 pass to do the computation, 1 pass to filter out the non-break values.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find cross selling products in postgres - postgresql

Related

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

Recursive CTE PostgreSQL Connecting Multiple IDs with Additional Logic for Other Fields

Fill in missing rows when aggregating over multiple fields in Postgres

Counting by earliest date found from an inner join?

select thresholds dynamically from sql

Categories

Resources