Calculate Mode - "Highest frequency row" DB2 - db2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.

You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.

Related

How to get latest data for a column when using grouping in postgres

I am using postgres alongside sequelize. I have encountered a case where I need to write a coustom query which groups the records are a particular field. I know for the remaning columns that are not used for grouping, I need to use a aggregate function like SUM. But the problem is that for some columns I need to get the one what is the latest one (DESC sorted by created_at). I see no function in sql to do so. Is my only option to write subqueries or is there a better way? Thanks?
For better understanding, If you look at the below picture, I want the group the records with address. So after the query there should only be two records, one with sydney and the other with new york. But when it comes to the distance, I want the result of the query to contain the distance form the row that was most recently created, i.e with the latest created_at.
so the final two query results should be:
sydney 100 2022-09-05 18:14:53.492131+05:45
new york 40 2022-09-05 18:14:46.23328+05:45
select address, distance, created_at
from(
select address, distance, created_at, row_number() over(partition by address order by created_at DESC) as rn
from table) x
where rn = 1

Postgres query filter by non column in table

i have a challenge whose consist in filter a query not with a value that is not present in a table but a value that is retrieved by a function.
let's consider a table that contains all sales on database
id, description, category, price, col1 , ..... col n
i have function that retrieve me a table of similar sales from one (based on rules and business logic) . This function performs a query again on all records in the sales table and match validation in some fields.
similar_sales (sale_id integer) - > returns a integer[]
now i need to list all similar sales for each one present in sales table.
select s.id, similar_sales (s.id)
from sales s
but the similar_sales can be null and i am interested only return sales which contains at least one.
select id, similar
from (
select s.id, similar_sales (s.id) as similar
from sales s
) q
where #similar > 1 (Pseudocode)
limit x
i can't do the limit in subquery because i don't know what sales have similar or not.
I just wanted do a subquery for a set of small rows and not all entire table to get query performance gains (pagination strategy)
you can try this :
select id, similar
from sales s
cross join lateral similar_sales (s.id) as similar
where not isempty(similar)
limit x

How to limit to just one result per condition when looking through multiple OR/IN conditions in the WHERE clause (Postgresql)

For Example:
SELECT * FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
I want to LIMIT 1 for each of the countries in my IN clause so I only see a total of 3 rows: One customer for per country (1 German, 1 France, 1 UK). Is there a simple way to do that?
Normally, a simple GROUP BY would suffice for this type of solution, however as you have specified that you want to include ALL of the columns in the result, then we can use the ROW_NUMBER() window function to provide a value to filter on.
As a general rule it is important to specify the column to sort on (ORDER BY) for all windowing or paged queries to make the result repeatable.
As no schema has been supplied, I have used Name as the field to sort on for the window, please update that (or the question) with any other field you would like, the PK is a good candidate if you have nothing else to go on.
SELECT * FROM
(
SELECT *
, ROW_NUMBER() OVER(PARTITION BY Country ORDER BY Name) AS _rn
FROM Customers
WHERE Country IN ('Germany', 'France', 'UK')
)
WHERE _rn = 1
The PARTITION BY forces the ROW_NUMBER to be counted across all records with the same Country value, starting at 1, so in this case we only select the rows that get a row number (aliased as _rn) of 1.
The WHERE clause could have been in the outer query if you really want to, but ROW_NUMBER() can only be specified in the SELECT or ORDER BY clauses of the query, so to use it as a filter criteria we are forced to wrap the results in some way.

Order picking in warehouse

In implementing the warehouse management system for an ecommerce store, I'm trying to create a picking list for warehouse workers, who will walk around a warehouse picking products in orders from different shelves.
One type of product can be on different shelves, and on each shelf there can be many of the same type of product.
If there are many of the same product in one order, sometimes the picker has to pick from multiple shelves to get all the items in an order.
To further make things trickier, sometimes the product will run out of stock as well.
My data model looks like this (simplified):
CREATE TABLE order_product (
id SERIAL PRIMARY KEY,
product_id integer,
order_id text
);
INSERT INTO "public"."order_product"("id","product_id","order_id")
VALUES
(1,1,'order1'),
(2,1,'order1'),
(3,1,'order1'),
(4,2,'order1'),
(5,2,'order2'),
(6,2,'order2');
CREATE TABLE warehouse_placement (
id SERIAL PRIMARY KEY,
product_id integer,
shelf text,
quantity integer
);
INSERT INTO "public"."warehouse_placement"("id","product_id","shelf","quantity")
VALUES
(1,1,E'A',2),
(2,2,E'B',2),
(3,1,E'C',2);
Is it possible, in postgres, to generate a picking list of instructions like the following:
order_id product_id shelf quantity_left_on_shelf
order1 1 A 1
order1 1 A 0
order1 2 B 1
order1 1 C 1
order2 2 B 0
order2 2 NONE null
I currently do this in the application code, but that feel quite clunky and somehow I feel like there should be a way to do this directly in SQL.
Thanks for any help!
Here we go:
WITH product_on_shelf AS (
SELECT warehouse_placement.*,
generate_series(1, quantity) AS order_on_shelf,
quantity - generate_series(1, quantity) AS quantity_left_on_shelf
FROM warehouse_placement
)
, product_on_shelf_with_product_order AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY quantity, shelf, order_on_shelf
) AS order_among_product
FROM product_on_shelf
)
, order_product_with_order_among_product AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY id
) AS order_among_product
FROM order_product
)
SELECT order_product_with_order_among_product.id,
order_product_with_order_among_product.order_id,
order_product_with_order_among_product.product_id,
product_on_shelf_with_product_order.shelf,
product_on_shelf_with_product_order.quantity_left_on_shelf
FROM order_product_with_order_among_product
LEFT JOIN product_on_shelf_with_product_order
ON order_product_with_order_among_product.product_id = product_on_shelf_with_product_order.product_id
AND order_product_with_order_among_product.order_among_product = product_on_shelf_with_product_order.order_among_product
ORDER BY order_product_with_order_among_product.id
;
Here's the idea:
We create a temporary table product_on_shelf, which is the same as warehouse_placement, except the rows are duplicated n times, n being the quantity of the product on the shelf.
We assign a number order_among_product to each row in product_on_shelf, so that each object on shelf knows its order among the same products.
We assign a symmetric number order_among_product to each row in order_product.
For each row in order_product, we try to find the product on shelf with the same order_among_product. If we can't find any, it means we've exhausted the products on any shelf.
Side note #1: Picking products off shelves is a concurrent action. You should make sure, either on the application side or on the DB side via smart locks, that any product on shelf can be attributed to one single order. Treating each row of product_order on the application side might be the best option to deal with concurrence.
Side note #2: I've written this query using CTEs for clarity. To boost performance, consider using subqueries instead. Make sure to run EXPLAIN ANALYZE

Finding duplicate records using cascading criteria, then combining into one record

I am using MS SQL Server 2012, and have done simple querying and data loading, but not looping or case statements, or nested selects. I am looking for some assistance to get me started on the approach.
We are in a project where we are combining the customer listing from multiple legacy systems. I have a raw customer table in a staging database that contains records from those multiple sources. We need to do the following before writing the final table to a data mart. I would think that would be quite common scenario in the data cleansing/golden record world, but after much searching, have not been able to locate a similar post.
First, we need to find records that represent the same customer. These records are coming from multiple sources so there could be more than 2 records that represent the same customer. Each source uses a similar model. The criteria that determines whether the record(s) represent the same customer changes in a cascading hierarchy depending on the values available. The first criteria we want to use for a record is the DOB and SSN. But if the SSN is missing, then the criteria for that row becomes the Last Name, First Name and DOB. If both the SSN and the DOB are missing, then the duplicate test changes to last name + first name + another criteria field. There are other criteria even after this if one of the names is missing. And since records that represent the same customer may have different fields available, we would have to use the test that both records can use. There may not be duplicate records if it turns out that a given customer only exists in one system.
Once duplicated records have been identified, we wish to then combine those records that represent a customer, so that we end up with 1 record representing the customer written to a new table, using the same fields. Combining is done by comparing values of like fields. If the SSN is missing from one source, but is available in another, then that SSN is used. If there are more than 2 records that represent a customer, and more than 1 has an SSN, and those SSN numbers are different, there is a heirarchy based on which system the record came from, and we want to write the SSN value from the system highest in the hierarchy. This kind of logic would be applied to each field we need to examine.
I think the piece that is hardest for me to conceptualize is how do you store values of one record so that you can compare against one or more other records in the same table, do the actual compare logic, then write the "winning" value to a new table? If I can get some help with that, it would be greatly appreciated.
The basic requirements that you have outlines are fulfilled by this query
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY DOB, SSN ) AS Match1,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], DOB ) AS Match2,
DENSE_RANK() OVER( PARTITION BY [Last Name], [First Name], [Another criteria] ) AS Match3
INTO #Matchmaking
FROM tCustStaging
What you will likely find though is that you will need to "prepare" (cleanse) your data first, that is ensure that it is all in the same format and remove "rubbish". A common problem may be phone numbers where various formats can be used e.g. '02 1234 1234', '0212341234', '+212341234' etc. Names may also have variations in spelling especially for Compound Names.
Another way to do matching, is to calculate matches on all fields individually
SELECT a.ID,
DENSE_RANK() OVER( PARTITION BY SSN ) AS SSNMatch,
DENSE_RANK() OVER( PARTITION BY DOB ) AS DOBMatch,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 10 ) ) AS LNMatch10,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch9,
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 9 ) ) AS LNMatch8,
etc.
DENSE_RANK() OVER( PARTITION BY LEFT( [Last Name], 3 ) ) AS LNMatch8,
DENSE_RANK() OVER( PARTITION BY LEFT( [First Name], 10 ) ) AS FNMatch10,
etc.
DENSE_RANK() OVER( PARTITION BY [Other criteria1] ) AS OC1,
DENSE_RANK() OVER( PARTITION BY [Other criteria2] ) AS OC2,
INTO #Matchmaking
FROM tCustStaging
You then create the strongest match (SSN, DOB). You can also experiment with various combinations of fields to see what you get.
-- You can play around with various combinations to see what results you get
SELECT c.*
FROM #Matchmaking AS a
INNER JOIN #Matchmaking AS b ON a.SSNMatch = b.SSNMatch AND a.DOBMatch = b.DOBMatch AND a.LNMatch10 = b.LNMatch10
INNER JOIN tCustStaging AS C ON a.ID = c.ID
After each iteration of matching you save the results.
You then keep relaxing the matching criteria, while carefully checking for false matches, until matching criteria is so weak that you no longer get useful results.
You will end up eventually with a set of results based on different strength of matching criteria.
In the end the number of "questionable matches" (where you are not sure if two customers are the same or not) would depend on the initial quality of data and the quality of it after "preparation". You would likely still have to analyse some data manually.