How to create a working pivot table without killing the system - db2

Let's say you have a Customer table, a simple customer table with just 4 columns:
customerCode numeric(7,0)
customerName char(50)
customerVATNumber char(11)
customerLocation char(35)
Keep in mind that the customers table contains 3 million rows because there are all the customers of the last 40 years, but the active ones are only 980000.
Suppose we then have a table called Sales structured in this way:
saleID integer
customerCode numeric(7,0)
agentID numeric(6,0)
productID char(2)
dateBeginSale date
dateEndSale date
There are about three and a half million rows in this table (here too we have stuff from 40 years ago), but the current supplies for the various products are a total of one million. The company only sells 4 products. Each customer can purchase up to 4 products with 4 different contracts even from 4 different agents. Most (90%) buy only one, the remaining from two to 4 (those who make the complete assortment are just 4 cats).
I was asked to build a pivot table showing for each customer with it's name and location all the product he purchased and from which agent.
The proposed layout for this pivot table is:
customerCode
customerName
customerLocation
productID1
agentID1
saleID1
dateBeginSale1
dateEndSale1
productID2
agentID2
saleID2
dateBeginSale2
dateEndSale2
productID3
agentID3
saleID3
dateBeginSale3
dateEndSale3
productID4
agentID4
saleID4
dateBeginSale4
dateEndSale4
I built the pivot with a view.
First I created 4 views, one for each product id on the Sales table, also useful for other statistical and reporting purposes
View1 as
customerCode1
productID1
agentID1
saleID1
dateBeginSale1
dateEndSale1
View2 as
customerCode2
productID2
agentID2
saleID2
dateBeginSale2
dateEndSale2
and so on till View4
Then i joined the 4 views with the customer table and created the PivotView i needed.
Now Select * from PivotView works perfectly.
Also Select * from PivotView Where customerLocation='NEW YORK CITY' too.
Any other request, for example: we select and count the customers residing in LOS ANGELES who have purchased the products from the same agent or from different sales agents, literally makes the machine sit down, I see the memory occupation grow (probably due to the construction of some temporary table or view) and often the execution of the query crashes.
However, if I create the same pivot on a table instead of a view the times of the various selections collapse and even if heavy (there are always about a million records to scan to verify the existence of the various conditions) they become acceptable.
For sure i am mistaking something and/or there must to be a better way to achieve the result: having a pivot built on on line data istead of one from data extracted nightly.
I'll be happy to read your comments and suggestion.

I don't clearly understand your data layout and what you need. But I'll say that the usual problem with pivoting data on Db2 for IBM i is that there's no built in way to dynamically pivot the data.
Given that you only have 4 products, the above limitation doesn't really apply.
Your problem would seem to be that by creating 4 views over the same table, you're processing records repeatedly. Instead, try to touch the data one time.
create view PivotSales as
select
customerCode,
-- product 1
max(case productID when '01' then productID end) as productID1,
max(case productID when '01' then agentID end) as agentID1,
max(case productID when '01' then saleID end) as saleID1,
max(case productID when '01' then dateBeginSale end) as dateBeginSale1,
max(case productID when '01' then dateEndSale end) as dateEndSale1,
-- product 2
max(case productID when '02' then productID end) as productID2,
max(case productID when '02' then agentID end) as agentID2,
max(case productID when '02' then saleID end) as saleID2,
max(case productID when '02' then dateBeginSale end) as dateBeginSale2,
max(case productID when '02' then dateEndSale end) as dateEndSale2,
-- repeat for product 3 and 4
from Sales
group by customerCode;
Now you can have a CustomerSales view:
create view CustomerSales as
select *
from Customers join SalesPivot using (customerCode);
Run your queries, using Visual Explain to see what indexes the system suggests are needed. At minimum, you should have an indexes:
Customer (customerCode)
Customer (location, customerCode)
Sales (customerCode)
I suspect that some Encoded Vector Indexes (EVI) over various columns in Sales and Customer would prove helpful. Especially since you mention "counting". An EVI keeps track of the counts of the symbols. So counting is "free". An example:
create encoded vector index customerLocEvi
on Customers (location);
-- this doesn't have to read any rows in customer
select count(*)
from customer
where location = 'LOS ANGELES';
For sure I am mistaking something and/or there must to be a better way
to achieve the result: having a pivot built on on line data istead of
one from data extracted nightly.
Don't be too sure about that. The DB structure that best supports Business Intelligence type queries usually doesn't match the typical transactional data structure. A periodic "extract, transform, load (ETL)" is pretty typical.
For your particular use case, you could turn CustomerSales into a Materalized Query Table (MQT), build some supporting indexes for it and just run queries directly over it. Nightly rebuild would be as simple as REFRESH CustomerSales;
Or if you wanted too, since Db2 for IBM i doesn't support SYSTEM MAINTAINED MQTs, a trigger over Sales could automatically propagate data to CustomerSales instead of rebuilding it nightly.

Related

How to get date range between dates from the records in the same table?

I have a table with employment records. It has Employee code, status, and date when table was updated.
Like this:
Employee
Status
Date
001
termed
01/01/2020
001
rehired
02/02/2020
001
termed
03/03/2020
001
rehired
04/04/2021
Problem - I need to get period length when Employee was working for a company, and check if it was less than a year - then don't display that record.
There could be multiple hire-rehire cycles for each Employee. 10-20 is normal.
So, I'm thinking about two separate selects into two tables, and then looking for a closest date from hire in table 1, to termination in table 2. But it seems like overcomplicated idea.
Is there a better way?
Many approaches, but something like this could work:
SELECT
Employee,
SUM(DaysWorked)
FROM
(
SELECT
a1.employee,
IsNull(DateDiff(DD, a1.[Date],
(SELECT TOP 1 [Date] FROM aaa a2 WHERE a2.employee = a1.employee AND a2.[Date] > a1.[Date] and [status] <> 'termed' ORDER BY [Date] )
),DateDiff(DD, a1.[Date], getDate())) as DaysWorked
FROM
aaa a1
WHERE
[Status] = 'termed'
) Totals
GROUP BY
Totals.employee
HAVING SUM(DaysWorked) >= 365
Also using a CROSS JOIN is an option and perhaps more efficient. In this example, replace 'aaa' with the actual table name. The IsNull deals with an employee still working.

Order picking in warehouse

In implementing the warehouse management system for an ecommerce store, I'm trying to create a picking list for warehouse workers, who will walk around a warehouse picking products in orders from different shelves.
One type of product can be on different shelves, and on each shelf there can be many of the same type of product.
If there are many of the same product in one order, sometimes the picker has to pick from multiple shelves to get all the items in an order.
To further make things trickier, sometimes the product will run out of stock as well.
My data model looks like this (simplified):
CREATE TABLE order_product (
id SERIAL PRIMARY KEY,
product_id integer,
order_id text
);
INSERT INTO "public"."order_product"("id","product_id","order_id")
VALUES
(1,1,'order1'),
(2,1,'order1'),
(3,1,'order1'),
(4,2,'order1'),
(5,2,'order2'),
(6,2,'order2');
CREATE TABLE warehouse_placement (
id SERIAL PRIMARY KEY,
product_id integer,
shelf text,
quantity integer
);
INSERT INTO "public"."warehouse_placement"("id","product_id","shelf","quantity")
VALUES
(1,1,E'A',2),
(2,2,E'B',2),
(3,1,E'C',2);
Is it possible, in postgres, to generate a picking list of instructions like the following:
order_id product_id shelf quantity_left_on_shelf
order1 1 A 1
order1 1 A 0
order1 2 B 1
order1 1 C 1
order2 2 B 0
order2 2 NONE null
I currently do this in the application code, but that feel quite clunky and somehow I feel like there should be a way to do this directly in SQL.
Thanks for any help!
Here we go:
WITH product_on_shelf AS (
SELECT warehouse_placement.*,
generate_series(1, quantity) AS order_on_shelf,
quantity - generate_series(1, quantity) AS quantity_left_on_shelf
FROM warehouse_placement
)
, product_on_shelf_with_product_order AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY quantity, shelf, order_on_shelf
) AS order_among_product
FROM product_on_shelf
)
, order_product_with_order_among_product AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY product_id
ORDER BY id
) AS order_among_product
FROM order_product
)
SELECT order_product_with_order_among_product.id,
order_product_with_order_among_product.order_id,
order_product_with_order_among_product.product_id,
product_on_shelf_with_product_order.shelf,
product_on_shelf_with_product_order.quantity_left_on_shelf
FROM order_product_with_order_among_product
LEFT JOIN product_on_shelf_with_product_order
ON order_product_with_order_among_product.product_id = product_on_shelf_with_product_order.product_id
AND order_product_with_order_among_product.order_among_product = product_on_shelf_with_product_order.order_among_product
ORDER BY order_product_with_order_among_product.id
;
Here's the idea:
We create a temporary table product_on_shelf, which is the same as warehouse_placement, except the rows are duplicated n times, n being the quantity of the product on the shelf.
We assign a number order_among_product to each row in product_on_shelf, so that each object on shelf knows its order among the same products.
We assign a symmetric number order_among_product to each row in order_product.
For each row in order_product, we try to find the product on shelf with the same order_among_product. If we can't find any, it means we've exhausted the products on any shelf.
Side note #1: Picking products off shelves is a concurrent action. You should make sure, either on the application side or on the DB side via smart locks, that any product on shelf can be attributed to one single order. Treating each row of product_order on the application side might be the best option to deal with concurrence.
Side note #2: I've written this query using CTEs for clarity. To boost performance, consider using subqueries instead. Make sure to run EXPLAIN ANALYZE

Stored procedure help for taking IDs and getting result from another table

In a table (store100) I have a list of storeIDs that make more than 100sales a day. In another table I have sales. Now what I want to do is for every StoreID in table store100 I want to see how many of product x they sold in the sales table. How do I achieve this? Obviously I don't want to be manually entering the storeIDs all the time so I want it to take all the IDs in the table and compare for sales of x in the sales table.
Table Structre:
store100 table:
ID
lon1
lon2
glas4
edi5
etc
Sales Table:
ID |Location|Product|Quantity|Total Price
lon1 London Wallet 5 50
edi5 Manc Shoes 4 100
So for example I want a query where it takes all the store100 IDs and shows how many wallets they sold.
If anyone has a better idea of achieving please tell me
You will need a join for this
SELECT S100.ID
,S.Product
,S.Quantity
FROM Store100 S100
INNER JOIN Sales S
ON (S100.ID = S.ID)
Of course you will still need your where clause if you need it and modify your select to fit your needs.

Calculate Mode - "Highest frequency row" DB2

What would be the most efficient way to calculating the mode across tables with joins in DB2..
I am trying to get the value with the most frequency(count) for a given column(ID - candidate key for joined table) on a given date.
The idea is to get the most common (value) from the table which has different (value)s for some accounts (for the same ID and date). We need to make it unique for use in another table.
You can use common table expressions [CTE's], indicated by WITH, to break the logic down into logical steps. First we'll build the summary rows, then we'll assign a ranking to the rows within each group, then pick out the ones that with the highest count of records.
Let's say we want to know which flavor of each item sells the most frequently on each date (perhaps assuming a record is quantity one).
WITH s as
(
SELECT itemID, saleDate, flavor, count(*) as tally
FROM sales
GROUP BY itemID, saleDate, flavor
), r as
(
SELECT itemID, saleDate, flavor, tally,
RANK() OVER (PARTITION BY itemID, saleDate ORDER BY tally desc) as pri
FROM s
)
SELECT itemID, saleDate, flavor, tally
FROM r
WHERE pri = 1
Here the names "s" and "r" refer to the result set from their respective CTE's. These names can then be used as to represent a table in another part of the statement.
The pri column will have the RANK() of tally value on the summary row from the first section "s" within the window of itemID and saleDate. Tally is descending, because we want the largest value first, which will get a RANK() of 1. Then in the main SELECT we simply pick those summary records which were first in their partition.
By using RANK() or DENSE_RANK() we could get back multiple flavors for an itemID, saleDate, if they are tied for first place. This could be eliminated by replacing RANK() with ROW_NUMBER(), but it would arbitrarily pick one of the tied flavors as a winner, and this may not be correct answer for the problem at hand.
If we had a sales quantity column in the table, we could replace COUNT(*) with SUM(salesqty) and find what had sold the most units.

Fully matching sets of records of two many-to-many tables

I have Users, Positions and Licenses.
Relations are:
users may have many licenses
positions may require many licenses
So I can easily get license requirements per position(s) as well as effective licenses per user(s).
But I wonder what would be the best way to match the two sets? As logic goes user needs at least those licenses that are required by a certain position. May have more, but remaining are not relevant.
I would like to get results with users and eligible positions.
PersonID PositionID
1 1 -> user 1 is eligible to work on position 1
1 2 -> user 1 is eligible to work on position 2
2 1 -> user 2 is eligible to work on position 1
3 2 -> user 3 is eligible to work on position 2
4 ...
As you can see I need a result for all users, not a single one per call, which would make things much much easier.
There are actually 5 tables here:
create table Person ( PersonID, ...)
create table Position (PositionID, ...)
create table License (LicenseID, ...)
and relations
create table PersonLicense (PersonID, LicenseID, ...)
create table PositionLicense (PositionID, LicenseID, ...)
So basically I need to find positions that a particular person is licensed to work on. There's of course a much more complex problem here, because there are other factors, but the main objective is the same:
How do I match multiple records of one relational table to multiple records of the other. This could as well be described as an inner join per set of records and not per single record as it's usually done in TSQL.
I'm thinking of TSQL language constructs:
rowsets but I've never used them before and don't know how to use them anyway
intersect statements maybe although these probably only work over whole sets and not groups
Final solution (for future reference)
In the meantime while you fellow developers answered my question, this is something I came up with and uses CTEs and partitioning which can of course be used on SQL Server 2008 R2. I've never used result partitioning before so I had to learn something new (which is a plus altogether). Here's the code:
with CTEPositionLicense as (
select
PositionID,
LicenseID,
checksum_agg(LicenseID) over (partition by PositionID) as RequiredHash
from PositionLicense
)
select per.PersonID, pos.PositionID
from CTEPositionLicense pos
join PersonLicense per
on (per.LicenseID = pos.LicenseID)
group by pos.PositionID, pos.RequiredHash, per.PersonID
having pos.RequiredHash = checksum_agg(per.LicenseID)
order by per.PersonID, pos.PositionID;
So I made a comparison between these three techniques that I named as:
Cross join (by Andriy M)
Table variable (by Petar Ivanov)
Checksum - this one here (by Robert Koritnik, me)
Mine already orders results per person and position, so I also added the same to the other two to make return identical results.
Resulting estimated execution plan
Checksum: 7%
Table variable: 2% (table creation) + 9% (execution) = 11%
Cross join: 82%
I also changed Table variable version into a CTE version (instead of table variable a CTE was used) and removed order by at the end and compared their estimated execution plans. Just for reference CTE version 43% while original version had 53% (10% + 43%).
One way to write this efficiently is to do a join of PositionLicences with PersonLicences on the licenceId. Then count the non nulls grouped by position and person and compare with the count of all licences for position - if equal than that person qualifies:
DECLARE #tmp TABLE(PositionId INT, LicenseCount INT)
INSERT INTO #tmp
SELECT PositionId as PositionId
COUNT(1) as LicenseCount
FROM PositionLicense
GROUP BY PositionId
SELECT per.PersonID, pos.PositionId
FROM PositionLicense as pos
INNER JOIN PersonLicense as per ON (pos.LicenseId = per.LicenseId)
GROUP BY t.PositionID, t.PersonId
HAVING COUNT(1) = (
SELECT LicenceCount FROM #tmp WHERE PositionId = t.PositionID
)
I would approach the problem like this:
Get all the (distinct) users from PersonLicense.
Cross join them with PositionLicense.
Left join the resulting set with PersonLicense using PersonID and LicenseID.
Group the results by PersonID and PositionID.
Filter out those (PersonID, PositionID) pairs where the number of licenses in PositionLicense does not match the number of those in PersonLicense.
And here's my implementation:
SELECT
u.PersonID,
pl.PositionID
FROM (SELECT DISTINCT PersonID FROM PersonLicense) u
CROSS JOIN PositionLicense pl
LEFT JOIN PersonLicense ul ON u.PersonID = ul.PersonID
AND pl.LicenseID = ul.LicenseID
GROUP BY
u.PersonID,
pl.PositionID
HAVING COUNT(pl.LicenseID) = COUNT(ul.LicenseID)