TSQL Comparing sets of similar and repeating items

TSQL Comparing sets of similar and repeating items - tsql

I have a table of orders waiting to be fulfilled and a table of returned orders. There is only one product but you can order in different quantity packs. My job is to pair these orders with returns but the return order must match exactly with the current order in terms of the number of packs ordered and quantity in each pack. So matching on the number of packs ordered is no issue but matching up the quantities is giving me a headache. My orders and returns are pipe delimited fields. An order/return of 3 packs of 30 each will look like "30|30|30". An order of 3 packs, 2 of 15 and 1 of 30 will look like "15|15|30", "15|30|15", or "30|15|15". An order of "15|15|30" can be paired up with a return of "15|30|15" as they are the same. I know I need to parse out the items in the fields into a table first. But how do I compare them?
I have gone through all the examples here: TSQL Comparing two Sets
the intersect and cross join examples doesn't work when there are duplicates in the set - (a,a,b) and (b,a,a) do not match
full join doesn't work 2 sets have different qtys of same elements - (a,a,c) (and a,c,c) is a match
So my thoughts at this point are to maybe parse into table, sort, reassemble back into ordered piped string and compare the 2 strings. That would work but is it the best way to do this?
Editing to add - I cannot change the data model. SQL Server 2017
Sample data (records 2 and 3 would be a match):
declare #comp table(
OrderNo int,
OrderPackCount int,
OrderTtlPieces int,
OrderQtys varchar (50),
ReturnNo int,
RtnPackCount int,
RtnTtlPieces int,
RtnQtys varchar(50))
insert into #comp values
(55500, 2, 100, '50|50|', 401, 2, 100, '75|25|'),
(55501, 2, 60, '20|40|', 404, 2, 60, '40|20|'),
(55504, 3, 75, '15|30|30|', 385, 7, 75, '30|15|30|'),
(55508, 3, 90, '30|30|30|', 422, 7, 75, '50|30|10|')

Couple options to try.
Here's an example of splitting the values, reordering, putting them back to do the compare. Since you mention SQL2017 we can use STRING_SPLIT and then use STRING_AGG with WITHIN GROUP ( ORDER BY <order_by_expression_list> [ ASC | DESC ] ) to reorder and concatenate the values.
SELECT [comp].*
FROM #comp [comp]
CROSS APPLY (
SELECT STRING_AGG([value], '|') WITHIN GROUP(ORDER BY [value]) AS [OrdStr]
FROM STRING_SPLIT([comp].[OrderQtys], '|')
WHERE [value] <> ''
) AS [ord]
CROSS APPLY (
SELECT STRING_AGG([value], '|') WITHIN GROUP(ORDER BY [value]) AS [RtnStr]
FROM STRING_SPLIT([comp].[RtnQtys], '|')
WHERE [value] <> ''
) AS [Rnt]
WHERE [ord].[OrdStr] = [Rnt].[RtnStr];
Another option would be to identify those that do not match and then use EXCEPT. Split the values, aggregate and get a count by value, then outer apply where the values equal having the same count and then identify those that do not match. EXCEPT then returns values those that are not in that result.
SELECT *
FROM #comp
EXCEPT
SELECT [comp].*
FROM #comp [comp]
OUTER APPLY (
SELECT [value] AS [OrdValue]
, COUNT(*) AS [RntCnt]
FROM STRING_SPLIT([comp].[OrderQtys], '|')
WHERE [value] <> ''
GROUP BY [value]
) AS [ord]
OUTER APPLY (
SELECT [value] AS [RntValue]
, COUNT(*) AS [RntCnt]
FROM STRING_SPLIT([comp].[RtnQtys], '|')
WHERE [value] <> ''
AND [value] = [ord].[OrdValue]
GROUP BY [value]
HAVING COUNT(*) = [ord].[RntCnt]
) AS [Rnt]
WHERE [Rnt].[RntValue] IS NULL;

Related

Update null values in a column based on non null values percentage of the column

I need to update the null values of a column in a table for each category based on the percentage of the non-null values. The following table shows the null values for a particular category -
There are only two types of values in the column. The percentage of types based on rows is -
The number of rows with null values is 7, I need to randomly populate the null values based on the percentage share of the non-null values as shown below - 38%(CV) of 7 = 3, 63%(NCV) of 7 = 4

If you want to dynamically calculate the "NULL rate", one way to do it could be:
with pcts as (
select
(select count(*)::numeric from the_table where type = 'cv') / (select count(*) from the_table where type is not null) as cv_pct,
(select count(*)::numeric from the_table where type = 'ncv') / (select count(*) from the_table where type is not null) as ncv_pct,
(select count(*) from the_table where type is null) as null_count
), calc as (
select d.ctid,
p.cv_pct,
p.ncv_pct,
row_number() over () as rn,
case
when row_number() over () <= round(null_count * p.cv_pct) then 'cv'
else 'ncv'
end as new_type
from the_table d
cross join pcts p
where type is null
)
update the_table t
set type = c.new_type
from calc c
where t.ctid = c.ctid
The first CTE calculates the percentage of each type and the total number of NULL values (in theory the percentage of the NCV type isn't really needed, but I included it for completeness)
The second then calculates for each row which new type should be used. This is done by multiplying the "current" row number with the expected percentage (the CASE expression)
This is then used to update the target table. I have used the ctid as an alternative for a primary key, because your sample data does not have any unique column (or combination of columns). If you do have a primary key that you haven't shown, replace ctid with that primary key column.
I wouldn't be surprised though, if there was a shorter, more efficient way to do it, but for now I can't think of a better alternative.
Online example

If you are on PG11 or later, you can use the groups frame to do this in what should be close to a single pass (except reordering for output when sorted by tid) with window functions:
select tid, category, id, type,
case
when type is not null then type
when round(
(count(*) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding))::numeric /
coalesce(
nullif(
count(*) over (partition by category
order by type nulls last
groups 2 preceding
exclude group), 0), 1
) *
count(*) over (partition by category
order by type nulls last
groups current row)
) >= row_number() over (partition by category, type
order by tid)
then
first_value(type) over (partition by category
order by type nulls last
groups between 2 preceding
and 2 preceding)
else
first_value(type) over (partition by category
order by type nulls last
groups 1 preceding
exclude group)
end as extended_type
from cv_ncv
order by tid;
Working fiddle here.

Convert comma separated id to comma separated string in postgresql

I have comma separated column which represents the ids of emergency type like:
ID | Name
1 | 1,2,3
2 | 1,2
3 | 1
I want to make query to get name of the this value field.
1 - Ambulance
2 - Fire
3 - Police
EXPECTED OUTPUT
1 - Ambulance, Fire, Police
2 - Ambulance, Fire
3 - Ambulance
I just need to write select statement in postgresql to display string values instead of integer values in comma separated.

Comma separated values is bad database design practice, though postgre is so feature rich, that you can handle this task easily.
-- just simulate tables
with t1(ID, Name) as(
select 1 ,'1,2,3' union all
select 2 ,'1,2' union all
select 3 ,'1'
),
t2(id, name) as(
select 1, 'Ambulance' union all
select 2, 'Fire' union all
select 3, 'Police'
)
-- here is actual query
select s1.id, string_agg(t2.name, ',') from
( select id, unnest(string_to_array(Name, ','))::INT as name_id from t1 ) s1
join t2
on s1.name_id = t2.id
group by s1.id
demo
Though, if you can, change your approach. Right database design means easy queries and better performance.

To get the values for each of the ids is a simple query: select * from ;. Once you have the values you will have to parse the strings with a delimiter of ','. then you would have to assign the parsed sting values to the appropriate job titles, and remake the list. Are you writing this in a specific language?
or you could just assign the sorted value to something like 1,2,3 is equal to "some string", 1,2 is equal to "some other string", etc.

Assuming you have a table with the ids and values for Ambulance, Police and Fire, then you can use something like the following.
CREATE TABLE public.test1
(
id integer NOT NULL,
commastring character varying,
CONSTRAINT pk_test1 PRIMARY KEY (id)
);
INSERT INTO public.test1
VALUES (1, '1,2,3'), (2, '1,2'), (3, '1');
CREATE TABLE public.test2
(
id integer NOT NULL,
description character varying,
CONSTRAINT pk_test2 PRIMARY KEY (id)
);
INSERT INTO public.test2
VALUES (1, 'Ambulance'), (2, 'Fire'), (3, 'Police');
with descs as
(with splits as
(SELECT id, split_part(commastring, ',', 1) as col2,
split_part(commastring, ',', 2) as col3, split_part(commastring, ',', 3) as col4 from test1)
select splits.id, t21.description as d1, t22.description as d2, t23.description as d3
from splits
inner join test2 t21 on t21.id::character varying = splits.col2
left join test2 t22 on t22.id::character varying = splits.col3
left join test2 t23 on t23.id::character varying = splits.col4)
SELECT descs.id, CASE WHEN d2 IS NOT NULL AND d3 IS NOT NULL
THEN CONCAT_WS(',', d1,d2,d3) ELSE CASE WHEN d2 IS NOT NULL
THEN CONCAT_WS(',', d1,d2) ELSE d1 END END FROM descs
ORDER BY id;
By way of explanation, I give the create table and insert commands, so that you (and others) can follow the logic. It would help enormously, if you were to do this in your questions, as it saves everyone time and avoids misunderstandings.
My inner CTE then splits the string using split_part. The syntax here is quite simple, field, separator and desired column within the field to be split (so in this case we need one, two and three). I then join the split columns to test2. Note two things here: the first join is an inner join, as there will always be at least one column in the split (I am assuming!!!), whereas the other two are left joins; secondly, the split of a character varying field in turn produces character varying splits, so I have to cast the int id to character varying for the join to work. Doing the cast this way round (i.e. id to character varying rather than character varying to id) means I don't have to bother about nulls. Finally depending on the number of nulls present, I concatenate the results with the given separator. Again I am assuming d1 will always have a value.
HTH

How Dynamicaly columns in UNPIVOT operator

I currently have the following query:
WITH History AS (
SELECT
kz.*,
kz.__$operation AS operation,
map.tran_begin_time as beginT,
map.tran_end_time as endT
FROM cdc.fn_cdc_get_all_changes_dbo_EXT_GeolObject_KategZalezh(sys.fn_cdc_get_min_lsn('dbo_EXT_GeolObject_KategZalezh'), sys.fn_cdc_get_max_lsn(), 'all') AS kz
INNER JOIN [cdc].[lsn_time_mapping] map
ON kz.[__$start_lsn] = map.start_lsn
where kz.GUID_BalanceHC_Zalezh = 'DDA9AB3A-A0AF-4623-9362-0000C8C83D63'
),
UnpivotedValues AS(
SELECT guid, GUID_another, field, val, operation, beginT, endT
FROM History
UNPIVOT ( [val] FOR field IN
(
area,
oilwidthmin,
oilwidthmax,
efectivwidthmin,
efectivwidthmax,
etc...
))t
),
UnpivotedWithLastValue AS (
SELECT
*,
--Use LAG() to get the last value for the same field
LAG(val, 1) OVER (PARTITION BY guid, GUID_another, field ORDER BY BeginT) LastVal
FROM UnpivotedValues
)
SELECT * FROM UnpivotedWithLastValue WHERE val <> LastVal OR LastVal IS NULL ORDER BY guid
This query returns the changed values for a single table that has CDC (Change Data Capture) enabled.
I want to create a stored procedure that receives the columns to be unpivoted, and the cdc function (e.g. cdc.fn_cdc_get_all_...) as parameters and returns the result set.
The result for this tables must be joined in one report.
In my case parameter 1 is cdc.fn_cdc_get_all_changes_dbo_EXT_GeolObject_KategZalezh(sys.fn_cdc_get_min_lsn('dbo_EXT_GeolObject_KategZalezh'), sys.fn_cdc_get_max_lsn(), 'all'). This is the CDC function.
How should I send the list of fields that i want in the result? How's the string?
Also, is there a way to do without dynamic SQL? Dynamic SQL it is not better solution for performance.

As you know SQL Server is declarative by design and does not support macro substitution.
UNPIVOT would clearly be more performant, but here is a simplified example of a UNPIVOT which does not require Dynamic SQL, but only a little XML.
Example
Let's assume your table/results looks like this:
You may notice that I only we only specify key fields to EXCLUDE in the final WHERE
Declare #YourData table (ID int,Active bit,First_Name varchar(50),Last_Name varchar(50),EMail varchar(50),Salary decimal(10,2))
Insert into #YourData values
(1,1,'John','Smith','john.smith#email.com',85600),
(2,0,'Jane','Doe' ,'jane.doe#email.com',83200)
;with cte as (
-- Replace with your Complex Query
Select * from #YourData
)
Select A.ID
,A.Active
,C.*
From cte A
Cross Apply (Select XMLData=cast((Select A.* for XML RAW) as xml)) B
Cross Apply (
Select Item = attr.value('local-name(.)','varchar(100)')
,Value = attr.value('.','varchar(max)')
From XMLData.nodes('/row') C1(n)
Cross Apply C1.n.nodes('./#*') C2(attr)
Where attr.value('local-name(.)','varchar(100)') not in ('ID','Active')
) C
Returns

How to rewrite SQL joins into window functions?

Database is HP Vertica 7 or PostgreSQL 9.
create table test (
id int,
card_id int,
tran_dt date,
amount int
);
insert into test values (1, 1, '2017-07-06', 10);
insert into test values (2, 1, '2017-06-01', 20);
insert into test values (3, 1, '2017-05-01', 30);
insert into test values (4, 1, '2017-04-01', 40);
insert into test values (5, 2, '2017-07-04', 10);
Of the payment cards used in the last 1 day, what is the maximum amount charged on that card in the last 90 days.
select t.card_id, max(t2.amount) max
from test t
join test t2 on t2.card_id=t.card_id and t2.tran_dt>='2017-04-06'
where t.tran_dt>='2017-07-06'
group by t.card_id
order by t.card_id;
Results are correct
card_id max
------- ---
1 30
I want to rewrite the query into sql window functions.
select card_id, max(amount) over(partition by card_id order by tran_dt range between '60 days' preceding and current row) max
from test
where card_id in (select card_id from test where tran_dt>='2017-07-06')
order by card_id;
But result set does not match, how can this be done?
Test data here:
http://sqlfiddle.com/#!17/db317/1

I can't try PostgreSQL, but in Vertica, you can apply the ANSI standard OLAP window function.
But you'll need to nest two queries: The window function only returns sensible results if it has all rows that need to be evaluated in the result set.
But you only want the row from '2017-07-06' to be displayed.
So you'll have to filter for that date in an outer query:
WITH olap_output AS (
SELECT
card_id
, tran_dt
, MAX(amount) OVER (
PARTITION BY card_id
ORDER BY tran_dt
RANGE BETWEEN '90 DAYS' PRECEDING AND CURRENT ROW
) AS the_max
FROM test
)
SELECT
card_id
, the_max
FROM olap_output
WHERE tran_dt='2017-07-06'
;
card_id|the_max
1| 30

As far as I know, PostgreSQL Window function doesn't support bounded range preceding thus range between '90 days' preceding won't work. It does support bounded rows preceding such as rows between 90 preceding, but then you would need to assemble a time-series query similar to the following for the Window function to operate on the time-based rows:
SELECT c.card_id, t.amount, g.d as d_series
FROM generate_series(
'2017-04-06'::timestamp, '2017-07-06'::timestamp, '1 day'::interval
) g(d)
CROSS JOIN ( SELECT distinct card_id from test ) c
LEFT JOIN test t ON t.card_id = c.card_id and t.tran_dt = g.d
ORDER BY c.card_id, d_series
For what you need (based on your question description), I would stick to using group by.

GROUP BY clause sees all VARCHAR fields as different

I have witnessed a strange behaviour while trying to GROUP BY a VARCHAR field.
Let the following example, where I try to spot customers that have changed name at least once in the past.
CREATE TABLE #CustomersHistory
(
Id INT IDENTITY(1,1),
CustomerId INT,
Name VARCHAR(200)
)
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'AAA')
INSERT INTO #CustomersHistory VALUES (12, 'BBB')
INSERT INTO #CustomersHistory VALUES (44, '444')
SELECT ch.CustomerId, count(ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(ch.Name) != 1
Which oddly yields (as if 'AAA' from first INSERT was different from the second one)
CustomerId cnt // (I was expecting)
12 3 // 2
44 1 // 1
Is this behaviour specific to T-SQL?
Why does it behave in this rather counter-intuitive way?
How is it customary to overcome this limitation?
Note: This question is very similar to GROUP BY problem with varchar, where I didn't find the answer to Why
Side Note: Is it good practice to use HAVING count(ch.Name) != 1 instead of HAVING count(ch.Name) > 1 ?

The COUNT() operator will count all rows regardless of value. I think you might want to use a COUNT(DISTINCT ch.Name) which will only count unique names.
SELECT ch.CustomerId, count(DISTINCT ch.Name) AS cnt
FROM #CustomersHistory ch
GROUP BY ch.CustomerId HAVING count(DISTINCT ch.Name) > 1
For more information, take a look at the COUNT() article on book online

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

TSQL Comparing sets of similar and repeating items - tsql

Related

Update null values in a column based on non null values percentage of the column

Convert comma separated id to comma separated string in postgresql

How Dynamicaly columns in UNPIVOT operator

How to rewrite SQL joins into window functions?

GROUP BY clause sees all VARCHAR fields as different

Categories

Resources