Identifying groups of duplicated records (TSQL) - tsql

I have got massive table (over 95 000 000 records) in MSSQL database
id
configuration_id
equipment_group_id
name
price
1
1
100
item1
10
2
1
100
item2
20
3
1
100
item3
30
4
2
100
item1
10
5
2
100
item2
20
6
2
100
item3
30
7
3
100
item1
10
8
3
100
item2
20
9
3
100
item3
31
I am going to identify duplicated group of records.
Configuration 1 Group
id
configuration_id
equipment_group_id
name
price
1
1
100
item1
10
2
1
100
item2
20
3
1
100
item3
30
Configuration 2 Group
id
configuration_id
equipment_group_id
name
price
4
2
100
item1
10
5
2
100
item2
20
6
2
100
item3
30
Configuration 3 Group
id
configuration_id
equipment_group_id
name
price
7
3
100
item1
10
8
3
100
item2
20
9
3
100
item3
31
in my logic Group 1 and Group 2 are duplicates
has the same number of records
has the same content in fields equipment_group_id, name, price
Group 1 and Group 3 are NOT duplicates because there is at least one different element (last record has price 31, not 30)
How to construct a query to find all groups that are duplicated (not records) across the table?

Performance of this query for 95M records will probably not be ideal, but this should do the trick.
Find Exact Matches of Groups Containing Multiple Rows
DROP TABLE IF EXISTS #Config
CREATE TABLE #Config
(id int
,configuration_id int
,equipment_group_id int
,name VARCHAR(100)
,price INT
)
INSERT INTO #Config
VALUES
(1,1,100,'item1',10)
,(2,1,100,'item2',20)
,(3,1,100,'item3',30)
,(4,2,100,'item1',10)
,(5,2,100,'item2',20)
,(6,2,100,'item3',30)
,(7,3,100,'item1',10)
,(8,3,100,'item2',20)
,(9,3,100,'item3',31)
;WITH cte_ConfigCount AS (
SELECT *,ConfigTotalRowCnt = COUNT(*) OVER (PARTITION BY A.configuration_id) /*Counts how many rows in each config*/
FROM #Config AS A
)
SELECT
A.configuration_id
,B.configuration_id
,TextDescription = CONCAT('Config #',A.configuration_id,' matches Config #',B.configuration_id)
,A.ConfigTotalRowCnt
,RowsMatch = COUNT(*)
FROM cte_ConfigCount AS A
INNER JOIN cte_ConfigCount AS B
ON A.configuration_id < B.configuration_id /*Don't join to self and only join 1 way (so don't have one row with A-B and another row with B-A)*/
AND a.equipment_group_id = B.equipment_group_id
AND A.name = B.name
AND A.price = B.price
GROUP BY A.configuration_id,A.ConfigTotalRowCnt,B.configuration_id
HAVING A.ConfigTotalRowCnt = COUNT(*) /*Only return where the total row for the config matches the rows where the configs match*/

Related

How to create Group By LINQ query

I'm using Entity Framework and I have 3 linked tables
image
I can create query like this
purchase_number album_name purchase_amount purchase_price
1 name_1 5 1000
1 name_2 10 2000
2 name_1 3 1000
2 name_3 7 1500
3 name_2 2 2000
How can I create query like this using LINQ
purchase_number purchase_price(purchase_price * purchase_amount)
1 25000
2 13500
3 4000
Where q is your original query:
var result = q
.GroupBy(x=>x.purchase_number)
.Select(x=>new {
purchase_number = x.Key,
purchase_price = x.Sum(z=>z.purchase_amount*z.purchase_price)
});
Optionally doing an orderby at the end to guarantee the order if that is a requirement.

Create Pivot Table using PostgreSQL

I have a table like this:
type code desc store Sales/Day Stock
-----------------------------------------------
1 AA1 abc 101 3 6
1 AA2 abd 101 4 0
1 AA3 abf 101 4 3
2 BA1 bba 101 5 1
2 BA2 bbc 101 2 1
1 AA1 abc 102 1 4
1 AA2 abd 102 2 0
2 BA1 bba 102 4 2
2 BA2 bbc 102 5 5
etc.
How I can show the result table like this:
type code desc Store 101 Store 102
Sales/Day | Stock Sales/Day | Stock
--------------------------------------------------------------
1 AA1 abc 3 6 1 4
1 AA2 abd 4 0 2 0
1 AA3 abf 4 3 0 0
2 BA1 bba 5 1 4 2
2 BA2 bbc 2 1 5 5
etc.
Note:
Colspan is only display.
demo:db<>fiddle
First way: FILTER
SELECT
type,
code,
"desc",
COALESCE(SUM(sales_day) FILTER (WHERE store = 101)) as sales_day_101,
COALESCE(SUM(stock) FILTER (WHERE store = 101), 0) as stock_101,
COALESCE(SUM(sales_day) FILTER (WHERE store = 102), 0) as sales_day_102,
COALESCE(SUM(stock) FILTER (WHERE store = 102), 0) as stock_102
FROM mytable
GROUP BY type, code, "desc"
ORDER BY type, code
Aggregating your values. I took SUM but in your case with distinct rows many other aggregate functions would do it. FILTER allows you to aggregate only one store.
The COALESCE is to avoid NULL values if no values are present for one aggregation (like AA3 in store 102).
Second way, CASE WHEN
SELECT
type,
code,
"desc",
SUM(CASE WHEN store = 101 THEN sales_day ELSE 0 END) as sales_day_101,
SUM(CASE WHEN store = 101 THEN stock ELSE 0 END) as stock_101,
SUM(CASE WHEN store = 102 THEN sales_day ELSE 0 END) as sales_day_102,
SUM(CASE WHEN store = 102 THEN stock ELSE 0 END) as stock_102
FROM mytable
GROUP BY type, code, "desc"
ORDER BY type, code
The idea is the same, but the newer FILTER function is replace by the more common CASE clause.
Notice that "desc" is a reserved word in Postgres. So I strictly recommend to rename your column.

TSQL - Max per group?

I have a table that looks like this:
GroupID UserID Value
1 1 10
1 2 20
1 3 30
1 4 40
1 5 45
1 6 49
1 7 80
1 8 90
2 1 2
2 2 24
2 3 34
2 4 48
2 5 56
3 1 etc.
3 2
3 3
3 4
4 1
4 2
4 3
I am trying to write a LEAD function that will give me the midpoint between each value. To do this I have written the following:
SELECT
[GroupID]
, [UserID]+0.5
, (LEAD ([Value], 1) OVER (ORDER BY GroupID, UserID) + [Value])/2 as [Value]
from dbo.myTable
The problem with this function is that when it gets to the last User in the group, it gives me a bad value because it's taking the [Value] on the current row and the value from the next row.
What I want to do is stop it when it reaches the maximum UserID for each Group. In other words, when it gets to GroupID = 1 and UserID = 8, it should end and start at the next Group. I do not want a row that looks like this:
GroupID UserID Value
1 8.5 46
I could run a DELETE statement after I INSERT the rows into the original table, but I don't have anything to identify when a row is the "maximum" User for it's Group. Ideally, I would like to somehow tell the lead statement not to calculate it in the first place.

Build a query that pulls records based on a value in a column

My table has a parent/child relationship, along the lines of parent.id,id. There is also a column that contains a quantity, and another ID representing a grand-parent, like so:
id parent.id qty Org
1 1 1 100
2 1 0 100
3 1 4 100
4 4 1 101
5 4 2 101
6 6 1 102
7 6 0 102
8 6 1 102
What this is supposed to show is ID 1 is the parent, and ID 2 and 3 are children which belongs to ID 1, and ID 1, 2, and 3 all belong to the grandparent 100.
I would like to know if any child or parent has QTY = 0, what are all the other id's associated to that parent, and what are all the other parents associated with that grandparent?
For example, I would want to see a report that shows me this:
Org id parent.id qty
100 1 1 1
100 2 1 0
100 3 1 4
102 6 6 1
102 7 6 0
102 8 6 1
Much appreciate any help you can offer to build a MS SQL 2000 (yeah, I know) query to handle this.
Try this
select * from tablename a
where exists (select 1 from tablename x
where x.parent_id = a.parent_id and qty = 0)
Example:
;with cte as
( select 1 id,1 parent_id, 1 qty, 100 org
union all select 2,1,0,100
union all select 3,1,4,100
union all select 4,4,1,101
union all select 5,4,2,101
union all select 6,6,1,102
union all select 7,6,0,102
union all select 8,6,1,102
)
select * from cte a
where exists (select 1 from cte x
where x.parent_id = a.parent_id and qty = 0)
SQL DEMO HERE

Querying sql table with multiple values

I would like to query a sql table from below
ID Val
-------------
1 5
1 7
1 8
1 9
2 5
2 7
2 9
3 1
3 5
that would return the following set of results
query > select distinct ID from dbo.table where val in (5,7,9)
result
--------
ID
1
2
I run into a problem where a single row can match only one val from the subset and not all of them...
Assuming the rows are distinct:
SELECT ID
FROM your_table
WHERE Val IN (5,7,9)
GROUP BY ID
HAVING COUNT(*) = 3