How to count matches across two sets of data - match

I am using Micorosft SQL Server. I have two tables, t1 and t2, that each consist of the following set of variables: PatientID, AdmissionDate, DiagnosisCode. Note that multiple diagnoses within an admission are shown as multiple rows. Each table shows a different list of patients. These tables are large so the solution has to be efficient (400,000 rows). I would like to calculate the similarity of patients in table 1 to the patients in table 2. Similarity is defined as ratio of number of diagnoses the two patients share divided by the following sum:
.8*(number of diagnosis of the patient in table 1 that is not matched to patient in table 2) +
.2*(number of diagnoses of patient in table 2 that is not matched to the patient in table 1) +
(number of diagnoses the two patients share)
Any suggestions of how to organize this problem is appreciated.

Here is my attempt at solving this problem and I hope others can find more efficient ways:
select #t1.id1, #t1.adm1, #t1.dx1, #t2.id2, #t2.adm2, #t2.dx2, iif(#t1.dx1=#t2.dx2,1,0) as shared Into #t3 From #t1 cross join #t2
Select id1, adm1, dx1, id2, adm2, sum(shared) as In1In2, iif(sum(shared)=0,1,0) as In1Not2 into #t4 from #t3 group by id1, adm1, dx1, id2, adm2
Select id1, adm1, dx1, id2, adm2, sum(In1Not2) as nIn1Not2, into #t5 from #t4 group by id1, adm1, id2, adm2
Select id1, adm1, dx2, id2, adm2, iif(sum(shared)=0,1,0) as In2Not1 into #t6 from #t3 group by id1, adm1, dx2, id2, adm2
Select id1, adm1, id2, adm2, sum(In2Not1) as nIn2Not1 into #t7 from #t6 group by id1, adm1, id2, adm2
In the next step the calculated values are combined into a common table. The problem with this attempt is that running it on t1 of 100,000 and t2 of 400,000 records is taking more than 2 hours.

Related

How to query two tables with same schema but different time ranges in a date column

I have two tables:
main_products
old_products
They have the same info and schema with only one difference:
main_products has min(date) = 2022-01 and max(date) = 2022-05
and
old_products has min(date) = 2020-01 and max(date) = 2020-12
How can I query to get all records from old_products + all records from main_products to get products from 2020-01 to 2022-05 ?
The product on both tables has and product_id field.
I tried to join both tables on product_id but the output is a table with twice number of columns.
select t1.*, t2.* from t1
inner join t2
one t1.product_id = t2.product_id
I think you are looking for a UNION or UNION ALL:
SELECT *
FROM t1
WHERE ...
UNION ALL
SELECT *
FROM t2
WHERE ...
If the columns in t1 and t2 are the same (same number of columns and same types), this will pull the data from both of them. Use UNION if you want duplicates removed or UNION ALL to include duplicates. (In your case it won't make a functional difference since the tables don't overlap by date, but UNION ALL will be faster.)
In the above example, you can put your condition (to only get 2022-01 to 2022-05) in both WHERE conditions. If you don't like repeating the condition, you can use the UNION ALL query in a subquery with the condition outside:
SELECT *
FROM
(
SELECT *
FROM t1
UNION ALL
SELECT *
FROM t2
) sq
WHERE ...

Find the next oldest row in Redshift

I have a table called user_activity in Redshift that has department, user_id, activity_type, activity_id, activity_date.
I'd like to query a daily report of how many days since the last event (of any type). Using CROSS APPLY (SQL Server) or LATERAL JOIN (Postgres 9+), I'd do something like...
SELECT d.date, a.last_activity_date
FROM date_table d
CROSS JOIN (
SELECT DISTINCT user_id FROM activity_table
) u
CROSS APPLY (
SELECT TOP 1 activity_date as last_activity_date
FROM activity_table
WHERE user_id = u.user_id AND activity_date <= d.date
ORDER BY activity_date DESC
) a
For now, I write it similar to the below, but it is a bit slow and I am afraid it'll only get slower.
with user_activity as (
select distinct activity_date, user_id from activity_table
)
select
d.date, u.user_id,
max(u.activity_date) as last_activity_date
from date_table d
inner join user_activity u on u.activity_date <= d.date
where d.date between '2020-01-01' and current_date
group by 1, 2
Can someone suggest a good alternative for my needs or for CROSS APPLY / LATERAL JOIN.
As you are seeing cross joining and inequality joining will slow down as you data grows and are generally not the approach you want in Redshift. This is due to the data size increase that comes with this type of action when applied to large data tables that are typical in Redshift.
You want to use window functions to perform this type of analysis. But you will need to step back and rethink how you will structure the SQL. A MAX(activity_date) window function, partitioned by user_id and ordered by date and with a frame clause of all preceding rows, will find the most recent activity to any activity.
Now this will produce only rows for user_ids and dates that exist in the data table and it looks like you want 1 row for each date for each user_id, right? To do this you need to UNION in a frame of data that has 1 row for each date for each user_id ahead of the window function. You will need NULLs in for the other columns so that the data widths match. You will also want the dates in a separate column from activity_date. Now all dates for all user ids will be in the source and the window function will give you the result you want.
You also ask ‘how is this better than the joins?’ Well in the joins you are replicating all the data records by the number of dates which can get really big. In this approach you just have the original data records plus one row per user_id per date (which is the size of your output) and as the number of records per user_id grows this approach doesn’t.
——— Request to modify asker’s code per comments made to their approach ———
Your code is definitely on the right track as you have removed the massive inequality join of your original. I made 2 comments about it. The first is that I believe you need GROUP BY user_id, date to prevent multiple rows per user_id per date that would result if there are records for the same user_id on a single date with differing activity_types. This is a simple oversight.
The second is to state that I intended for you to use UNION ALL, not LEFT JOIN, in combining the actual data and the user_id/date framework. Your approach works fine but I have found that unioning with very large amounts of data is generally faster than joining but you do need to make sure the columns match up. Either way we end up with a data segment with 3 columns - 2 date columns, one with NULLs for framework rows, and 1 user_id. Your approach is fine and the difference in performance is likely very small unless you have huge tables.
Since you asked for a rewrite, here it is with both changes. (NOTE: my laptop is in the shop so I don’t have ready access to Redshift at the moment and this SQL is untested. If the intent is not clear from this and you need me to debug it will be delayed by a few days. I’m keeping your setup methods and SQL structure.)
with date_table as (
select '2000-01-01'::date as date
union all
select '2000-01-02'::date
union all
select '2000-01-03'::date
union all
select '2000-01-04'::date
union all
select '2000-01-05'::date
union all
select '2000-01-06'::date
),
users as (
select 1 as user_id
union all
select 2
union all
select 3
),
user_activity as (
select 1 as user_id, '2000-01-01'::date as activity_date
union all
select 1 as user_id, '2000-01-04'::date as activity_date
union all
select 3 as user_id, '2000-01-03'::date as activity_date
union all
select 1 as user_id, '2000-01-05'::date as activity_date
union all
select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
select d.date, u.user_id
from date_table d
cross join users u
),
user_date_activity as (
select cal_date, user_id,
lag(max(activity_date), 1) ignore nulls over (partition by user_id order by date) as last_activity_date
from (
Select user_id, date as cal_date, NULL as activity_date from user_dates
Union all
Select user_id, activity_date as cal_date, activity_date from user_activity
)
Group by user_id, cal_date
)
select * from user_date_activity
order by user_id, cal_date```
This was my query based on Bill's answer.
with date_table as (
select '2000-01-01'::date as date
union all
select '2000-01-02'::date
union all
select '2000-01-03'::date
union all
select '2000-01-04'::date
union all
select '2000-01-05'::date
union all
select '2000-01-06'::date
),
users as (
select 1 as user_id
union all
select 2
union all
select 3
),
user_activity as (
select 1 as user_id, '2000-01-01'::date as activity_date
union all
select 1 as user_id, '2000-01-04'::date as activity_date
union all
select 3 as user_id, '2000-01-03'::date as activity_date
union all
select 1 as user_id, '2000-01-05'::date as activity_date
union all
select 1 as user_id, '2000-01-06'::date as activity_date
),
user_dates as (
select d.date, u.user_id
from date_table d
cross join users u
),
user_date_activity as (
select ud.date, ud.user_id,
lag(ua.activity_date, 1) ignore nulls over (partition by ud.user_id order by ud.date) as last_activity_date
from user_dates ud
left join user_activity ua on ud.date = ua.activity_date and ud.user_id = ua.user_id
)
select * from user_date_activity
order by user_id, date

Postgres count total matches per group

Input data
I have the following association table:
AssociationTable
- Item ID: Integer
- Tag ID: Integer
Referring to the following example data
Item Tag
1 1
1 2
1 3
2 1
and some input list of tags T (e.g. [1, 2])
What I want
For each item, I would like to know which tags were not provided in the input list T.
With our sample data, we'd get:
Item Num missing
1 1
2 0
My thoughts
The best I've done so far is: select "ItemId", count("TagId") as "Num missing" from "AssociationTab" where "TagId" not in (1) group by "ItemId";
The problem here is that items where all tags match will not be included in the output.
You could use a calendar table with anti-join approach:
WITH cte AS (
SELECT t1.Item, t2.Tag
FROM (SELECT DISTINCT Item FROM AssociationTable) t1
CROSS JOIN (SELECT 1 AS Tag UNION ALL SELECT 2) t2
)
SELECT
t1.Item,
COUNT(*) FILTER (WHERE t2.Item IS NULL) AS num_missing
FROM cte t1
LEFT JOIN AssociationTable t2
ON t1.Item = t2.Item AND
t1.Tag = t2.Tag AND
t2.Tag IN (1, 2)
GROUP BY
t1.Item;
Demo
The strategy here is to build a calendar/reference table in the first CTE which contains all combinations of items and tags. Then, we left join this CTE to your association table, aggregate by item, and then detect how many tags are missing for each item.
Simplest solution is
SELECT
ItemId,
count(*) FILTER (WHERE TagId NOT IN (1,2))
FROM AssociationTab
GROUP BY ItemId
Alternatively, if you already have an Items table with the item list, you could do this:
SELECT
i.ItemId,
count(a.TagId)
FROM Items i
LEFT JOIN AssociationTab a ON a.ItemId = i.ItemId AND a.TagId NOT IN (1,2)
GROUP BY i.ItemId
The key is that LEFT JOIN does not remove the Items row if no tags match.

Using Derived Tables and CTEs to Display Details?

I am teaching myself T-SQL and am struggling to comprehend the following example..
Suppose you want to display several nonaggregated columns along with
some aggregate expressions that apply to the entire result set or to a
larger grouping level. For example, you may need to display several
columns from the Sales.SalesOrderHeader table and calculate the
percent of the TotalDue for each sale compared to the TotalDue for all
the customer’s sales. If you group by CustomerID, you can’t include
other nonaggregated columns from Sales.SalesOrderHeader unless you
group by those columns. To get around this, you can use a derived
table or a CTE.
Here are two examples given...
SELECT c.CustomerID, SalesOrderID, TotalDue, AvgOfTotalDue,
TotalDue/SumOfTotalDue * 100 AS SalePercent
FROM Sales.SalesOrderHeader AS soh
INNER JOIN
(SELECT CustomerID, SUM(TotalDue) AS SumOfTotalDue,
AVG(TotalDue) AS AvgOfTotalDue
FROM Sales.SalesOrderHeader
GROUP BY CustomerID) AS c ON soh.CustomerID = c.CustomerID
ORDER BY c.CustomerID;
WITH c AS
(SELECT CustomerID, SUM(TotalDue) AS SumOfTotalDue,
AVG(TotalDue) AS AvgOfTotalDue
FROM Sales.SalesOrderHeader
GROUP BY CustomerID)
SELECT c.CustomerID, SalesOrderID, TotalDue,AvgOfTotalDue,
TotalDue/SumOfTotalDue * 100 AS SalePercent
FROM Sales.SalesOrderHeader AS soh
INNER JOIN c ON soh.CustomerID = c.CustomerID
ORDER BY c.CustomerID;
Why doesn't this query produce the same result..
SELECT CustomerID, SalesOrderID, TotalDue, AVG(TotalDue) AS AvgOfTotalDue,
TotalDue/SUM(TotalDue) * 100 AS SalePercent
FROM Sales.SalesOrderHeader
GROUP BY CustomerID, SalesOrderID, TotalDue
ORDER BY CustomerID
I'm looking for someone to explain the above examples in another way or step through it logically so I can understand how they work?
The aggregates in this statement (i.e. SUM and AVG) don't do anything:
SELECT CustomerID, SalesOrderID, TotalDue, AVG(TotalDue) AS AvgOfTotalDue,
TotalDue/SUM(TotalDue) * 100 AS SalePercent
FROM Sales.SalesOrderHeader
GROUP BY CustomerID, SalesOrderID, TotalDue
ORDER BY CustomerID
The reason for this is you're grouping by TotalDue, so all records in the same group have the same value for this field. In the case of AVG this means you're guarenteed for AvgOfTotalDue to always equal TotalDue. For SUM it's possible you'd get a different result, but as you're also grouping by SalesOrderID (which I'd imagine is unique in the SalesOrderHeader table) you will only have one record per group, so again this will always equal the TotalDue value.
With the CTE example you're only grouping by CustomerId; as a customer may have many sales orders associated with it, these aggregate values will be different to the TotalDue.
EDIT
Explanation of the aggregate of field included in group by:
When you group by a value, all rows with that same value are collected together and aggregate functions are performed over them. Say you had 5 rows with a total due of 1 and 3 with a total due of 2 you'd get two result lines; one with the 1s and one with the 2s. Now if you perform a sum on these you have 3*1 and 2*2. Now divide by the number of rows in that result line (to get the average) and you have 3*1/3 and 2*2/2; so things cancel out leaving you with 1 and 2.
select totalDue, avg(totalDue)
from (
select 1 totalDue
union all select 1 totalDue
union all select 1 totalDue
union all select 2 totalDue
union all select 2 totalDue
) x
group by totalDue
select uniqueId, totalDue, avg(totalDue), sum(totalDue)
from (
select 1 uniqueId, 1 totalDue
union all select 2 uniqueId, 1 totalDue
union all select 3 uniqueId, 1 totalDue
union all select 4 uniqueId, 2 totalDue
union all select 5 uniqueId, 2 totalDue
) x
group by uniqueId
Runnable Example: http://sqlfiddle.com/#!2/d41d8/21263

Find duplicate row "details" in table

OrderId OrderCode Description
-------------------------------
1 Z123 Stuff
2 ABC999 Things
3 Z123 Stuff
I have duplicates in a table like the above. I'm trying to get a report of which Orders are duplicates, and what Order they are duplicates of, so I can figure out how they got into the database.
So ideally I'd like to get an output something like;
OrderId IsDuplicatedBy
-------------------------
1 3
3 1
I can't work out how to code this in SQL.
You can use the same table twice in one query and join on the fields you need to check against. T1.OrderID <> T2.OrderID is needed to not find a duplicate for the same row.
declare #T table (OrderID int, OrderCode varchar(10), Description varchar(50))
insert into #T values
(1, 'Z123', 'Stuff'),
(2, 'ABC999', 'Things'),
(3, 'Z123', 'Stuff')
select
T1.OrderID,
T2.OrderID as IsDuplicatedBy
from #T as T1
inner join #T as T2
on T1.OrderCode = T2.OrderCode and
T1.Description = T2.Description and
T1.OrderID <> T2.OrderID
Result:
OrderID IsDuplicatedBy
1 3
3 1