Choosing the first child record in a selfjoin in TSQL - tsql

I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?

View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.

/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date

Related

Optional filter on a column of an outer joined table in the where clause

I have got two tables:
create table student
(
studentid bigint primary key not null,
name varchar(200) not null
);
create table courseregistration
(
studentid bigint not null,
coursenamename varchar(200) not null,
isfinished boolean default false
);
--insert some data
insert into student values(1,'Dave');
insert into courseregistration values(1,'SQL',true);
Student is fetched with id, so it should be always returned in the result. Entry in the courseregistration is optional and should be returned if there are matching rows and those matching rows should be filtered on isfinished=false. This means I want to get the course regsitrations that are not finished yet. Tried to outer join student with courseregistration and filter courseregistration on isfinished=false. Note that, I still want to retrieve the student.
Trying this returns no rows:
select * from student
left outer join courseregistration using(studentid)
where studentid = 1
and courseregistration.isfinished = false
What I'd want in the example above, is a result set with 1 row student, but course rows null (because the only example has the isfinished=true). One more constraint though. If there is no corresponding row in courseregistration, there should still be a result for the student entry.
This is an adjusted example. I can tweak my code to solve the problem, but I really wonder, what is the "correct/smart way" of solving this in postgresql?
PS I have used the (+) in Oracle previously to solve similar issues.
Isn't this what you are looking for :
select * from student s
left outer join courseregistration cr
on s.studentid = cr.studentid
and cr.isfinished = false
where s.studentid = 1
db<>fiddle here

Postgres: Need to match records from two tables based on key value and earliest dates in each table

I'm dealing with a pretty unique record matching problem within postgres right now. Essentially I have a table (A) with a lot of records in it, including a key value that I need to match on and the date of the record. Then I have this other table (B) that I want to match the first table on that key value. However, there can be multiple of the same 'key values' in both tables. To get around this I need to match the earliest key value from table A to the earliest key value to table B, the second earliest to the second earliest, and so on... However, if table B runs out of key value matches in table B then I want to default to the latest key value match in A, even though something else already matched on it.
My initial thought is to use a something like this on both tables:
ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
And then join on the rank and key_value field. However, I'm not exactly sure how to get that default scenario to work with this method. And if records are added to one table and not the other and I try the join again, I feel like it might get out of sync.
My other thought was to use a cursor, but I'm really struggling to see how I'd implement that.
Any help would be greatly appreciated!
first you need number all your rows, the find the one with matching ranks.
After that match the one without matching to the latest_date
with cteA as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableA
), cteB as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableB
), ranked_match as (
SELECT ctA.*, cteB.*
FROM cteA
LEFT JOIN cteB
ON cteA.key_value = cteB.key_value
AND cteA.rank = cteB.rank
), latest_row as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date DESC) AS rank
FROM tableB
)
SELECT *
FROM ranked_match
WHERE cteB.key_value IS NOT NULL
UNION ALL
SELECT *
FROM ranked_match
JOIN latest_row
ON ranked_match.key_value = latest_row .key_value
WHERE cteB.key_value IS NULL
AND latest_row .rank = 1

How can I execute a least cost routing query in postgresql, without temporary tables?

How can I execute a telecoms least cost routing query in PostgreSQL?
The purpose is generate a result set with ordered by the lowest price for the carriers. The table structure is below
SQL Fiddle
CREATE TABLE tariffs (
trf_tariff_id integer,
trf_carrier_id integer,
trf_prefix character varying,
trf_destination character varying,
trf_price numeric(15,6),
trf_connect_charge numeric(15,6),
trf_billing_interval integer,
trf_minimum_interval integer
);
For instance to check the cost for a call if passed through a particular carrier carrier_id the query is:
SELECT trf_price, trf_prefix as lmp FROM tariffs WHERE SUBSTRING(dialled_number,1, LENGTH(trf_prefix)) = trf_prefix and trf_carrier_id = carrier_id ORDER BY trf_prefix DESC limit 1
For the cost of the call for each carrier ie the least cost query the query is:
-- select * from tariffs
select distinct banana2.longest_prefix, banana2.trf_carrier_id_2, apple2.trf_carrier_id, apple2.lenprefix, apple2.trf_price, apple2.trf_destination from
(select banana.longest_prefix, banana.trf_carrier_id_2 from (select max(length(trf_prefix)) as longest_prefix, trf_carrier_id as trf_carrier_id_2 from (select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple group by apple.trf_carrier_id) as banana) as banana2,
(select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple2 -- group by apple2.trf_carrier_id where banana2.trf_carrier_id_2=apple2.trf_carrier_id and banana2.longest_prefix=apple2.lenprefix order by trf_price
The query works on the basis that for each carrier the longest matching prefix for a dialled number is unique and it will be the longest. So a join involving the longest prefix and carrier on the selection gives the set for all the carriers.
I one problem with my query:
I don't want to do the apple(X) query twice
(select *, length(trf_prefix) as lenprefix from tariffs where substring('35567234567', 1, length(trf_prefix) )= trf_prefix) as apple
There must be a more elegant way, probably declaring it once and using it twice.
What I want to do is run the query on the single carrier for each carrier:
SELECT trf_price, trf_prefix as lmp FROM tariffs WHERE SUBSTRING(dialled_number,1, LENGTH(trf_prefix)) = trf_prefix and trf_carrier_id = carrier_id ORDER BY trf_prefix DESC limit 1
and combine them into one set which will be sorted by price.
In fact I want to generalize the method for any such query where the output for the various values for a particular column or set of columns are combined into one set for further querying. I am told that CTEs are the way to accomplish that kind of query but I find the docs rather confusing. It is much easier with your own use cases.
PS. I am aware that the prefix length can be precomputed and stored.
Common Table Expressions:
with apple as (
select *, length(trf_prefix) as lenprefix
from tariffs
where substring('35567234567', 1, length(trf_prefix)) = trf_prefix
)
select distinct banana2.longest_prefix, banana2.trf_carrier_id_2,
apple.trf_carrier_id, apple.lenprefix, apple.trf_price,
apple.trf_destination
from (select banana.longest_prefix, banana.trf_carrier_id_2
from (select max(length(trf_prefix)) as longest_prefix,
trf_carrier_id as trf_carrier_id_2
from apple
group by apple.trf_carrier_id) as banana) as banana2,
apple
where banana2.trf_carrier_id_2 = apple.trf_carrier_id
and banana2.longest_prefix = apple.lenprefix
order by trf_price
You can just pull out the repeated table definition. Even if I'm just using one of those sub-select-in-a-from things a single time, I still use CTEs. I find the style you're using basically unreadable.

Random order by performance

What is the best way to get top n rows by random order?
I use the query like:
Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())
The problem in the above query is that it will keep getting slower as table size increases. it will always do full clustered index scan to find top(10) rows by random order.
Is there any other better way to do it?
I have tested this and got better performance changing the query.
The DDL for the table I used in my tests.
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Col1] [nvarchar](100) NOT NULL,
[Col2] [nvarchar](38) NOT NULL,
[Col3] [datetime] NULL,
[Col4] [nvarchar](50) NULL,
[Col5] [int] NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable]
(
[Col5] ASC
)
The table has 722888 rows.
First query:
select top 10
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
order by newid()
Statistics for first query:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 13 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1700 ms.
Execution plan first query:
Second query:
select
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
inner join (select top 10 ID
from TestTable
order by newid()) as C
on T.ID = C.ID
Statistics for second query:
SQL Server parse and compile time:
CPU time = 125 ms, elapsed time = 183 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 706 ms.
Execution plan second query:
Summary:
The second query is using the index on Col5 to order the rows by newid() and then it does a Clustered Index Seek 10 times to get the values for the output.
The performance gain is there because the index on Col5 is narrower than the clustered key and that causes fewer reads.
Thanks to Martin Smith for pointing that out.
One way to reduce the size of the scan necessary is by using a combination of TABLESAMPLE with ORDER BY newid in order to select a random number of rows from a selection of pages in the table, rather scanning the entire table.
The idea is to calculate the average number of rows per page, then use tablesample to select 1 random page of data for each row you want to output. Then you will run the ORDER BY newid() query on just that subset of data. This approach will be slightly less random then the original approach, but is much better than just using tablesample and involves reading much less data off the table.
Unfortunately, the TABLESAMPLE clause does not accept a variable so dynamic sql is necessary in order to use a dynamic rows value based on the size of the records of the input table.
declare #factor int
select #factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare #numRows int = 10
declare #sampledRows int = #factor * #numRows
declare #stmt nvarchar(max) = N'select top (#numRows) * from sample tablesample (' + convert(varchar(32), #sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql #stmt, N'#numRows int', #numRows
This question is 7 years old, and had no accepted answer. But it ranked high when I searched for SQL performance on selecting random rows. But none of the current answers seems to give a simple, quick solution in the case of large tables so I want to add my suggestion.
Assumptions:
The primary key is a numeric data type (typical int / +1 per row)
The primary key is the clustered index
The table has many rows, and only a few should be selected
I think this is fairly common, so it would help in a lot of cases.
Given a typical set of data, my suggestion is to
Find max and min
pick a random number
check if the number is a valid id in the table
Repeat as needed
These operations should all be very fast as they all are on the clustered index. Only at the end will the rest of the data be read, by selecting a set based on a list of primary keys so we only pull in the data we actually need.
Example (MS SQL):
--
-- First, create a table with some dummy data to select from
--
DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
Id int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(50) NULL,
[Content] text NULL
)
GO
DECLARE #I INT = 0
WHILE #I < 40
BEGIN
INSERT INTO MainTable VALUES('Foo', 'bar')
SET #I=#I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))
-- Create a gap in IDs at the end
DELETE FROM MainTable
WHERE ID < 10
-- Create a gap in IDs in the middle
DELETE FROM MainTable
WHERE ID >= 20 AND ID < 30
-- We now have our "source" data we want to select random rows from
--
-- Then we select random data from our table
--
-- Get the interval of values to pick random values from
DECLARE #MaxId int
SELECT #MaxId = MAX(Id) FROM MainTable
DECLARE #MinId int
SELECT #MinId = MIN(Id) FROM MainTable
DECLARE #RandomId int
DECLARE #NumberOfIdsTofind int = 10
-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)
WHILE (#NumberOfIdsTofind > 0)
BEGIN
SET #RandomId = ROUND(((#MaxId - #MinId -1) * RAND() + #MinId), 0)
-- Verify that the random ID is a real id in the main table
IF EXISTS (SELECT Id FROM MainTable WHERE Id = #RandomId)
BEGIN
-- Verify that the random ID has not already been inserted
IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = #RandomId)
BEGIN
-- It's a valid, new ID, add it to the list.
INSERT INTO #Ids VALUES (#RandomId)
SET #NumberOfIdsTofind = #NumberOfIdsTofind - 1;
END
END
END
-- Select the random rows of data by joining the main table with our random Ids
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id
No, there is no way to improve the performance here. Since you want the rows in "random" order, indexes will be useless. You could, however, try ordering by newid() instead of its checksum, but that's just an optimization to the random ordering, not the sorting itself.
The server has no way of knowing that you want a random selection of 10 rows from the table. The query is going to evaluate the order by expression for every row in the table, since it's a computed value that cannot be determined by index values. This is why you're seeing a full clustered index scan.

Designing SQL view and improving Perfromance

I have a requirement to create a view and the business scenario is explained below
Consider i am having table Products(all product information) and Settings(settings for a country/state/City)
Now i have to create a view which gives product information by considering settings, It might be possible to have cities/states/country have there own settings.
Design of the view
It means first i need to check
1. any city is having there custom settings then output those records
UNION ALL
2. any state is having there custom settings then output those records by excluding cities under this state in step 1
UNION ALL
3. any country is having there custom settings or not then output those records by excluding cities ans states records in step1 and step2
This is the design which i thought of, is there anything wrong in the design?
Performance improving
With this existing design its taking 5 minutes for a query to run without any indexes on view and base tables.
Now what is the best option for me to improve the performance.
Creating indexed views or create index on base tables? which one helps me to make the query run in seconds :)
Sample Data
Product Table
Settings table
Expected Output
I can't work out why your (P2 - Blue) result is showing. I re-wrote your samples as SQL, and created what I thought you wanted (whilst waiting for your expected output), and mine only produces one row (P1 - Red)
create table dbo.Product (
ProductID int not null,
Name char(2) not null,
StateId char(2) not null,
CityId char(2) not null,
CountryId char(2) not null,
Price int not null,
Colour varchar(10) not null,
constraint PK_Product PRIMARY KEY (ProductID)
)
go
insert into dbo.Product (ProductID,Name,StateId,CityId,CountryId,Price,Colour)
select 1,'P1','S1','C1','C1',150,'Red' union all
select 2,'P2','S2','C2','C1',100,'Blue' union all
select 3,'P3','S1','C3','C1',200,'Green'
go
create table dbo.Settings (
SettingsID int not null,
StateId char(2) null,
CityId char(2) null,
CountryId char(2) null,
MaxPrice int not null,
MinPrice int not null,
constraint PK_Settings PRIMARY KEY (SettingsID)
)
go
insert into dbo.Settings (SettingsID,StateId,CityId,CountryId,MaxPrice,MinPrice)
select 1,null,null,'C1',1000,150 union all
select 2,'S1',null,'C1',2000,100 union all
select 3,'S1','C3','C1',3000,300
go
And now the actual view:
create view dbo.Products_Filtered
with schemabinding
as
with MatchedSettings as (
select p.ProductID,MAX(MinPrice) as MinPrice,MIN(MaxPrice) as MaxPrice
from
dbo.Product p
inner join
dbo.Settings s
on
(p.CountryId = s.CountryId or s.CountryId is null) and
(p.CityId = s.CityId or s.CityId is null) and
(p.StateId = s.StateId or s.StateId is null)
group by
p.ProductID
)
select
p.ProductID,p.Name,p.CityID,p.StateId,p.CountryId,p.Price,p.Colour
from
dbo.Product p
inner join
MatchedSettings ms
on
p.ProductID = ms.ProductID and
p.Price between ms.MinPrice and ms.MaxPrice
What I did was to combine all applicable settings, and then assumed that we applied the most restrictive settings (so take the MAX MinPrice specified and MIN MaxPrice).
Using those rules, the (P2 - Blue) row is ruled out, since the only applicable setting is setting 1 - which has a Min price of 150.
If I reverse it, so that we try to be as inclusive as possible (MIN MinPrice and MAX MaxPrice), then that returns (P1 - Red) and (P3 - Green) - but still not (P2 - Blue)