Join Solution for offset columns - tsql

I would like to select data from two columns, both of similar length(#fetch) but at a certain offset in the same table.
However off the bat I am running into syntax errors.I would prefer a join solution for this. Thanks.
SELECT c.[Close],h.[High]
FROM
(
SELECT [Close],[CityID],[Time]
FROM [dataSQL].[dbo].[temperature]
WHERE [Time]<#time
ORDER BY [Time] DESC
OFFSET 0 ROWS
FETCH NEXT (#fetch) ROWS ONLY
) AS c
JOIN
(
SELECT [High],[CityID],[Time]
FROM [dataSQL].[dbo].[temperature]
WHERE [Time]<#time
ORDER BY [Time] DESC
OFFSET (#offset) ROWS
FETCH NEXT (#fetch) ROWS ONLY
) AS h
ON c.[CityID]=h.[CityID] AND c.[Time]=h.[Time]
WHERE c.[CityID]=#name AND h.[CityID]=#name;
EDIT:
This is now returning more than expected results with repetitions in both columns
EDIT:
This now returns columns that are in same row without offset because I required primary keys to match! There has to be an offset and the problem is this table contains data for more than two cities so you cannot use ROW_NUMBER()!
Here is my table schema:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Temperature](
[Time] [datetime] NOT NULL,
[CityID] [tinyint] NOT NULL,
[High] [real] NULL,
[Close] [real] NULL,
CONSTRAINT [pk_time_cityid] PRIMARY KEY CLUSTERED
(
[Time] ASC,
[CityID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Temperature] WITH CHECK ADD CONSTRAINT [FK_Temperature_Cities] FOREIGN KEY([CityID])
REFERENCES [dbo].[Cities] ([CityID])
GO
ALTER TABLE [dbo].[Temperature] CHECK CONSTRAINT [FK_Temperature_Cities]
GO

Keep in mind you can edit your question to correct your query.
In addition to OFFSET 0 ROWS and FETCH NEXT (...) ROWS ONLY issues, you need to use SELECT [CityID],[Close] and SELECT [CityID],[High] in your subqueries respectively. The only fields from a subquery that are available outside the subquery are those you explicitly specify. That includes JOIN conditions.
Regarding your comment that you're getting the number of rows you expect squared, you've probably got an implicit cross join that's creating a Cartesian product. You need to join your tables using the fields in the primary key of dbo.temperature. Try adding some sample data to your question and possibly the table schema.

Related

Postgres: Need to match records from two tables based on key value and earliest dates in each table

I'm dealing with a pretty unique record matching problem within postgres right now. Essentially I have a table (A) with a lot of records in it, including a key value that I need to match on and the date of the record. Then I have this other table (B) that I want to match the first table on that key value. However, there can be multiple of the same 'key values' in both tables. To get around this I need to match the earliest key value from table A to the earliest key value to table B, the second earliest to the second earliest, and so on... However, if table B runs out of key value matches in table B then I want to default to the latest key value match in A, even though something else already matched on it.
My initial thought is to use a something like this on both tables:
ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
And then join on the rank and key_value field. However, I'm not exactly sure how to get that default scenario to work with this method. And if records are added to one table and not the other and I try the join again, I feel like it might get out of sync.
My other thought was to use a cursor, but I'm really struggling to see how I'd implement that.
Any help would be greatly appreciated!
first you need number all your rows, the find the one with matching ranks.
After that match the one without matching to the latest_date
with cteA as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableA
), cteB as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date) AS rank
FROM tableB
), ranked_match as (
SELECT ctA.*, cteB.*
FROM cteA
LEFT JOIN cteB
ON cteA.key_value = cteB.key_value
AND cteA.rank = cteB.rank
), latest_row as (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY key_value ORDER BY date DESC) AS rank
FROM tableB
)
SELECT *
FROM ranked_match
WHERE cteB.key_value IS NOT NULL
UNION ALL
SELECT *
FROM ranked_match
JOIN latest_row
ON ranked_match.key_value = latest_row .key_value
WHERE cteB.key_value IS NULL
AND latest_row .rank = 1

Violation of PRIMARY KEY constraint . Cannot insert duplicate key in object

I have a table with a primary key constraint created like so:
CONSTRAINT [APP_NOTIFICATION_LOG_PK] PRIMARY KEY CLUSTERED
(
[ID] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
I had some records in the table that I have now deleted.
I manually find the next ID to insert like so:
SELECT #maxid_log = max(ID) + 1 FROM APP_NOTIFICATION_LOG;
And then I try to insert the record:
INSERT INTO [dbo].[APP_NOTIFICATION_LOG]([ID],[COLOR],[ACTIVE],[FK_SYS_USERS_ID],[FK_APP_NOTIFICATIONS_ID], [MESSAGE],[WIN_USER_CREATOR], [FK_JOBR_RESOURCE_ID])
SELECT -- log notification created
#maxid_log,
anc.COLOR,
1,
anc.[FK_SYS_USERS_ID],
an.id,
'Notification cancelled!',
#creatorUserId,
#jobrResourceDbId
FROM [dbo].[APP_NOTIFICATIONS] an
INNER JOIN [dbo].[APP_NOTIFICATION_CONFIG] anc on anc.id = #configId
WHERE an.[FK_JOBR_RESOURCE_ID] = #jobrResourceDbId
At this stage get the error in the title. It also says that the value 5 is a dublicate. But running a select:
SELECT * FROM APP_NOTIFICATION_LOG WHERE ID = 5
...returns zero records.
What could be the problem here?
The Select is returning more than one record ?
Run it by itself and see how many rows are returned.
Your inner join returns more than just 1 result, so you try to insert several rows with same id.

Get cartesian product of two columns

How can I get the cartesian product of two columns in one table?
I have table
A 1
A 2
B 3
B 4
and I want a new table
A 1
A 2
A 3
A 4
B 1
B 2
B 3
B 4
fiddle demo
your table
try this using joins
select distinct b.let,a.id from [dbo].[cartesian] a join [dbo].[cartesian] b on a.id<>b.id
will result like this
Create this table :
CREATE TABLE [dbo].[Table_1]
(
[A] [int] NOT NULL ,
[B] [nvarchar](50) NULL ,
CONSTRAINT [PK_Table_1] PRIMARY KEY CLUSTERED ( [A] ASC )
WITH ( PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON ) ON [PRIMARY]
)
ON [PRIMARY]
Fill table like this :
INSERT INTO [dbo].[Table_1]
VALUES ( 1, 'A' )
INSERT INTO [dbo].[Table_1]
VALUES ( 2, 'A' )
INSERT INTO [dbo].[Table_1]
VALUES ( 3, 'B' )
INSERT INTO [dbo].[Table_1]
VALUES ( 4, 'C' )
SELECT *
FROM [dbo].[Table_1]
Use this query
SELECT DISTINCT
T1.B ,
T2.A
FROM dbo.Table_1 AS T1 ,
dbo.Table_1 AS T2
ORDER BY T1.B
To clarify loup's answer (in more detail that allowable in a comment), any join with no relevant criteria specified will naturally produce a Cartesian product (which is why a glib answer to your question might be "all too easily"-- mistakenly doing t1 INNER JOIN t2 ON t1.Key = t1.Key will produce the same result).
However, SQL Server does offer an explicit option to make your intentions known. The CROSS JOIN is essentially what you're looking for. But like INNER JOIN devolving to a Cartesian product without a useful join condition, CROSS JOIN devolves to a simple inner join if you go out of your way to add join criteria in the WHERE clause.
If this is a one-off operation, it probably doesn't matter which you use. But if you want to make it clear for posterity, consider CROSS JOIN instead.

Choosing the first child record in a selfjoin in TSQL

I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?
View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.
/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date

Random order by performance

What is the best way to get top n rows by random order?
I use the query like:
Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())
The problem in the above query is that it will keep getting slower as table size increases. it will always do full clustered index scan to find top(10) rows by random order.
Is there any other better way to do it?
I have tested this and got better performance changing the query.
The DDL for the table I used in my tests.
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Col1] [nvarchar](100) NOT NULL,
[Col2] [nvarchar](38) NOT NULL,
[Col3] [datetime] NULL,
[Col4] [nvarchar](50) NULL,
[Col5] [int] NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable]
(
[Col5] ASC
)
The table has 722888 rows.
First query:
select top 10
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
order by newid()
Statistics for first query:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 13 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1700 ms.
Execution plan first query:
Second query:
select
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
inner join (select top 10 ID
from TestTable
order by newid()) as C
on T.ID = C.ID
Statistics for second query:
SQL Server parse and compile time:
CPU time = 125 ms, elapsed time = 183 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 706 ms.
Execution plan second query:
Summary:
The second query is using the index on Col5 to order the rows by newid() and then it does a Clustered Index Seek 10 times to get the values for the output.
The performance gain is there because the index on Col5 is narrower than the clustered key and that causes fewer reads.
Thanks to Martin Smith for pointing that out.
One way to reduce the size of the scan necessary is by using a combination of TABLESAMPLE with ORDER BY newid in order to select a random number of rows from a selection of pages in the table, rather scanning the entire table.
The idea is to calculate the average number of rows per page, then use tablesample to select 1 random page of data for each row you want to output. Then you will run the ORDER BY newid() query on just that subset of data. This approach will be slightly less random then the original approach, but is much better than just using tablesample and involves reading much less data off the table.
Unfortunately, the TABLESAMPLE clause does not accept a variable so dynamic sql is necessary in order to use a dynamic rows value based on the size of the records of the input table.
declare #factor int
select #factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare #numRows int = 10
declare #sampledRows int = #factor * #numRows
declare #stmt nvarchar(max) = N'select top (#numRows) * from sample tablesample (' + convert(varchar(32), #sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql #stmt, N'#numRows int', #numRows
This question is 7 years old, and had no accepted answer. But it ranked high when I searched for SQL performance on selecting random rows. But none of the current answers seems to give a simple, quick solution in the case of large tables so I want to add my suggestion.
Assumptions:
The primary key is a numeric data type (typical int / +1 per row)
The primary key is the clustered index
The table has many rows, and only a few should be selected
I think this is fairly common, so it would help in a lot of cases.
Given a typical set of data, my suggestion is to
Find max and min
pick a random number
check if the number is a valid id in the table
Repeat as needed
These operations should all be very fast as they all are on the clustered index. Only at the end will the rest of the data be read, by selecting a set based on a list of primary keys so we only pull in the data we actually need.
Example (MS SQL):
--
-- First, create a table with some dummy data to select from
--
DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
Id int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(50) NULL,
[Content] text NULL
)
GO
DECLARE #I INT = 0
WHILE #I < 40
BEGIN
INSERT INTO MainTable VALUES('Foo', 'bar')
SET #I=#I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))
-- Create a gap in IDs at the end
DELETE FROM MainTable
WHERE ID < 10
-- Create a gap in IDs in the middle
DELETE FROM MainTable
WHERE ID >= 20 AND ID < 30
-- We now have our "source" data we want to select random rows from
--
-- Then we select random data from our table
--
-- Get the interval of values to pick random values from
DECLARE #MaxId int
SELECT #MaxId = MAX(Id) FROM MainTable
DECLARE #MinId int
SELECT #MinId = MIN(Id) FROM MainTable
DECLARE #RandomId int
DECLARE #NumberOfIdsTofind int = 10
-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)
WHILE (#NumberOfIdsTofind > 0)
BEGIN
SET #RandomId = ROUND(((#MaxId - #MinId -1) * RAND() + #MinId), 0)
-- Verify that the random ID is a real id in the main table
IF EXISTS (SELECT Id FROM MainTable WHERE Id = #RandomId)
BEGIN
-- Verify that the random ID has not already been inserted
IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = #RandomId)
BEGIN
-- It's a valid, new ID, add it to the list.
INSERT INTO #Ids VALUES (#RandomId)
SET #NumberOfIdsTofind = #NumberOfIdsTofind - 1;
END
END
END
-- Select the random rows of data by joining the main table with our random Ids
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id
No, there is no way to improve the performance here. Since you want the rows in "random" order, indexes will be useless. You could, however, try ordering by newid() instead of its checksum, but that's just an optimization to the random ordering, not the sorting itself.
The server has no way of knowing that you want a random selection of 10 rows from the table. The query is going to evaluate the order by expression for every row in the table, since it's a computed value that cannot be determined by index values. This is why you're seeing a full clustered index scan.