Random order by performance - tsql

What is the best way to get top n rows by random order?
I use the query like:
Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())
The problem in the above query is that it will keep getting slower as table size increases. it will always do full clustered index scan to find top(10) rows by random order.
Is there any other better way to do it?

I have tested this and got better performance changing the query.
The DDL for the table I used in my tests.
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Col1] [nvarchar](100) NOT NULL,
[Col2] [nvarchar](38) NOT NULL,
[Col3] [datetime] NULL,
[Col4] [nvarchar](50) NULL,
[Col5] [int] NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable]
(
[Col5] ASC
)
The table has 722888 rows.
First query:
select top 10
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
order by newid()
Statistics for first query:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 13 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1700 ms.
Execution plan first query:
Second query:
select
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
inner join (select top 10 ID
from TestTable
order by newid()) as C
on T.ID = C.ID
Statistics for second query:
SQL Server parse and compile time:
CPU time = 125 ms, elapsed time = 183 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 706 ms.
Execution plan second query:
Summary:
The second query is using the index on Col5 to order the rows by newid() and then it does a Clustered Index Seek 10 times to get the values for the output.
The performance gain is there because the index on Col5 is narrower than the clustered key and that causes fewer reads.
Thanks to Martin Smith for pointing that out.

One way to reduce the size of the scan necessary is by using a combination of TABLESAMPLE with ORDER BY newid in order to select a random number of rows from a selection of pages in the table, rather scanning the entire table.
The idea is to calculate the average number of rows per page, then use tablesample to select 1 random page of data for each row you want to output. Then you will run the ORDER BY newid() query on just that subset of data. This approach will be slightly less random then the original approach, but is much better than just using tablesample and involves reading much less data off the table.
Unfortunately, the TABLESAMPLE clause does not accept a variable so dynamic sql is necessary in order to use a dynamic rows value based on the size of the records of the input table.
declare #factor int
select #factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare #numRows int = 10
declare #sampledRows int = #factor * #numRows
declare #stmt nvarchar(max) = N'select top (#numRows) * from sample tablesample (' + convert(varchar(32), #sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql #stmt, N'#numRows int', #numRows

This question is 7 years old, and had no accepted answer. But it ranked high when I searched for SQL performance on selecting random rows. But none of the current answers seems to give a simple, quick solution in the case of large tables so I want to add my suggestion.
Assumptions:
The primary key is a numeric data type (typical int / +1 per row)
The primary key is the clustered index
The table has many rows, and only a few should be selected
I think this is fairly common, so it would help in a lot of cases.
Given a typical set of data, my suggestion is to
Find max and min
pick a random number
check if the number is a valid id in the table
Repeat as needed
These operations should all be very fast as they all are on the clustered index. Only at the end will the rest of the data be read, by selecting a set based on a list of primary keys so we only pull in the data we actually need.
Example (MS SQL):
--
-- First, create a table with some dummy data to select from
--
DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
Id int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(50) NULL,
[Content] text NULL
)
GO
DECLARE #I INT = 0
WHILE #I < 40
BEGIN
INSERT INTO MainTable VALUES('Foo', 'bar')
SET #I=#I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))
-- Create a gap in IDs at the end
DELETE FROM MainTable
WHERE ID < 10
-- Create a gap in IDs in the middle
DELETE FROM MainTable
WHERE ID >= 20 AND ID < 30
-- We now have our "source" data we want to select random rows from
--
-- Then we select random data from our table
--
-- Get the interval of values to pick random values from
DECLARE #MaxId int
SELECT #MaxId = MAX(Id) FROM MainTable
DECLARE #MinId int
SELECT #MinId = MIN(Id) FROM MainTable
DECLARE #RandomId int
DECLARE #NumberOfIdsTofind int = 10
-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)
WHILE (#NumberOfIdsTofind > 0)
BEGIN
SET #RandomId = ROUND(((#MaxId - #MinId -1) * RAND() + #MinId), 0)
-- Verify that the random ID is a real id in the main table
IF EXISTS (SELECT Id FROM MainTable WHERE Id = #RandomId)
BEGIN
-- Verify that the random ID has not already been inserted
IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = #RandomId)
BEGIN
-- It's a valid, new ID, add it to the list.
INSERT INTO #Ids VALUES (#RandomId)
SET #NumberOfIdsTofind = #NumberOfIdsTofind - 1;
END
END
END
-- Select the random rows of data by joining the main table with our random Ids
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id

No, there is no way to improve the performance here. Since you want the rows in "random" order, indexes will be useless. You could, however, try ordering by newid() instead of its checksum, but that's just an optimization to the random ordering, not the sorting itself.
The server has no way of knowing that you want a random selection of 10 rows from the table. The query is going to evaluate the order by expression for every row in the table, since it's a computed value that cannot be determined by index values. This is why you're seeing a full clustered index scan.

Related

postgresql: How to grab an existing id from a not subsequent ids of a table

Postgresql version 9.4
I have a table with an integer column, which has a number of integers with some gaps, like the sample below; I'm trying to get an existing id from the column at random with the following query, but it returns NULL occasionally:
CREATE TABLE
IF NOT EXISTS test_tbl(
id INTEGER);
INSERT INTO test_tbl
VALUES (10),
(13),
(14),
(16),
(18),
(20);
-------------------------------
SELECT * FROM test_tbl;
-------------------------------
SELECT COALESCE(tmp.id, 20) AS classification_id
FROM (
SELECT tt.id,
row_number() over(
ORDER BY tt.id) AS row_num
FROM test_tbl tt
) tmp
WHERE tmp.row_num =floor(random() * 10);
Please let me know where I'm doing wrong.
but it returns NULL occasionally
and I must add to this that it sometimes returns more than 1 rows, right?
in your sample data there are 6 rows, so the column row_num will have a value from 1 to 6.
This:
floor(random() * 10)
creates a random number from 0 up to 0.9999...
You should use:
floor(random() * 6 + 1)::int
to get a random integer from 1 to 6.
But this would not solve the problem, because the WHERE clause is executed once for each row, so there is a case that row_num will never match the created random number, so it will return nothing, or it will match more than once so it will return more than 1 rows.
See the demo.
The proper (although sometimes not the most efficient) way to get a random row is:
SELECT id FROM test_tbl ORDER BY random() LIMIT 1
Also check other links from SO, like:
quick random row selection in Postgres
You could select one row and order by random(), this way you are ensured to hit an existing row
select id
from test_tbl
order by random()
LIMIT 1;

to write a SQL query which select rows where column value changed from previous row

CREATE TABLE status( id serial NOT NULL,
id integer,
plan smallint,
ime timestamp without time zone
CONSTRAINT data_pkey PRIMARY KEY (id))
WITH (OIDS=FALSE);
ALTER TABLE data
OWNER TO postgres;
Index: data_idx
CREATE INDEX data_idx
ON data
USING btree
(time, id);
I have a table like this
id val plan time
1 8300 1 2011-01-01
2 8300 1 2011-01-02
3 8300 2 2011-01-03
4 9600 1 2011-01-04
5 9600 2 2011-01-05
How do I select the rows where sigplan changed from the previous row for that siteId?
In the example above, the query should return the rows
2011-01-03 (sigplan changed from 1 to 2 between 2011-01-01 and 2011-01-03 for 8300),
2011-01-05(sigplan changed from 1 to 2 between 2011-01-04 and 2011-01-05 for 9600).
The table contains lot of data so the query should be optimized.
SELECT siteId, sigplan, MAX(server_time) FROM traffview.status_data
GROUP BY siteId, sigplan
HAVING COUNT(1) > 1 AND MAX(server_time) > 'XXXXX' AND MAX(server_time) < 'XXXXX'
The annoying part is figuring out which is the previous row id with the same siteId. After that it is pretty easy by joining the table with itself.
SELECT t1.* FROM table t1, table t2
WHERE t1.sigplan != t2.sigplan
AND t2.id = (SELECT MAX(t3.id) FROM table t3 WHERE t3.id < t1.id)
If the table is moderately (not extremely) large I would consider doing this in application code instead, or by storing the change flag in its own column when writing a new row. A subquery for each row in the table has very poor performance.
This version doesn't have a sub-query, but does assume that you have consecutive IDs.
SELECT t1.*
FROM traffview AS t1, traffview AS t2
WHERE
t1.siteId = t2.siteId
AND t1.sigplan <> t2.sigplan
AND t1.id - t2.id = 1
ORDER BY
t1.server_time
In case you compare with previous rows it is useful to use LAG function which does the job for you:
SELECT sub.*
FROM (
SELECT
plan AS curr_plan,
LAG(plan) OVER (PARTITION BY val ORDER BY time) AS prev_plan,
val,
time
) sub
WHERE
sub.prev_plan IS NOT NULL AND sub.prev_plan <> sub.curr_plan;

PostgreSQL: Statistics on partial index?

PostgreSQL version: 9.3.13
Consider the following tables, index and data:
CREATE TABLE orders (
order_id bigint,
status smallint,
owner int,
CONSTRAINT orders_pkey PRIMARY KEY (order_id)
)
CREATE INDEX owner_index ON orders
USING btree
(owner) WHERE status > 0;
CREATE TABLE orders_appendix (
order_id bigint,
note text
)
Data
orders:
(IDs, 0, 1337) * 1000000 rows
(IDs, 10, 1337) * 1000 rows
(IDs, 10, 777) * 1000 rows
orders_appendix:
one row for each order
My problem is:
select * from orders where owner=1337 and status>0
The query planner estimated the number of row to be 1000000, but actual number of row is 1000.
In a more complicated following query:
SELECT note FROM orders JOIN orders_appendix using (order_id)
WHERE owner=1337 AND status>0
Instead of using inner join (which is preferable for small number of rows), it picks bitmap join + a full table scan on orders_appendix, which is very slow.
If the condition is "owner=777", it will choose the preferable inner join instead.
I believe it is bacause of the statistics, as AFAIK postgres can only collect and consider stats for each column independently.
However, if I...
CREATE INDEX onwer_abs ON orders (abs(owner)) where status>0;
Now, a slightly changed query...
SELECT note FROM orders JOIN rders_appendix using (order_id)
WHERE abs(owner)=1337 AND status>0
will results in the inner join that I wanted.
Is there a better solution? Perhaps "statistics on partial index"?

How do I replace a SSN with a 9 digit random number in SQL Server 2008R2?

To satisfy security requirements, I need to find a way to replace SSN's with unique, random 9 digit numbers, before providing said database to a developer. The SSN is in a column in a table of a database. There may be 10's of thousands of rows in said table. The number does not need hyphens. I am a beginner with SQL and programming in general.
I have been unable to find a solution for my specific needs. Nothing seems quite right. But if you know of a thread that I have missed, please let me know.
Thanks for any help!
Here is one way.
I'm assuming that you already have a backup of the real data as this update is not reversible.
Below I've assumed your table name is Person with your ssn column named SSN.
UPDATE Person SET
SSN = CAST(LEFT(CAST(ABS(CAST(CAST(NEWID() as BINARY(10)) as int)) as varchar(max)) + '00000000',9) as int)
If they do not have to be random, you could just replace them with ascending numeric values. Failing that, you’d have to generate a random number. As you may have discovered, the RAND function will only generate a single value per query statement (select, update, etc.); the work-around to that is the newid() function, which would generate a GUID for each row produced by a query (run SELECT newid() from MyTable to see how this works). Wrap this in a checksum() to generate an integer; modulus that by 1,000,00,000 to get a value within the SSN range (0 to 999,999,999); and, assuming you’re storing it as a char(9) prefix it with leading zeros.
Next trick is ensuring it’s unique for all values in your table. This gets tricky, and I’d do it by setting up a temp table with the values, populating it, then copying them over. Lessee now…
DECLARE #DummySSN as table
(
PrimaryKey int not null
,NewSSN char(9) not null
)
-- Load initial values
INSERT #DummySSN
select
UserId
,right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
from Users
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
-- Loop until values are unique
IF exists (SELECT 1 from #DummySSN group by NewSSN having count(*) > 1)
UPDATE #DummySSN
set NewSSN = right('000000000' + cast(abs(checksum(newid()))%1000000000 as varchar(9)), 9)
where NewSSN in (select NewSSN from #DummySSN group by NewSSN having count(*) > 1)
-- Check for dups
select NewSSN from #DummySSN group by NewSSN having count(*) > 1
This works for a small table I have, and it should work for a large one. I don’t see this turning into an infinite loop, but even so you might want to add a check to exit the loop after say 10 iterations,
I've run a couple million tests in this and it seems to generate random (URN) 9 digit numbers (no leading zeros).
I cannot think of a more efficient way to do this.
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
The test used;
;WITH Fn(N) AS
(
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
UNION ALL
SELECT CAST(FLOOR(RAND(CHECKSUM(NEWID())) * 900000000 ) + 100000000 AS BIGINT)
FROM Fn
)
,Tester AS
(
SELECT TOP 5000000 *
FROM Fn
)
SELECT LEN(MIN(N))
,LEN(MAX(N))
,MIN(N)
,MAX(N)
FROM Tester
OPTION (MAXRECURSION 0)
Not so fast, but easiest... I added some dot's...
DECLARE #tr NVARCHAR(40)
SET #tr = CAST(ROUND((888*RAND()+111),0) AS CHAR(3)) + '.' +
CAST(ROUND((8888*RAND()+1111),0) AS CHAR(4)) + '.' + CAST(ROUND((8888*RAND()+1111),0) AS
CHAR(4)) + '.' + CAST(ROUND((88*RAND()+11),0) AS CHAR(2))
PRINT #tr
If the requirement is to obfuscate a database then this will return the same unique value for each distinct SSN in any table preserving referential integrity in the output without having to do a lookup and translate.
SELECT CAST(RAND(SSN)*999999999 AS INT)

Choosing the first child record in a selfjoin in TSQL

I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?
View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.
/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date