Replaced a large table with a memory optimized table
For some stuff I am getting good response time for some stuff it dies
It is a composite primary key
The only way I can get it to use the Primary Key is to search on a specific row (the entire PK)
Will NOT use the PK for sort or only one component of composite key
Sized the hash buckets from existing data
used the syntax in this link for the composite primary key
Hekaton: Composite Primary Key in create table statement
CREATE TABLE (SQL Server)
CREATE TABLE [dbo].[FTSindex]
(
[sID] [int] NOT NULL,
[wordPos] [int] NOT NULL,
[wordID] [int] NOT NULL,
[charPos] [int] NOT NULL,
INDEX [ix_wordID_MO_2] NONCLUSTERED HASH
(
[wordID]
)WITH ( BUCKET_COUNT = 524288),
CONSTRAINT [pk_FTSindexMO_2] PRIMARY KEY NONCLUSTERED HASH
(
[sID],
[wordPos]
)WITH ( BUCKET_COUNT = 268435456)
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY )
select top 10 * from [FTSindex] where [sID] = 100
-- runs in 0 seconds
-- Index Seek on ix_wordID_MO_2
-- it is NOT using the PRIMARY KEY pk_FTSindexMO_2
select top 10 * from [FTSindex] where [wordPos] = 100
-- never finishes (I only waited 10 minutes)
-- will not even display an execution plan
select top 10 * from [FTSindex] where [sID] = 100 and [wordPos] < 1000
-- never finishes (I only waited 10 minutes)
-- will not even display an execution plan
select top 10 * from [FTSindex] order by [sID]
-- never finishes (I only waited 10 minutes)
-- query plan is Table Scan
select top 10 * from [FTSindex] order by [sID], [wordPos]
-- never finishes (I only waited 10 minutes)
-- will not even display an execution plan
select top 10 * from [FTSindex] where [wordID] = 100 and [sID] = 856515
-- runs in 0 seconds
-- Index Seek on ix_wordID_MO_2
select top 10 * from [FTSindex] where [wordID] = 100 and [sID] = 856515 and [wordPos] < 1000
-- never finishes (I only waited 10 minutes)
-- will not even display an execution plan
select * from [FTSindex] where [sID] = 100
-- 45 seconds to return 1500 rows
-- table scan
select * from [FTSindex] where [sID] = 100 and [wordPos] = 1133
-- runs in 0 seconds
-- this uses the pk_FTSindexMO_2
-- this is the only way I could get it to use the primary key
Note in the original (non memory optimized table)
All of these queries run in 0 seconds
I don't mean each
ALL run in 0 seconds
I think this summarizes my problem
Troubleshooting Common Performance Problems with Memory-Optimized Hash Indexes
Not using HASH for Primary Key seems to have fixed it
CREATE TABLE [dbo].[FTSindex]
(
[sID] [int] NOT NULL,
[wordPos] [int] NOT NULL,
[wordID] [int] NOT NULL,
[charPos] [int] NOT NULL,
INDEX [ix_wordID_MO_2] NONCLUSTERED HASH
(
[wordID]
)WITH ( BUCKET_COUNT = 524288),
CONSTRAINT [pk_FTSindexMO_2] PRIMARY KEY NONCLUSTERED
(
[sID] ASC,
[wordPos] ASC
)
)WITH ( MEMORY_OPTIMIZED = ON , DURABILITY = SCHEMA_ONLY )
Note in the end I whet back to the old disk based tables
In the actual queries used by the application memory optimized was slower
The memory optimized did load faster but this table is write once and read many
Hash indexes are not usable for range scans. Range indexes are.
Related
PostgreSQL version: 9.3.13
Consider the following tables, index and data:
CREATE TABLE orders (
order_id bigint,
status smallint,
owner int,
CONSTRAINT orders_pkey PRIMARY KEY (order_id)
)
CREATE INDEX owner_index ON orders
USING btree
(owner) WHERE status > 0;
CREATE TABLE orders_appendix (
order_id bigint,
note text
)
Data
orders:
(IDs, 0, 1337) * 1000000 rows
(IDs, 10, 1337) * 1000 rows
(IDs, 10, 777) * 1000 rows
orders_appendix:
one row for each order
My problem is:
select * from orders where owner=1337 and status>0
The query planner estimated the number of row to be 1000000, but actual number of row is 1000.
In a more complicated following query:
SELECT note FROM orders JOIN orders_appendix using (order_id)
WHERE owner=1337 AND status>0
Instead of using inner join (which is preferable for small number of rows), it picks bitmap join + a full table scan on orders_appendix, which is very slow.
If the condition is "owner=777", it will choose the preferable inner join instead.
I believe it is bacause of the statistics, as AFAIK postgres can only collect and consider stats for each column independently.
However, if I...
CREATE INDEX onwer_abs ON orders (abs(owner)) where status>0;
Now, a slightly changed query...
SELECT note FROM orders JOIN rders_appendix using (order_id)
WHERE abs(owner)=1337 AND status>0
will results in the inner join that I wanted.
Is there a better solution? Perhaps "statistics on partial index"?
I would like to select data from two columns, both of similar length(#fetch) but at a certain offset in the same table.
However off the bat I am running into syntax errors.I would prefer a join solution for this. Thanks.
SELECT c.[Close],h.[High]
FROM
(
SELECT [Close],[CityID],[Time]
FROM [dataSQL].[dbo].[temperature]
WHERE [Time]<#time
ORDER BY [Time] DESC
OFFSET 0 ROWS
FETCH NEXT (#fetch) ROWS ONLY
) AS c
JOIN
(
SELECT [High],[CityID],[Time]
FROM [dataSQL].[dbo].[temperature]
WHERE [Time]<#time
ORDER BY [Time] DESC
OFFSET (#offset) ROWS
FETCH NEXT (#fetch) ROWS ONLY
) AS h
ON c.[CityID]=h.[CityID] AND c.[Time]=h.[Time]
WHERE c.[CityID]=#name AND h.[CityID]=#name;
EDIT:
This is now returning more than expected results with repetitions in both columns
EDIT:
This now returns columns that are in same row without offset because I required primary keys to match! There has to be an offset and the problem is this table contains data for more than two cities so you cannot use ROW_NUMBER()!
Here is my table schema:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Temperature](
[Time] [datetime] NOT NULL,
[CityID] [tinyint] NOT NULL,
[High] [real] NULL,
[Close] [real] NULL,
CONSTRAINT [pk_time_cityid] PRIMARY KEY CLUSTERED
(
[Time] ASC,
[CityID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Temperature] WITH CHECK ADD CONSTRAINT [FK_Temperature_Cities] FOREIGN KEY([CityID])
REFERENCES [dbo].[Cities] ([CityID])
GO
ALTER TABLE [dbo].[Temperature] CHECK CONSTRAINT [FK_Temperature_Cities]
GO
Keep in mind you can edit your question to correct your query.
In addition to OFFSET 0 ROWS and FETCH NEXT (...) ROWS ONLY issues, you need to use SELECT [CityID],[Close] and SELECT [CityID],[High] in your subqueries respectively. The only fields from a subquery that are available outside the subquery are those you explicitly specify. That includes JOIN conditions.
Regarding your comment that you're getting the number of rows you expect squared, you've probably got an implicit cross join that's creating a Cartesian product. You need to join your tables using the fields in the primary key of dbo.temperature. Try adding some sample data to your question and possibly the table schema.
I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?
View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.
/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date
INSERT INTO contacts_lists (contact_id, list_id)
SELECT contact_id, 110689 AS list_id
FROM plain_contacts
WHERE TRUE
AND is_print = TRUE
AND ( ( TRUE
AND country_id IN (231,39)
AND company_type_id IN (2,8,12,5,6,4,3,9,10,13,11,1,7)
AND is_broadcast = TRUE )
OR ( TRUE
AND country_id IN (15,59,73,74,81,108,155,165,204,210,211,230)
AND company_type_id IN (2,8,12,5,6,4,3,9,10,13,11,1,7)
AND is_broadcast = TRUE )
OR ( TRUE
AND country_id IN (230)
AND company_type_id IN (2,8,12,5,6,4,3,9,10,13,11,1,7)
AND is_broadcast = TRUE ))
AND (NOT EXISTS (
SELECT title_id
FROM company_types_lists_titles
WHERE company_types_list_id = 92080)
OR title_id IN (
SELECT title_id
FROM company_types_lists_titles
WHERE company_types_list_id = 92080))
AND company_type_id = 2
AND country_id IN (
SELECT country_id
FROM countries_lists
WHERE list_id = 110689)
AND ((state_id IS NULL
OR country_id NOT IN (231,39)
OR state_id IN (
SELECT state_id
FROM lists_states
WHERE list_id = 110689))
OR zone_ids && ARRAY(
SELECT zone_id
FROM lists_zones
WHERE list_id = 110689)
)
AND (NOT EXISTS (
SELECT award_id
FROM company_types_lists_top_awards
WHERE company_types_list_id = 92080)
OR top_award_ids && ARRAY(
SELECT award_id
FROM company_types_lists_top_awards
WHERE company_types_list_id = 92080))
I have using postgresql which selects 30000 rows from various tables which takes lessthan a second to select data from various tables. But after selecting data which take more and more time to insert in another table. how to reduce the time insert. This is the query i have. In this the select query give nearly 30000 thousand records.
take more and more time to insert
That usually means you're missing an index.
Edit: now that you've posted the query... Definitely missing one or more indexes to speed up lookups during the insert. And you probably want to rewrite that huge select statement so as to reduce nesting.
If no other people (threads) are meanwhile working with the target table, you could drop the indexes for the table, insert the data, and recreate the indexes later.
This may lead to a speed up, and might be considered, if your data is reliable, and you can guarantee, that you won't violate unique-restrictions.
What is the best way to get top n rows by random order?
I use the query like:
Select top(10) field1,field2 .. fieldn
from Table1
order by checksum(newid())
The problem in the above query is that it will keep getting slower as table size increases. it will always do full clustered index scan to find top(10) rows by random order.
Is there any other better way to do it?
I have tested this and got better performance changing the query.
The DDL for the table I used in my tests.
CREATE TABLE [dbo].[TestTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Col1] [nvarchar](100) NOT NULL,
[Col2] [nvarchar](38) NOT NULL,
[Col3] [datetime] NULL,
[Col4] [nvarchar](50) NULL,
[Col5] [int] NULL,
CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
[ID] ASC
)
)
GO
CREATE NONCLUSTERED INDEX [IX_TestTable_Col5] ON [dbo].[TestTable]
(
[Col5] ASC
)
The table has 722888 rows.
First query:
select top 10
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
order by newid()
Statistics for first query:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 13 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 12492, physical reads 14, read-ahead reads 6437, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 859 ms, elapsed time = 1700 ms.
Execution plan first query:
Second query:
select
T.ID,
T.Col1,
T.Col2,
T.Col3,
T.Col5,
T.Col5
from TestTable as T
inner join (select top 10 ID
from TestTable
order by newid()) as C
on T.ID = C.ID
Statistics for second query:
SQL Server parse and compile time:
CPU time = 125 ms, elapsed time = 183 ms.
(10 row(s) affected)
Table 'TestTable'. Scan count 1, logical reads 1291, physical reads 10, read-ahead reads 399, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 516 ms, elapsed time = 706 ms.
Execution plan second query:
Summary:
The second query is using the index on Col5 to order the rows by newid() and then it does a Clustered Index Seek 10 times to get the values for the output.
The performance gain is there because the index on Col5 is narrower than the clustered key and that causes fewer reads.
Thanks to Martin Smith for pointing that out.
One way to reduce the size of the scan necessary is by using a combination of TABLESAMPLE with ORDER BY newid in order to select a random number of rows from a selection of pages in the table, rather scanning the entire table.
The idea is to calculate the average number of rows per page, then use tablesample to select 1 random page of data for each row you want to output. Then you will run the ORDER BY newid() query on just that subset of data. This approach will be slightly less random then the original approach, but is much better than just using tablesample and involves reading much less data off the table.
Unfortunately, the TABLESAMPLE clause does not accept a variable so dynamic sql is necessary in order to use a dynamic rows value based on the size of the records of the input table.
declare #factor int
select #factor=8000/avg_record_size_in_bytes from sys.dm_db_index_physical_stats(db_id(), object_id('sample'), null, null, 'detailed') where index_level = 0
declare #numRows int = 10
declare #sampledRows int = #factor * #numRows
declare #stmt nvarchar(max) = N'select top (#numRows) * from sample tablesample (' + convert(varchar(32), #sampledRows) + ' rows) order by checksum(newid())'
exec sp_executesql #stmt, N'#numRows int', #numRows
This question is 7 years old, and had no accepted answer. But it ranked high when I searched for SQL performance on selecting random rows. But none of the current answers seems to give a simple, quick solution in the case of large tables so I want to add my suggestion.
Assumptions:
The primary key is a numeric data type (typical int / +1 per row)
The primary key is the clustered index
The table has many rows, and only a few should be selected
I think this is fairly common, so it would help in a lot of cases.
Given a typical set of data, my suggestion is to
Find max and min
pick a random number
check if the number is a valid id in the table
Repeat as needed
These operations should all be very fast as they all are on the clustered index. Only at the end will the rest of the data be read, by selecting a set based on a list of primary keys so we only pull in the data we actually need.
Example (MS SQL):
--
-- First, create a table with some dummy data to select from
--
DROP TABLE IF EXISTS MainTable
CREATE TABLE MainTable(
Id int IDENTITY(1,1) NOT NULL,
[Name] nvarchar(50) NULL,
[Content] text NULL
)
GO
DECLARE #I INT = 0
WHILE #I < 40
BEGIN
INSERT INTO MainTable VALUES('Foo', 'bar')
SET #I=#I+1
END
UPDATE MainTable SET [Name] = [Name] + CAST(Id as nvarchar(50))
-- Create a gap in IDs at the end
DELETE FROM MainTable
WHERE ID < 10
-- Create a gap in IDs in the middle
DELETE FROM MainTable
WHERE ID >= 20 AND ID < 30
-- We now have our "source" data we want to select random rows from
--
-- Then we select random data from our table
--
-- Get the interval of values to pick random values from
DECLARE #MaxId int
SELECT #MaxId = MAX(Id) FROM MainTable
DECLARE #MinId int
SELECT #MinId = MIN(Id) FROM MainTable
DECLARE #RandomId int
DECLARE #NumberOfIdsTofind int = 10
-- Make temp table to insert ids from
DROP TABLE IF EXISTS #Ids
CREATE TABLE #Ids (Id int)
WHILE (#NumberOfIdsTofind > 0)
BEGIN
SET #RandomId = ROUND(((#MaxId - #MinId -1) * RAND() + #MinId), 0)
-- Verify that the random ID is a real id in the main table
IF EXISTS (SELECT Id FROM MainTable WHERE Id = #RandomId)
BEGIN
-- Verify that the random ID has not already been inserted
IF NOT EXISTS (SELECT Id FROM #Ids WHERE Id = #RandomId)
BEGIN
-- It's a valid, new ID, add it to the list.
INSERT INTO #Ids VALUES (#RandomId)
SET #NumberOfIdsTofind = #NumberOfIdsTofind - 1;
END
END
END
-- Select the random rows of data by joining the main table with our random Ids
SELECT MainTable.* FROM MainTable
INNER JOIN #Ids ON #Ids.Id = MainTable.Id
No, there is no way to improve the performance here. Since you want the rows in "random" order, indexes will be useless. You could, however, try ordering by newid() instead of its checksum, but that's just an optimization to the random ordering, not the sorting itself.
The server has no way of knowing that you want a random selection of 10 rows from the table. The query is going to evaluate the order by expression for every row in the table, since it's a computed value that cannot be determined by index values. This is why you're seeing a full clustered index scan.