I would like to have an equivalent Oracle query for the below SQL Query
SQL QUERY:
CREATE UNIQUE NONCLUSTERED INDEX ValidSub_Category ON ValidSub (Category ASC) WHERE (category IS NOT NULL)
purpose: This index is created to make sure that the column has more than 1 NULL records but does not have duplicate strings.
Thanks in advance
I found it
CREATE UNIQUE INDEX VALIDSUB_CATEGORY ON VALIDSUB (Case WHEN Category
IS NOT NULL THEN CATEGORY END);
I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:
CREATE TABLE IF NOT EXISTS data.row_check (
id uuid NOT NULL DEFAULT NULL,
version int8 NOT NULL DEFAULT NULL,
row_hash int8 NOT NULL DEFAULT NULL,
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_check_pkey
PRIMARY KEY (id, version)
);
I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:
-- Run my test of "version 0" of the push code, the base code I'm refactoring.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
truncate table record_changes_log;
-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:
id version row_hash table_name
e6218751-ab78-4942-9734-f017839703f6 0 -142492569 record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1 0 -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d 0 -323725113 record_changes_log
1935590c-8d22-4baf-85ba-00b563022983 0 -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2 0 -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0 0 1857654119 record_changes_log
I asked yesterday for some help on the comparison query, and it lead to this:
select v0.table_name,
v0.id,
v0.row_hash as v0,
v1.row_hash as v1
from row_check v0
join row_check v1 on v0.id = v1.id and
v0.version = 0 and
v1.version = 1 and
v0.row_hash <> v1.row_hash
That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:
drop index if exists row_check_version_btree;
create index row_check_version_btree
on row_check
using btree(version);
cluster row_check using row_check_version_btree;
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.
drop index if exists row_check_version_brin;
create index row_check_version_brin
on row_check
using brin(row_hash);
vacuum analyze row_check;
I ran the query through explain analyze and get this:
Merge Join (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
Inner Unique: true
Merge Cond: (v0.id = v1.id)
Join Filter: (v0.row_hash <> v1.row_hash)
Rows Removed by Join Filter: 4318290
Buffers: shared hit=8679005 read=42511
-> Index Scan using row_check_pkey on ascendco.row_check v0 (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
Output: v0.id, v0.version, v0.row_hash, v0.table_name
Index Cond: (v0.version = 0)
Buffers: shared hit=4360752
-> Index Scan using row_check_pkey on ascendco.row_check v1 (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
Output: v1.id, v1.version, v1.row_hash, v1.table_name
Index Cond: (v1.version = 1)
Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms
...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:
http://tatiyants.com/pev/#/plans
The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.
I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:
-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.
Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.
Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.
I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:
DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
id uuid NOT NULL DEFAULT NULL,
hash_1 int8, -- Want NULL to defer calculating hash comparison until after both hashes are entered.
hash_2 int8, -- Ditto
hashes_match boolean, -- Likewise
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_compare_pkey
PRIMARY KEY (id)
);
The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:
CREATE INDEX row_compare_fail ON row_compare (hashes_match)
WHERE hashes_match = false;
The trigger below does the column calculation, once hash_1 and hash_2 are both provided:
-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.hash_1 = NULL OR
NEW.hash_2 = NULL THEN
RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.
ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
RETURN NEW; -- important!
END IF;
END;
$BODY$
LANGUAGE plpgsql;
This now adds 4.3M rows instead of 8.6M rows:
-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_1 = EXCLUDED.hash_1,
table_name = EXCLUDED.table_name;
-- I'll truncate the record_changes_log and push my sample data again here.
-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_2 = EXCLUDED.hash_2,
table_name = EXCLUDED.table_name;
And now the results are simple to find:
select * from row_compare where hashes_match = false;
This changes the query time from around 17 seconds to around 24 milliseconds.
I have a query where I'm trying to find a null field from millions of records. There will only be one or two.
The query looks like this:
SELECT *
FROM “table”
WHERE “id” = $1
AND “end_time” IS NULL
ORDER BY “start_time” DESC LIMIT 1
How can I make this query more performant eg using indexes in the database.
try a partial index, smth like:
create index iname on "table" (id, start_time) where end_time is null;
I'd like to count all the Orders that are not urgent and whose order status = 1 (shipped).
This should be a very simple query to optimize. I'd like to put a simple filtered index on the Orders table to cover this query to make it a constant time/O(1) operation. However, when I look at the query plan, it looks like it's using a Index Scan which doesn't make sense. Ideally, this query should just returning the number of items in the index.
The table look like this (simplified to get to the essence):
CREATE TABLE [dbo].[Orders](
[Id] [int] IDENTITY(1,1) NOT NULL,
[IsUrgent] [bit] NOT NULL,
[Status] [tinyint] NOT NULL
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ( [Id] ASC )
I've created this filtered index:
CREATE INDEX IX_Orders_ShippedNonUrgent ON Orders(Id) WHERE IsUrgent = 0 AND Status = 1;
Now, when I do this query:
SELECT COUNT(*) FROM Orders WHERE IsUrgent = 0 AND Status = 1
I see that the query plan is using IX_Orders_ShippedNonUrgent, but it's doing an Index Scan and performing around 200 reads across the ~150,000 rows in Orders.
Is it possible to always have this query run in constant time assuming the filtered index is kept up to date? Ideally, it should only perform 1 read to get the size of the index.
If I switch to a non-filtered index like this:
CREATE INDEX IX_Orders_IsUrgentStatus ON Orders(IsUrgent, Status);
The query plan uses an Index Seek, but still performs many more reads than should be necessary to answer this simple query.
UPDATE
I'm able to do this
SELECT TOP 1 rows FROM sys.partitions p
INNER JOIN sys.indexes i
ON i.name = 'IX_Orders_ShippedNonUrgent'
AND i.object_id = p.object_id
AND i.index_id = p.index_id
and get the result in 9 reads but it seems like there should be a much easier and less brittle way of using the simple COUNT(*) query.
It seems like what I'm wanting isn't possible. The best answer was left in the comments by Nikola Markovinović which is to forget about the filtered index and use an indexed view instead:
CREATE VIEW [dbo].vw_Orders_TotalShippedNonUrgent WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS TotalOrders
FROM dbo.Orders WHERE IsUrgent = 0 AND Status = 1;
with
CREATE UNIQUE CLUSTERED INDEX IX_vw_Orders_TotalShippedNonUrgent ON vw_Orders_TotalShippedNonUrgent(TotalOrders);
This forces creating views and their index for each summary statistic that I want as well as rewriting the query to ask the view instead of the simple approach, but it is fast at only 2 reads.
I'll leave this question open for awhile in case anyone has a simpler approach that's just as fast.
I have a large database, that I want to do some logic to update new fields.
The primary key is id for the table harvard_assignees
The LOGIC GOES LIKE THIS
Select all of the records based on id
For each record (WHILE), if (state is NOT NULL && country is NULL), update country_out = "US" ELSE update country_out=country
I see step 1 as a PostgreSQL query and step 2 as a function. Just trying to figure out the easiest way to implement natively with the exact syntax.
====
The second function is a little more interesting, requiring (I believe) DISTINCT:
Find all DISTINCT foreign_keys (a bivariate key of pat_type,patent)
Count Records that contain that value (e.g., n=3 records have fkey "D","388585")
Update those 3 records to identify percent as 1/n (e.g., UPDATE 3 records, set percent = 1/3)
For the first one:
UPDATE
harvard_assignees
SET
country_out = (CASE
WHEN (state is NOT NULL AND country is NULL) THEN 'US'
ELSE country
END);
At first it had condition "id = ..." but I removed that because I believe you actually want to update all records.
And for the second one:
UPDATE
example_table
SET
percent = (SELECT 1/cnt FROM (SELECT count(*) AS cnt FROM example_table AS x WHERE x.fn_key_1 = example_table.fn_key_1 AND x.fn_key_2 = example_table.fn_key_2) AS tmp WHERE cnt > 0)
That one will be kind of slow though.
I'm thinking on a solution based on window functions, you may want to explore those too.