Suppose we have a query like below:
select top 1 X from dbo.Tab
where SomeDateTime < #P1 and SomeID = #P2
order by SomeDateTime desc
What indexes or other techniques do you use to effectively optimize these types of queries?
The table may still have a primary key but on a different column like Id:
Id BIGINT IDENTITY PK
X something
SomeDateTime DateTime2
SomeID BIGINT FK to somewhere
Suppose there are millions of records for a given SomeDateTime or SomeID - table is "long".
Let's assume that data is read and written with a similar frequency.
For posterity, an article with pictures on the topic:
https://use-the-index-luke.com/sql/where-clause/searching-for-ranges/greater-less-between-tuning-sql-access-filter-predicates
Related
I have a choice in how a data table is created and am wondering which approach is more performant.
Making a table with a row for every data point,
Making a table with an array column that will allow repeated content to be unnested
That is, if I have the data:
day
val1
val2
Mon
7
11
Tue
7
11
Wed
8
9
Thu
1
4
Is it better to enter the data in as shown, or instead:
day
val1
val2
(Mon,Tue)
7
11
(Wed)
8
9
(Thu)
1
4
And then use unnest() to explode those into unique rows when I need them?
Assume that we're talking about large data in reality - 100k rows of data generated every day x 20 columns. Using the array would greatly reduce the number of rows in the table but I'm concerned that unnest would be less performant than just having all of the rows.
I believe making a table with a row for every data point would be the option I would go for. As unnest for large amounts of data would be just as if not slower. Plus
unless your data will be very repeated 20 columns is alot to align.
"100k rows of data generated every day x 20 columns"
And:
"the array would greatly reduce the number of rows" - so lots of duplicates.
Based on this I would suggest a third option:
Create a table with your 20 columns of data and add a surrogate bigint PK to it. To enforce uniqueness across all 20 columns, add a generated hash and make it UNIQUE. I suggest a custom function for the purpose:
-- hash function
CREATE OR REPLACE FUNCTION public.f_uniq_hash20(col1 text, col2 text, ... , col20 text)
RETURNS uuid
LANGUAGE sql IMMUTABLE COST 30 PARALLEL SAFE AS
'SELECT md5(textin(record_out(($1,$2, ... ,$20))))::uuid';
-- data table
CREATE TABLE data (
data_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, col1 text
, col2 text
, ...
, col20 text
, uniq_hash uuid GENERATED ALWAYS AS (public.f_uniq_hash20(col1, col2, ... , col20)) STORED
, CONSTRAINT data_uniq_hash_uni UNIQUE (uniq_hash)
);
-- reference data_id in next table
CREATE TABLE day_data (
day text
, data_id bigint REFERENCES data ON UPDATE CASCADE -- FK to enforce referential integrity
, PRIMARY KEY (day, data_id) -- must be unique?
);
db<>fiddle here
With only text columns, the function is actually IMMUTABLE (which we need!). For other data types (like timestamptz) it would not be.
In-depth explanation in this closely related answer:
Why doesn't my UNIQUE constraint trigger?
You could use uniq_hash as PK directly, but for many references, a bigint is more efficient (8 vs. 16 bytes).
About generated columns:
Computed / calculated / virtual / derived columns in PostgreSQL
Basic technique to avoid duplicates while inserting new data:
INSERT INTO data (col1, col2) VALUES
('foo', 'bam')
ON CONFLICT DO NOTHING
RETURNING *;
If there can be concurrent writes, see:
How to use RETURNING with ON CONFLICT in PostgreSQL?
I have a table similar to this one:
create table request_journal
(
id bigint,
request_body text,
request_date timestamp,
user_id bigint,
);
It is used for request logging purposes, so frequent inserts in it are expected (2k+ rps).
I want to create composite index on columns request_date and user_id to speed up execution of select queries like this:
select *
from request_journal
where request_date between '2021-07-08 10:00:00' and '2021-07-08 16:00:00'
and user_id = 123
order by request_date desc;
I tested select queries with (request_date desc, user_id) btree index and (user_id, request_date desc) btree index. With request_date leading column index select queries are executed about 10% faster, but in general performance of any of this indexes is acceptable.
So my question is does the index column order affect insertion time? I have not spotted any differences using EXPLAIN/EXPLAIN ANALYZE on insert query. Which index will be more build time efficient under "high load"?
It is hard to believe your test were done on any vaguely realistic data size.
At the rate you indicate, a 6 hour range would include over 43 million records. If the user_ids are spread evenly over 1e6 different values, I get the index leading with user_id to be a thousand times faster for that query than the one leading with request_date.
But anyway, for loading new data, assuming the new data is all from recent times, then the one with request_date should be faster as the part of the index needing maintenance while loading will be more concentrated in part of the index, and so better cached. But this would depend on how much RAM you have, what your disk system is like, and how many distinct user_ids you are loading data for.
I have 2 tables, approximately this:
Parent_table: Parent_id bigint, Loc geometry
Child_table: Child_id bigint,
parent_id bigint,
record_date timestamp,
value double precision,
category character varying(10)
I need to query subsets of child table for varying conditions (location, date range, value range, category). As part of this, I winnow down the locations from the parent table and then want to get all the matching records
The obvious way to do this is:
with limited_parents as
(
select parent_id from parent_table where [location condition]
)
select [ columns ] from child_table where parent_id in
(select parent_id from limited_parents)
and [ other conditions for record_date, value, category ]
Child_table has >200m records, partioned by year. It has an index of the parent_key and the other columns are all included in an index, in that order.
Parent_table has <10k records. Each Parent could easily have > 1m+ child records (the numbers of child records per parent are widely distributed from a few hundred to million+). The set of parents which are in scope in any query (and therefore included in that sub-select) might be from 1 to several hundred.
Database is currently Postgres 10.
The query is performant for ranges of a couple of years/partitions, but gets significantly slower as the amount of date in scope in increases.
I have freedom to adjust indexes and change the queries. Is there a more efficient way of doing this query? (Flattening the two tables, and putting the location on the child table and doing the geo intersection there, makes the whole thing orders of magnitude slower)
Your query is written in a complicated and inefficient fashion. In particular, CTEs are an optimizer fence in older PostgreSQL versions.
Try this:
SELECT [ columns ] FROM child_table AS c
WHERE EXISTS (SELECT 1 FROM parent_table AS p
WHERE [location condition on p]
AND p.parent_id = c.parent_id)
AND [ other conditions for record_date, value, category ]
Create an index on the foreign key column of the large table to speed up a nested loop join. Set work_mem high to speed up a hash join. PostgreSQL will autonatically pick the best solution.
Let's say I have a 1:many relation:
Tree:Apples
Tree/Apple each have an primary key ID column and apple has some date attribute/column (created_at).
Using sphinx, I want to retrieve all trees, sorted by the number apples created during a given period of time. So, for example:
All trees, sorted by the total number of apples created between 1/1/2010 and 1/1/2011.
Is that possible?
So you have two tables
create table tree ( tree_id int unsigned primary key,...);
and
create table apple ( apple_id int unsigned primary key, tree_id int unsigned, created_at timestamp default current_timestamp,...);
So can then just build an index on apples
sql_query = select apple_id,tree_id,unix_timestamp(created_at) as created_at from apple
then run group by queries on it
$cl->setGroupBy('tree_id',SPH_GROUPBY_ATTR,'#count DESC');
The #count virtual attribute will give you the number of apples on that tree.
to set the filter
$cl->setFilterRange('created_at',strtotime('2010-01-01'),strtotime('2011-01-01'));
Also because you not using it, can set ranking to none
$cl->setRankingMode(SPH_RANK_NONE);
To be clear you just use a blank query
$res = $cl->Query('',$index);
I'm creating a robust date table want to know the best way to link to it. The Primary Key Clustered Index will be on the smart date integer key (per Kimball spec) with a name of DateID. Until now I have been running queries against it like so:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON DTE.date = CAST(FLOOR(CAST(Foo.forderdate AS FLOAT)) AS DATETIME)
Keep in mind that Date is a nonclustered index field with values such as:
2000-01-01 00:00:00.000
It just occured to me that since I have a clustered integer index (DATEID) that perhaps I should be converting the datetime in my database field to match it and linking based upon that field.
What do you folks think?
Also, depending on your first answer, if I am typically pulling those fields from the date table, what kind of index how can I optimize the retrieval of those fields? Covering index?
Even without changing the database structure, you'd get much better performance using a date range join like this:
select Foo.orderdate -- a bunch of fields from Foo
,DTE.FiscalYearName
,DTE.FiscalPeriod
,DTE.FiscalYearPeriod
,DTE.FiscalYearWeekName
,DTE.FiscalWeekName
FROM SomeTable Foo
INNER JOIN
DateDatabase.dbo.MyDateTable DTE
ON Foo.forderdate >= DTE.date AND Foo.forderdate < DATEADD(dd, 1, DTE.date)
However, if you can change it so that your Foo table includes a DateID field then, yes, you'd get the best performance by joining with that instead of any converted date value or date range.
If you change it to join on DateID and DateID is the first column of the clustered index of the MyDateTable then it's already covering (a clustered index always includes all other fields).