Recently I have created a collection that contains a limited number (<5 by now) of kind of data, I created the collection with entity_type and entity_id, as following shown
+-------------+-----------+---------+------
| entity_type | entity_id | date | ...
+-------------+-----------+---------+------
And since the query for the latest records are more frequency. So I created the following compound index (note that the entity_id is not globally unique)
[('date': -1), ('entity_type': 1), ('entity_id': 1)]
However, I am not sure if such index is okay since someone said that the index should not created on a field (here it is entity_type) which only has limited values.
It's recommended to create mongodb index on fields which ensures high selectivity.
So if entity_type column contains limited values, I would suggest to remove this column from compound index.
MongoDB Index Selectivity
Related
How do you add a compound / composite index on a PostgreSQL table with TimescaleDB installed?
Following https://docs.timescale.com/latest/using-timescaledb/schema-management, you can add a compound / composite index to TimescaleDB by simply doing:
CREATE INDEX ON conditions (time DESC, cityid)
WHERE cityid IS NOT NULL;
time is a column with timestamps (The one used as primary key in TimescaleDB).
cityid is a column for a city identifier we might to often query for (As second parameter after the time series dates).
This can be done before or after converting the table to a hypertable.
To avoid bloating the index when the column cityid is often NULL, the statement WHERE cityid IS NOT NULL is for. Use this per default unless you are often searching for missing data (cityid IS NULL).
I need to create a varchar category column in a table and search for rows that are belonging to a particular category.
ie. ALTER TABLE items ADD COLUMN category VARCHAR(30)
The number of categories is very small (repeated across the table)
and the intention is to only use = in the where clause.
ie. select * from items where category = 'food'
What kind of index would be ideal in postgres?
Especially if the table is never expected to be too big (less than 5,000 rows always)
This is a textbook usecase for a Hash Index - you have a very small number of distinct values and only use the equality operator to query them. Using a hash index will enable you to index a relatively small hash of the value, which will allow for faster querying.
This might be an obvious and simple question.
But I read through the jsonb data type documentation, but nowhere it mentions the lookup cost of a key in jsonb data.
For example, let's say I have a table with following schema:
CREATE TABLE A (id character varying (20),
info jsonb);
I want to know how postgres would parse a where query as below:
SELECT * FROM A WHERE info->>'city' = 'portland';
While going through the jsonb field of a row, is the lookup constant time (O(1)) or linear time (checking each key one by one in the row's jsonb dictionary) within that jsonb data dictionary?
My intuition is that it must be constant time (else what's the point of a dictionary style data?) but I can't see it in the official documentation to convince my team.
Any help would be great!
Thanks!
As with any WHERE condition in SQL: if there is no index, the database has to go through all rows of the table to find those that satisfy your condition.
You can either index a specific expression, or you can index the whole json value using a GIN index which then enables Postgres to use the index if any of the supported operators are used.
If you always check for the city, you can create a regular B-Tree index:
create index on a ( (info->>'city') );
If you don't know what you will be looking for, a GIN index might be a better choice:
create index on a using gin (info);
But you will need to change your query to use one of the operators that are supported by a GIN index, e.g. using the contains operator #>
select *
from a
where info #> '{"city": "portland"}::jsonb;
Note that an index lookup is not always the most efficient solution. Sometimes it's faster to simply go through all rows, sometimes the index lookup is faster.
If you want to learn more about indexes in relational database, go through the material here: http://use-the-index-luke.com/
Suppose I have key/value/timerange tuples, e.g.:
CREATE TABLE historical_values(
key TEXT,
value NUMERIC,
from_time TIMESTAMPTZ,
to_time TIMESTAMPTZ
)
and would like to be able to efficiently query values (sorted descending) for a specific key and time, e.g.:
SELECT value
FROM historical_values
WHERE
key = [KEY]
AND from_time <= [TIME]
AND to_time >= [TIME]
ORDER BY value DESC
What kind of index/types should I use to get the best lookup performance? I suspect my solution will involve a tstzrange and a gist index, but I'm
not sure how to make that play well with the key matching and value ordering requirements.
Edit: Here's some more information about usage.
Ideally uses features available in Postgres v9.6.
Relation will contain approx. 1k keys and 5m values per key. Values are large integers (up to 32 bytes), mostly unique. Time ranges between few hours to a couple years. Time horizon is 5 years. No NULL values allowed, but some time ranges are open-ended (could either use NULL or a time far into the future for to_time).
The primary key is the key and time range (as there is only one historical value for a time range, per key).
Common operations are a) updating to_time to "close" a historical value, and b) inserting a new value with from_time = NOW.
All values may be queried. Partitioning is an option.
DB design
For a big table like that ("1k keys and 5m values per key") I would suggest to optimize storage like:
CREATE TABLE hist_keys (
key_id serial PRIMARY KEY
, key text NOT NULL UNIQUE
);
CREATE TABLE hist_values (
hist_value_id bigserial PRIMARY KEY -- optional, see below!
, key_id int NOT NULL REFERENCES hist_keys
, value numeric
, from_time timestamptz NOT NULL
, to_time timestamptz NOT NULL
, CONSTRAINT range_valid CHECK (from_time <= to_time) -- or < ?
);
Also helps index performance.
And consider partitioning. List-partitioning on key_id. Maybe even add sub-partitioning on (range partitioning this time) on from_time. Read the manual here.
With one partition per key_id, (and constraint exclusion enabled!) Postgres would only look at the small partition (and index) for the given key, instead of the whole big table. Major win.
But I would strongly suggest to upgrade to at least Postgres 10 first, which added "declarative partitioning". Makes managing partition a lot easier.
Better yet, skip forward to Postgres 11 (currently beta), which adds major improvements for partitioning (incl. performance improvements). Most notably, for your goal to get the best lookup performance, quoting the chapter on partitioning in release notes for Postgres 11 (currently beta):
Allow faster partition elimination during query processing (Amit Langote, David Rowley, Dilip Kumar)
This speeds access to partitioned tables with many partitions.
Allow partition elimination during query execution (David Rowley, Beena Emerson)
Previously partition elimination could only happen at planning time,
meaning many joins and prepared queries could not use partition elimination.
Index
From the perspective of the value column, the small subset of selected rows is arbitrary for every new query. I don't expect you'll find a useful way to support ORDER BY value DESC with an index. I'd concentrate on the other columns. Maybe add value as last column to each index if you can get index-only scans out of it (possible for btree and GiST).
Without partitioning:
CREATE UNIQUE INDEX hist_btree_idx ON hist_values (key_id, from_time, to_time DESC);
UNIQUE is optional, but see below.
Note the importance of opposing sort orders for from_time and to_time. See (closely related!):
Optimizing queries on a range of timestamps (two columns)
This is almost the same index as the one implementing your PK on (key_id, from_time, to_time). Unfortunately, we cannot use it as PK index. Quoting the manual:
Also, it must be a b-tree index with default sort ordering.
So I added a bigserial as surrogate primary key in my suggested table design above and NOT NULL constraints plus the UNIQUE index to enforce your uniqueness rule.
In Postgres 10 or later consider an IDENTITY column instead:
Auto increment table column
You might even do with PK constraint in this exceptional case to avoid duplicating the index and keep the table at minimum size. Depends on the complete situation. You may need it for FK constraints or similar. See:
How does PostgreSQL enforce the UNIQUE constraint / what type of index does it use?
A GiST index like you already suspected may be even faster. I suggest to keep your original timestamptz columns in the table (16 bytes instead of 32 bytes for a tstzrange) and add key_id after installing the additional module btree_gist:
CREATE INDEX hist_gist_idx ON hist_values
USING GiST (key_id, tstzrange(from_time, to_time, '[]'));
The expression tstzrange(from_time, to_time, '[]') constructs a range including upper and lower bound. Read the manual here.
Your query needs to match the index:
SELECT value
FROM hist_values
WHERE key = [KEY]
AND tstzrange(from_time, to_time, '[]') #> tstzrange([TIME_FROM], [TIME_TO], '[]')
ORDER BY value DESC;
It's equivalent to your original.
#> being the range contains operator.
With list-partitioning on key_id
With a separate table for each key_id, we can omit key_id from the index, improving size and performance - especially for the GiST index - for which we then also don't need the additional module btree_gist. Results in ~ 1000 partitions and the corresponding indexes:
CREATE INDEX hist999_gist_idx ON hist_values USING GiST (tstzrange(from_time, to_time, '[]'));
Related:
Store the day of the week and time?
I have a postgres table with 3 fields:
a : postgis geometry
b : array varchar[]
c : integer
and I have a query that involves all of them. I would like to add a multicolumn index to speed it up but I cannot as the 3 fields cannot go under the same index because of their nature.
What is the strategy in this case? Adding 3 indexes gist, gin and btree and postgres will use them all during the query?
Single-column index
Postgres can combine multiple indexes very efficiently in a single query with bitmap index scans. Most of the time, the most selective index is picked (or two, combined with bitmap index scans) and the rest is filtered. Once the result set is narrow enough, it's not efficient to scan more indexes.
Multicolumn index
It is still faster to have a perfectly matching multicolumn index, but not by orders of magnitude.
Since you want to include an array type I suggest to use a GIN index. AFAIK, operator classes are missing for general-purpose GiST indexes on array type. (The exception being intarray for integer arrays.)
To include the integer column, first install the additional module btree_gin, which provides the necessary GIN operator classes. Run once per database:
CREATE EXTENSION btree_gin;
Then you should be able to create your multicolumn index:
CREATE INDEX tbl_abc_gin_idx ON tbl USING GIN(a, b, c);
The order of index columns is irrelevant for GIN indexes. The manual:
A multicolumn GIN index can be used with query conditions that involve
any subset of the index's columns. Unlike B-tree or GiST, index search
effectiveness is the same regardless of which index column(s) the
query conditions use.
Nearest neighbour search
Since you are including a PostGis geometry type, chances are you want to do a nearest neighbour search, for which you need a GiST index. In this case I suggest two indexes:
CREATE INDEX tbl_ac_gist_idx ON tbl USING GiST(a, c); -- geometry type
CREATE INDEX tbl_bc_gin_idx ON tbl USING GIN(b, c);
You could add the integer column c to either or both. It depends.
For that, you need either btree_gin or btree_gist or both, respectively.
the 3 fields cannot go under the same index because of their nature
The 3 fields can go under the same index using the btree-gist module.