PostgreSQL Trigram indexes vs btree - postgresql

I have this searching query for my cake database, which is currently very slow and I'm looking to improve it. I am running PostgreSQL v. 9.6.
Table structure:
Table: Cakes
=====
id int
cake_id varchar
cake_short_name varchar
cake_full_name varchar
has_recipe boolean
createdAt Datetime
updatedAt DateTime
Table: CakeViews
=========
id int
cake_id varchar
createdAt Datetime
updatedAt DateTime
Query:
WITH myconstants (myquery, queryLiteral) as (
values ('%a%', 'a')
)
select
full_count,
cake_id,
cake_short_name,
cake_full_name,
has_recipe,
views
from (
select
count(*) OVER() AS full_count,
cake_id,
cake_short_name,
cake_full_name,
has_recipe
cast((select count(*) FROM "CakeViews" as cv where "createdAt" > CURRENT_DATE - 3 and c.cake_id = cv.cake_id) as integer) as views
from "Cakes" c, myconstants
where has_recipe = true
and (cake_full_name ilike myquery or cake_short_name ilike myquery)
or cake_full_name ilike lower(queryLiteral) or cake_short_name ilike lower(queryLiteral)) t, myconstants
order by views desc,
case
when cake_short_name ilike lower(queryLiteral) then 1
when cake_full_name ilike lower(queryLiteral) then 1
end,
case
when has_recipe = true and cake_short_name ilike myquery then length(cake_short_name)
when has_recipe = true and cake_full_name ilike myquery then length(cake_full_name)
end
limit 10
I have ideas for the following indices, but they don't speed up the query that much:
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_cakes_cake_short_name ON public."Cakes" (lower(cake_short_name) varchar_pattern_ops);
CREATE INDEX idx_cakes_cake_id ON public."Cakes" (cake_short_name);
CREATE INDEX idx_cakeviews_cake_id ON public."CakeViews" (cake_id);
CREATE INDEX idx_cakes_cake_short_name ON public."Cakes" USING gin (cake_short_name gin_trgm_ops);
CREATE INDEX idx_cakes_cake_full_name ON public."Cakes" USING gin (cake_full_name gin_trgm_ops);
Questions:
What indices would be better or which am I missing?
Is my query inefficient?
EDIT: Explain Analyze output: here

The query '%a%' doesn't contain any trigrams, so the index will not be useful there. It must scan the whole table. But if you used a longer query, then they might be useful.
The index on "CakeViews" (cake_id) would be better if it were on "CakeViews" (cake_id, "createdAt"). Except that none of your cakes seem to have any views, so if that is generally the case I guess it wouldn't matter.

Related

PostgreSQL: optimization of query for select with overlap

I create the following query for selection data with overlapping periods (for campaigns, which have the same business identifier!):
select
campaign_instance_1.campaign_id,
campaign_instance_1.start_time
from campaign_instance as campaign_instance_1
inner join campaign_instance as campaign_instance_2
on campaign_instance_1.campaign_id = campaign_instance_2.campaign_id
and (
(campaign_instance_1.start_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.finish_time between campaign_instance_2.start_time and campaign_instance_2.finish_time)
or (campaign_instance_1.start_time<campaign_instance_2.start_time and campaign_instance_1.finish_time>campaign_instance_2.finish_time)
or (campaign_instance_1.start_time>campaign_instance_2.start_time and campaign_instance_1.finish_time<campaign_instance_2.finish_time))
With index, created as:
CREATE INDEX IF NOT EXISTS camp_inst_idx_campaign_id_and_finish_time
ON public.campaign_instance_without_index USING btree
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST)
TABLESPACE pg_default;
Already on 100 000 rows it runs very slowly - 43 seconds!
For optimization I tried to add index on start_time:
(campaign_id ASC NULLS LAST, finish_time DESC NULLS LAST, start_time DESC NULLS LAST)
But result is the same.
As I understand results of explain analyze, index "start_time" doesn't uses as a Index Condition:
I tried the query with this index either with 10 000 and 100 000 rows - so, as possible, it does not depends on sample size (at least on this scales).
Source table contains the following structure:
campaign_id bigint,
fire_time bigint,
start_time bigint,
finish_time bigint,
recap character varying,
details json
Why my index is not used, and what possible ways to improve the query?
Joining to campaign_instance (itself) doesn't really serve anything here other than making an "existence" check and probably your intention is not to get back duplicates for matching records. Thus you could simplify the query with EXISTS or LATERAL join. Also your join condition on time could be simplified, you seem to be looking for overlapping times:
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
That time overlap probably would use < and > instead of <= and >= but I don't know your exact requirements, between is implicitly saying it is <= and >=.
EDIT: Ensure that the match is not the row itself:
(This table should have a primary key to make things easier, but as it doesn't, I would assume that there is no duplication on campaign_id, start_time and finish_time and that could be used as a composite key)
select campaign_id,start_time
from campaign_instance c1
where exists( select * from campaign_instance c2
where c1.campaign_id = c2.campaign_id
and (c1.start_time != c2.start_time or c1.finish_time != c2.finish_time)
and (c1.start_time <= c2.finish_time and c1.finish_time >= c2.start_time));
This takes around 230-250 milliseconds on my system (iMac i5 7500, 3.4 Ghz, 64 Gb mem).

Does cluster index on time increase the speed of a query where we want the max time group by certain id?

Consider the following query
SELECT my_id, my_info FROM my_table as r
JOIN (
SELECT my_id, max(my_time) as max_time FROM my_table
WHERE my_time > timestamp '2019-01-10 00:00:00'
GROUP BY my_id) as k
ON k.my_id = r.my_id and k.max_time = r.my_time
And the following table
my_table
my_id [text, secondary index]
my_info [arbitrary]
my_time [timestamp with timezone, clustered index]
I think the most efficient query if the cardenality of my_id is not big would be the following
Get the set of all unique my_id from the index table
Scan through the entire table from first row (guarantee to have the highest timestamp due to clustering) and fetch my_info of my_id if not been fetch before.
I am not sure if postgres does exactly that, but I am interested in knowing if having cluster index help with my original query
If the answer is no, is there a way to increase the speed of the query above given the table structure?
I believe the clustered index should assist the filtering predicate WHERE my_time > timestamp '2019-01-10 00:00:00' but you need to consider explain plans to determine how the query has been handled. You might also want to consider using a window function approach instead:
SELECT k.my_id, k.my_info
JOIN (
SELECT my_id, my_info
, ROW_NUMBER() OVER(PARTITION BY my_id ORDER BY my_time DESC) as rn
FROM my_table
WHERE my_time > timestamp '2019-01-10 00:00:00'
) as k
WHERE k.rn = 1

Cassandra filter with ordering query modeling

I am new to Cassandra and I am trying to model a table in Cassandra. My queries look like the following
Query #1: select * from TableA where Id = "123"
Query #2: select * from TableA where name="test" orderby startTime DESC
Query #3: select * from TableA where state="running" orderby startTime DESC
I have been able to build the table for Query #1 which looks like
val tableAStatement = SchemaBuilder.createTable("tableA").ifNotExists.
addPartitionKey(Id, DataType.uuid).
addColumn(Name, DataType.text).
addColumn(StartTime, DataType.timestamp).
addColumn(EndTime, DataType.timestamp).
addColumn(State, DataType.text)
session.execute(tableAStatement)
but for Query#2 and 3, I have tried many different things but failed. Everytime, I get stuck in a different error from cassandra.
Considering the above queries, what would be the right table model? What is the right way to model such queries.
Query #2: select * from TableB where name="test"
CREATE TABLE TableB (
name text,
start_time timestamp,
PRIMARY KEY (text, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
Query #3: select * from TableC where state="running"
CREATE TABLE TableC (
state text,
start_time timestamp,
PRIMARY KEY (state, start_time)
) WITH CLUSTERING ORDER BY (start_time DESC)
In cassandra you model your tables around your queries. Data denormalization and duplication is wanted. Notice the clustering order - this way you can omit the "ordered by" in your query

Optimizing SQL query with multiple joins and grouping (Postgres 9.3)

I've browsed around some other posts and managed to make my queries run a bit faster. However, I've come to a loss as to how to further optimize this query. I'm going to be using it on a website where it will execute the query when the page is loaded, but 5.5 seconds is far too long to wait for something that should be a lot more simple. The largest table has around 4,000,000 rows and the other ones are around 400,000 each.
Table Structure
match
id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT
team
matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT
champion
id INTEGER PRIMARY KEY,
version TEXT,
name TEXT
item
id INTEGER PRIMARY KEY,
name TEXT
participant
PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT
Query
select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
Output of EXPLAIN ANALYZE: http://explain.depesz.com/s/ZYX
What I've done so far
I've created separate indexes on match.region, participant.championid, and a partial index on team where winner = 'True' (since that is only what I am interested in). Note that enable_seqscan = on since when it's off the query is extremely slow. Essentially, the result I'm trying to get is something like this:
Champion |item0 | item1 | ... | item5
champ_name | num | num1 | ... | num5
...
Since I'm still a beginner with respect to database design, I wouldn't be surprised if there is a flaw in my overall table design. I'm still leaning towards the query being absolutely inefficient, though. I've played with both inner joins and left joins -- there is no significant difference though. Additionally, match needs to be bigint (or something larger than integer, since it's too small).
Database design
I suggest:
CREATE TABLE matchversion (
matchversion_id int PRIMARY KEY
, matchversion text UNIQUE NOT NULL
);
CREATE TABLE matchtype (
matchtype_id int PRIMARY KEY
, matchtype text UNIQUE NOT NULL
);
CREATE TABLE region (
region_id int PRIMARY KEY
, region text NOT NULL
);
CREATE TABLE match (
match_id bigint PRIMARY KEY
, region_id int REFERENCES region
, matchtype_id int REFERENCES matchtype
, matchversion_id int REFERENCES matchversion
);
CREATE TABLE team (
match_id bigint REFERENCES match
, team_id integer -- better name !
, winner boolean -- ?!
, PRIMARY KEY(match_id, team_id)
);
CREATE TABLE champion (
champion_id int PRIMARY KEY
, version text
, name text
);
CREATE TABLE participant (
participant_id serial PRIMARY KEY -- use proper name !
, champion_id int NOT NULL REFERENCES champion
, match_id bigint NOT NULL REFERENCES match -- this FK might be redundant
, team_id int
, magic_damage_dealt_to_champions real
, damage_dealt_to_champions real
, item0 text -- or integer ??
, item1 text
, item2 text
, item3 text
, item4 text
, item5 text
, highest_achieved_season_tier text -- integer ??
, FOREIGN KEY (match_id, team_id) REFERENCES team
);
More normalization in order to get smaller tables and indexes and faster access. Create lookup-tables for matchversion, matchtype and region and only write a small integer ID in match.
Seems like the columns participant.item0 .. item5 and highestAchievedSeasonTier could be integer, but are defined as text?
The column team.winner seems to be boolean, but is defined as text.
I also changed the order of columns to be more efficient. Details:
Calculating and saving space in PostgreSQL
Query
Building on above modifications and for Postgres 9.3:
SELECT c.name, *
FROM (
SELECT p.champion_id
, count(p.item0 = '3285' OR NULL) AS it0
, count(p.item1 = '3285' OR NULL) AS it1
, count(p.item2 = '3285' OR NULL) AS it2
, count(p.item3 = '3285' OR NULL) AS it3
, count(p.item4 = '3285' OR NULL) AS it4
, count(p.item5 = '3285' OR NULL) AS it5
FROM matchversion mv
CROSS JOIN matchtype mt
JOIN match m USING (matchtype_id, matchversion_id)
JOIN team t USING (match_id)
JOIN participant p USING (match_id, team_id)
WHERE mv.matchversion = '5.14'
AND mt.matchtype = 'RANKED_SOLO_5x5'
AND t.winner = 'True' -- should be boolean
GROUP BY p.champion_id
) p
JOIN champion c USING (champion_id); -- probably just JOIN ?
Since champion.name is not defined UNIQUE, it's probably wrong to GROUP BY it. It's also inefficient. Use participant.championid instead (and join to champion later if you need the name in the result).
All instances of LEFT JOIN are pointless, since you have predicates on the left tables anyway and / or use the column in GROUP BY.
Parentheses around AND-ed WHERE conditions are not needed.
In Postgres 9.4 or later you could use the new aggregate FILTER syntax instead. Details and alternatives:
How can I simplify this game statistics query?
Index
The partial index on team you already have should look like this to allow index-only scans:
CREATE INDEX on team (matchid, id) WHERE winner -- boolean
But from what I see, you might just add a winner column to participant and drop the table team completely (unless there is more to it).
Also, that index is not going to help much, because (telling from your query plan) the table has 800k rows, half of which qualify:
rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
This index on match will help a little more (later) when you have more different matchtypes and matchversions:
CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
Still, while 100k rows qualify out of 400k, the index is only useful for an index only scan. Otherwise, a sequential scan will be faster. An index typically pays for about selecting 5 % of the table or less.
Your main problem is that you are obviously running a test case with hardly realistic data distribution. With more selective predicates, indexes will be used more readily.
Aside
Make sure you have configured basic Postgres settings like random_page_cost or work_mem etc.
enable_seqscan = on goes without saying. This is only turned off for debugging or locally as a desperate measure of last resort.
I'd try using
count(*) filter (where item0 = '3285' ) as it0
for your counts instead of sums.
Also, why are you left joining your last 2 tables, then having a where statement. That defeats the purpose and a regular inner join is faster
select champion.name,
count(*) filter( where participant.item0 = 3285) as it0,
count(*) filter( where participant.item1 = 3285) as it1,
count(*) filter( where participant.item2 = 3285) as it2,
count(*) filter( where participant.item3 = 3285) as it3,
count(*) filter( where participant.item4 = 3285) as it4,
count(*) filter( where participant.item5 = 3285) as it5
from participant
join champion on champion.id = participant.championid
join team on team.matchid = participant.matchid and team.id = participant.teamid
join match on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;

Indexing to reduce cost of SORT

I have this table:
TopScores
Username char(255)
Score int
DateAdded datetime2
which will have a lot of rows.
I run the following query (code for a stored procedure) against it to get the top 5 high scorers, and the score for a particular Username preceded by the person directly above them in position and the person below:
WITH Rankings
AS (SELECT Row_Number() OVER (ORDER BY Score DESC, DateAdded DESC) AS Pos,
--if score same, latest date higher
Username,
Score
FROM TopScores)
SELECT TOP 5 Pos,
Username,
Score
FROM Rankings
UNION ALL
SELECT Pos,
Username,
Score
FROM Rankings
WHERE Pos BETWEEN (SELECT Pos
FROM Rankings
WHERE Username = #User) - 1 AND (SELECT Pos
FROM Rankings
WHERE Username = #User) + 1
I had to index the table so I added clustered: ci_TopScores(Username) first and nonclustered: nci_TopScores(Dateadded, Score).
Query plan showed that clustered was completely ignored (before I created the nonclustered I tested and it was used by the query), and logical reads were more (as compared to a table scan without any index).
Sort was the highest costing operator. So I adjusted indexes to clustered: ci_TopScores(Score desc, Dateadded desc) and nonclustered: nci_TopScores(Username).
Still sort costs the same. Nonclustered: nci_TopScores(Username) is completely ignored again.
How can I avoid the high cost of sort and index this table effectively?
The CTE does not use Username so not a surprise it does not use that index.
A CTE is just syntax. You are evaluating that CTE 4 times.
Try a #temp so it is only evaluated once.
But you need to think about the indexes.
I would skip the RowNumber and just put an iden pk on the #temp to serve as pos
I would skip any other indexes on #temp
For TopScores an index on Score desc, DateAdded desc, Username asc will help
But it won't help if it is fragmented
That is an index that will fragment when you insert
insert into #temp (Score, DateAdded, Username)
select Score, DateAdded, Username
from TopScores
order by Score desc, DateAdded desc, Username asc
select top 5 *
from #temp
order by pos
union
select three.*
from #temp
join #temp as three
on #temp.UserName = #user
and abs(three.pos - #temp.pos) <= 1
So what if there is table scan on #temp UserName.
One scan does not take as long as create one index.
That index would be severely fragmented anyway.