Designing SQL view and improving Perfromance - tsql

I have a requirement to create a view and the business scenario is explained below
Consider i am having table Products(all product information) and Settings(settings for a country/state/City)
Now i have to create a view which gives product information by considering settings, It might be possible to have cities/states/country have there own settings.
Design of the view
It means first i need to check
1. any city is having there custom settings then output those records
UNION ALL
2. any state is having there custom settings then output those records by excluding cities under this state in step 1
UNION ALL
3. any country is having there custom settings or not then output those records by excluding cities ans states records in step1 and step2
This is the design which i thought of, is there anything wrong in the design?
Performance improving
With this existing design its taking 5 minutes for a query to run without any indexes on view and base tables.
Now what is the best option for me to improve the performance.
Creating indexed views or create index on base tables? which one helps me to make the query run in seconds :)
Sample Data
Product Table
Settings table
Expected Output

I can't work out why your (P2 - Blue) result is showing. I re-wrote your samples as SQL, and created what I thought you wanted (whilst waiting for your expected output), and mine only produces one row (P1 - Red)
create table dbo.Product (
ProductID int not null,
Name char(2) not null,
StateId char(2) not null,
CityId char(2) not null,
CountryId char(2) not null,
Price int not null,
Colour varchar(10) not null,
constraint PK_Product PRIMARY KEY (ProductID)
)
go
insert into dbo.Product (ProductID,Name,StateId,CityId,CountryId,Price,Colour)
select 1,'P1','S1','C1','C1',150,'Red' union all
select 2,'P2','S2','C2','C1',100,'Blue' union all
select 3,'P3','S1','C3','C1',200,'Green'
go
create table dbo.Settings (
SettingsID int not null,
StateId char(2) null,
CityId char(2) null,
CountryId char(2) null,
MaxPrice int not null,
MinPrice int not null,
constraint PK_Settings PRIMARY KEY (SettingsID)
)
go
insert into dbo.Settings (SettingsID,StateId,CityId,CountryId,MaxPrice,MinPrice)
select 1,null,null,'C1',1000,150 union all
select 2,'S1',null,'C1',2000,100 union all
select 3,'S1','C3','C1',3000,300
go
And now the actual view:
create view dbo.Products_Filtered
with schemabinding
as
with MatchedSettings as (
select p.ProductID,MAX(MinPrice) as MinPrice,MIN(MaxPrice) as MaxPrice
from
dbo.Product p
inner join
dbo.Settings s
on
(p.CountryId = s.CountryId or s.CountryId is null) and
(p.CityId = s.CityId or s.CityId is null) and
(p.StateId = s.StateId or s.StateId is null)
group by
p.ProductID
)
select
p.ProductID,p.Name,p.CityID,p.StateId,p.CountryId,p.Price,p.Colour
from
dbo.Product p
inner join
MatchedSettings ms
on
p.ProductID = ms.ProductID and
p.Price between ms.MinPrice and ms.MaxPrice
What I did was to combine all applicable settings, and then assumed that we applied the most restrictive settings (so take the MAX MinPrice specified and MIN MaxPrice).
Using those rules, the (P2 - Blue) row is ruled out, since the only applicable setting is setting 1 - which has a Min price of 150.
If I reverse it, so that we try to be as inclusive as possible (MIN MinPrice and MAX MaxPrice), then that returns (P1 - Red) and (P3 - Green) - but still not (P2 - Blue)

Related

Optional filter on a column of an outer joined table in the where clause

I have got two tables:
create table student
(
studentid bigint primary key not null,
name varchar(200) not null
);
create table courseregistration
(
studentid bigint not null,
coursenamename varchar(200) not null,
isfinished boolean default false
);
--insert some data
insert into student values(1,'Dave');
insert into courseregistration values(1,'SQL',true);
Student is fetched with id, so it should be always returned in the result. Entry in the courseregistration is optional and should be returned if there are matching rows and those matching rows should be filtered on isfinished=false. This means I want to get the course regsitrations that are not finished yet. Tried to outer join student with courseregistration and filter courseregistration on isfinished=false. Note that, I still want to retrieve the student.
Trying this returns no rows:
select * from student
left outer join courseregistration using(studentid)
where studentid = 1
and courseregistration.isfinished = false
What I'd want in the example above, is a result set with 1 row student, but course rows null (because the only example has the isfinished=true). One more constraint though. If there is no corresponding row in courseregistration, there should still be a result for the student entry.
This is an adjusted example. I can tweak my code to solve the problem, but I really wonder, what is the "correct/smart way" of solving this in postgresql?
PS I have used the (+) in Oracle previously to solve similar issues.
Isn't this what you are looking for :
select * from student s
left outer join courseregistration cr
on s.studentid = cr.studentid
and cr.isfinished = false
where s.studentid = 1
db<>fiddle here

Query where NOT NULL but only if NOT used as a FK

area
-----
id BIGSERIAL PRIMARY KEY
deleted_at TIMESTAMP WITH TIME ZONE DEFAULT NULL
and
registration
-----
area_id BIGINT REFERENCES area(id) NOT NULL
I want to get all records from area which have deleted_at IS NULL and the ones that can have deleted_at is NOT NULL but are present as a FK in the registration.
SELECT * FROM area
JOIN registration AS reg
ON reg.area_id=area.id
WHERE area.deleted_at IS NULL;
will omit the area records which are FKs in registration but have been marked as "deleted".
Adding an AND clause regarding the deleted_at column in the JOIN ON clause doesn't make sense, since it will only strip out valid records.
I can't quite wrap it around my head, since the two where conditions kind of contradict each other.
Try something like this:
SELECT *
FROM area
LEFT JOIN registration AS reg ON reg.area_id = area.id
WHERE (area.deleted_at IS NULL) <> (reg.area_id IS NOT NULL)
The LEFT JOIN would list all area rows, even without a matching row from registration. (Resulting NULL values for those rows.)
The WHERE clause makes sure that both of the fields are not NULLs at the same time.
I think this is what your asking for. When you use left join it data fields for registration will show up as null where they are not present in registration table.
select * from area
left join registration as reg
on reg.area_id= area.id
where area.deleted_at is null or reg.area_id is not null;
-- I need (0) all area records EXCEPT the ones where (1) deleted_at IS NOT NULL
-- AND (2) are NOT present as FKs in registration.
SELECT * FROM area a
WHERE NOT(
a.deleted_at IS NOT NULL -- (1)
AND NOT EXISTS (
SELECT * -- (2)
FROM registration r
WHERE r.area_id=a.id
)
);
Note: your textual phrasing is confusing: EXCEPT a AND b could mean two things
And, after the rephrasing of the question:
-- I want to get (0) all records from area (1) which have deleted_at IS NULL (1a)
-- and (2) the ones that can have deleted_at is NOT NULL but are present as a FK in the registration.
SELECT * FROM area a
WHERE a.deleted_at IS NULL -- (1)
OR a.deleted_at IS NOT NULL AND EXISTS ( (1a)
SELECT * -- (2)
FROM registration r
WHERE r.area_id=a.id
);
If I understand correctly, you mean plus the ones at (1a) : if so, the and in (1a) is translated into an or
Are you simply searching for the following query?
SELECT * FROM Area
LEFT OUTER JOIN registration on id = area_id
WHERE deleted_at IS NULL OR area_id IS NOT NULL
This will return the same area.id multiple times if registration.area_id is not unique though (since you have no UNIQUE constraints).
If that is a problem, you may want the following query instead.
SELECT * FROM Area
WHERE deleted_at IS NULL OR id IN (SELECT area_id FROM registration)
Or this, built with a COUNT:
SELECT id, deleted_at, COUNT(*) FROM Area
LEFT OUTER JOIN registration on id = area_id
WHERE (deleted_at IS NULL or area_id IS NOT NULL)
GROUP BY id, deleted_at

PostgreSQL: Delete all but most recent date

I have a table defined like so:
CREATE TABLE contracts (
ContractID TEXT DEFAULT NULL,
ContractName TEXT DEFAULT NULL,
ContractEndDate TIMESTAMP WITHOUT TIME ZONE,
ContractPOC TEXT DEFAULT NULL
);
In this table, a ContractID may have more than one record, for each ContractID I want to delete all records but the one with the latest ContractEndDate. I know how to do this in MySQL using:
DELETE contracts
FROM contracts
INNER JOIN (
SELECT
ContractID,
ContractName,
max(ContractEndDate) as lastDate,
ContractPOC
FROM contracts
GROUP BY EmployeeID
HAVING COUNT(*) > 0) Duplicate on Duplicate.ContractID = contracts.ContractID
WHERE contracts.ContractEndDate < Duplicate.lastDate;
But I need help to get this working in PostgreSQL.
You could use this
delete
from
contracts c
using (SELECT
ContractID,
max(ContractEndDate) as lastDate
FROM
contracts
GROUP BY
ContractID) d
where
d.ContractID = c.ContractID
and c.ContractEndDate < d.lastDate;

Optimizing SQL query with multiple joins and grouping (Postgres 9.3)

I've browsed around some other posts and managed to make my queries run a bit faster. However, I've come to a loss as to how to further optimize this query. I'm going to be using it on a website where it will execute the query when the page is loaded, but 5.5 seconds is far too long to wait for something that should be a lot more simple. The largest table has around 4,000,000 rows and the other ones are around 400,000 each.
Table Structure
match
id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT
team
matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT
champion
id INTEGER PRIMARY KEY,
version TEXT,
name TEXT
item
id INTEGER PRIMARY KEY,
name TEXT
participant
PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT
Query
select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
Output of EXPLAIN ANALYZE: http://explain.depesz.com/s/ZYX
What I've done so far
I've created separate indexes on match.region, participant.championid, and a partial index on team where winner = 'True' (since that is only what I am interested in). Note that enable_seqscan = on since when it's off the query is extremely slow. Essentially, the result I'm trying to get is something like this:
Champion |item0 | item1 | ... | item5
champ_name | num | num1 | ... | num5
...
Since I'm still a beginner with respect to database design, I wouldn't be surprised if there is a flaw in my overall table design. I'm still leaning towards the query being absolutely inefficient, though. I've played with both inner joins and left joins -- there is no significant difference though. Additionally, match needs to be bigint (or something larger than integer, since it's too small).
Database design
I suggest:
CREATE TABLE matchversion (
matchversion_id int PRIMARY KEY
, matchversion text UNIQUE NOT NULL
);
CREATE TABLE matchtype (
matchtype_id int PRIMARY KEY
, matchtype text UNIQUE NOT NULL
);
CREATE TABLE region (
region_id int PRIMARY KEY
, region text NOT NULL
);
CREATE TABLE match (
match_id bigint PRIMARY KEY
, region_id int REFERENCES region
, matchtype_id int REFERENCES matchtype
, matchversion_id int REFERENCES matchversion
);
CREATE TABLE team (
match_id bigint REFERENCES match
, team_id integer -- better name !
, winner boolean -- ?!
, PRIMARY KEY(match_id, team_id)
);
CREATE TABLE champion (
champion_id int PRIMARY KEY
, version text
, name text
);
CREATE TABLE participant (
participant_id serial PRIMARY KEY -- use proper name !
, champion_id int NOT NULL REFERENCES champion
, match_id bigint NOT NULL REFERENCES match -- this FK might be redundant
, team_id int
, magic_damage_dealt_to_champions real
, damage_dealt_to_champions real
, item0 text -- or integer ??
, item1 text
, item2 text
, item3 text
, item4 text
, item5 text
, highest_achieved_season_tier text -- integer ??
, FOREIGN KEY (match_id, team_id) REFERENCES team
);
More normalization in order to get smaller tables and indexes and faster access. Create lookup-tables for matchversion, matchtype and region and only write a small integer ID in match.
Seems like the columns participant.item0 .. item5 and highestAchievedSeasonTier could be integer, but are defined as text?
The column team.winner seems to be boolean, but is defined as text.
I also changed the order of columns to be more efficient. Details:
Calculating and saving space in PostgreSQL
Query
Building on above modifications and for Postgres 9.3:
SELECT c.name, *
FROM (
SELECT p.champion_id
, count(p.item0 = '3285' OR NULL) AS it0
, count(p.item1 = '3285' OR NULL) AS it1
, count(p.item2 = '3285' OR NULL) AS it2
, count(p.item3 = '3285' OR NULL) AS it3
, count(p.item4 = '3285' OR NULL) AS it4
, count(p.item5 = '3285' OR NULL) AS it5
FROM matchversion mv
CROSS JOIN matchtype mt
JOIN match m USING (matchtype_id, matchversion_id)
JOIN team t USING (match_id)
JOIN participant p USING (match_id, team_id)
WHERE mv.matchversion = '5.14'
AND mt.matchtype = 'RANKED_SOLO_5x5'
AND t.winner = 'True' -- should be boolean
GROUP BY p.champion_id
) p
JOIN champion c USING (champion_id); -- probably just JOIN ?
Since champion.name is not defined UNIQUE, it's probably wrong to GROUP BY it. It's also inefficient. Use participant.championid instead (and join to champion later if you need the name in the result).
All instances of LEFT JOIN are pointless, since you have predicates on the left tables anyway and / or use the column in GROUP BY.
Parentheses around AND-ed WHERE conditions are not needed.
In Postgres 9.4 or later you could use the new aggregate FILTER syntax instead. Details and alternatives:
How can I simplify this game statistics query?
Index
The partial index on team you already have should look like this to allow index-only scans:
CREATE INDEX on team (matchid, id) WHERE winner -- boolean
But from what I see, you might just add a winner column to participant and drop the table team completely (unless there is more to it).
Also, that index is not going to help much, because (telling from your query plan) the table has 800k rows, half of which qualify:
rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
This index on match will help a little more (later) when you have more different matchtypes and matchversions:
CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
Still, while 100k rows qualify out of 400k, the index is only useful for an index only scan. Otherwise, a sequential scan will be faster. An index typically pays for about selecting 5 % of the table or less.
Your main problem is that you are obviously running a test case with hardly realistic data distribution. With more selective predicates, indexes will be used more readily.
Aside
Make sure you have configured basic Postgres settings like random_page_cost or work_mem etc.
enable_seqscan = on goes without saying. This is only turned off for debugging or locally as a desperate measure of last resort.
I'd try using
count(*) filter (where item0 = '3285' ) as it0
for your counts instead of sums.
Also, why are you left joining your last 2 tables, then having a where statement. That defeats the purpose and a regular inner join is faster
select champion.name,
count(*) filter( where participant.item0 = 3285) as it0,
count(*) filter( where participant.item1 = 3285) as it1,
count(*) filter( where participant.item2 = 3285) as it2,
count(*) filter( where participant.item3 = 3285) as it3,
count(*) filter( where participant.item4 = 3285) as it4,
count(*) filter( where participant.item5 = 3285) as it5
from participant
join champion on champion.id = participant.championid
join team on team.matchid = participant.matchid and team.id = participant.teamid
join match on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;

Choosing the first child record in a selfjoin in TSQL

I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?
View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.
/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date