I have a 3-column composite index on a table below defined using SqlAlchemy and Postgres. I'm trying to figure out what order to make the internal_date and email_id to make the query the most efficient. Or if I need the composite index at all.
I want to query by folder_id and get the most recent 100 emails. If I order the index by folder_id, internal_date and then email_id that would appear to be the most logical, but I'm not sure if applying the composite index like this will work well or I should just have a single index on each column?
This model assumes there are many thousands of folder_ids and millions of email_ids. (i.e. a folder can have many emails)
class EmailMessages(db.Model):
__tablename__ = "email_message_mapped_model"
folder_id = db.Column(
sa.Integer,
db.ForeignKey("folders.folder_id", ondelete="CASCADE"),
primary_key=True,
)
email_id = db.Column(
sa.Integer,
db.ForeignKey("emails.email_id", ondelete="CASCADE"),
primary_key=True,
)
internal_date = db.Column(db.DateTime, default=datetime.utcnow)
Indexes:
__table_args__ = (
db.Index(
'ix_index1',
folder_id, email_id, internal_date
),
)
OR
__table_args__ = (
db.Index(
'ix_index2',
folder_id, internal_date, email_id
),
)
OR
__table_args__ = (
db.Index(
'ix_index3a',
folder_id,
),
db.Index(
'ix_index3b',
internal_date,
),
db.Index(
'ix_index3c',
email_id,
),
)
Related
I try to populate null-rows in a table with data from the same table. Here is my code:
create table public.testdata(
id INTEGER,
person INTEGER,
name varchar(10));
insert into testdata (id, person,name) VALUES ( 1,1,'Jane' ), ( 2,1,'Jane' ), ( 3,1,NULL ), ( 4,2,'Tom' ), ( 5,2,NULL );
select * from testdata;
enter image description here
Basically i would like to have name 'Jane' in the 3rd row and name 'Tom' in the 5th.
Here is the asnswer which i have found online to a simmilar problem:
Update testdata
SET name = COALESCE(a1.name, b1.name)
FROM testdata a1
JOIN testdata b1
on a1.person = b1.person
and a1.id <> b1.id
where a1.name is NULL;
But if i run this code, i get name 'Jane' in every column, which is not what i want. I appreciate any help and suggestions.
Example for you:
select t1.id, t1.person, t2.name from testdata t1
left join
(
select distinct person, name from testdata
where name is not null
) t2 on t1.person = t2.person
Get the person (id?) and the desired name via a CTE. Then use the results to update names. So (see demo):
with namer (person, name) as
( select distinct on (person)
person, name
from testdata
where name is not null
order by person, name
)
update testdata d1
set name = (select n1.name
from namer n1
where n1.person = d1.person
)
where d1.name is null;
NOTE: Demo contains additional rows where the entry sequence of the rows is not ideal. And not all person values have associated name.
I have a database with three tables. Ultimately I want to JOIN the three tables and sort them by a column shared by two of the tables.
A main item table with foreign keys (product_id) to the two sub-tables:
items
CREATE TABLE items (
id INT NOT NULL,
product_id varchar(40) NOT NULL,
type CHAR NOT NULL
);
and then a table corresponding to each typeA and typeB. They have differing columns, but for the sake of this exercise I'm only including the columns they have in common:
CREATE TABLE products_a (
id varchar(40) NOT NULL,
name varchar(40) NOT NULL,
price INT NOT NULL
);
CREATE TABLE products_b (
id varchar(40) NOT NULL,
name varchar(40) NOT NULL,
price INT NOT NULL
);
Some example rows:
INSERT INTO items VALUES
( 1, 'abc', 'a' ),
( 2, 'def', 'b' ),
( 3, 'ghi', 'a' ),
( 4, 'jkl', 'b' );
INSERT INTO products_a VALUES
( 'abc', 'product 1', 10 ),
( 'ghi', 'product 2', 50 );
INSERT INTO products_b VALUES
( 'def', 'product 3', 20 ),
( 'jkl', 'product 4', 100 );
I have a JOIN working, but my sorting is not interpolating the rows as I would expect.
Query:
SELECT
items.id AS item_id,
products_a.name AS product_a_name,
products_a.price AS product_a_price,
products_b.name AS product_b_name,
products_b.price AS product_b_price
FROM items
FULL JOIN products_a ON items.product_id = products_a.id
FULL JOIN products_b ON items.product_id = products_b.id
ORDER BY 3, 5 ASC;
Actual result:
item_id
product_a_name
product_a_price
product_b_name
product_b_price
1
product 1
10
NULL
NULL
3
product 2
50
NULL
NULL
2
NULL
NULL
product 3
20
4
NULL
NULL
product 4
100
Desired result:
item_id
product_a_name
product_a_price
product_b_name
product_b_price
1
product 1
10
NULL
NULL
2
NULL
NULL
product 3
20
3
product 2
50
NULL
NULL
4
NULL
NULL
product 4
100
I realize this is a weird table setup, but simplified this way looks more contrived than it is. Ultimately the sorting matches the real use case, though, and changing the DB schema is not an option. I feel like I am missing something simple here, just sorting by either one column or another. Any help is appreciated.
Use COALESCE in the ORDER BY clause to always sort by the first non NULL price:
SELECT
items.id AS item_id,
products_a.name AS product_a_name,
products_a.price AS product_a_price,
products_b.name AS product_b_name,
products_b.price AS product_b_price
FROM items
FULL JOIN products_a ON items.product_id = products_a.id
FULL JOIN products_b ON items.product_id = products_b.id
ORDER BY
COALESCE(3, 5);
I have the following table:
Table Creation
accident_info
(
accident_index varchar(20),
first_road_class varchar(20),
accident_severity varchar(20),
date date,
urban_or_rural_area varchar(20),
weather_conditions varchar(40),
year int,
inscotland varchar(20)
);
Index:
CREATE INDEX index1 ON accident_info(accident_info.first_road_class , accident_info.date)
CREATE INDEX index2 ON accident_info(vehicle_info.age_band_of_driver)
Query:
SELECT COUNT(accident_info.accident_index) as max, vehicle_info.make
FROM vehicle_info
INNER JOIN accident_info on vehicle_info.accident_index = accident_info.accident_index
WHERE vehicle_info.age_band_of_driver = '26 - 35' AND accident_info.first_road_class = 'A' AND accident_info.date > '2009-12-31' and accident_info.date < '2013-01-01'
GROUP BY make
ORDER BY max DESC
LIMIT 1
Even though I have created two indexes, Postgres doesn't use none of them. Why is this happening?
Maybe your dataset is just too small to use index
I have 3 tables, Transaction, Transaction_Items and Transaction_History.
Where the Transaction is the parent table, while Transaction_Items and Transaction_History are the children tables, with one to many relationship.
When i try to join those tables together, if i have 2+ Transaction_History records, or 2+ Transaction_Items i get duplicated or triplicated record results.
This is the SQL query im currently using which works, but what worries me that in the future if i have to Join another one-to-many table, it will duplicate the results again.
I found a workaround for this, but i was just wondering if there is a better and cleaner way to do this ?
The results should be a PostgreSQL JSON array which will contain the Transaction_Items and Transaction_History
SELECT
TR.id AS transaction_id,
TR.transaction_number,
TR.status,
TR.status AS status,
to_json(TR_INV.list),
COUNT(TR_INV) item_cnt,
COUNT(THR) tr_cnt,
json_agg(THR)
FROM transaction_transaction AS TR
LEFT JOIN (
SELECT
array_agg(t) list, -- this is a workaround method
t.transaction_id
FROM (
SELECT
TR_INV.transaction_id transaction_id,
IT.id,
IT.stock_number,
CAT.key category_key,
ITP.description description,
ITP.serial_number serial_number,
ITP.color color,
ITP.manufacturer manufacturer,
ITP.inventory_model inventory_model,
ITP.average_cost average_cost,
ITP.location_in_store location_in_store,
ITP.firearm_caliber firearm_caliber,
ITP.federal_firearm_number federal_firearm_number,
ITP.sold_price sold_price
FROM transaction_transaction_item TR_INV
LEFT JOIN inventory_item IT ON IT.id = TR_INV.item_id
LEFT JOIN inventory_itemprofile ITP ON ITP.id = IT.current_profile_id
LEFT JOIN inventory_category CAT ON CAT.id = ITP.category_id
LEFT JOIN inventory_categorytype CAT_T ON CAT_T.id = CAT.category_type_id
) t
GROUP BY t.transaction_id
) TR_INV ON TR_INV.transaction_id = TR.id
LEFT JOIN transaction_transactionhistory THR ON THR.transaction_id = TR.id
AND (THR.audit_code_id = 44 OR THR.audit_code_id = 27 OR THR.audit_code_id = 28)
WHERE TR.store_id = 21
AND TR.transaction_type = 'Pawn_Loan' AND TR.date_made >= '2018-10-08'
GROUP BY TR.id, TR_INV.list
What you want to do can be achieved by not using joins, as shown below.
Because your actual tables have so many columns that I don't know and should not care. I just created the simplest forms of them for demonstration.
CREATE TABLE transactions (
tid serial PRIMARY KEY,
name varchar(40) NOT NULL
);
CREATE TABLE transaction_histories (
hid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
history varchar(40) NOT NULL
);
CREATE TABLE transaction_items (
iid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
item varchar(40) NOT NULL
);
INSERT INTO transactions(tid,name) Values(1, 'transaction');
INSERT INTO transaction_histories(tid, history) Values(1, 'history1');
INSERT INTO transaction_histories(tid, history) Values(1, 'history2');
INSERT INTO transaction_items(tid, item) Values(1, 'item1');
INSERT INTO transaction_items(tid, item) Values(1, 'item2');
select
t.*,
(select count(*) from transaction_histories h where h.tid= t.tid) h_count ,
(select json_agg(h) from transaction_histories h where h.tid= t.tid) h ,
(select count(*) from transaction_items i where i.tid= t.tid) i_count ,
(select json_agg(i) from transaction_items i where i.tid= t.tid) i
from transactions t;
I've browsed around some other posts and managed to make my queries run a bit faster. However, I've come to a loss as to how to further optimize this query. I'm going to be using it on a website where it will execute the query when the page is loaded, but 5.5 seconds is far too long to wait for something that should be a lot more simple. The largest table has around 4,000,000 rows and the other ones are around 400,000 each.
Table Structure
match
id BIGINT PRIMARY KEY,
region TEXT,
matchType TEXT,
matchVersion TEXT
team
matchid BIGINT REFERENCES match(id),
id INTEGER,
PRIMARY KEY(matchid, id),
winner TEXT
champion
id INTEGER PRIMARY KEY,
version TEXT,
name TEXT
item
id INTEGER PRIMARY KEY,
name TEXT
participant
PRIMARY KEY(matchid, id),
id INTEGER NOT NULL,
matchid BIGINT REFERENCES match(id),
championid INTEGER REFERENCES champion(id),
teamid INTEGER,
FOREIGN KEY (matchid, teamid) REFERENCES team(matchid, id),
magicDamageDealtToChampions REAL,
damageDealtToChampions REAL,
item0 TEXT,
item1 TEXT,
item2 TEXT,
item3 TEXT,
item4 TEXT,
item5 TEXT,
highestAchievedSeasonTier TEXT
Query
select champion.name,
sum(case when participant.item0 = '3285' then 1::int8 else 0::int8 end) as it0,
sum(case when participant.item1 = '3285' then 1::int8 else 0::int8 end) as it1,
sum(case when participant.item2 = '3285' then 1::int8 else 0::int8 end) as it2,
sum(case when participant.item3 = '3285' then 1::int8 else 0::int8 end) as it3,
sum(case when participant.item4 = '3285' then 1::int8 else 0::int8 end) as it4,
sum(case when participant.item5 = '3285' then 1::int8 else 0::int8 end) as it5
from participant
left join champion
on champion.id = participant.championid
left join team
on team.matchid = participant.matchid and team.id = participant.teamid
left join match
on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;
Output of EXPLAIN ANALYZE: http://explain.depesz.com/s/ZYX
What I've done so far
I've created separate indexes on match.region, participant.championid, and a partial index on team where winner = 'True' (since that is only what I am interested in). Note that enable_seqscan = on since when it's off the query is extremely slow. Essentially, the result I'm trying to get is something like this:
Champion |item0 | item1 | ... | item5
champ_name | num | num1 | ... | num5
...
Since I'm still a beginner with respect to database design, I wouldn't be surprised if there is a flaw in my overall table design. I'm still leaning towards the query being absolutely inefficient, though. I've played with both inner joins and left joins -- there is no significant difference though. Additionally, match needs to be bigint (or something larger than integer, since it's too small).
Database design
I suggest:
CREATE TABLE matchversion (
matchversion_id int PRIMARY KEY
, matchversion text UNIQUE NOT NULL
);
CREATE TABLE matchtype (
matchtype_id int PRIMARY KEY
, matchtype text UNIQUE NOT NULL
);
CREATE TABLE region (
region_id int PRIMARY KEY
, region text NOT NULL
);
CREATE TABLE match (
match_id bigint PRIMARY KEY
, region_id int REFERENCES region
, matchtype_id int REFERENCES matchtype
, matchversion_id int REFERENCES matchversion
);
CREATE TABLE team (
match_id bigint REFERENCES match
, team_id integer -- better name !
, winner boolean -- ?!
, PRIMARY KEY(match_id, team_id)
);
CREATE TABLE champion (
champion_id int PRIMARY KEY
, version text
, name text
);
CREATE TABLE participant (
participant_id serial PRIMARY KEY -- use proper name !
, champion_id int NOT NULL REFERENCES champion
, match_id bigint NOT NULL REFERENCES match -- this FK might be redundant
, team_id int
, magic_damage_dealt_to_champions real
, damage_dealt_to_champions real
, item0 text -- or integer ??
, item1 text
, item2 text
, item3 text
, item4 text
, item5 text
, highest_achieved_season_tier text -- integer ??
, FOREIGN KEY (match_id, team_id) REFERENCES team
);
More normalization in order to get smaller tables and indexes and faster access. Create lookup-tables for matchversion, matchtype and region and only write a small integer ID in match.
Seems like the columns participant.item0 .. item5 and highestAchievedSeasonTier could be integer, but are defined as text?
The column team.winner seems to be boolean, but is defined as text.
I also changed the order of columns to be more efficient. Details:
Calculating and saving space in PostgreSQL
Query
Building on above modifications and for Postgres 9.3:
SELECT c.name, *
FROM (
SELECT p.champion_id
, count(p.item0 = '3285' OR NULL) AS it0
, count(p.item1 = '3285' OR NULL) AS it1
, count(p.item2 = '3285' OR NULL) AS it2
, count(p.item3 = '3285' OR NULL) AS it3
, count(p.item4 = '3285' OR NULL) AS it4
, count(p.item5 = '3285' OR NULL) AS it5
FROM matchversion mv
CROSS JOIN matchtype mt
JOIN match m USING (matchtype_id, matchversion_id)
JOIN team t USING (match_id)
JOIN participant p USING (match_id, team_id)
WHERE mv.matchversion = '5.14'
AND mt.matchtype = 'RANKED_SOLO_5x5'
AND t.winner = 'True' -- should be boolean
GROUP BY p.champion_id
) p
JOIN champion c USING (champion_id); -- probably just JOIN ?
Since champion.name is not defined UNIQUE, it's probably wrong to GROUP BY it. It's also inefficient. Use participant.championid instead (and join to champion later if you need the name in the result).
All instances of LEFT JOIN are pointless, since you have predicates on the left tables anyway and / or use the column in GROUP BY.
Parentheses around AND-ed WHERE conditions are not needed.
In Postgres 9.4 or later you could use the new aggregate FILTER syntax instead. Details and alternatives:
How can I simplify this game statistics query?
Index
The partial index on team you already have should look like this to allow index-only scans:
CREATE INDEX on team (matchid, id) WHERE winner -- boolean
But from what I see, you might just add a winner column to participant and drop the table team completely (unless there is more to it).
Also, that index is not going to help much, because (telling from your query plan) the table has 800k rows, half of which qualify:
rows=399999 ... Filter: (winner = 'True'::text) ... Rows Removed by Filter: 399999
This index on match will help a little more (later) when you have more different matchtypes and matchversions:
CREATE INDEX on match (matchtype_id, matchversion_id, match_id);
Still, while 100k rows qualify out of 400k, the index is only useful for an index only scan. Otherwise, a sequential scan will be faster. An index typically pays for about selecting 5 % of the table or less.
Your main problem is that you are obviously running a test case with hardly realistic data distribution. With more selective predicates, indexes will be used more readily.
Aside
Make sure you have configured basic Postgres settings like random_page_cost or work_mem etc.
enable_seqscan = on goes without saying. This is only turned off for debugging or locally as a desperate measure of last resort.
I'd try using
count(*) filter (where item0 = '3285' ) as it0
for your counts instead of sums.
Also, why are you left joining your last 2 tables, then having a where statement. That defeats the purpose and a regular inner join is faster
select champion.name,
count(*) filter( where participant.item0 = 3285) as it0,
count(*) filter( where participant.item1 = 3285) as it1,
count(*) filter( where participant.item2 = 3285) as it2,
count(*) filter( where participant.item3 = 3285) as it3,
count(*) filter( where participant.item4 = 3285) as it4,
count(*) filter( where participant.item5 = 3285) as it5
from participant
join champion on champion.id = participant.championid
join team on team.matchid = participant.matchid and team.id = participant.teamid
join match on match.id = participant.matchid
where (team.winner = 'True' and matchversion = '5.14' and matchtype='RANKED_SOLO_5x5')
group by champion.name;