Many to Many Table - Performance is bad - postgresql

The following tables are given:
--- player --
id serial
name VARCHAR(100)
birthday DATE
country VARCHAR(3)
PRIMARY KEY id
--- club ---
id SERIAL
name VARCHAR(100)
country VARCHAR(3)
PRIMARY KEY id
--- playersinclubs ---
id SERIAL
player_id INTEGER (with INDEX)
club_id INTEGER (with INDEX)
joined DATE
left DATE
PRIMARY KEY id
Every player has a row in table player (with his attributes). Equally every club has an entry in table club.
For every station in his career, a player has an entry in table playersInClubs (n-m) with the date when the player joined and optionally when the player left the club.
My main problem is the performance of these tables. In Table player we have over 10 million entries. If i want to display a history of a club with all his players played for this club, my select looks like the following:
SELECT * FROM player
JOIN playersinclubs ON player.id = playersinclubs.player_id
JOIN club ON club.id = playersinclubs.club_id
WHERE club.dbid = 3;
But for the massive load of players a sequence scan on table player will be executed. This selection takes a lot of time.
Before I implemented some new functions to my app, every players has exactly one team (only todays teams and players).
So i havn't had the table playersinclubs. Instead i had a team_id in table player. I could select the players of a team directly in table player with the where clause team_id = 3.
Does someone has some performance tips for my database structure to speed up these selections?

Most importantly, you need an index on playersinclubs(club_id, player_id). The rest is details (that may still make quite a difference).
You need to be precise about your actual goals. You write:
all his players played for this club:
You don't need to join to club for this at all:
SELECT p.*
FROM playersinclubs pc
JOIN player p ON p.id = pc.player_id
WHERE pc.club_id = 3;
And you don't need columns playersinclubs in the output either, which is a small gain for performance - unless it allows an index-only scan on playersinclubs, then it may be substantial.
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
You probably don't need all columns of player in the result, either. Only SELECT the columns you actually need.
The PK on player provides the index you need on that table.
You need an index on playersinclubs(club_id, player_id), but do not make it unique unless players are not allowed to join the same club a second time.
If players can join multiple times and you just want a list of "all players", you also need to add a DISTINCT step to fold duplicate entries. You could just:
SELECT DISTINCT p.* ...
But since you are trying to optimize performance: it's cheaper to eliminate dupes early:
SELECT p.*
FROM (
SELECT DISTINCT player_id
FROM playersinclubs
WHERE club_id = 3;
) pc
JOIN player p ON p.id = pc.player_id;
Maybe you really want all entries in playersinclubs and all columns of the table, too. But your description says otherwise. Query and indexes would be different.
Closely related answer:
Find overlapping date ranges in PostgreSQL

The tables look fine and so does the query. So let's see what the query is supposed to do:
Select the club with ID 3. One record that can be accessed via the PK index.
Select all playersinclub records for club ID 3. So we need an index starting with this column. If you don't have it, create it.
I suggest:
create unique index idx_playersinclubs on playersinclubs(club_id, player_id, joined);
This would be the table's unique business key. I know that in many databases with technical IDs these unique constraints are not established, but I consider this a flaw in those databases and would always create these constraints/indexes.
Use the player IDs got thus and select the players accordingly. We can get the player ID from the playersinclubs records, but it is also the second column in our index, so the DBMS may choose one or the other to perform the join. (It will probably use the column from the index.)
So maybe it is simply that above index does not exist yet.

Related

How to query "has no linked records in this table"

I have two simple tables: one with primary key id, and two with primary key id and a foreign key oneId.
I want to get all rows from one with no references in two.oneId.
I could do
SELECT ... FROM one LEFT JOIN two ON two.oneId = one.id WHERE two.id IS NULL
SELECT ... FROM one WHERE NOT exists(SELECT 1 FROM two WHERE oneId = one.id)
SELECT ... FROM one WHERE id NOT IN (SELECT oneId FROM two)
probably other options exist
Which option is better, and why?
The second choice is the best – it will be translated to an antijoin.
Number one looks pretty good too, it might have the same execution plan.

ST_contains taking too much time

I am trying to match latitude/longitude to a particular neighbor location using below query
create table address_classification as (
select distinct buildingid,street,city,state,neighborhood,borough
from master_data
join
Borough_GEOM
on st_contains(st_astext(geom),coordinates) = 'true'
);
In this, coordinates is of below format
ST_GeometryFromText('POINT('||longitude||' '||latitude||')') as coordinates
and geom is of column type geometry.
i have already created indexes as below
CREATE INDEX coordinates_gix ON master_data USING GIST (coordinates);
CREATE INDEX boro_geom_indx ON Borough_GEOM USING gist(geom);
I have almost 3 million records in the main table and 200 geometric information in the GEOM table. Explain analyze of the query is taking so much time (2 hrs).
Please let me know, how can i optimize this query.
Thanks in advance.
As mentioned in the comments, don't use ST_AsText(): that doesn't belong there. It's casting the geom to text, and then going back to geom. But, more importantly, that process is likely to fumble the index.
If you're unique on only column then use DISTINCT ON, no need to compare the others.
If you're unique on the ID column and your only joining to add selectivity then consider using EXISTS. Do any of these columns come from the borough_GEOM other than geom?
I'd start with something like this,
CREATE TABLE address_classification AS
SELECT DISTINCT ON (buildingid),
buildingid,
street,
city,
state,
neighborhood,
borough
FROM master_data
JOIN borough_GEOM
ON ST_Contains(geom,coordinates);

SARGable way to find records near each other based on time window?

We have events insert into a table - a start event and an end event. Related events have the same internal_id number, and are inserted within a 90 second window. We frequently do a self-join on the table:
create table mytable (id bigint identity, internal_id bigint,
internal_date datetime, event_number int, field_a varchar(50))
select * from mytable a inner join mytable b on a.internal_id = b.internal_id
and a.event_number = 1 and b.event_number = 2
However, we can have millions of linked events each day. Our clustered key is the internal_date, so we can filter down to a partition level, but the performance can still be mediocre:
and a.internal_date >='20120807' and a.internal_date < '20120808'
and b.internal_date >='20120807' and b.internal_date < '20120808'
Is there a SARGable way to narrow it down further?
Adding this doesn't work - non-SARGable:
and a.internal_date <= b.internal_date +.001 --about 90 seconds
and a.internal_date > b.internal_date - .001 --make sure they're within the window
This isn't for a point query, so doing one-offs doesn't help - we're searching for thousands of records and need event details from the start event and the end event.
Thanks!
With this index your query will be much cheaper:
CREATE UNIQUE INDEX idx_iid on mytable(event_number, internal_id)
INCLUDE (id, internal_date, field_a);
The index allows you to seek on event_number rather than doing a clustered index scan, as well as enables you to do a merge join on internal_id rather than a hash join. The uniqueness constraint makes merge join even cheaper by eliminating possibility of many-to-many join.
See this for a more detailed explanation of merge join.

define a computed column reference another table

I have two database tables, Team (ID, NAME, CITY, BOSS, TOTALPLAYER) and
Player (ID, NAME, TEAMID, AGE), the relationship between the two tables is one to many, one team can have many players.
I want to know is there a way to define a TOTALPLAYER column in the Team table as computed?
For example, if there are 10 players' TEAMID is 1, then the row in Team table which ID is 1 has the TOTALPLAYER column with a value of 10. If I add a player, the TOTALPLAYER column's value goes up to 11, I needn't to explicitly assign value to it, let it generated by the database. Anyone know how to realize it?
Thx in advance.
BTW, the database is SQL Server 2008 R2
Yes, you can do that - you need a function to count the players for the team, and use that in the computed column:
CREATE FUNCTION dbo.CountPlayers (#TeamID INT)
RETURNS INT
AS BEGIN
DECLARE #PlayerCount INT
SELECT #PlayerCount = COUNT(*) FROM dbo.Player WHERE TeamID = #TeamID
RETURN #PlayerCount
END
and then define your computed column:
ALTER TABLE dbo.Team
ADD TotalPlayers AS dbo.CountPlayers(ID)
Now if you select, that function is being called every time, for each team being selected. The value is not persisted in the Team table - it's calculated on the fly each time you select from the Team table.
Since it's value isn't persisted, the question really is: does it need to be a computed column on the table, or could you just use the stored function to compute the number of players, if needed?
You don't have to store the total in the table -- it can be computed when you do a query, something like:
SELECT teams.*, COUNT(players.id) AS num_players
FROM teams LEFT JOIN players ON teams.id = players.team_id
GROUP BY teams.id;
This will create an additional column "num_players" in the query, which will be a count of the number of players on each team, if any.

MySQL cross-database WHERE clause

I am working on a project that obtains values from many measurement stations (e.g. 50000) located all over the world. I have 2 databases, one storing information on the measurement stations, the other one storing values obtained from these stations (e.g. several million). A super-simplified version of the database structure could look like this:
database measurement_stations
table measurement_station
id : primary key
name : colloquial station name
country : foreign key into table country
table country
id : primary key
name : name of the country
database measurement_values
table measurement_value
id : primary key
station : id of the station the value came from
value : measured value
I need a list of the names of all countries from the first database for which values exist in the second database. I am using MySQL with InnoDB, so cross-database foreign are supported.
I am lost on the SELECT statement, more specifically, the where clause.
Selecting the IDs of the countries for which values exist seems easy:
SELECT DISTINCT id FROM measurement_values.measurement_value
This takes a couple of minutes the first time, but is really fast in subsequent calls, even after database server restarts; I assume that's normal.
I think the COUNT trick mentioned in Problem with Query Data in a Table and Mysql Complex Where Clause could help, but I can't seem to get it right.
SELECT country.name FROM measurement_stations WHERE country.id = measurement_station.id
AND (id is in the result of the previous SELECT statement)
Can anyone help me ?
This should do it:
select distinct m.country, ct.name
from measurement_stations..measurement_station m
inner join measurement_values..measurement_value mv on mv.station = m.id
inner join measurement_stations..country ct on ct.id = m.country