I am working on a project that obtains values from many measurement stations (e.g. 50000) located all over the world. I have 2 databases, one storing information on the measurement stations, the other one storing values obtained from these stations (e.g. several million). A super-simplified version of the database structure could look like this:
database measurement_stations
table measurement_station
id : primary key
name : colloquial station name
country : foreign key into table country
table country
id : primary key
name : name of the country
database measurement_values
table measurement_value
id : primary key
station : id of the station the value came from
value : measured value
I need a list of the names of all countries from the first database for which values exist in the second database. I am using MySQL with InnoDB, so cross-database foreign are supported.
I am lost on the SELECT statement, more specifically, the where clause.
Selecting the IDs of the countries for which values exist seems easy:
SELECT DISTINCT id FROM measurement_values.measurement_value
This takes a couple of minutes the first time, but is really fast in subsequent calls, even after database server restarts; I assume that's normal.
I think the COUNT trick mentioned in Problem with Query Data in a Table and Mysql Complex Where Clause could help, but I can't seem to get it right.
SELECT country.name FROM measurement_stations WHERE country.id = measurement_station.id
AND (id is in the result of the previous SELECT statement)
Can anyone help me ?
This should do it:
select distinct m.country, ct.name
from measurement_stations..measurement_station m
inner join measurement_values..measurement_value mv on mv.station = m.id
inner join measurement_stations..country ct on ct.id = m.country
Related
The database used is postgresql.
Suppose I have dimension table dim_orders with fields describing order line (one order number can has many order line, different on item name):
order_id (auto-increment primary key)
order_number
order_status (NEW, PAID, ORDERED, ...)
item_name
...
Then I have ETL process that runs hourly. The data source is from sales database. The problem is, we have order_line_status field, which can change hour-to-hour (e.g. one cycle will be NEW > PAID > ORDERED > DELIVERED > CLOSED) on the data source. And this order status can be different for each order. For example, when order X has line X1 (item : chocolate) and line X2 (item : coffee), the X1 chocolate might already DELIVERED, but X2 coffee still on ORDERED.
My fact_sales table is something like this:
sales_id (auto-increment primary key)
order_id (foreign key to dim_orders), which basically the order line : chocolate or coffee on sample above
quantity (taken from sales data)
discount_amount (taken from sales data)
...
To maintain speed, I'd like to avoid network / SQL call everytime the ETL process runs.
This is because the network sometimes full and quite slow.
Right now, every ETL process one sales data, I query from dim_orders :
**existing_order_id = "SELECT order_id FROM dim_orders WHERE order_number = [staging_data.order_number] AND item_name = [staging_data.item_name]"**
if existing_order_id found then:
- update dim_orders set order_status = new order status from staging data (it might changed or not since last hour ETL)
- update fact_sales, using data on staging table
else:
- insert into dim_sales, get new order_id. Something like (INSERT INTO dim_orders ... RETURNING order_id)
- insert into fact_sales, using order_id
The problem is on the bold statement, where I always query for existing order ID. If I have 10k rows, this means 10k select before processing the data. I'm trying to change the dim_orders, using unique key on (order_number, item_name), so instead of select-insert/update, I can have something like this (I think)
A. upsert data from staging_table into dim_orders. If existing (order_number, item_name) match, update the order_status.
B. process the fact table, using returned order id
Since we can get the order id from fact table, query A can be achieved by:
INSERT INTO dim_orders (order_number, item_name, order_status, ...) VALUES (...)
ON CONFLICT(order_number, item_name) DO UPDATE
SET order_status = excluded.order_status
RETURNING order_id
Now, the only problem to do this, I have to add unique constraint on dim_orders where (order_number, item_name) must be unique.
However, this design is rejected, because we never have unique constraint on dimension tables before. But from what I get, the reason is because we never done it before.
So my question is:
is it OK to add unique constraint on star schema (fact / dim tables)? Or it is actually bad by design, why?
In term of data warehouse, is there any other approach for this kind of select-insert/update problem?
Thanks
I'm having trouble creating a query for the following task: i want to return exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid. The database has 3 tables "countrylist" , "provinces" and "regionlist"
the table countrylist has the following columns : countryid, language code, countryname, countrycode and continentid
provinces : country_code, country_name, province_code, province_name
regionlist: regionid, regiontype.
So I tried writing a query for joining the table but I'm sure if I'm doing it correct.
exactly one row with columns: region_id, region_name, province_name, province_code, country_name, country_code for any given regionid.
I am not 100% aware of the differences between Postgres and MySQL - but guess you get the idea at the very least.
One way to do it, to get your id with WHERE regionlist.regionid = and join the other tables. From either the regionlist you can use the LIMIT (reference) to get a limited amount of rows.
Apparently neither provinces nor country have a common column with regionlist, so I can not tell where the link between those are. However, once you have 1 row of the region list you should have no troubles joining them with the others (if the links are trivial).
The following tables are given:
--- player --
id serial
name VARCHAR(100)
birthday DATE
country VARCHAR(3)
PRIMARY KEY id
--- club ---
id SERIAL
name VARCHAR(100)
country VARCHAR(3)
PRIMARY KEY id
--- playersinclubs ---
id SERIAL
player_id INTEGER (with INDEX)
club_id INTEGER (with INDEX)
joined DATE
left DATE
PRIMARY KEY id
Every player has a row in table player (with his attributes). Equally every club has an entry in table club.
For every station in his career, a player has an entry in table playersInClubs (n-m) with the date when the player joined and optionally when the player left the club.
My main problem is the performance of these tables. In Table player we have over 10 million entries. If i want to display a history of a club with all his players played for this club, my select looks like the following:
SELECT * FROM player
JOIN playersinclubs ON player.id = playersinclubs.player_id
JOIN club ON club.id = playersinclubs.club_id
WHERE club.dbid = 3;
But for the massive load of players a sequence scan on table player will be executed. This selection takes a lot of time.
Before I implemented some new functions to my app, every players has exactly one team (only todays teams and players).
So i havn't had the table playersinclubs. Instead i had a team_id in table player. I could select the players of a team directly in table player with the where clause team_id = 3.
Does someone has some performance tips for my database structure to speed up these selections?
Most importantly, you need an index on playersinclubs(club_id, player_id). The rest is details (that may still make quite a difference).
You need to be precise about your actual goals. You write:
all his players played for this club:
You don't need to join to club for this at all:
SELECT p.*
FROM playersinclubs pc
JOIN player p ON p.id = pc.player_id
WHERE pc.club_id = 3;
And you don't need columns playersinclubs in the output either, which is a small gain for performance - unless it allows an index-only scan on playersinclubs, then it may be substantial.
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
You probably don't need all columns of player in the result, either. Only SELECT the columns you actually need.
The PK on player provides the index you need on that table.
You need an index on playersinclubs(club_id, player_id), but do not make it unique unless players are not allowed to join the same club a second time.
If players can join multiple times and you just want a list of "all players", you also need to add a DISTINCT step to fold duplicate entries. You could just:
SELECT DISTINCT p.* ...
But since you are trying to optimize performance: it's cheaper to eliminate dupes early:
SELECT p.*
FROM (
SELECT DISTINCT player_id
FROM playersinclubs
WHERE club_id = 3;
) pc
JOIN player p ON p.id = pc.player_id;
Maybe you really want all entries in playersinclubs and all columns of the table, too. But your description says otherwise. Query and indexes would be different.
Closely related answer:
Find overlapping date ranges in PostgreSQL
The tables look fine and so does the query. So let's see what the query is supposed to do:
Select the club with ID 3. One record that can be accessed via the PK index.
Select all playersinclub records for club ID 3. So we need an index starting with this column. If you don't have it, create it.
I suggest:
create unique index idx_playersinclubs on playersinclubs(club_id, player_id, joined);
This would be the table's unique business key. I know that in many databases with technical IDs these unique constraints are not established, but I consider this a flaw in those databases and would always create these constraints/indexes.
Use the player IDs got thus and select the players accordingly. We can get the player ID from the playersinclubs records, but it is also the second column in our index, so the DBMS may choose one or the other to perform the join. (It will probably use the column from the index.)
So maybe it is simply that above index does not exist yet.
I have three tables representing some geographical datas and :
- one with the actual datas,
- one storing the name of the streets,
- one storing the combination between the street number and the street name (table address).
I already have some address existing in my table, in order to realize an INSERT INTO SELECT in a fourth table, I am looking on how to build the SELECT query to retrieve only the objects not already existing in the address table.
I tried different approaches, including the NOT EXISTS and the id_street IS NULL conditions, but I didn't manage to make it work.
Here is an example : http://rextester.com/KMSW4349
Thanks
You can simply use EXCEPT to remove the rows already in address:
INSERT INTO address(street_number,id_street)
SELECT DISTINCT datas.street_number, street.id_street
FROM datas
LEFT JOIN street USING (street_name)
EXCEPT
SELECT street_number, id_street FROM address;
You could end up with duplicates if there are concurrent data modifications on address.
To avoid that, you'd add a unique constraint and use INSERT ... ON CONFLICT DO NOTHING.
Your sub query is not correct. You have to match with the outer tables:
INSERT INTO address(street_number,id_street)
SELECT DISTINCT street_number, id_street
FROM datas
LEFT JOIN street ON street.street_name=datas.street_name
WHERE NOT EXISTS (SELECT * FROM address a2 WHERE a2.street_number = datas.street_number AND a2.id_street = street.id_street);
I'm trying to build a request to get the data from a table, but some of those columns have foreign keys I would like to replace by the associated keyword in one request.
Basically there's
table A with column 1:PKA-ID and column 2:name.
table B with column 1:PKB-ID, column 2:FKA-ID, column 3:amount.
I want to get all the lines in table B but with all foreign keys replaced by the associated names in table A.
I started building a request with a subrequest + alias to get that, but ofc I have more than one result per subrequest, yet I can't find a way to link that subrequest to the ID of table B [might be exhausted, dumb or both] from the main request. I did something like that:
SELECT (SELECT "NAME" FROM A JOIN B ON ID = FKA-ID) AS name, amount FROM TABLEB;
it feels so simple of a request yet...
You don't need a join in the subselect.
SELECT pkb_id,
(SELECT name FROM a WHERE a.pka_id = b.fka_id),
amount
FROM b;
(See it live in SQL Fiddle).
The subselect query runs for each and every row of its parent select and has the parent row available from the context.
You can also use a simple join.
SELECT b.pkb_id, a.name, b.amount
FROM b, a
WHERE a.pka_id = b.fka_id;
Note that the join version puts less restrictions on the PostgreSQL query optimizer so in some cases the join version might work faster. (For example, in PostgreSQL 9.6 the join might utilize multiple CPU units, cf. Parallel Query).