SQL Select rows by comparison of value to aggregated function result - select

I have a table listing (gameid, playerid, team, max_minions) and I want to get the players within each team that have the lowest max_minions (within each team, within each game). I.e. I want a list (gameid, team, playerid_with_lowest_minions) for each game/team combination.
I tried this:
SELECT * FROM MinionView GROUP BY gameid, team
HAVING MIN(max_minions) = max_minions;
Unfortunately, this doesn't seem to work as it seems to select a random row from the available rows for each (gameid, team) and then does the HAVING comparison. If the randomly selected row doesn't match, it's simply skipped.
Using WHERE won't work either since you can't use aggregate functions within WHERE clauses.
LIMIT won't work since I have many more games and LIMIT limits the total number of rows returned.
Is there any way to do this without adding another table/view that contains (gameid, teamid, MIN(max_minions))?
Example data:
sqlite> SELECT * FROM MinionView;
gameid|playerid|team|champion|max_minions
21|49|100|Champ1|124
21|52|100|Champ2|18
21|53|100|Champ3|303
21|54|200|Champ4|356
21|57|200|Champ5|180
21|58|200|Champ6|21
64|49|100|Champ7|111
64|50|100|Champ8|208
64|53|100|Champ9|8
64|54|200|Champ0|226
64|55|200|ChampA|182
64|58|200|ChampB|15
...
Expected result (I mostly care about playerid, but included champion, max_minions here for better overview):
21|52|100|Champ2|18
21|58|200|Champ6|21
64|53|100|Champ9|8
64|58|200|ChampB|15
...
I'm using Sqlite3 under Python 3.1 if that matters.

This is in SQL Server, hopefully the syntax works for you too:
SELECT
MV.*
FROM
(
SELECT
team, gameid, min(max_minions) as maxmin
FROM
MinionView
GROUP BY
team, gameid
) groups
JOIN MinionView MV ON
MV.team = groups.team
AND MV.gameid = groups.gameid
AND MV.max_minions = groups.maxmin
In words, first you make the usual grouping query (the nested one). At this point you have the min value for each group but you don't know to which row it belongs. For this you join with the original table and match the "keys" (team, game and min) to get the other columns as well.
Note that if a team will have more than one member with the same value for max_minions then all these rows will be selected. If you only want one of them then that's probably a bit more complicated.

Related

Replace correlated subquery with join

I'd like to replace the following ABAP OpenSQL snippet (in the where clause of a much bigger statement) with an equivalent join.
... AND tf~tarifart = ( SELECT MAX( tf2~tarifart ) FROM ertfnd AS tf2 WHERE tf2~tariftyp = e1~tariftyp AND tf2~bis >= e1~bis AND tf2~ab <= e1~ab ) ...
My motivation: Query migration to ABAP CDS views (basically plain SQL with in comparison somewhat reduced expressiveness). Alas, correlated subqueries and EXISTS statements are not supported.
I googled a bit and found a possible solution (last post) here https://archive.sap.com/discussions/thread/3824523
However, the proposal
Selecting MAX(value)
Your scenarion using inner join to first CDS view
doesn't work in my case.
tf.bis (and tf.ab) need to be in the selection list of the new view to limit the rhs of the join (new view) to the correct time frames.
Alas, there could be multiple (non overlapping) sub time frames (contained within [tf.ab, tf.bis]) with the same tf.tarifart.
Since these couldn't be grouped together, this results in multiple rows on the rhs.
The original query does not have a problem with that (no join -> no Cartesian product).
I hope the following fiddle (working example) clears things up a bit: http://sqlfiddle.com/#!9/8d1f48/3
Given these constraints, to me it seems that an equivalent join is indeed impossible. Suggestions or even confirmations?
select doc_belzart,
doc_tariftyp,
doc_ab,
doc_bis,
max(tar_tarifart)
from
(
select document.belzart as doc_belzart,
document.tariftyp as doc_tariftyp,
document.ab as doc_ab,
document.bis as doc_bis,
tariff.tarifart as tar_tarifart,
tariff.tariftyp as tar_tariftyp,
tariff.ab as tar_ab,
tariff.bis as tar_bis
from dberchz1 as document
inner join ertfnd as tariff
on tariff.tariftyp = document.tariftyp and
tariff.ab <= document.ab and
tariff.bis >= document.bis
) as max_tariff
group by doc_belzart,
doc_tariftyp,
doc_ab,
doc_bis
Translated in English, you seem to want to determine the max applicable tariff for a set of documents.
I'd refactor this into separate steps:
Determine all applicable tariffs, meaning all tariffs that completely cover the document's time interval. This will become your first CDS view, and in my answer forms the sub-query.
Determine for all documents the max applicable tariff. This will form your second CDS view, and in my answer forms the outer query. This one has the MAX / GROUP BY to reduce the result set to one per document.

Laravel 4.2 order by another collections field or result of a function

I have a mongo database and I'm trying to write an Eloquent code to change some fields before using them in WHERE or ORDER BY clauses. something like this SQL query:
Select ag.*, ht.*
from agency as ag inner join hotel as ht on ag.hotel_id = ht.id
Where ht.title = 'OrangeHotel'
-- or --
Select ag.*, ht.*
from agency as ag inner join hotel as ht on ag.hotel_id = ht.id
Order by ht.title
sometimes there is no other table and I just need to use calculated field in Where or Order By clause:
Select *
from agency
Where func(agency_admin) = 'testAdmin'
Select *
from agency
Order by func(agency_admin)
where func() is my custom function.
any suggestion?
and I have read Laravel 4/5, order by a foreign column for half of my problem, but I don't know how can I use it.
For the first query: mongodb only support "join" partially with the aggregation pipeline, which limits your aggregation in one collection. For "join"s between different collections/tables, just select from collections one by one, first the one containing the "where" field, then the one who should "join" with the former, and so on.
The second question just puzzled me for some minutes until I see this question and realized it's the same as your first question: sort the collection containing your sort field and retrive some data, then go to another.
For the 3rd question, this question should serve you well.

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?

Fastest way to update a Postgres table, given a set of unique column values?

I've been running into this same issue repeatedly when trying to execute Postgres updates. First I'll run a SELECT query, like so:
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list;
This selects the stock numbers of all products that indicate that they're available in the current database, but no longer appear in the new list of inventory that's just been downloaded. This command runs very quickly. However, virtually any method I use to update this list seems to take at least ten minutes to run, slowing the server down in the process. For instance:
UPDATE products
SET available = false
WHERE stock_number IN (
SELECT stock_number
FROM products
WHERE available = true
AND stock_number IS NOT NULL
EXCEPT
SELECT stock_number
FROM new_inventory_list
);
There are usually at least 10,000 rows that need to be updated, and often a lot more if a supplier pushes a lot of new inventory at once. Additionally, we need to check for price updates. It's relatively fast and easy to get a list of stock numbers for products that have been changed in price:
WITH overlap AS (
SELECT stock_number
FROM products
INTERSECT
SELECT stock_number
FROM new_inventory_list
)
unchanged AS (
SELECT stock_number, price
FROM products
INTERSECT
SELECT stock_number, price
FROM new_inventory_list
)
SELECT * FROM overlap EXCEPT SELECT stock FROM unchanged;
For this query, I don't even try to use SQL commands to do it, instead I pull the list out into a script, then run UPDATE on each modified value individually. It's slow, but still seems to be faster than any command I've tried that was strictly in SQL. Plus, with an external script, I can output the progress periodically, so I approximate how long it will run for. Stock numbers are unique, although they're occasionally NULL. (Those should be ignored)
I feel like there has to be a much faster way of doing this, but so far I haven't had any luck figuring it out. Any thoughts?
edit:
I think I found a better solution to this problem than any that I've tried so far:
WITH removed AS (
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list
)
UPDATE products AS p
SET available = false
FROM removed
WHERE removed.stock_number = p.stock_number;
I hadn't considered the idea of using UPDATE and WITH together, and didn't even know it was possible until I read the UPDATE documentation for Postgres. Even though it's considerably faster, it still takes a few minutes to run, so to monitor it, I just run the above command in a loop, with LIMIT 1000 at the end of the SELECT clause, printing a message to the console every time it successfully updates another block.
This query:
WITH removed AS (
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list
)
UPDATE products AS p
SET available = false
FROM removed
WHERE removed.stock_number = p.stock_number;
… will, I trust, do a superfluous join on the entire table with itself. And probably a poorly performing one, at that, because of the except clause in the with statement.
Think of it this way: suppose a products table with a million rows, around 250k marked as available, and 50k of those that don't appear in a 200k-item strong inventory list. The with query runs like this: 1) find the 50k rows in products that need to be updated; 2) then, for each row in products, check if the id is in those 50k rows in order to re-select those same 50k rows; 3) and update the row.
For improved performance, the update query should select the candidate rows from products that need to be updated directly, and use an anti-join to eliminate unwanted rows. The query #wildplasser posted earlier seems fine:
UPDATE products dst
SET available = false
WHERE available
AND NOT EXISTS (
SELECT 1
FROM new_inventory_list nx
WHERE nx.stock_number = dst.stock_number
);
Another point is the "about 50 columns, 20 of which are indexed" you mentioned in the comments: That will slow down updates considerable. Imagine: each row that gets updated needs to be written into not just that table, but in an additional 20 tables. Are you sure this shouldn't be normalized a bit more and that you actually need each of those indexes?
Have you tried
WITH removed AS (
SELECT stock_number
FROM products p1
LEFT JOIN new_inventory_list n1
ON p1.stock_number=n1.stock_number
WHERE p1.available AND n1.stock_number IS NULL
)
I don't know how the EXCEPT is being done; perhaps this will retain some indexing for use in the UPDATE. Also, if available is usually false, I would add a partial index
CREATE INDEX product_available ON product(stock_number) WHERE available;

what's the utility of array type?

I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)