I have a database with around 300,000 users and 800,000 relationships between those users, this data can be described like:
User - Contact -> User
I want to know the number of possible new contacts that a specific user can have, so I wrote this query in order to know this number:
SELECT COUNT(*) FROM (TRAVERSE OUT() FROM (SELECT FROM Usuario WHERE user_id=12345) WHILE $depth <=2) WHERE $depth = 2
The query take 5 sec (more or less). I have the same data into a neo4j database and the count for the same level takes 450 ms. So I want to know if exists some way to obtain this information (number of possible new contacts) with best performance.
A good improvement you get by putting a NOTUNIQUE_HASH_INDEX the field user_id.
EDIT 1
Another tip that you can try using 'maxdepth' instead of 'while depth <= 2.
SELECT COUNT (*) FROM (TRAVERSE OUT () FROM (SELECT FROM Usuario WHERE user_id = 12345) WHILE $ MAXDEPTH = 2) WHERE $ depth = 2
There is a slight difference in terms of calculation time, due to the fact that the while $depth will be evaluated also at level 3, then the records are skipped because they don't match the while, but in the meantime they were loaded, and it costs execution time. Withmaxdepth you just stop the execution at level 2.
Related
Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.
I'm using FB 2.5.5 and I'm trying to understand why a very simple query does not use an index and thus takes forever to execute. I've read a lot of articles about why existing indices might be ignored by the query optimizer but I'm not understanding how it can happens in my case. I recomputed the selectivity for all my indices within IB Expert, and I've also done a backup/restore of the database to be sure I wasn't missing something.
The index selectivity, as displayed by IB Expert, is approx 0,000024 - which is far from 1 :
CREATE INDEX TVERSIONS_IDX_LASTMODDATE ON TVERSIONS (LASTMODDATE)
The table I'm querying contains approx. 2M records :
SELECT COUNT(ID) FROM TVERSIONS
2479518
I'm trying to fetch all records based on the LASTMODDATE field (TIMETSAMP, indexed by TVERSIONS_IDX_LASTMODDATE). An oversimplified version of the query would be :
SELECT COUNT(ID) FROM TVERSIONS WHERE LASTMODDATE > :TheDate
In this case, the execution plan shows that the index is actually used :
Plan
PLAN (TVERSIONS INDEX (TVERSIONS_IDX_LASTMODDATE))
...and records matching the condition are fetched very quickly :
------ Performance info ------
Prepare time = 172ms
Execute time = 16ms <----
Avg fetch time = 16,00 ms
Current memory = 2 714 672
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 57
Writes from cache to disk = 0
Fetches from cache = 387
Now, the "real" query fetches the same fields using the same condition on LASTMODDATE but adds a JOIN over 3 tables :
SELECT COUNT(ID) FROM TVERSIONS
JOIN TFILES ON TFILES.ID = TVERSIONS.FILEID
JOIN TROOTS ON TROOTS.ID = TFILES.ROOTID
JOIN TUSERSBACKUPS ON TROOTS.BACKUPID = TUSERSBACKUPS.BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
Now the query plan does not use the index anymore :
Plan
PLAN JOIN (TUSERSBACKUPS INDEX (RDB$FOREIGN4), TROOTS INDEX (RDB$FOREIGN3), TFILES INDEX (RDB$FOREIGN2), TVERSIONS INDEX (RDB$FOREIGN6))
Without any surprise execution time is far more slower (approx. 1 minute):
------ Performance info ------
Prepare time = 329ms
Execute time = 53s 593ms <---
Avg fetch time = 53 593,00 ms
Current memory = 3 044 736
Max memory = 10 128 480
Memory buffers = 90
Reads from disk to cache = 55 732
Writes from cache to disk = 0
Fetches from cache = 6 952 648
In other words, searching the WHOLE table is magnitude faster than searching into a subset of rows returned by JOIN.
I can't understand why the index on the LASTMODDATE field is not used anymore just because I'm adding the join clause. The selectivity of the index is good and the query is very simple. What do I miss ?
It seems Firebird decided to start with condition TUSERSBACKUPS.USERID=:UserID using index RDB$FOREIGN4. Probably it happens because you have here equality, and for condition TVERSIONS.LASTMODDATE >:TheDate you have inequality which could lead to potentially larger set of records (for example if TheDate is a date 200 years ago it will include the whole table).
To force Firebird use a plan which you (but not its optimizer) prefer - use PLAN clause, see http://www.firebirdfaq.org/faq224/
I think I've understood what happened, and... I guess it was my fault.
I forgot that the table I'm querying has been "denormalized" to avoid such long JOINs. The problematic query can indeed by rewritten in a much shorter way :
SELECT COUNT(TVERSIONS.ID) FROM TVERSIONS
JOIN TUSERSBACKUPS ON TUSERSBACKUPS.BACKUPID = TVERSIONS.RD_BACKUPID
WHERE TUSERSBACKUPS.USERID= :UserID
AND TVERSIONS.LASTMODDATE >:TheDate
This one properly uses the indices I set before and has a very short execution time.
I have the impression that when Firebird detects you're deliberately using a sub-optimal path to access records in a table it does not even try to use your indices and let you shoot yourself in the foot...
Anyway, the problem is solved. Thank you all for your suggestions.
I am running the following query:
SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC LIMIT 25 OFFSET 1
Because I have records in the table with name = 'Bob' the query time is fast on a table of 10M records (<.5 seconds)
However, if I search for name = 'Susan' the query takes over 45 seconds. I have no records in the table where name = 'Susan'.
I have an index on each of name and address. I've vacuumed the table, analyzed it and have even tried to re-write the query:
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob' ORDER BY address DESC) f LIMIT 25 OFFSET 1
and can't find any solution. I'm not really sure how to proceed. Please note this is different than this post as my slowness only happens when there are no records.
EDIT:
If I take out the ORDER BY address then it runs quickly. Obviously, I need that there. I've tried re-writing it (with no success):
SELECT * FROM (SELECT * FROM foo WHERE name = 'Bob') f ORDER BY address DESC LIMIT 25 OFFSET 1
Examine the execution plan to see which index is being used. In this case, the separate indexes for name and address are not enough. You should create a combined index of name, then address for this query.
Think of an index as a system maintained copy of certain columns, in a different order from the original. In this case, you want to first find matches by name, then tie-break on address, then take until you have enough or run out of name matches.
By making name first in the multi-column index, the index will be sorted by name first. Then address will serve as our tie-breaker.
Under the original indexes, if the address index is the one chosen then the query's speed will vary based on how quickly it can find matches.
The plan (in english) would be: Proceed through all of the rows which happen to already be sorted by address, discard any that do not match the name, keep going until we have enough.
So if you do not get 25 matches, you read the whole table!
With my proposed multi-column index, the plan (in English) would be: Proceed through all of the name matching rows which happen to already be sorted by address. Start with the first one and take them until you have enough. If you run out, stop.
Since the situation is that a query without the Order By is much faster than the one with the Order By clause; I'd make 2 queries:
-One without the order by, limit 1, to know if you have at least one record.
In the case you have at least one, it's safe to run the query with Order by.
-If there's no record, no need to run the second query.
Yes, it's not a solution, but it will let you deliver your project. Just ensure you create a ticket to handle the technical debt after delivery ;) otherwise your lead developer will set you on fire.
Then, to solve the real technical problem, it will be useful to know which indices you have created. Without these it will be very hard to give you a proper solution!
I've been running into this same issue repeatedly when trying to execute Postgres updates. First I'll run a SELECT query, like so:
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list;
This selects the stock numbers of all products that indicate that they're available in the current database, but no longer appear in the new list of inventory that's just been downloaded. This command runs very quickly. However, virtually any method I use to update this list seems to take at least ten minutes to run, slowing the server down in the process. For instance:
UPDATE products
SET available = false
WHERE stock_number IN (
SELECT stock_number
FROM products
WHERE available = true
AND stock_number IS NOT NULL
EXCEPT
SELECT stock_number
FROM new_inventory_list
);
There are usually at least 10,000 rows that need to be updated, and often a lot more if a supplier pushes a lot of new inventory at once. Additionally, we need to check for price updates. It's relatively fast and easy to get a list of stock numbers for products that have been changed in price:
WITH overlap AS (
SELECT stock_number
FROM products
INTERSECT
SELECT stock_number
FROM new_inventory_list
)
unchanged AS (
SELECT stock_number, price
FROM products
INTERSECT
SELECT stock_number, price
FROM new_inventory_list
)
SELECT * FROM overlap EXCEPT SELECT stock FROM unchanged;
For this query, I don't even try to use SQL commands to do it, instead I pull the list out into a script, then run UPDATE on each modified value individually. It's slow, but still seems to be faster than any command I've tried that was strictly in SQL. Plus, with an external script, I can output the progress periodically, so I approximate how long it will run for. Stock numbers are unique, although they're occasionally NULL. (Those should be ignored)
I feel like there has to be a much faster way of doing this, but so far I haven't had any luck figuring it out. Any thoughts?
edit:
I think I found a better solution to this problem than any that I've tried so far:
WITH removed AS (
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list
)
UPDATE products AS p
SET available = false
FROM removed
WHERE removed.stock_number = p.stock_number;
I hadn't considered the idea of using UPDATE and WITH together, and didn't even know it was possible until I read the UPDATE documentation for Postgres. Even though it's considerably faster, it still takes a few minutes to run, so to monitor it, I just run the above command in a loop, with LIMIT 1000 at the end of the SELECT clause, printing a message to the console every time it successfully updates another block.
This query:
WITH removed AS (
SELECT stock_number
FROM products
WHERE available = true
EXCEPT
SELECT stock_number
FROM new_inventory_list
)
UPDATE products AS p
SET available = false
FROM removed
WHERE removed.stock_number = p.stock_number;
… will, I trust, do a superfluous join on the entire table with itself. And probably a poorly performing one, at that, because of the except clause in the with statement.
Think of it this way: suppose a products table with a million rows, around 250k marked as available, and 50k of those that don't appear in a 200k-item strong inventory list. The with query runs like this: 1) find the 50k rows in products that need to be updated; 2) then, for each row in products, check if the id is in those 50k rows in order to re-select those same 50k rows; 3) and update the row.
For improved performance, the update query should select the candidate rows from products that need to be updated directly, and use an anti-join to eliminate unwanted rows. The query #wildplasser posted earlier seems fine:
UPDATE products dst
SET available = false
WHERE available
AND NOT EXISTS (
SELECT 1
FROM new_inventory_list nx
WHERE nx.stock_number = dst.stock_number
);
Another point is the "about 50 columns, 20 of which are indexed" you mentioned in the comments: That will slow down updates considerable. Imagine: each row that gets updated needs to be written into not just that table, but in an additional 20 tables. Are you sure this shouldn't be normalized a bit more and that you actually need each of those indexes?
Have you tried
WITH removed AS (
SELECT stock_number
FROM products p1
LEFT JOIN new_inventory_list n1
ON p1.stock_number=n1.stock_number
WHERE p1.available AND n1.stock_number IS NULL
)
I don't know how the EXCEPT is being done; perhaps this will retain some indexing for use in the UPDATE. Also, if available is usually false, I would add a partial index
CREATE INDEX product_available ON product(stock_number) WHERE available;
I have a table listing (gameid, playerid, team, max_minions) and I want to get the players within each team that have the lowest max_minions (within each team, within each game). I.e. I want a list (gameid, team, playerid_with_lowest_minions) for each game/team combination.
I tried this:
SELECT * FROM MinionView GROUP BY gameid, team
HAVING MIN(max_minions) = max_minions;
Unfortunately, this doesn't seem to work as it seems to select a random row from the available rows for each (gameid, team) and then does the HAVING comparison. If the randomly selected row doesn't match, it's simply skipped.
Using WHERE won't work either since you can't use aggregate functions within WHERE clauses.
LIMIT won't work since I have many more games and LIMIT limits the total number of rows returned.
Is there any way to do this without adding another table/view that contains (gameid, teamid, MIN(max_minions))?
Example data:
sqlite> SELECT * FROM MinionView;
gameid|playerid|team|champion|max_minions
21|49|100|Champ1|124
21|52|100|Champ2|18
21|53|100|Champ3|303
21|54|200|Champ4|356
21|57|200|Champ5|180
21|58|200|Champ6|21
64|49|100|Champ7|111
64|50|100|Champ8|208
64|53|100|Champ9|8
64|54|200|Champ0|226
64|55|200|ChampA|182
64|58|200|ChampB|15
...
Expected result (I mostly care about playerid, but included champion, max_minions here for better overview):
21|52|100|Champ2|18
21|58|200|Champ6|21
64|53|100|Champ9|8
64|58|200|ChampB|15
...
I'm using Sqlite3 under Python 3.1 if that matters.
This is in SQL Server, hopefully the syntax works for you too:
SELECT
MV.*
FROM
(
SELECT
team, gameid, min(max_minions) as maxmin
FROM
MinionView
GROUP BY
team, gameid
) groups
JOIN MinionView MV ON
MV.team = groups.team
AND MV.gameid = groups.gameid
AND MV.max_minions = groups.maxmin
In words, first you make the usual grouping query (the nested one). At this point you have the min value for each group but you don't know to which row it belongs. For this you join with the original table and match the "keys" (team, game and min) to get the other columns as well.
Note that if a team will have more than one member with the same value for max_minions then all these rows will be selected. If you only want one of them then that's probably a bit more complicated.