I have a table that represents an organizational hierarchy. Simplifying a bit (because the complexity isn't relevant here), imagine the classic bill of materials or management hierarchy, each record has an id and a parent_id, and of course other relevant data, let's just say name. Let's call this table "node".
Then I have records saying which users are authorized to access data about any given node in the hierarchy. Call this node_user.
So I want to find all the node_user records for nodes that are at or below a given node.
Easy enough for a small finite number of levels. Say ...
select nu.userid
from node_user nu
join node n1 on n1.id=nu.node_id
left join node n2 on n2.id=n1.parent_id
left join node n3 on n3.id=n2.parent_id
where #root in (n1.id, n2.id, n3.id)
But what if I don't know the maximum depth or it's large? I was thinking I could do the "chase up the tree" using a CTE, doing a "from node_user join (with hierarchy ...)", but that's a syntax error. Apparently you can't use a CTE as a sub-query. But I can't make the CTE the main query because I don't know what nodes I'll find. Unless I make the CTE build a tree for the entire database, which I would think would be a performance killer.
Is there an easy way to do this and I'm just having a brain freeze?
Related
I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.
Question
I would like to know: How can I rewrite/alter my search query/strategy to get an acceptable performance for my end users?
The search
I'm implementing a search for our users, they are provided the ability to search for candidates on our system based on:
A professional group they fall into,
A location + radius,
A full text search.
The query
select v.id
from (
select
c.id,
c.ts_description,
c.latitude,
c.longitude,
g.group
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
) v
-- Group selection
where v.group = 'medical'
-- Location + radius
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
-- Full text search
and v.ts_description ## to_tsquery('simple', 'nurse | doctor')
;
Data size & benchmarks
I am working with 1.7 million records
I have the 3 conditions in order of impact which were benchmarked in isolation:
Group clause: 3s & reduces to 700k records
Location clause: 8s & reduces to 54k records
Full text clause: 60s+ & reduces to 10k records
When combined they seem to take 71s, which is the full impact of the 3 queries in isolation, my expectation was that when putting all 3 clauses together they would work sequentially i.e on the subset of data from the previous clause therefore the timing should reduce dramatically - but this has not happened.
What I've tried
All join conditions & where clauses are indexed
Notably the ts_description index (GIN) is 2GB
lat/lng is indexed with ll_to_earth() to reduce the impact inline
I nested each where clause into a different subquery in order
Changed the order of all clauses & subqueries
Increased the shared_buffers size to increase the potential cache hits
It seems you do not need to subquery, and it is also a good practice to filter with numeric fields, so, instead of filtering with where v.group = 'medical' for example, create a dictionary and just filter with where v.group = 1
select
DISTINCT c.id,
from entities.candidates c
join entities.candidates_connections cc on cc.candidates_id = c.id
join system.groups g on cc.systems_id = g.id
where tablename.group = 1
and earth_distance(ll_to_earth(v.latitude, v.longitude), ll_to_earth(50.87050439999999, -1.2191283)) < 48270
and v.ts_description ## to_tsquery(0, 1 | 2)
also, use EXPLAIN ANALYSE to see and check your execution plan. These quick tips will help you improve it clearly.
There were some best practice cases that I had not considered, I have subsequently implemented these to gain a substantial performance increase:
tsvector Index Size Reduction
I was storing up to 25,000 characters in the tsvector, this meant that when more complicated full text search queries were used there was just an immense amount of work to do, I reduced this down to 10,000 which has made a big difference and for my use case this is an acceptable trade-off.
Create a Materialised View
I created a materialised view that contains the join, this offloads a little bit of the work, additionally I built my indexes on there and run a concurrent refresh on a 2 hour interval. This gives me a pretty stable table to work with.
Even though my search yields 10k records I end up paginating on the front-end so I only ever bring back up to 100 results on the screen, this allows me to join onto the original table for only the 100 records I'm going to send back.
Increase RAM & utilise pg_prewarm
I increased the server RAM to give me enough space to store my materialised view into, then ran pg_prewarm on my materialised view. Keeping it in memory yielded the biggest performance increase for me, bringing a 2m query down to 3s.
I am writing a query to get records from Table A which satisfies a condition from records in Table B. For example:
Table A is:
Name Profession City
John Engineer Palo Alto
Jack Doctor SF
Table B is:
Profession City NewJobOffer
Engineer SF Yes
and I'm interested to get Table c:
Name Profession City NewJobOffer
Jack Engineer SF Yes
I can do this in two ways using where clause or join query which one is faster and why in spark sql?
Where clause to compare the columns add select those records or join on the column itself, which is better?
It's better to provide filter in WHERE clause. These two expressions are not equivalent.
When you provide filtering in JOIN clause, you will have two data sources retrieved and then joined on specified condition. Since join is done through shuffling (redistributing between executors) data first, you are going to shuffle a lot of data.
When you provide filter in WHERE clause, Spark can recognize it and you will have two data sources filtered and then joined. This way you will shuffle less amount of data. What might be even more important is that this way Spark may also be able to do a filter-pushdown, filtering data at datasource level, which means even less network pressure.
My scenario is like this:
I have a couple of views in my databse (SQL Server 2005).
These views are queried from Excel across the organization.
My goal is to identify those views which have not been used by anyone for a long time.
Is there a way to count the number of times a view has been requested since a specific date?
Thanks
Avi
You can use following query to get some queries those executed. You can place "Like" operator in dest.text field to check for views.
SELECT deqs.last_execution_time AS [Time], dest.text AS [Query]
FROM sys.dm_exec_query_stats AS deqs
CROSS APPLY sys.dm_exec_sql_text(deqs.sql_handle) AS dest
ORDER BY deqs.last_execution_time DES
I think a combination of DMVs and sysobjects could tell you this. This should hopefully show you all queries run that refer to a view, the name of the view, when it was last run etc.
SELECT s2.text AS Query,
so.name AS ViewName,
creation_time,
last_execution_time,
execution_count
FROM sys.dm_exec_query_stats AS s1
CROSS APPLY sys.Dm_exec_sql_text(sql_handle) AS s2
INNER JOIN sys.objects so
ON so.object_id = s2.objectid
AND so.type = 'V'
I don't think that you'll be able to do this unless you are running a trace 24/7. You could turn on auditing in order to monitor it. But, it would be a big task for however has to read through the logs.
I'm totally newbie with postgresql but I have a good experience with mysql. I was reading the documentation and I've discovered that postgresql has an array type. I'm quite confused since I can't understand in which context this type can be useful within a rdbms. Why would I have to choose this type instead of using a classical one to many relationship?
Thanks in advance.
I've used them to make working with trees (such as comment threads) easier. You can store the path from the tree's root to a single node in an array, each number in the array is the branch number for that node. Then, you can do things like this:
SELECT id, content
FROM nodes
WHERE tree = X
ORDER BY path -- The array is here.
PostgreSQL will compare arrays element by element in the natural fashion so ORDER BY path will dump the tree in a sensible linear display order; then, you check the length of path to figure out a node's depth and that gives you the indentation to get the rendering right.
The above approach gets you from the database to the rendered page with one pass through the data.
PostgreSQL also has geometric types, simple key/value types, and supports the construction of various other composite types.
Usually it is better to use traditional association tables but there's nothing wrong with having more tools in your toolbox.
One SO user is using it for what appears to be machine-aided translation. The comments to a follow-up question might be helpful in understanding his approach.
I've been using them successfully to aggregate recursive tree references using triggers.
For instance, suppose you've a tree of categories, and you want to find products in any of categories (1,2,3) or any of their subcategories.
One way to do it is to use an ugly with recursive statement. Doing so will output a plan stuffed with merge/hash joins on entire tables and an occasional materialize.
with recursive categories as (
select id
from categories
where id in (1,2,3)
union all
...
)
select products.*
from products
join product2category on...
join categories on ...
group by products.id, ...
order by ... limit 10;
Another is to pre-aggregate the needed data:
categories (
id int,
parents int[] -- (array_agg(parent_id) from parents) || id
)
products (
id int,
categories int[] -- array_agg(category_id) from product2category
)
index on categories using gin (parents)
index on products using gin (categories)
select products.*
from products
where categories && array(
select id from categories where parents && array[1,2,3]
)
order by ... limit 10;
One issue with the above approach is that row estimates for the && operator are junk. (The selectivity is a stub function that has yet to be written, and results in something like 1/200 rows irrespective of the values in your aggregates.) Put another way, you may very well end up with an index scan where a seq scan would be correct.
To work around it, I increased the statistics on the gin-indexed column and I periodically look into pg_stats to extract more appropriate stats. When a cursory look at those stats reveal that using && for the specified values will return an incorrect plan, I rewrite applicable occurrences of && with arrayoverlap() (the latter has a stub selectivity of 1/3), e.g.:
select products.*
from products
where arrayoverlap(cat_id, array(
select id from categories where arrayoverlap(parents, array[1,2,3])
))
order by ... limit 10;
(The same goes for the <# operator...)