So I got 3 tables:
Locations (LocationID [PK])
Track_Locations (TrackID [FK], LocationID [FK])
Tracks (TrackID [PK])
I'm building an app for iPhone and I was told that a JOIN would probably work faster then a subquery would. Is this true?
I have searched around but can't find a straight answer to this and would like to know what is best to use in this case.
My queries in both cases would look like this:
Subquery:
SELECT LocationID, TimestampGPS, Longitude, Latitude, ...
FROM locations
WHERE LocationID
IN (SELECT LocationID FROM track_locations WHERE TrackID = ?)
Using JOIN:
SELECT l.LocationID, l.TimestampGPS, l.Longitude, l.Latitude, ...
FROM locations l
JOIN track_locations tl
ON (l.LocationID = tl.LocationID)
JOIN tracks t
ON (t.TrackID = tl.TrackID)
WHERE t.TrackID = ?
A JOIN is almost always faster than a subquery, because it results in a single query, whereas a subquery is two queries. Some databases, however, know how to optimize such simple subqueries into JOINs, though I doubt SQLite does.
That said, sometimes, given the structure of one's data, it is faster to do a subquery. The only real way to find out is to load your database with a likely collection of sample data from the wild and benchmark both approaches.
My suspicion is that it will be a wash, however. Most iPhone apps don't have enough rows in the database to make much difference.
Related
I have two dataframes(from a delta lake table) that do a left join via an id column.
sd1, sd2
%sql
select
a.columnA,
b.columnB,
from sd1 a
left outer join sd2 b
on a.id = b.id
The problem is that my query takes a long time, looking for ways to improve the results I have found OPTIMIZE ZORDER BY Youtube video
according to the video seems to be useful when ordering columns if they are going to be part of the where condition`.
But since the two dataframes use the id in the join condition, could it be interesting to order that column?
spark.sql(f'OPTIMIZE delta.`{sd1_delta_table_path}` ZORDER BY (id)')
the logic that follows in my head is that if we first order that column then it will take less time to look for them to make the match. Is this correct ?
Thanks ind advance
OPTIMIZE ZORDER may help a bit by placing related data together, but it's usefulness may depend on the data type used for ID column. OPTIMIZE ZORDER relies on the data skipping functionality that just gives you min & max statistics, but may not be useful when you have big ranges in your joins.
You can also tune a file sizes, to avoid scanning of too many smaller files.
But from my personal experience, for joins, bloom filters give better performance because they allow to skip files more efficiently than data skipping. Just build bloom filter on the ID column...
I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.
I have a custom query along these lines. I get the list of orderIds from outside. I have the entire order object list with me, so I can change the query in any way, if needed.
#Query("SELECT p FROM Person p INNER JOIN p.orders o WHERE o.orderId in :orderIds)")
public List<Person> findByOrderIds(#Param("orderIds") List<String> orderIds);
This query works fine, but sometimes it may have anywhere between 50-1000 entries in the orderIds list sent from outside function. So it becomes very slow, taking as much as 5-6 seconds which is not fast enough. My question is, is there a better, faster way to do this? When I googled, and on this site, I see we can use ANY, EXISTS: Postgresql: alternative to WHERE IN respective WHERE NOT IN or create a temporary table: https://dba.stackexchange.com/questions/12607/ways-to-speed-up-in-queries-under-postgresql or join this to VALUES clause: Alternative when IN clause is inputed A LOT of values (postgreSQL). All these answers are tailored towards direct SQL calls, nothing based on JPA. ANY keyword is not supported by spring-data. Not sure about creating temporary tables in custom queries. I think I can do it with native queries, but have not tried it. I am using spring-data + OpenJPA + PostgresSQL.
Can you please suggest a solution or give pointers? I apologize if I missed anything.
thanks,
Alice
You can use WHERE EXISTS instead of IN Clause in a native SQL Query as well as in HQL in JPA which results in a lot of performance benefits. Please see sample below
Sample JPA Query:
SELECT emp FROM Employee emp JOIN emp.projects p where NOT EXISTS (SELECT project from Project project where p = project AND project.status <> 'Active')
First of all, English it's not my first language, feel free to edit my question and I'm sorry for any mistakes that can offend you or not being so clear exposing the problem.
I have a few sql queries with lots of joins, these joins are based on clustered index (no worries about that). Some of the joins are used only to respect normalization and because is intuitive to maintenance, but sometimes it's possible to skip some of then. It's not clear to me what to do about these joins in terms of best practices.
Edit:
A simple example:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories on
categories.id_category = things_categories.id_category
join categories_properties on
categories_properties.id_category = categories.id_category
where
categories_properties.bo_default = 1
But it's possible to do:
select *
from things
join things_categories on
things_categories.id_thing = things.id_thing
join categories_properties on
categories_properties.id_category = things_categories.id_category
where
categories_properties.bo_default = 1
The second join it's not necessary (I do have integrity at database level), it's there only because makes the code more intuitive and respect the database normalization. I'm not sure if I should follow the smallest possible and efficient path or leave unnecessary joins to respect normalization and make the code more intuitive.
Any tips?
All the best.
It deppends, wheter you've or not integrity already.
In one hand, if the categories_properties table has a foreign key in the id_category column, then the integrity exists and you don't need to make the join with the categories table.
On the other hand, if the integrity might not exist (i.e.: there are id_categories in categories_properties table that are not defined in categories table), then you should make the join.
The join:
join categories on
categories.id_category = things_categories.id_category
is very necessary, since the categories table is used in the next join:
join categories_properties on
categories_properties.id_category = categories.id_category
So it's definitely required, if it's not already defined, as SQL requires for you to establish the links it needs to index and join one to the next.
What is however very painful, is the select *.
You don't need all that info, since * will bring all data from all tables.
Perhaps you could specify what you need from each table or, at worst, use things.* to specify all columns of a specific table.
If you do not need a join do not use it. You are taking a totally unneeded performance hit. Don't force the database to do work it doesn't need to do because you think it looks more comlete, you should consider performance ahead of readability in a query. After all once you start writing performant SQl code, it will become more readable to you. However, make sure you actually don't need it before eliminating it by making sure both versions of the query return the same result set.
I usually use ORM instead of SQL and I am slightly out of touch on the different JOINs...
SELECT `order_invoice`.*
, `client`.*
, `order_product`.*
, SUM(product.cost) as net
FROM `order_invoice`
LEFT JOIN `client`
ON order_invoice.client_id = client.client_id
LEFT JOIN `order_product`
ON order_invoice.invoice_id = order_product.invoice_id
LEFT JOIN `product`
ON order_product.product_id = product.product_id
WHERE (order_invoice.date_created >= '2009-01-01')
AND (order_invoice.date_created <= '2009-02-01')
GROUP BY `order_invoice`.`invoice_id`
The tables/ columns are logically names... it's an shop type application... the query works... it's just very very slow...
I use the Zend Framework and would usually use Zend_Db_Table_Row::find(Parent|Dependent)Row(set)('TableClass') but I have to make lots of joins and I thought it'll improve performance by doing it all in one query instead of hundreds...
Can I improve the above query by using more appropriate JOINs or a different implementation? Many thanks.
The query is wrong, the GROUP BY is wrong. All columns in the SELECT-part that are not in an aggregate function, have to be in the GROUP BY. You mention only one column.
Change the SQL Mode, set it to ONLY_FULL_GROUP_BY.
When this is done and you have a correct query, use EXPLAIN to find out how the query is executed and what indexes are used. Then start optimizing.