Postgres - UNION ALL vs INSERT INTO which is better? - postgresql

I have two query by union all and insert into temp table.
Query 1
select *
from (
select a.id as id, a.name as name from a
union all
select b.id as id, b.name as name from b
)
Query 2
drop table if exists temporary;
create temp table temporary as
select id as id, name as name
from a;
insert into temporary
select id as id, name as name
from b;
select * from temp;
Please tell me which one is better for performance?

I would expect the second option to have better performance, at least at the the database level. Both versions require doing a full table scan of both the a and b tables. But the first version would create an unnecessary intermediate table, used only for the purpose of the insert.
The only potential issue with doing two separate inserts is latency, i.e. the time it might take some process to get to and from the database. If you are worried about this, then you can limit to one insert statement:
INSERT INTO temporary (id, name)
SELECT id, name FROM a
UNION ALL
SELECT id, name FROM b;
This would just require one trip to the database.

I think use union all is the better performance way, not sure, you can try it your self. In tab run of SQL application alway show time to run. I take a snapshot in oracle; mysql and sql sv have the same tool to see it
click here to see image

Related

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

postgress: insert rows to table with multiple records from other join tables

ّ am trying to insert multiple records got from the join table to another table user_to_property. In the user_to_property table user_to_property_id is primary, not null it is not autoincrementing. So I am trying to add user_to_property_id manually by an increment of 1.
WITH selectedData AS
( -- selection of the data that needs to be inserted
SELECT t2.user_id as userId
FROM property_lines t1
INNER JOIN user t2 ON t1.account_id = t2.account_id
)
INSERT INTO user_to_property (user_to_property_id, user_id, property_id, created_date)
VALUES ((SELECT MAX( user_to_property_id )+1 FROM user_to_property),(SELECT
selectedData.userId
FROM selectedData),3,now());
The above query gives me the below error:
ERROR: more than one row returned by a subquery used as an expression
How to insert multiple records to a table from the join of other tables? where the user_to_property table contains a unique record for the same user-id and property_id there should be only 1 record.
Typically for Insert you use either values or select. The structure values( select...) often (generally?) just causes more trouble than it worth, and it is never necessary. You can always select a constant or an expression. In this case convert to just select. For generating your ID get the max value from your table and then just add the row_number that you are inserting: (see demo)
insert into user_to_property(user_to_property_id
, user_id
, property_id
, created
)
with start_with(current_max_id) as
( select max(user_to_property_id) from user_to_property )
select current_max_id + id_incr, user_id, 3, now()
from (
select t2.user_id, row_number() over() id_incr
from property_lines t1
join users t2 on t1.account_id = t2.account_id
) js
join start_with on true;
A couple notes:
DO NOT use user for table name, or any other object name. It is a
documented reserved word by both Postgres and SQL standard (and has
been since Postgres v7.1 and the SQL 92 Standard at lest).
You really should create another column or change the column type
user_to_property_id to auto-generated. Using Max()+1, or
anything based on that idea, is a virtual guarantee you will generate
duplicate keys. Much to the amusement of users and developers alike.
What happens in an MVCC when 2 users run the query concurrently.

Update or insert with outer join in postgres

Is it possible to add a new column to an existing table from another table using insert or update in conjunction with full outer join .
In my main table i am missing some records in one column in the other table i have all those records i want to take the full record set into the maintable table. Something like this;
UPDATE maintable
SET all_records= othertable.records
FROM
FULL JOIN othertable on maintable.col = othertable.records;
Where maintable.col has same id a othertable.records.
I know i could simply join the tables but i have a lot of comments in the maintable i don't want to have to copy paste back in if possible. As i understand using where is equivalent of a left join so won't show me what i'm missing
EDIT:
What i want is effectively a new maintable.col with all the records i can then pare down based on presence of records in other cols from other tables
Try this:
UPDATE maintable
SET all_records = o.records
FROM othertable o
WHERE maintable.col = o.records;
This is the general syntax to use in postgres when updating via a join.
HTH
EDIT
BTW you will need to change this - I used your example, but you are updating the maintable with the column used for the join! Your set needs to be something like SET missingcol = o.extracol
AMENDED GENERALISED ANSWER (following off-line chat)
To take a simplified example, suppose that you have two tables maintable and subtable, each with the same columns, but where the subtable has extra records. For both tables id is the primary key. To fill maintable with the missing records, for pre 9.5 versions of Postgres you must use the following syntax:
INSERT INTO maintable (SELECT * FROM subtable s WHERE NOT EXISTS
(SELECT 1 FROM maintable m WHERE m.id = s.id));
Since 9.5 there is a (preferred) alternative:
INSERT INTO maintable (SELECT * FROM subtable) ON CONFLICT DO NOTHING;
This is preferred because (apart from being simpler) it avoids the situation that has been known to arise in the former, where a race condition is created between the INSERT and the sub-SELECT.
Obviously when the columns are different, you need to specify in the INSERT statement which columns are inserted from which. Something like:
INSERT INTO maintable (id, ColA, ColB)
(SELECT id, ColE, ColG FROM subtable ....)
Similarly the common field might not be id in both tables. However, the simplified example should be enough to point you in the right direction.

Postgres subquery has access to column in a higher level table. Is this a bug? or a feature I don't understand?

I don't understand why the following doesn't fail. How does the subquery have access to a column from a different table at the higher level?
drop table if exists temp_a;
create temp table temp_a as
(
select 1 as col_a
);
drop table if exists temp_b;
create temp table temp_b as
(
select 2 as col_b
);
select col_a from temp_a where col_a in (select col_a from temp_b);
/*why doesn't this fail?*/
The following fail, as I would expect them to.
select col_a from temp_b;
/*ERROR: column "col_a" does not exist*/
select * from temp_a cross join (select col_a from temp_b) as sq;
/*ERROR: column "col_a" does not exist
*HINT: There is a column named "col_a" in table "temp_a", but it cannot be referenced from this part of the query.*/
I know about the LATERAL keyword (link, link) but I'm not using LATERAL here. Also, this query succeeds even in pre-9.3 versions of Postgres (when the LATERAL keyword was introduced.)
Here's a sqlfiddle: http://sqlfiddle.com/#!10/09f62/5/0
Thank you for any insights.
Although this feature might be confusing, without it, several types of queries would be more difficult, slower, or impossible to write in sql. This feature is called a "correlated subquery" and the correlation can serve a similar function as a join.
For example: Consider this statement
select first_name, last_name from users u
where exists (select * from orders o where o.user_id=u.user_id)
Now this query will get the names of all the users who have ever placed an order. Now, I know, you can get that info using a join to the orders table, but you'd also have to use a "distinct", which would internally require a sort and would likely perform a tad worse than this query. You could also produce a similar query with a group by.
Here's a better example that's pretty practical, and not just for performance reasons. Suppose you want to delete all users who have no orders and no tickets.
delete from users u where
not exists (select * from orders o where o.user_d = u.user_id)
and not exists (select * from tickets t where t.user_id=u.ticket_id)
One very important thing to note is that you should fully qualify or alias your table names when doing this or you might wind up with a typo that completely messes up the query and silently "just works" while returning bad data.
The following is an example of what NOT to do.
select * from users
where exists (select * from product where last_updated_by=user_id)
This looks just fine until you look at the tables and realize that the table "product" has no "last_updated_by" field and the user table does, which returns the wrong data. Add the alias and the query will fail because no "last_updated_by" column exists in product.
I hope this has given you some examples that show you how to use this feature. I use them all the time in update and delete statements (as well as in selects-- but I find an absolute need for them in updates and deletes often)

Finding duplicates between two tables

I've got two SQL2008 tables, one is a "Import" table containing new data and the other a "Destination" table with the live data. Both tables are similar but not identical (there's more columns in the Destination table updated by a CRM system), but both tables have three "phone number" fields - Tel1, Tel2 and Tel3. I need to remove all records from the Import table where any of the phone numbers already exist in the destination table.
I've tried knocking together a simple query (just a SELECT to test with just now):
select t2.account_id
from ImportData t2, Destination t1
where
(t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
... but I'm aware this is almost certainly Not The Way To Do Things, especially as it's very slow. Can anyone point me in the right direction?
this query requires a little more that this information. If You want to write it in the efficient way we need to know whether there is more duplicates each load or more new records. I assume that account_id is the primary key and has a clustered index.
I would use the temporary table approach that is create a normalized table #r with an index on phone_no and account_id like
SELECT Phone, Account into #tmp
FROM
(SELECT account_id, tel1, tel2, tel3
FROM destination) p
UNPIVOT
(Phone FOR Account IN
(Tel1, tel2, tel3)
)AS unpvt;
create unclustered index on this table with the first column on the phone number and the second part the account number. You can't escape one full table scan so I assume You can scan the import(probably smaller). then just join with this table and use the not exists qualifier as explained. Then of course drop the table after the processing
luke
I am not sure on the perforamance of this query, but since I made the effort of writing it I will post it anyway...
;with aaa(tel)
as
(
select Tel1
from Destination
union
select Tel2
from Destination
union
select Tel3
from Destination
)
,bbb(tel, id)
as
(
select Tel1, account_id
from ImportData
union
select Tel2, account_id
from ImportData
union
select Tel3, account_id
from ImportData
)
select distinct b.id
from bbb b
where b.tel in
(
select a.tel
from aaa a
intersect
select b2.tel
from bbb b2
)
Exists will short-circuit the query and not do a full traversal of the table like a join. You could refactor the where clause as well, if this still doesn't perform the way you want.
SELECT *
FROM ImportData t2
WHERE NOT EXISTS (
select 1
from Destination t1
where (t2.Tel1!='' AND (t2.Tel1 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel2!='' AND (t2.Tel2 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
or
(t2.Tel3!='' AND (t2.Tel3 IN (t1.Tel1,t1.Tel2,t1.Tel3)))
)