redshift nulling columns when joining another table - left-join

table_1 has 35 columns, table_2 has 20 columns
query is:
select table1.*,
table2.f1,
...
table2.f20
FROM public.table_1 as table1
left join public.table_2 as table2
on table1.id = table2.id
and table1.arrival_time::date <= table2.end_date::date
and table2.activity_date < table2.end_date
;
this works I expect 469 rows to be returned and that's what I get. However several fields from table_1 get displayed as null instead of the values in the table.
These fields are NOT part of the join.
Due to IP concerns I can't provide the full details of the tables, each field in table_1 and table_2 are varchar (don't ask me why a timestamp is stored as a varchar - its a long story that I have no control over)
This query WORKS in RDS PostgreSQL!
Any ideas why it has a problem in redshift?

Well I'll be very confused.
table_1 is data from two sources joined together - I didn't even think to look at the sources. Turns out the linked source had no data for one value.
Just goes to show that when looking at just a piece of the data you need to look HARD at all the data.
Now I'm off to find a better source for the missing data.
Thanks for your time!
James

Related

How to insert new rows only on tables without Primary or Foreign Keys?

Scenario: I have two tables. Table A and Table B, both have the same exact columns. My task is to create a master table. I need to ensure no duplicates are in the master table unless it is a new record.
problem: Whoever built the tables did not assign a Primary Key to the table.
Attempts: I attempted running an INSERT INTO WHERE NOT EXISTS query (below as an example not the actual query I ran)
Question: the portion of the query below WHERE t2.id = t1.id confuses me, my table has a multitude of columns, there is no id column like I said it has no PRIMARY key to anchor the match, so, in a scenario where all I have are values without primary keys, how can I append only new records? Also, perhaps I am going about this the wrong way but are there any other functions or options through TSQL worth considering? Maybe not an INSERT INTO statement or perhaps something else? My SQL skills aren't yet that advance so I am not asking for a solution but perhaps ideas or other methods worth considering? Any ideas are welcome.
INSERT INTO TABLE_2
(id, name)
SELECT t1.id,
t1.name
FROM TABLE_1 t1
WHERE NOT EXISTS(SELECT id
FROM TABLE_2 t2
WHERE t2.id = t1.id)
If I understand your question correctly, you would need to amend the SQL sample you posted by changing the condition t2.id = t1.id to whatever columns you do have.
Say your 2 tables have name and brand columns and you don't want duplicates, just change the sample to:
WHERE t2.name = t1.name
AND t2.brand = t1.brand
This will ensure you don't insert and rows in table 2 from table 1 which are duplicates. You would have to make sure the where condition contains all columns (you said the table schemas are identical).
Also, the above code sample copies everything into table 2 - but you said you want a master table - so you'd have to change it to insert into the master table, not table 2.

NOT IN query performance issue with large data

i was trying to get the id and the number from table with condition of number isn't in the id.
select id,number from tmp_t where number not in (select id from tmp_t)
Have tried the query and it's taking soooo looonggg... like almost 40 minutes and i got disconnected from server.
So what should i do? the data is around 500K rows..
So i wanted to show "here you go the id and the number, which the number didn't exist in the id."
Because i tried to insert the number, but the number is a FK and depending on the ID, so i wanted to know the id and the number, that's why i'm using not in.
Maybe someone know? Btw im using Postgresql-13
You can write it with NOT EXISTS instead, although these queries will have different results if any value of id is NULL (in which case, NOT IN probably yields not the answer you want, so NOT EXISTS is better from that perspective as well.)
select id,number from tmp_t where not exists
(select 1 from tmp_t a where a.id=tmp_t.number);
But your formulation is also efficient as long as work_mem is large enough.
Typically NOT EXISTS is faster (and doesn't suffer from surprises if NULL values are involved):
select t1.id, t1.number
from tmp_t t1
where not exists (select *
from tmp_t t2
where t2.id = t1.number)

Is this a JOIN, Lookup or how to select only records matching a col from two tables

I have two postgres tables where one column listing a city name matches. I'm trying to create a view of some records which I'm displaying on a map via WMS on my GeoServer.
I need to select only records from table1 of 100k records that has a city name that matches those cities listed in table2 of 20 records.
To list everything I've tried would be a waste of your time. I've tried every join tutorial and example but, am perplexed why I can't get any success. I would really appreciate some direction.
Here's a latest query but, if this is the wrong approach just ignore since I have about 50 similar attempts.
SELECT t1.id,
t1.dba,
t1.prem_city,
t1.geom
t2.city_label
FROM schema1.table1 AS t1
LEFT JOIN schema2.table2 AS t2
ON t2.city_label = t1.prem_city;
Thanks for any help!
Your query seems correct, just a minor change - LEFT JOIN keeps all the records from the left table and only the matching record from the right one. If you want only those that appear in both - an INNER JOIN is required .
SELECT t1.id,
t1.dba,
t1.prem_city,
t1.geom,
t2.city_label
FROM schema1.table1 t1
JOIN schema2.table2 t2
ON t2.city_label = t1.prem_city;

How to optimise tables in Netezza to compliment a join with date conditions

I have two tables that I need to join in Netezza and one of them is very large
I have a dimension table that is a customer table which has two fields, customer id and an observation date i.e.
cust_id, obs_date
'a','2015-01-05'
'b','2016-02-03'
'c','2014-05-21'
'd','2016-01-31'
I have a fact table that is transactional and very high in volume. It has a lot of transactions per customer per date i.e.
cust_id, tran_date, transaction_amt
'a','2015-01-01',1
'a','2015-01-01',2
'a','2015-01-01',5
'a','2015-01-02',7
'a','2015-01-02',2
'b','2016-01-02',12
Both tables are distributed by the same key - cust_id
However When I join the tables, i need to join given the date condition. The query is very fast when i just join them together, but when I add the date condition it does not seem optimised. Does anyone have tips on how to set up the underlying tables or write the join?
I.e. sum transaction_amt for each customer for all their transactions for the 3 months up to their obs_date
FROM CUSTOMER_TABLE
INNER JOIN TRANSACTION_TABLE
ON CUSTOMER_TABLE.cust_id = TRANSACTION_TABLE.cust_id
AND TRANSACTION_TABLE.TRAN_DATE BETWEEN CUSTOMER_TABLE.OBS_DATE - 30 AND CUSTOMER_TABLE.OBS_DATE
If your transaction table is sufficiently large, it may benefit from using CBTs.
If you can, create a copy of the table that uses TRAN_DATE to organize (I'm guessing at your ddl here):
create table transaction_table (
cust_id varchar(20)
,tran_date date
,transaction_amt numeric(10,0)
) distribute on (cust_id)
organize on (tran_date);
Join to that and see if performance is improved. You could also use a materialized view for just those columns, but I think a CBT would be more useful here.
As Scott mentions in the comments below, you should either sort by the date on insert or groom the records after to make sure that they are sorted appropriately.

Postgres subquery has access to column in a higher level table. Is this a bug? or a feature I don't understand?

I don't understand why the following doesn't fail. How does the subquery have access to a column from a different table at the higher level?
drop table if exists temp_a;
create temp table temp_a as
(
select 1 as col_a
);
drop table if exists temp_b;
create temp table temp_b as
(
select 2 as col_b
);
select col_a from temp_a where col_a in (select col_a from temp_b);
/*why doesn't this fail?*/
The following fail, as I would expect them to.
select col_a from temp_b;
/*ERROR: column "col_a" does not exist*/
select * from temp_a cross join (select col_a from temp_b) as sq;
/*ERROR: column "col_a" does not exist
*HINT: There is a column named "col_a" in table "temp_a", but it cannot be referenced from this part of the query.*/
I know about the LATERAL keyword (link, link) but I'm not using LATERAL here. Also, this query succeeds even in pre-9.3 versions of Postgres (when the LATERAL keyword was introduced.)
Here's a sqlfiddle: http://sqlfiddle.com/#!10/09f62/5/0
Thank you for any insights.
Although this feature might be confusing, without it, several types of queries would be more difficult, slower, or impossible to write in sql. This feature is called a "correlated subquery" and the correlation can serve a similar function as a join.
For example: Consider this statement
select first_name, last_name from users u
where exists (select * from orders o where o.user_id=u.user_id)
Now this query will get the names of all the users who have ever placed an order. Now, I know, you can get that info using a join to the orders table, but you'd also have to use a "distinct", which would internally require a sort and would likely perform a tad worse than this query. You could also produce a similar query with a group by.
Here's a better example that's pretty practical, and not just for performance reasons. Suppose you want to delete all users who have no orders and no tickets.
delete from users u where
not exists (select * from orders o where o.user_d = u.user_id)
and not exists (select * from tickets t where t.user_id=u.ticket_id)
One very important thing to note is that you should fully qualify or alias your table names when doing this or you might wind up with a typo that completely messes up the query and silently "just works" while returning bad data.
The following is an example of what NOT to do.
select * from users
where exists (select * from product where last_updated_by=user_id)
This looks just fine until you look at the tables and realize that the table "product" has no "last_updated_by" field and the user table does, which returns the wrong data. Add the alias and the query will fail because no "last_updated_by" column exists in product.
I hope this has given you some examples that show you how to use this feature. I use them all the time in update and delete statements (as well as in selects-- but I find an absolute need for them in updates and deletes often)