Postgres LEFT JOIN is creating more rows than in left table

Postgres LEFT JOIN is creating more rows than in left table - postgresql

I am running Postgres 9.1.3 32-bit on Windows 7 x64. (Have to use 32 bit because there is no Windows PostGIS release compatible with 64 bit Postgres.) (EDIT: As of PostGIS 2.0, it is compatible with Postgres 64 bit on windows.)
I have a query that left joins a table (consistent.master) with a temporary table, then inserts the resulting data into a third table (consistent.masternew).
Since this is a left join, the resulting table should have the same number of rows as the left table in the query. However, if I run this:
SELECT count(*)
FROM consistent.master
I get 2085343. But if I run this:
SELECT count(*)
FROM consistent.masternew
I get 2085703.
How can masternew have more rows than master? Shouldn't masternew have the same number of rows as master, the left table in the query?
Below is the query. The master and masternew tables should be identically-structured.
--temporary table created here
--I am trying to locate where multiple tickets were written on
--a single traffic stop
WITH stops AS (
SELECT citation_id,
rank() OVER (ORDER BY offense_timestamp,
defendant_dl,
offense_street_number,
offense_street_name) AS stop
FROM consistent.master
WHERE citing_jurisdiction=1
)
--Here's the insert statement. Below you'll see it's
--pulling data from a select query
INSERT INTO consistent.masternew (arrest_id,
citation_id,
defendant_dl,
defendant_dl_state,
defendant_zip,
defendant_race,
defendant_sex,
defendant_dob,
vehicle_licenseplate,
vehicle_licenseplate_state,
vehicle_registration_expiration_date,
vehicle_year,
vehicle_make,
vehicle_model,
vehicle_color,
offense_timestamp,
offense_street_number,
offense_street_name,
offense_crossstreet_number,
offense_crossstreet_name,
offense_county,
officer_id,
offense_code,
speed_alleged,
speed_limit,
work_zone,
school_zone,
offense_location,
source,
citing_jurisdiction,
the_geom)
--Here's the select query that the insert statement is using.
SELECT stops.stop,
master.citation_id,
defendant_dl,
defendant_dl_state,
defendant_zip,
defendant_race,
defendant_sex,
defendant_dob,
vehicle_licenseplate,
vehicle_licenseplate_state,
vehicle_registration_expiration_date,
vehicle_year,
vehicle_make,
vehicle_model,
vehicle_color,
offense_timestamp,
offense_street_number,
offense_street_name,
offense_crossstreet_number,
offense_crossstreet_name,
offense_county,
officer_id,
offense_code,
speed_alleged,
speed_limit,
work_zone,
school_zone,
offense_location,
source,
citing_jurisdiction,
the_geom
FROM consistent.master LEFT JOIN stops
ON stops.citation_id = master.citation_id
In case it matters, I have run a VACUUM FULL ANALYZE and reindexed both tables. (Not sure of exact commands; did it through pgAdmin III.)

A left join does not necessarily have the same number of rows as the number of rows in the left table. Basically, it is like a normal join, except rows of the left table that would not appear in the normal join are also added. So, if you have more than one row in the right table that matches one row in the left table, you can have more rows in your results than the number of rows of the left table.
In order to do what you want to do, you should use a group by, and a count to detect multiples.
select citation_id
from stops join master on stops.citation_id = master.citation_id
group by citation_id
having count(*) > 1

Sometimes you know there are multiples, but don't care. You just want to take the first or top entry.
If so, you can use SELECT DISTINCT ON:
FROM consistent.master LEFT JOIN (SELECT DISTINCT ON (citation_id) * FROM stops) s
ON s.citation_id = master.citation_id
Where citation_id is the column that you want to take the first (any) row for each match.
You might want to ensure this is deterministic and use ORDER BY with some other orderable column:
SELECT DISTINCT ON (citation_id) * FROM stops ORDER BY citation_id, created_at

Related

Unexpected sort order on postgres left outer join

Background
I'm using Postgres 11 and pgAdmin4 v5.2. The problem I describe below is on my dev machine which has both the postgres server and pgAdmin client.
Questions I've looked at on SO that deal with incorrect ordering seem related to collation-related issues with ordering of text fields, whereas my problem is on an integer field.
Setup
I have a table norm_plans that contains ~5k records.
Column | Type
---------------------------------
canon_id | integer
name | character varying(200)
...
other fields
canon_id is autopopulated using a sequence.
I've created a new table norm_plans_cmp as a copy of norm_plans (CREATE TABLE norm_plans_cmp AS TABLE norm_plans WITH DATA;)
I next insert some new records into norm_plans and update some existing records (fields other than canon_id.
The new records increment the sequence and are assigned canon_id values as expected.
I now want to compare norm_plans against norm_plans_cmp so I perform a left outer join:
select a.*, b.*
from norm_plans a
left outer join norm_plans_cmp b
on a.canon_id = b.canon_id
order by a.canon_id
Problem
I would expect records to be sorted by canon_id. This holds true from 1-2000, but after 2,000 I get canon_ids from 5,001 to 5,111 (which is the last canon_id) and then it again picks up from 2,001. I'm viewing this data from pgAdmin, see screenshot 1 below showing the shift from 2,000 to 5,001, and screenshot 2 showing the transition again from 5,111 back to 2,001.
Additional observations
While incorrect, the ordering seems consistent. Running the query multiple times results in the same (incorrect) ordering.
Despite my question title, I'm not totally sure the left join has anything to do with this.
Running SELECT * ... ORDER BY canon_id on norm_plans or norm_plans_cmp alone also result in incorrect ordering, albeit at different points in the order.
Answers to this SO question suggest index corruption may be a contributing problem, but I have no indexes on either norm_plans or norm_plans_cmp (canon_id is not defined as a PK).
At this point, I'm stumped!

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers

What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause

My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

Multiple table UPDATE using SELECT that requires JOIN using PGAdmin

I am using PGAdmin III and postgres 9.6
I have postgres schemas that consists of 250 tables with various columns. Two tables are of similar names and have the same columns, but unique data and IDs. I want to UPDATE one column in both tables with a specific number but choosing which rows get updated is complicated.
I have a script which correctly works on one table at a time, but I wonder if it can be done in a single script. I have tried using DO and LOOP but keep running into roadblocks.
Here's the working single-table script:
UPDATE table1
SET number = 44
WHERE table1.id IN
(SELECT table1.id FROM plan
JOIN name ON name.id = plan.nameid
JOIN type ON type.id = name.typeid
JOIN simul ON simul.id = plan.simulid
LEFT JOIN table1 ON table1.id = tableid
WHERE type.tag = 'X' AND plan.class LIKE '%Table%' AND simul.label IN
('SIMUL90','SIMUL99','SIMUL87'));
"plan.class" contains text which determines class of simul and thus either Table1 or Table2. Table1 and Table2 are of identical column structure but their IDs may be identical so UNION is out of consideration. TH other joins are to fine tune the SELECT so I get only a small subset of the table. The number to update is the same for either table.

Why are rows in temp table not showing in count?

I have a T-SQL query with temporary tables which is running and showing the count of rows per table however for one of the temp tables (which has 124 rows) the SELECT COUNT is showing as a zero and not the 124 that are in the temporary table.
I have tried changing the joins in the select statement and retyping the entire query. I have added aliases for all the fields and double checked that all of the temporary tables are giving the correct results. As far as I can see I have written the COUNT the same way for all 4 of the temporary tables but the COUNT(#ML.MembershipID) is the only one that is showing as a zero and not matching the row count.
--this is the select part of my query
SELECT #Type.MembershipType,
COUNT(#MS.MembershipID) AS MembersStartCount,
COUNT(#ME.MembershipID) AS MembersEndCount,
COUNT(#ML.MembershipID) AS MembersLostCount,
COUNT(#MN.MembershipID) AS MembersGainedCount
FROM Filteredccx_Membership mem INNER JOIN #Type ON mem.ccx_membershipid=#Type.MembershipID
LEFT OUTER JOIN #MS ON mem.ccx_membershipid=#MS.MembershipID
LEFT OUTER JOIN #ME ON mem.ccx_membershipid=#ME.MembershipID
LEFT OUTER JOIN #MN ON mem.ccx_membershipid=#MN.MembershipID
LEFT OUTER JOIN #ML ON mem.ccx_membershipid=#ML.MembershipID
GROUP BY #type.MembershipType
The COUNT(#ML.MembershipID) AS MembersLostCount, should be showing the 124 in total across the 2 membership types but it is showing 0 in both rows. All of the other COUNT's are showing the number of rows in the temp tables.

Your LEFT OUTER JOIN is probably not returning anything from the #ML table.
Meaning that #ML.MembershipID is NULL for all rows.
Try changing it to:
COUNT(ISNULL(#ML.MembershipID, 0)) AS MembersLostCount,

Updating multiple rows in one table based on multiple rows in a second

I have two tables, table1 and table2, both of which contain columns that store postgis geometries. What I want to do is see where the geometry stored in any row of table2 geometrically intersects with the geometry stored in any row of table1 and update a count column in table1 with the number of intersections. Therefore, if I have a geometry in row 1 of table1 that intersects with the geometries stored in 5 rows in table2, I want to store a count of 5 in a separate column in table one. The tricky part for me is that I want to do this for every row of column 1 at the same time.
I have the following:
UPDATE circles SET intersectCount = intersectCount + 1 FROM rectangles
WHERE ST_INTERSECTS(cirlces.geom, rectangles.geom);
...which doesn't seem to be working. I'm not too familiar with postgres (or sql in general) and I'm wondering if I can do this all in one statement or if I need a few. I have some ideas for how I would do this with multiple statements (or using for loop) but I'm really looking for a concise solution. Any help would be much appreciated.
Thanks!

something like:
update t1 set ctr=helper.ctr
from (
select t1.id, count(*) as cnt
from t1, t2
where st_intersects(t1.col, t2.col)
group by t1.id
) helper
where helper.id=t1.id
?
btw: Your version does not work, because a row can get updated only once in a single update statement.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse