Using EXCEPT and flagging column differences - postgresql

What Im looking to do is select data from a postgres table, which does not appear in another. Both tables have identical columns, bar the use of boolean over Varchar(1) but the issue is that the data in those columns do not match up.
I know I can do this with a SELECT EXCEPT SELECT statement, which I have implemented and is working.
What I would like to do is find a method to flag the columns that do not match up. As an idea, I have thought to append a character to the end of the data in the fields that do not match.
For example if the updateflag is different in one table to the other, I would be returned '* f' instead of 'f'
SELECT id, number, "updateflag" from dbc.person
EXCEPT
SELECT id, number, "updateflag":bool from dbg.person;
Should I be joining the two tables together, post executing this statement to identify the differences, from whats returned?
I have tried to research methods to implement this but have no found anything on the topic

I prefer a full outer join for this
select *
from dbc.person p1
full join dbg.person p2 on p1.id = p2.id
where p1 is distinct from p2;
The id column is assumed the primary key column that "links" the two tables together.
This will only return rows where at least one column is different.
If you want to see the differences, you could use a hstore feature
select hstore(p1) - hstore(p2) as columns_diff_p1,
hstore(p2) - hstore(p1) as columns_diff_p2
from dbc.person p1
full join dbg.person p2 on p1.id = p2.id
where p1 is distinct from p2;

Related

sql restriction for join table with string similarity rule

My Db is building from some tables that are similar to each other and share the same column names. The reason is to perform a comparison between data from each resource.
table_A and table_B: id, product_id, capacitor_name, ressitance
It is easy to join tables by product_id and see the comparison,
but I need to compare data between product_id if exists in both tables and if not I want to compare by name similarity and if similarity restricts the result for up to 3 results.
The names most of the time are not equal this is why I'm using a similarity.
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
similarity(ta.name,tb.name) > 0.8
It works fine. But the problem is sometimes I'm getting more data than I need, how can I restrict it? (and moreover, order it by similarity in order to get higher similarity names).
If you want to benefit from an trigram index, you need to use the operator form (%), not the function form. Then you would order on two "columns", the first to be exact matches first, the 2nd to put most similar matches after and in order. And use LIMIT to do the limit. I've assumed you have some WHERE condition to restrict this to just one row of table_a. If not, then your question is not very well formed. To what is this limit supposed to apply? Each what should be limited to just 3?
SELECT * FROM table_a ta
JOIN table_b tb
ON
ta.product_id = tb.product_id
OR
ta.name % tb.name
WHERE ta.id=$1
ORDER BY ta.product_id = tb.product_id desc, similarity(ta.name,tb.name) desc
LIMIT 3

It's a function or table in this later join case?

It is often particularly handy to LEFT JOIN to a LATERAL subquery, so that source rows will appear in the result even if the LATERAL subquery produces no rows for them. For example, if get_product_names() returns the names of products made by a manufacturer, but some manufacturers in our table currently produce no products, we could find out which ones those are like this:
SELECT m.name
FROM manufacturers m LEFT JOIN LATERAL get_product_names(m.id) pname ON true
WHERE pname IS NULL;
All contents extract from PostgreSQL manual. LINK
Now I finally probably get what does LATERAL mean. In this case,
Overall I am Not sure get_product_names is a table or function. The following is my understanding.
A: get_product_names(m.id) is a function, and using m.id as a input parameter returns a table. The return table alias as pname. Overall it's a table m join a null (where condition) table.
B: get_product_names is a table, table m left join table get_product_names on m.id. pname is alias for get_product_names. Overall it's a table m join a null (where condition) table.
get_product_names is a table function (also known as set returning function or SRF in PostgreSQL slang). Such a function does not necessarily return a single result row, but arbitrarily many rows.
Since the result of such a function is a table, you typically use it in SQL statements where you would use a table, that is in the FROM clause.
A simple example is
SELECT * FROM generate_series(1, 5);
generate_series
-----------------
1
2
3
4
5
(5 rows)
You can also use normal functions in this way, they are then treated as a table function that returns exactly one row.

SQL with table as becomes ambiguous

Perhaps I'm approaching this all wrong, in which case feel free to point out a better way to solve the overall question, which "How do I use an intermediate table for future queries?"
Let's say I've got tables foo and bar, which join on some baz_id, and I want to use combine this into an intermediate table to be fed into upcoming queries. I know of the WITH .. AS (...) statement, but am running into problems as such:
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar ON bar.baz_id = foo.baz_id
)
SELECT
baz_id
-- some other things as well
FROM
foobar
The issue is that (Postgres 9.4) tells me baz_id is ambiguous. I understand this happens because SELECT * includes all the columns in both tables, so baz_id shows up twice; but I'm not sure how to get around it. I was hoping to avoid copying the column names out individually, like
SELECT
foo.var1, foo.var2, foo.var3, ...
bar.other1, bar.other2, bar.other3, ...
FROM foo INNER JOIN bar ...
because there are hundreds of columns in these tables.
Is there some way around this I'm missing, or some altogether different way to approach the question at hand?
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar USING(baz_id)
)
SELECT
baz_id
-- some other things as well
FROM
foobar
It leaves only one instance of the baz_id column in the select list.
From the documentation:
The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b = T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

How to specify two expressions in the select list when the subquery is not introduced with EXISTS

I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part

TSQL selecting unique value from multiple ranges in a column

A question from a beginner.
I have two tables. One (A) contains Start_time, End_time, Status. Second one (B) contains Timestamp, Error_code. Second table is automatically logged by system every few seconds, so it contains lots of non unique values of Error_code (it changes randomly, but within a time range from table A). What i need is to select unique error code for every time range (in my case every row) from the first table for every time range in table A:
A.Start_time, A.End_time B.Error_code.
I have come to this:
select A.Start_time,
A.End_time,
B.Error_code
from B
inner join A
on B.Timestamp between A.Start_time and A.End_time
This is wrong, i know.
Any thoughts are welcome.
If tour query gives a lot of duplicates use distinct to remove them:
select DISTINCT A.Start_time, A.End_time, B.Error_code
from B
inner join A on B.Timestamp between A.Start_time and A.End_time