I want compare two column values which come from two different queries. Can anyone suggest a query which compares two columns in Postgres?
Well, the easiest to understand--but not necessarily the fastest--is probably something like this. (But you might mean something else by "compare".)
-- Values in column1 that aren't in column2.
SELECT column1 FROM query1
WHERE column1 NOT IN (SELECT column2 FROM query2);
-- Values in column2 that aren't in column1.
SELECT column2 FROM query2
WHERE column2 NOT IN (SELECT column1 FROM query1);
-- Values common to both column1 and column2
SELECT q1.column1 FROM query1 q1
INNER JOIN query2 q2 ON (q1.column1 = q2.column2);
You can also do this in a single statement to give you a visual comparison. A FULL OUTER JOIN returns all the values in both columns, with matching values in the same row, and NULL where one column is missing a value that's in the other column.
SELECT q1.column1, q2.column2 FROM query1 q1
FULL OUTER JOIN query2 q2 ON (q1.column1 = q2.column2);
Related
I have a PostgreSQL database that is partitioned into multiple schemas, one for each tenant and workspace (the exact meaning of these terms doesn't matter, they're just dimensions of the partition scheme):
reports/tenant1/workspace1
reports/tenant1/workspace2
reports/tenant2/workspace1
reports/tenant3/workspace1
reports/tenant3/workspace2
reports/tenant3/workspace3
Each workspace schema has the same set of tables with identical definitions, and each table includes "_tenant" and "_workspace" columns with the values of its enclosing schema, e.g., tenant1 and workspace1.
In the public schema, there is one view per table definition that unions the tables with that definition across all workspace schemas. For example, the view for "example_table" would be:
SELECT _tenant, _workspace, column1, column2, column3
FROM "reports/tenant1/workspace1".example_table
WHERE _tenant = 'tenant1' AND _workspace = 'workspace1'
UNION ALL
SELECT _tenant, _workspace, column1, column2, column3
FROM "reports/tenant1/workspace2".example_table
WHERE _tenant = 'tenant1' AND _workspace = 'workspace2'
UNION ALL
SELECT _tenant, _workspace, column1, column2, column3
FROM "reports/tenant2/workspace1".example_table
WHERE _tenant = 'tenant2' AND _workspace = 'workspace1'
UNION ALL
Note the "redundant" partition predicates in each SELECT. I added these because it seems to provide a hint to PostgreSQL not to execute queries on tables in unrelated partitions when querying the view with the same predicate. Indeed, EXPLAIN ANALYZE shows "(never executed)" for those queries.
Queries are made from a BI tool to the views, and the BI tool automatically adds predicates on the "_tenant" and "_workspace" columns based on attributes of the logged-in user.
Now that there are 50+ workspaces, I've noticed that queries on the views can have non-optimal plans when compared to equivalent queries on the underlying tables. For example, the following query on the views might use a nested loop join that takes 1 minute:
SELECT * FROM
(
SELECT column1, column2, column3
FROM example_view1
WHERE _tenant = 'tenant1' AND _workspace = 'workspace1'
) v1
JOIN
(
SELECT column4, column5, column6
FROM example_view2
WHERE _tenant = 'tenant1' AND _workspace = 'workspace1'
) v2
ON v1.column1 = v2.column4
Whereas the equivalent query on the underlying tables would use a hash join and complete in under a second:
SELECT * FROM
(
SELECT column1, column2, column3
FROM "reports/tenant1/workspace1".example_table1
WHERE _tenant = 'tenant1' AND _workspace = 'workspace1'
) v1
JOIN
(
SELECT column4, column5, column6
FROM "reports/tenant1/workspace1".example_table2
WHERE _tenant = 'tenant1' AND _workspace = 'workspace1'
) v2
ON v1.column1 = v2.column4
I know the subqueries are pointless, but it's how the BI tool's query builder generates the SQL for the join.
Is there a way to let the query planner know that all tables outside the selected partition won't return results and can be ignored? As I said before, EXPLAIN ANALYZE shows queries are never executed on these tables due to the "redundant" partition predicates in the view definition, but that doesn't seem to be used at planning time.
I have two tables let’s say A & B and would like to count the results of column2 in table B by comparing them to table A column2 and update them in table A column1.
I am using the script shown here, but it's taking a really long time so I'd appreciate it if somebody could provide an alternative / better and faster option/script
UPDATE tableA
SET tableA.column1 = (SELECT COUNT(*)
FROM tableB
WHERE tableA.column2 = tableB.column2)
Use the proprietary UPDATE ... FROM to perform a join that can be something else than a nested loop:
UPDATE tableA SET tableA.column1 = tbc.count
FROM (SELECT column2,
count(*) AS count
FROM tableB
GROUP BY column2) AS tbc
WHERE tableA.column2 = tbc.column2;
I have two tables(Table A and Table B) in a Postgres DB.
Both have "id" column in common. Table A has one column called "id" and Table B has three columns: "id, date, value($)".
For each "id" of Table A there exists multiple rows in Table B in the following format - (id, date, value).
For instance, for Table A with "id" as 1 if there exists following rows in Table B:
(1, 2018-06-21, null)
(1, 2018-06-20, null)
(1, 2018-06-19, 202)
(1, 2018-06-18, 200)
I would like to extract the most recent dated non-null value. For example for id - 1, the result should be 202. Please share your thoughts or let me know in case more info is required.
Here is the solution I went ahead with:
with mapping as ( select distinct table1.id, table2.value, table2.date, row_number() over (partition by table1.id order by table2.date desc nulls last) as row_number
from table1
left join table2 on table2.id=table1.id and table2.value is not null
)
select * from mapping where row_number = 1
Let me know if there is scope for improvement.
You may very well want an inner join, not an outer join. If you have an id in table1 that does not exist in table2 or that has only null values you will get NULL for both date and value. This is due to the how outer join works. What it says is if nothing in the right side table matches the ON condition then return NULL for each column in that table. So
with mapping as
(select distinct table1.id
, table2.value
, table2.date
, row_number() over (partition by table1.id order by table2.date desc nulls last) as row_number
from table1
join table2 on table2.id=table1.id and table2.value is not null
)
select *
from mapping
where row_number = 1;
See example of each here. Your query worked because all your test data satisfied the 1st condition of the ON condition. You really need test data that fails to see what your query does.
Caution: DATE and VALUE are very poor choice for a column names. Both are SQL standard reserved words, although not Postgres specifically. Further DATE is a Postgres data type. Having columns with names the same as datatype leads to confusion.
There is a single table named Products which has 100s of columns. I am running a distinct column1,column2,column3....column6 postgresql query and the result is something like below:
2 Product A 300 2017 Null Null
2 Product A 300 2017 Null Null
Due to null values, instead of a single row I am getting two rows. How to solve this? Your help is much appreciated.
null differs from itself, distinct checks for equality under the hood. Instead of
select distinct field1, field2, ..., fieldn
you can have your select clause like this:
select distinct coalesce(field1, 'Empty') AS field1, ..., coalesce(fieldn, 'Empty') AS fieldn
You will only need coalesce for nullable fields.
One way to remove the duplicates you have above is to use GROUP BY the columns that you want distinct values for. So something like this
SELECT column1, column2, column3, ...,column6
FROM sometable
GROUP BY column1, column2, column3, ...,column6
I have two tables like this
A B
---- -------
col1 col2 col1 col2
---------- -----------
A table contains 300k rows
B table contains 400k rows
I need to count the col1 for table A if it is matching col1 for table B
I have written a query like this:
select count(distinct ab.col1) from A ab join B bc on(ab.col1=bc.col1)
but this takes too much time
could try a group by...
Also ensure that the col1 is indexed in both tables
SELECT COUNT (col1 )
FROM
(
SELECT aa.col1
FROM A aa JOIN B bb on aa.col1 = bb.col1
GROUP BY (aa.col1)
)
It's difficult to answer without you positing more details: did you analyze the tables? Do you have an index on col1 on each table? How many rows are you counting?
That being said, there aren'y so many potential query plans for your query. You likely have two seq scans that are hash joined together, which is about the best you can do... If you've a material numbers of rows, you'll be counting a gazillion rows, and this takes time.
Perhaps you could rewrite the query differently? If every B.col1 is in A.col1, you could get the same result without the join:
select count(distinct col1) from B
If A has low cardinality, it might be faster to rely on exists():
with vals as (
select distinct A.col1 as val from A
)
select count(*) from vals
where exists(select 1 from B where B.col1 = vals.val)
Or, if you know every possible value from A.col1 and it's reasonably small, you could unnest an array without querying A at all:
select count(*) from unnest(Array[val1, val2, ...]) as vals (val)
where exists(select 1 from B where B.col1 = vals.val)
Or vice-versa, in each of the above, if every B holds the reference values.