Join 2 sets based on default order - postgresql

How do I join 2 sets of records solely based on the default order?
So if I have a table x(col(1,2,3,4,5,6,7)) and another table z(col(a,b,c,d,e,f,g))
it will return
c1 c2
-- --
1 a
2 b
3 c
4 d
5 e
6 f
7 g
Actually, I wanted to join a pair of one dimensional arrays from parameters and treat them like columns from a table.
Sample code:
CREATE OR REPLACE FUNCTION "Test"(timestamp without time zone[],
timestamp without time zone[])
RETURNS refcursor AS
$BODY$
DECLARE
curr refcursor;
BEGIN
OPEN curr FOR
SELECT DISTINCT "Start" AS x, "End" AS y, COUNT("A"."id")
FROM UNNEST($1) "Start"
INNER JOIN
(
SELECT "End", ROW_NUMBER() OVER(ORDER BY ("End")) rn
FROM UNNEST($2) "End" ORDER BY ("End")
) "End" ON ROW_NUMBER() OVER(ORDER BY ("Start")) = "End".rn
LEFT JOIN "A" ON ("A"."date" BETWEEN x AND y)
GROUP BY 1,2
ORDER BY "Start";
return curr;
END
$BODY$

Now, to answer the real question that was revealed in comments, which appears to be something like:
Given two arrays 'a' and 'b', how do I pair up their elements so I can get the element pairs as column aliases in a query?
There are a couple of ways to tackle this:
If and only if the arrays are of equal length, use multiple unnest functions in the SELECT clause (a deprecated approach that should only be used for backward compatibility);
Use generate_subscripts to loop over the arrays;
Use generate_series over subqueries against array_lower and array_upper to emulate generate_subscripts if you need to support versions too old to have generate_subscripts;
Relying on the order that unnest returns tuples in and hoping - like in my other answer and as shown below. It'll work, but it's not guaranteed to work in future versions.
Use the WITH ORDINALITY functionality added in PostgreSQL 9.4 (see also its first posting) to get a row number for unnest when 9.4 comes out.
Use multiple-array UNNEST, which is SQL-standard but which PostgreSQL doesn't support yet.
So, say we have function arraypair with array parameters a and b:
CREATE OR REPLACE FUNCTION arraypair (a integer[], b text[])
RETURNS TABLE (col_a integer, col_b text) AS $$
-- blah code here blah
$$ LANGUAGE whatever IMMUTABLE;
and it's invoked as:
SELECT * FROM arraypair( ARRAY[1,2,3,4,5,6,7], ARRAY['a','b','c','d','e','f','g'] );
possible function definitions would be:
SRF-in-SELECT (deprecated)
CREATE OR REPLACE FUNCTION arraypair (a integer[], b text[])
RETURNS TABLE (col_a integer, col_b text) AS $$
SELECT unnest(a), unnest(b);
$$ LANGUAGE sql IMMUTABLE;
Will produce bizarre and unexpected results if the arrays aren't equal in length; see the documentation on set returning functions and their non-standard use in the SELECT list to learn why, and what exactly happens.
generate_subscripts
This is likely the safest option:
CREATE OR REPLACE FUNCTION arraypair (a integer[], b text[])
RETURNS TABLE (col_a integer, col_b text) AS $$
SELECT
a[i], b[i]
FROM generate_subscripts(CASE WHEN array_length(a,1) >= array_length(b,1) THEN a::text[] ELSE b::text[] END, 1) i;
$$ LANGUAGE sql IMMUTABLE;
If the arrays are of unequal length, as written it'll return null elements for the shorter, so it works like a full outer join. Reverse the sense of the case to get an inner-join like effect. The function assumes the arrays are one-dimensional and that they start at index 1. If an entire array argument is NULL then the function returns NULL.
A more generalized version would be written in PL/PgSQL and would check array_ndims(a) = 1, check array_lower(a, 1) = 1, test for null arrays, etc. I'll leave that to you.
Hoping for pair-wise returns:
This isn't guaranteed to work, but does with PostgreSQL's current query executor:
CREATE OR REPLACE FUNCTION arraypair (a integer[], b text[])
RETURNS TABLE (col_a integer, col_b text) AS $$
WITH
rn_c1(rn, col) AS (
SELECT row_number() OVER (), c1.col
FROM unnest(a) c1(col)
),
rn_c2(rn, col) AS (
SELECT row_number() OVER (), c2.col
FROM unnest(b) c2(col)
)
SELECT
rn_c1.col AS c1,
rn_c2.col AS c2
FROM rn_c1
INNER JOIN rn_c2 ON (rn_c1.rn = rn_c2.rn);
$$ LANGUAGE sql IMMUTABLE;
I would consider using generate_subscripts much safer.
Multi-argument unnest:
This should work, but doesn't because PostgreSQL's unnest doesn't accept multiple input arrays (yet):
SELECT * FROM unnest(a,b);

select x.c1, z.c2
from
x
inner join
(
select
c2,
row_number() over(order by c2) rn
from z
order by c2
) z on x.c1 = z.rn
order by x.c1
If x.c1 is not 1,2,3... you can do the same that was done with z
The middle order by is not necessary as pointed by Erwin. I tested it like this:
create table t (i integer);
insert into t
select ceil(random() * 100000)
from generate_series(1, 100000);
select
i,
row_number() over(order by i) rn
from t
;
And i comes out ordered. Before this simple test which I never executed I though it would be possible that the rows would be numbered in any order.

By "default order" it sounds like you probably mean the order in which the rows are returned by select * from tablename without an ORDER BY.
If so, this ordering is undefined. The database can return rows in any order that it feels like. You'll find that if you UPDATE a row, it probably moves to a different position in the table.
If you're stuck in a situation where you assumed tables had an order and they don't, you can as a recovery option add a row number based on the on-disk ordering of the tuples within the table:
select row_number() OVER (), *
from the_table
order by ctid
If the output looks right, I recommend that you CREATE TABLE a new table with an extra field, then do an INSERT INTO ... SELECT to insert the data ordered by ctid, then ALTER TABLE ... RENAME the tables and finally fix any foreign key references so they point to the new table.
ctid can be changed by autovacuum, UPDATE, CLUSTER, etc, so it is not something you should ever be using in applications. I'm using it here only because it sounds like you don't have any real ordering or identifier key.
If you need to pair up rows based on their on-disk ordering (an unreliable and unsafe thing to do as noted above), you could per this SQLFiddle try:
WITH
rn_c1(rn, col) AS (
SELECT row_number() OVER (ORDER BY ctid), c1.col
FROM c1
),
rn_c2(rn, col) AS (
SELECT row_number() OVER (ORDER BY ctid), c2.col
FROM c2
)
SELECT
rn_c1.col AS c1,
rn_c2.col AS c2
FROM rn_c1
INNER JOIN rn_c2 ON (rn_c1.rn = rn_c2.rn);
but never rely on this in a production app. If you're really stuck you can use this with CREATE TABLE AS to construct a new table that you can start with when you're working on recovering data from a DB that lacks a required key, but that's about it.
The same approach given above might work with an empty window clause () instead of (ORDER BY ctid) when using sets that lack a ctid, like interim results from functions. It's even less safe then though, and should be a matter of last resort only.
(See also this newer related answer: https://stackoverflow.com/a/17762282/398670)

Related

How to transform SETOF into useful thing?

SELECT GENERATE_SERIES(0, 2) and SELECT * FROM GENERATE_SERIES(0, 2) are useful, because returns columns and rows... Also SELECT id,GENERATE_SERIES(0, 2) AS s FROM t...
Now suppose a SETOF returning function like json_each_text()
SELECT * FROM json_each_text('{"a":"foo", "b":"bar"}'); -- OK, useful...
SELECT id,json_each_text('{"a":"foo", "b":"bar"}') FROM t; -- Ops, UGLY THING!
So, how to "cast" the de second query to a util thing, with rows and coluns?
PS: the second query works fine but is not what I need (not is cols and rows).
Ops... I am solving (it works fine!) a very specific problem with (pg9.3+ feature)
SELECT t.id, key, value
FROM t, LATERAL json_each_text(t.info);
but not understand how to solve the generic problem of "cast SETOF datatype" to rows and cols.
SELECT id, (json_each_text(t.info)).* FROM t

Avoid putting PostgreSQL function result into one field

The end result of what I am after is a query that calls a function and that function returns a set of records that are in their own separate fields. I can do this but the results of the function are all in one field.
ie: http://i.stack.imgur.com/ETLCL.png and the results I am after are: http://i.stack.imgur.com/wqRQ9.png
Here's the code to create the table
CREATE TABLE tbl_1_hm
(
tbl_1_hm_id bigserial NOT NULL,
tbl_1_hm_f1 VARCHAR (250),
tbl_1_hm_f2 INTEGER,
CONSTRAINT tbl_1_hm PRIMARY KEY (tbl_1_hm_id)
)
-- do that for a few times to get some data
INSERT INTO tbl_1_hm (tbl_1_hm_f1, tbl_1_hm_f2)
VALUES ('hello', 1);
CREATE OR REPLACE FUNCTION proc_1_hm(id BIGINT)
RETURNS TABLE(tbl_1_hm_f1 VARCHAR (250), tbl_1_hm_f2 int AS $$
SELECT tbl_1_hm_f1, tbl_1_hm_f2
FROM tbl_1_hm
WHERE tbl_1_hm_id = id
$$ LANGUAGE SQL;
--And here is the current query I am running for my results:
SELECT t1.tbl_1_hm_id, proc_1_hm(t1.tbl_1_hm_id) AS t3
FROM tbl_1_hm AS t1
Thanks for having a read. Please if you want to haggle about the semantics of what I am doing by hitting the same table twice or my naming convention --> this is a simplified test.
When a function returns a set of records, you should treat it as a table source:
SELECT t1.tbl_1_hm_id, t3.*
FROM tbl_1_hm AS t1, proc_1_hm(t1.tbl_1_hm_id) AS t3;
Note that functions are implicitly using a LATERAL join (scroll down to sub-sections 4 and 5) so you can use fields from tables listed previously without having to specify an explicit JOIN condition.

SELECT clause in FOR LOOP control using plpgsql

I try to make a script as to output all the foos that are used by only one user, if a foo is used by more than one user, it shouldn't be outputed.
here's my tables
foos (id, value)
users (id, name)
used (foo_id, user_id)
and my not working script
FUNCTION output_unshared_foos ()
RETURNS foos AS
$a$
DECLARE
foocounts RECORD;
BEGIN
SELECT u.foo_id, count(*)
INTO foocounts -- store in the local variable
FROM used u
GROUP BY u.foo_id;
FOR f IN SELECT * FROM foos
LOOP
IF (SELECT fc.count < 2 FROM foocounts fc WHERE fc.foo_id = f.id) THEN
RETURN NEXT f;
END IF;
END LOOP;
END
$a$ language plpgsql;
doesn't seem to work, every rows are returned and the conditional control seems to be always true.
Your first problem is that you can't store the result of a query that returns more than one row into a single variable (the SELECT u.foo_id, count(*) INTO ... part). I'm surprised you don't get a runtime error when you call your function.
Your function also doesn't compile because the record f is not declared and a functioned defined as returns foos can't use return next
But your approach is wrong (even if it worked). Doing row-by-row processing is almost always the wrong choice in SQL. SQL and relational databases are meant to handle sets, not single rows.
Your problem can be solved with a single query:
select foo_id
from used
group by foo_id
having count(distinct user_id) = 1
will return all foo ids that are used by exactly one user.
If you need the additional information from the foos table, you can join the above query to the foos table:
select f.*
from foos f
join (
select foo_id
from used
group by foo_id
having count(distinct user_id) = 1
) u on f.id = u.foo_id

Get columns that differ between 2 rows

I have a table company with 60 columns. The goal is to create a tool to find, compare and eliminate duplicates in this table.
Example: I find 2 companies that potentially are the same, but I need to know which values (columns) differ between these 2 rows in order to continue.
I think it is possible to compare column by column x 60, but I search for a simpler and more generic solution.
Something like:
SELECT * FROM company where co_id=22
SHOW DIFFERENCE
SELECT * FROM company where co_id=33
The result should be the column names that differ.
For this you may use an intermediate key/value representation of the rows, with JSON functions or alternatively with the hstore extension (now only of historical interest). JSON comes built-in with every reasonably recent version of PostgreSQL, whereas hstore must be installed in the database with CREATE EXTENSION.
Demo:
CREATE TABLE table1 (id int primary key, t1 text, t2 text, t3 text);
Let's insert two rows that differ by the primary key and one other column (t3).
INSERT INTO table1 VALUES
(1,'foo','bar','baz'),
(2,'foo','bar','biz');
Solution with json
First with get a key/value representation of the rows with the original row number, then we pair the rows based on their original row number and
filter out those with the same "value" column
WITH rowcols AS (
select rn, key, value
from (select row_number() over () as rn,
row_to_json(table1.*) as r from table1) AS s
cross join lateral json_each_text(s.r)
)
select r1.key from rowcols r1 join rowcols r2
on (r1.rn=r2.rn-1 and r1.key = r2.key)
where r1.value <> r2.value;
Sample result:
key
-----
id
t3
Solution with hstore
SELECT skeys(h1-h2) from
(select hstore(t.*) as h1 from table1 t where id=1) h1
CROSS JOIN
(select hstore(t.*) as h2 from table1 t where id=2) h2;
h1-h2 computes the difference key by key and skeys() outputs the result as a set.
Result:
skeys
-------
id
t3
The select-list might be refined with skeys((h1-h2)-'id'::text) to always remove id which, as the primary key, will obviously always differ between rows.
Here's a stored procedure that should get you most of the way...
While this should work "as is", it has no error checking, which you should add.
It gets all the columns in the table, and loops over them. A difference is when the count of the distinct items is more than one.
Also, the output is:
The count of the number of differences
Messages for each column where there is a difference
It might be more useful to return a rowset of the columns with the differences. Anyway, good luck!
Usage:
SELECT showdifference('public','company','co_id',22,33)
CREATE OR REPLACE FUNCTION showdifference(p_schema text, p_tablename text,p_idcolumn text,p_firstid integer, p_secondid integer)
RETURNS INTEGER AS
$BODY$
DECLARE
l_diffcount INTEGER;
l_column text;
l_dupcount integer;
column_cursor CURSOR FOR select column_name from information_schema.columns where table_name = p_tablename and table_schema = p_schema and column_name <> p_idcolumn;
BEGIN
-- need error checking here, to ensure the table and schema exist and the columns exist
-- Should also check that the records ids exist.
-- Should also check that the column type of the id field is integer
-- Set the number of differences to zero.
l_diffcount := 0;
-- use a cursor to iterate over the columns found in information_schema.columns
-- open the cursor
OPEN column_cursor;
LOOP
FETCH column_cursor INTO l_column;
EXIT WHEN NOT FOUND;
-- build a query to see if there is a difference between the columns. If there is raise a notice
EXECUTE 'select count(distinct ' || quote_ident(l_column) || ' ) from ' || quote_ident(p_schema) || '.' || quote_ident(p_tablename) || ' where ' || quote_ident(p_idcolumn) || ' in ('|| p_firstid || ',' || p_secondid ||')'
INTO l_dupcount;
IF l_dupcount > 1 THEN
-- increment the counter
l_diffcount := l_diffcount +1;
RAISE NOTICE '% has % differences', l_column, l_dupcount ; -- for "real" you might want to return a rowset and could do something here
END IF;
END LOOP;
-- close the cursor
CLOSE column_cursor;
RETURN l_diffcount;
END;
$BODY$
LANGUAGE plpgsql VOLATILE STRICT
COST 100;

Ordering a query on a field in a return record

I've got a query that calls a function in its select clause. The function returns a record type. In the calling query, I want to order by one of the fields in the returned record and if possible I'd also like to return the fields of the record as fields of the calling query. To make this clear, here's a simplified version of the code:
CREATE OR REPLACE FUNCTION getStatus(lastContact timestamptz, lastAlTime timestamptz, lastGps timestamptz, out status varchar, out toelichting varchar, out colorLevel integer)
RETURNS record AS
$BODY$
BEGIN
status := 'controle_status_ok';
toelichting := '';
colorLevel := 3;
END
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
ALTER FUNCTION DMI_Controle_getStatus(timestamptz, timestamptz, timestamptz, out varchar, out varchar, out integer) OWNER TO xyz;
Using this function, I want to have a query like this one:
SELECT
id,
name,
getStatus(tabel3.lastcontact, tabel4.lastchanged, tabel5.lastfound) as status
FROM
tabel1
left join tabel2 on ...
left join tabel3 on ...
left join tabel4 on ...
left join tabel5 on ...
ORDER BY
status
Postgres comes with the following error:
ERROR: could not identify an ordering operator for type record
HINT: Use an explicit ordering operator or modify the query.
The question: how should I order by the value of colorLevel that's been returned by getStatus?
Additional question: can I return the three fields of the getStatus function at fields of the query that calls the getStatus function?
Use
ORDER BY (status).colorlevel
to reference a column of your record type.
As an aside: I used lower case(colorlevel instead of colorLevel) because identifiers are cast to lower case if not double-quoted anyway, and using mixed case identifiers is generally a bad idea in PostgreSQL.
As to your additional question, similar syntax requirement. I also use a subquery to optimize the query:
SELECT id
, name
, (x.status).status
, (x.status).toelichting
, (x.status).colorLevel
FROM tabel
, (SELECT getStatus(now(), now(), now()) as status) x
ORDER BY (x.status).colorlevel
Read about accessing composite types in the manual.
Answer after additional input
To use columns from your tables, put it all in the a subquery. I am trying to avoid to call the function multiple times, because that may be expensive.
SELECT
id,
name,
(status).status,
(status).toelichting,
(status).colorLevel
FROM (
SELECT
id,
name,
getStatus(tabel3.lastcontact, tabel4.lastchanged, tabel5.lastfound) as status
FROM
tabel1
left join tabel2 on ...
left join tabel3 on ...
left join tabel4 on ...
left join tabel5 on ...
) x
ORDER BY
(status).colorlevel