R2DBC adjacency list get all children - postgresql

I have a table which has id and parentId columns: i refer to this structure as an adjacency list.
So, now i want to get all children of arbitrary id. Classic solution of this problem use recursion, for example here is Postgres procedure or CTE implementation.
I'm currently using Spring Webflux and Spring Data R2DBC + Postgres R2DBC driver (which doesn't support stored procedures yet).
How can i approach this problem in reactive style? Is it even possible or am i missing something conceptually wrong?
UPD 1:
Let's image we've data like:
+-------------+---------+
|id |parent_id|
+-------------+---------+
|root |NULL |
|id1 |root |
|dir1 |root |
|dir1_id1 |dir1 |
|dir1_dir1 |dir1 |
|dir1_dir1_id1|dir1_dir1|
+-------------+---------+
Now i want to have a method inside a ReactiveCrudRepository, which will return all children of provided id.
For example, using sample data: by providing id='dir1', i want to get children with ids: ['dir1_id1', "dir1_dir1", "dir1_dir1_id1"].

using proc or cte has nothing to do with full scan.
in your case scenario, you only have to use recursive cte , but adding an index on id, parentid will surely help
create index idx_name on tablename (parentid , id);
also 10k rows its not that big , index will definitely improve cte alot.

I think the best sql approach is recursive CTE (Common Table Expressions), did you try it? I never tried it with many rows.
WITH recursive nodes AS (
SELECT id, parent_id
FROM t
WHERE parent_id = 'dir1'
UNION ALL
SELECT t.id, t.parent_id
FROM nodes n
INNER JOIN t ON t.parent_id = n.id
)
SELECT id
FROM nodes;
Output for parent_id = 'dir1'
id
dir1_id1
dir1_dir1
dir1_dir1_id1

Related

Redshift: replace FULL OUTER for a CROSS JOIN

I would like to perform a full outer join using multiple OR values but i've read that PostgreSQL can only do a full outer join in a situation where the join conditions are distinct on each side of the = sign.
In my scenario, I have 2 tables: ticket and production. One register on Ticket can have a few values for Production.code. Example:
TICKET|custom_field_1|custom_field_2|custom_field_3
1| 10 |9 |
2| |8 |
PRODUCTION|CODE
1| 10
5| 8
12| 9
In the following example, Ticket ID 1 is related with Production Code 9 and 10. And Ticket ID 2 is related with Production Code 8.
I'm trying to write a query to return column Status from table Production:
SELECT
production.status
FROM ticket
FULL OUTER JOIN production ON ticket.custom_field_1 = production.code
OR ticket.custom_field_2 = production.code
OR ticket.custom_field_3 = production.code
GROUP BY 1
ORDER BY 1
LIMIT 1000
When I try to run this query, I got an error: Invalid operation: FULL JOIN is only supported with merge-joinable join conditions;
So I've started to replace it for a CROSS JOIN. The query is almost working but I'm facing a difference number of rows:
SELECT count(production.id) FROM ticket
CROSS JOIN production
WHERE date(production.ts_real) >= '2019-03-01' AND
((ticket.custom_field_1 = sisweb_producao.proposta) OR
(ticket.custom_field_2 = sisweb_producao.proposta) OR
(ticket.custom_field_3 = sisweb_producao.proposta));
This query above should return 202 rows but only gives 181 because of my conditions. How can i make the cross join works like a FULL OUTER?
I'm using a tool called Looker, that's why I'm building this query on this way.
It's not quite clear what the schema of your tables is as some of your example SQL contains columns not in the example schema, but it looks like you could use an alternative approach of pivoting the ticket columns and joining them to the production table using an inner join to achieve the same thing:
SELECT
t1.ticket
, production.id
, production.status
FROM
(
SELECT
ticket
, custom_field_1 AS code
FROM
ticket
WHERE
custom_field_1 IS NOT NULL
UNION
SELECT
ticket
, custom_field_2 AS code
FROM
ticket
WHERE
custom_field_2 IS NOT NULL
UNION
SELECT
ticket
, custom_field_3 AS code
FROM
ticket
WHERE
custom_field_3 IS NOT NULL
) t1
INNER JOIN
production ON t1.code = production.code
Based on the example data you provided, it looks like a ticket can be related to more than one production code, and hence more than one "status", so whichever way you do this be aware you will potentially have multiple result rows per ticket.

Top N rows by group in ClickHouse

What is the proper way to query top N rows by group in ClickHouse?
Lets take an example of tbl having id2, id4, v3 columns and N=2.
I tried the following
SELECT
id2,
id4,
v3 AS v3
FROM tbl
GROUP BY
id2,
id4
ORDER BY v3 DESC
LIMIT 2 BY
id2,
id4
but getting error
Received exception from server (version 19.3.4):
Code: 215. DB::Exception: Received from localhost:9000, 127.0.0.1. DB::Exception
: Column v3 is not under aggregate function and not in GROUP BY..
I could put v3 into GROUP BY and it does seems to work, but it is not efficient to group by a metric.
There is any aggregate function, but we actually want all values (limited to 2 by LIMIT BY clause) not any value, so it doesn't sound like to be proper solution here.
SELECT
id2,
id4,
any(v3) AS v3
FROM tbl
GROUP BY
id2,
id4
ORDER BY v3 DESC
LIMIT 2 BY
id2,
id4
It can be used aggregate functions like this:
SELECT
id2,
id4,
arrayJoin(arraySlice(arrayReverseSort(groupArray(v3)), 1, 2)) v3
FROM tbl
GROUP BY
id2,
id4
You can also do it the way you would do it in "normal" SQL as described in this thread
While vladimir's solutions works for many cases, it didn't work for my case. I have a table, that looks like this:
column | group by
++++++++++++++++++++++
A | Yes
B | Yes
C | No
Now, imagine column A identifies the user and column B stands for whatever action a user could do e. g. on your website or your online game. Column C is the sum of how often the user has done this particular action. Vladimir's solution would allow me to get column A and C, but not the action the user has done (column B), meaning I would know how often a user has done something, but not what.
The reason for this is that it doesn't make sense to group by both A and B. Every row would be a unique group and you aren't able to find the top K rows since every group has only 1 member. The result is the same table you query against. Instead, if you group only by A, you can apply vladimir's solution but would get only columns A and C. You can't output column B because it's not part of the Group By statement as explained.
If you would like to get the top 2 (or top 5, or top 100) actions a user has done, you might look for a solution that this:
SELECT rs.id2, rs.id4, rs.v3
FROM (
SELECT id2, id4, v3, row_number()
OVER (PARTITION BY id2, id4 ORDER BY v3 DESC) AS Rank
FROM tbl
) rs WHERE Rank <= 2
Note: To use this, you have to set allow_experimental_window_functions = 1.

PostgreSQL nested selects

Is there anybody who can help me with making a query with the following functionality:
Let's have a simple statement like:
SELECT relname FROM pg_catalog.pg_class WHERE relkind = 'r';
This will produce a nice result with a single column - the names of all tables.
Now lets imagine that one of the tables has name "table1". If we execute:
SELECT count(*) FROM table1;
we will get the number of rows of the table "table1".
Now the real question - how these two queries can be unified and to have one query, which to give the result of two columns: name of the table and number of rows? Written in pseudo SQL it should be something like this:
SELECT relname, (SELECT count(*) FROM relname::[as table name]) FROM pg_catalog.pg_class WHERE relkind = 'r';
And here is and example - if there are 3 tables in the database and the names are table1, table2 and table 3, and they have respectively 20, 30 and 40 rows, the query result should be like this:
-------------
|relname| rows|
|-------------|
|table1 | 20|
|-------------|
|table2 | 30|
|-------------|
|table3 | 40|
-------------
Thanks to everyone who is willing to help ;-)
P.S. Yes I know that the table name is not schema-qualified ;-) Let's hope that all tables in the database have unique names ;-)
(Corrected typos from rename to relname in last query)
EDIT1: The question is not related to "how can I find the number of rows in a table". What I'm asking is: how to build a query with 2 selects and the second to have as FROM the value of a column from the result of the first select.
EDIT2: As #jdigital suggested I've tried the dynamic querying and it does the job, but can be used only in PL/pgSQL. So it doesn't fit my needs. In additional I tried with PREPARE and EXECUTE statement - yet again it is not working. Anyway - I'll stick with the two queries approach. But I'm damn sure that PostgreSQL is capable of this ....
With PL/pgSQL (postgres SQL Procedural Language), you can execute dynamic queries by building a string and then executing it as SQL. Note that this is postgres-specific, but other databases are likely to have something equivalent. Even more generally, if you are willing to go beyond SQL, you can do this with any programming language (or shell/cmd script).
By the way, you'll get better results searching for "postgres dynamic query" since "nested select" has a different meaning.

Alternative when IN clause is inputed A LOT of values (postgreSQL)

I'm using the IN clause to retrieve places that contains certain tags. For that I simply use
select .. FROM table WHERE tags IN (...)
For now the number of tags I provide in the IN clause is around 500) but soon (in the near future) number tags will probably jump off to easily over 5000 (maybe even more)
I would guess there is some kind of limition in both the size of the query AND in the number values in the IN clause (bonus question for curiosity what is this value?)
So my question is what is a good alternative query that would be future proof even if in the future I would be matching against let's say 10'000 tags ?
ps: I have looked around and see people mentioning "temporary table". I have never used those. How will they be used in my case? Will i need to create a temp table everytime I make a query ?
Thanks,
Francesco
One option is to join this to a values clause
with parms (tag) as (
values ('tag1'), ('tag2'), ('tag3')
)
select t.*
from the_table t
join params p on p.tag = t.tag;
You could create a table using:
tablename
id | tags
----+----------
1 | tag1
2 | tag2
3 | tag3
And then do:
select .. FROM table WHERE tags IN (SELECT * FROM tablename)

Postgres subquery has access to column in a higher level table. Is this a bug? or a feature I don't understand?

I don't understand why the following doesn't fail. How does the subquery have access to a column from a different table at the higher level?
drop table if exists temp_a;
create temp table temp_a as
(
select 1 as col_a
);
drop table if exists temp_b;
create temp table temp_b as
(
select 2 as col_b
);
select col_a from temp_a where col_a in (select col_a from temp_b);
/*why doesn't this fail?*/
The following fail, as I would expect them to.
select col_a from temp_b;
/*ERROR: column "col_a" does not exist*/
select * from temp_a cross join (select col_a from temp_b) as sq;
/*ERROR: column "col_a" does not exist
*HINT: There is a column named "col_a" in table "temp_a", but it cannot be referenced from this part of the query.*/
I know about the LATERAL keyword (link, link) but I'm not using LATERAL here. Also, this query succeeds even in pre-9.3 versions of Postgres (when the LATERAL keyword was introduced.)
Here's a sqlfiddle: http://sqlfiddle.com/#!10/09f62/5/0
Thank you for any insights.
Although this feature might be confusing, without it, several types of queries would be more difficult, slower, or impossible to write in sql. This feature is called a "correlated subquery" and the correlation can serve a similar function as a join.
For example: Consider this statement
select first_name, last_name from users u
where exists (select * from orders o where o.user_id=u.user_id)
Now this query will get the names of all the users who have ever placed an order. Now, I know, you can get that info using a join to the orders table, but you'd also have to use a "distinct", which would internally require a sort and would likely perform a tad worse than this query. You could also produce a similar query with a group by.
Here's a better example that's pretty practical, and not just for performance reasons. Suppose you want to delete all users who have no orders and no tickets.
delete from users u where
not exists (select * from orders o where o.user_d = u.user_id)
and not exists (select * from tickets t where t.user_id=u.ticket_id)
One very important thing to note is that you should fully qualify or alias your table names when doing this or you might wind up with a typo that completely messes up the query and silently "just works" while returning bad data.
The following is an example of what NOT to do.
select * from users
where exists (select * from product where last_updated_by=user_id)
This looks just fine until you look at the tables and realize that the table "product" has no "last_updated_by" field and the user table does, which returns the wrong data. Add the alias and the query will fail because no "last_updated_by" column exists in product.
I hope this has given you some examples that show you how to use this feature. I use them all the time in update and delete statements (as well as in selects-- but I find an absolute need for them in updates and deletes often)