Postgresql finding and counting unique tuples - postgresql

I have a question where I must find unique tuples in a table, where they have not ever been seen inside another table. I then must count those tuples and display the ones that occur more then 10 times. That is, there are some jobs that don't require job staff, I am to find the jobs that have never ever required a job staff and has ran in terms more then 10 times.
create table JobStaff (
Job integer references Job(id),
staff integer references Staff(id),
role integer references JobRole(id),
primary key (job,staff,role)
);
create table Job (
id integer,
branch integer not null references Branches(id),
term integer not null references Terms(id),
primary key (id)
);
Essentially my code exists as:
CREATE OR REPLACE VIEW example AS
SELECT * FROM Job
WHERE id NOT IN (SELECT DISTINCT jobid FROM JobStaff);
create or replace view exampleTwo as
select branch, count(*) as ct from example group by 1;
create or replace view sixThree as
select branch, ct from exampleTwo where ct > 30;
This seems to return two extra rows then the expected result. I asked my lecturer and he said that it's because I'm counting courses that SOMETIMES
EDIT: this means, that for all the terms the job was available, there were no job staff assigned to it
EDIT2: expected output and what I get:
what i got:
branch | cou
---------+-----
7973 | 34
7978 | 31
8386 | 33
8387 | 32
8446 | 32
8447 | 32
8448 | 31
61397 | 31
62438 | 32
63689 | 31
expected:
branch | cou
---------+-----
7973 | 34
8387 | 32
8446 | 32
8447 | 32
8448 | 31
61397 | 31
62438 | 32
63689 | 31

You have to understand that SQL works in the way that you must know what you want before you can design the query.
In your question you write that you are looking for jobs that are not in jobstuff but now that we have the answer here, its clear that you were looking for branches that are not in jobstuff. My advice to you is: Take you time to word (in your language, or in English but not in SQL) exactly what you want BEFORE trying to implement it. When you are more experienced that isn't necessary always but for a beginner its the best way to learn SQL.
The Solution:
The thing here is that you don't need views to cascade queries, you can just select from inner queries. Also to filter elements based on a computed value (like count) can be done by the having clause.
The count(DISTINCT ...) causes duplicate entries to be only counted once, so if a Job is in a term twice, it now gets counted only once.
The query below selects all branches that ever got a jobstaff and then looks for jobs that are not in this list.
As far as I understood your question this should help you:
SELECT branch, count(DISTINCT term_id) as cou
FROM jobs j
WHERE j.branch NOT IN (SELECT branch FROM job WHERE job.id IN (select jobid FROM jobstaff))
GROUP BY branch
HAVING count(DISTINCT term_id) > 10

Related

Optimise a simple update query for large dataset

I have some data migration that has to occur between a parent and child table. For the sake of simplicity, the schemas are as follows:
------- -----------
| event | | parameter |
------- -----------
| id | | id |
| order | | eventId |
------- | order |
-----------
Because of an oversight with business logic that needs to be performed, we need to update parameter.order to the parent event.order. I have come up with the following SQL to do that:
UPDATE "parameter"
SET "order" = e."order"
FROM "event" e
WHERE "eventId" = e.id
The problem is that this query didn't resolve after over 4 hours and I had to clock out, so I cancelled it.
There are 11 million rows on parameter and 4 million rows on event. I've run EXPLAIN on the query and it tells me this:
Update on parameter (cost=706691.80..1706622.39 rows=11217313 width=155)
-> Hash Join (cost=706691.80..1706622.39 rows=11217313 width=155)
Hash Cond: (parameter."eventId" = e.id)
-> Seq Scan on parameter (cost=0.00..435684.13 rows=11217313 width=145)
-> Hash (cost=557324.91..557324.91 rows=7724791 width=26)
-> Seq Scan on event e (cost=0.00..557324.91 rows=7724791 width=26)
Based on this article it tells me that the "cost" referenced by the EXPLAIN is an "arbitrary unit of computation".
Ultimately, this update needs to be performed, but I would accept it happening in one of two ways:
I am advised of a better way to do this query that executes in a timely manner (I'm open to all suggestions, including updating schemas, indexing, etc.)
The query remains the same but I can somehow get an accurate prediction of execution time (even if it's hours long). This way, at least, I can manage the expectations of the team. I understand that without actually running the query it can't be expected to know the times, but is there an easy way to "convert" these arbitrary units into some millisecond execution time?
Edit for Jim Jones' comment:
I executed the following query:
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
I got 9 identical rows like the following:
pid | locktype | mode | query | query-start | state
-------------------------------------------------------------------------
23192 | relation | AccessShareLock | <see below> | 2021-10-26 14:10:01 | active
query column:
--update parameter
--set "order" = e."order"
--from "event" e
--where "eventId" = e.id
SELECT psa.pid,locktype,mode,query,query_start,state FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid
Edit 2: I think I've been stupid here... The query produced by checking these locks is just the commented query. I think that means there's actually nothing to report.
If some rows already have the target value, you can skip empty updates (at full cost). Like:
UPDATE parameter p
SET "order" = e."order"
FROM event e
WHERE p."eventId" = e.id
AND p."order" IS DISTINCT FROM e."order"; -- this
If both "order" columns are defined NOT NULL, simplify to:
...
AND p."order" <> e."order";
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you have to update all or most rows - and can afford it! - writing a new table may be cheaper overall, like Mike already mentioned. But concurrency and depending objects may stand in the way.
Aside: use legal, lower-case identifiers, so you don't have to double-quote. Makes your life with Postgres easier.
The query will be slow because for each UPDATE operation, it has to look up the index by id. Even with an index, on a large table, this is a per-row read/write so it is slow.
I'm not sure how to get a good estimate, maybe do 1% of the table and multiply?
I suggest creating a new table, then dropping the old one and renaming the new table.
CREATE TABLE parameter_new AS
SELECT
parameter.id,
parameter."eventId",
e."order"
FROM
parameter
JOIN event AS "e" ON
"e".id = parameter."eventId"
Later, once you verify things:
ALTER TABLE parameter RENAME TO parameter_old;
ALTER TABLE parameter_new RENAME TO parameter;
Later, once you're completely certain:
DROP TABLE parameter_old;

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Using results from PostgreSQL query in FROM clause of another query

I have a table that's designed as follows.
master_table
id -> serial
timestamp -> timestamp without time zone
fk_slave_id -> integer
fk_id -> id of the table
fk_table1_id -> foreign key relationship with table1
...
fk_table30_id -> foreign key relationship with table30
Every time a new table is added, this table gets altered to include a new column to link. I've been told it was designed as such to allow for deletes in the tables to cascade in the master.
The issue I'm having is finding a proper solution to linking the master table to the other tables. I can do it programmatically using loops and such, but that would be incredibly inefficient.
Here's the query being used to grab the id of the table the id of the row within that table.
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100;
The results are.
fk_slave_id | id
-------------+-----
30 | 678
25 | 677
29 | 676
1 | 675
15 | 674
9 | 673
The next step is using this data to formulate the table required to get the required data. For example, data is required from table30 with id 678.
This is where I'm stuck. If I use WITH it doesn't seem to accept the output in the FROM clause.
WITH items AS (
SELECT fk_slave_id, concat(fk_table1_id,...,fk_table30_id) AS id
FROM master_table
ORDER BY id DESC
LIMIT 100
)
SELECT data
FROM concat('table', items.fk_slave_id)
WHERE id = items.id;
This produces the following error.
ERROR: missing FROM-clause entry for table "items"
LINE x: FROM string_agg('table', items.fk_slave_id)
plpgsql is an option to use EXECUTE with format, but then I'd have to loop through each result and process it with EXECUTE.
Is there any way to achieve what I'm after using SQL or is it a matter of needing to do it programmatically?
Apologies on the bad title. I can't think of another way to word this question.
edit 1: Replaced rows with items
edit 2: Based on the responses it doesn't seem like this can be accomplished cleanly. I'll be resorting to creating an additional column and using triggers instead.
I don't think you can reference a dynamically named table like that in your FROM clause:
FROM concat('table', rows.fk_slave_id)
Have you tried building/executing that SQL from a stored procedure/function. You can create the SQL you want to execute as a string and then just EXECUTE it.
Take a look at this one:
PostgreSQL - Writing dynamic sql in stored procedure that returns a result set

Build a list of grouped values

I'm new to this page and this is the first time i post a question. Sorry for anything wrong. The question may be old, but i just can't find any answer for SQL AnyWhere.
I have a table like
Order | Mark
======|========
1 | AA
2 | BB
1 | CC
2 | DD
1 | EE
I want to have result as following
Order | Mark
1 | AA,CC,EE
2 | BB,DD
My current SQL is
Select Order, Cast(Mark as NVARCHAR(20))
From #Order
Group by Order
and it just give me with result completely the same with the original table.
Any idea for this?
You can use the ASA LIST() aggregate function (untested, you might need to enclose the order column name into quotes as it is also a reserved name):
SELECT Order, LIST( Mark )
FROM #Order
GROUP BY Order;
You can customize the separator character and order if you need.
Note: it is rather a bad idea to
name your table and column name with like regular SQL clause (Order by)
use the same name for column an table (Order)

Counting entries depending on foreign key

I have 2 tables, let's call them T_FATHER and T_CHILD, where each father can have multiple childs, like so:
T_FATHER
--------------------------
ID - BIGINT, from Generator
T_CHILD
-------------------------------
ID - BIGINT, from Generator
FATHER_ID - BIGINT, Foreign Key
Now I want to add a counter to the T_CHILD table, that starts with 1 and adds 1 for every new child, but not globally, but per father, like:
ID | FATHER_ID | COUNTER |
--------------------------
1 | 1 | 1 |
--------------------------
2 | 1 | 2 |
--------------------------
3 | 2 | 1 |
--------------------------
My initial thought was creating a before-insert-trigger that counts how many childs are present for the given father and add 1 for the counter. This should work fine unless there are 2 inserts at the same time, which would end with the same counter. Chances are high that this never actually happens - but better save than sorry.
I don't know if it is possible to use a generator, but don't think so as there would have to be a generator per father.
My current approach is using the aforementioned trigger and add a unique index to FATHER_ID + COUNTER, so that only one of the simultaneous inserts goes through. I will have to handle the exception client-side (and reattempt the failed insert).
Is there a better way to handle this directly in Firebird?
PS: There won't be any deletes on any of the two tables, so this is not an issue.
Even with a generator per FATHER_ID you couldn't use them for this, because generators are not transaction safe. If your transaction is rolled back for whatever reason, the generator will have increased anyway, causing gaps.
If there are no deletes, I think your approach with a unique constraint is valid. I would consider an alternative however.
You could decide not to store the counter as such – storing counters in a database is often a bad idea. Instead, only track the insertion order. For that, a generator is usable, because every new record will have a higher value and gaps won't matter. In fact, you don't need anything but the ID you already have. Determine the numbering when selecting; for every child you want to know how many children there are with the same father but a lower ID. As a bonus, deletes would work normally.
Here's a proof of concept using a nested query:
SELECT ID, FATHER_ID,
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = C.FATHER_ID
AND OTHERS.ID < C.ID) AS COUNTER
FROM T_CHILD AS C
There's also the option of a window function. It has to have a derived table to also count any rows that are ultimately not being selected:
SELECT * FROM (
SELECT ID, FATHER_ID,
ROW_NUMBER() OVER(PARTITION BY FATHER_ID ORDER BY ID) AS COUNTER
FROM T_CHILD
-- Filtering that wouldn't affect COUNTER (e.g. WHERE FATHER_ID ... AND ID < ...)
)
-- Filtering that would affect COUNTER (e.g. WHERE ID > ...)
These two options have completely different performance characteristics. Which one, if either at all, is suitable for you depends on your data size and access patterns.
And when you try with a computed field and the Select solution of Thijs van Dien ?
CREATE TABLE T_CHILD(
ID INTEGER,
FATHER_ID INTEGER,
COUNTER COMPUTED BY (
(SELECT 1 + COUNT(*)
FROM T_CHILD AS OTHERS
WHERE OTHERS.FATHER_ID = T_CHILD.FATHER_ID
AND OTHERS.ID < T_CHILD.ID)
)
);
During the insert, you should just do a "Select...count + 1" directly inside that field.
But I would probably reconsider adding that field in the first place. It feels like redundant information that could easily be deduced at the moment you need it.(For example, by using DENSE_RANK http://www.firebirdfaq.org/faq343/)