How can I fetch rows upto some condition for multiple group of ids

How can I fetch rows upto some condition for multiple group of ids - postgresql

How can I fetch rows up-to some condition for multiple groups
Please refer below my_table currently applied order by
(1) fk_user_id in ascending (2) created_date in descending
I am using postgresql and spring boot jpa, jpql.
1) Please find query to create table and insert data as below
Create and insert statements:
CREATE TABLE public.my_table
(
id bigint,
condition boolean,
fk_user_id bigint,
created_date date
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.my_table
OWNER TO postgres;
INSERT INTO public.my_table(
id, condition, fk_user_id, created_date)
VALUES
(137, FALSE, 23, '2019-08-28'),
(107, FALSE, 23, '2019-05-13'),
(83, TRUE, 23, '2019-04-28'),
(78, FALSE, 23, '2019-04-07'),
(67, TRUE, 23, '2019-03-18'),
(32, FALSE, 23, '2019-01-19'),
(181, FALSE, 57, '2019-11-04'),
(158, TRUE, 57, '2019-09-27'),
(146, FALSE, 57, '2019-09-16'),
(125, FALSE, 57, '2019-07-24'),
(378, TRUE, 71, '2020-02-16'),
(228, TRUE, 71, '2019-12-13'),
(179, FALSE, 71, '2019-10-06'),
(130, FALSE, 71, '2019-08-19'),
(114, TRUE, 71, '2019-06-29'),
(593, FALSE, 92, '2020-03-02'),
(320, FALSE, 92, '2020-01-30'),
(187, FALSE, 92, '2019-11-23'),
(180, TRUE, 92, '2019-10-17'),
(124, FALSE, 92, '2019-08-05');
I would like to fetch all the rows which have ALL latest FALSE condition upto last TRUE found and then skip other rows for that user.
For ex.,
1) User id = 23 -- First 2 rows with id (137, 107) will fetch as on 2019-04-28, it has TRUE condition and so, will skip other rows
2) User id = 57 -- Only 1 row with id (181)
3) User id = 71 -- No rows will be fetch as it has latest TRUE condition
likewise, my result rows should be like as below.
I can find rows for only 1 user with below query
select * from user_condition where
fk_user_id = 23 and created_date > (select max(created_date) from user_condition where fk_user_id = 23 and condition like 'TRUE' group by fk_user_id);
But, I want rows for all fk_user_id

SELECT t1.*
FROM sourcetable t1
WHERE t1.condition = 'FALSE'
AND NOT EXISTS ( SELECT NULL
FROM sourcetable t2
WHERE t1.fk_user_id = t2.fk_user_id
AND t1.created_date < t2.created_date /* or <= */
AND t2.condition = 'TRUE' )

Related

Redshift duplicated rows count mismatch using CTE due to table primary key configuration

It looks like I've come across a Redshift bug/inconsistency. I explain my original question first and include below a reproducible example.
Original question
I have a table with many columns in Redshift with some duplicated rows. I've tried to determine the number of unique rows using CTEs and two different methods: DISTINCT and GROUP BY.
The GROUP BY method looks something like this:
WITH duplicated_rows as
(SELECT *, COUNT(*) AS q
FROM my_schema.my_table
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows
With this query I get this result:
count_unique_rows | count_total_rows
------------------------------------
27 | 83
Then I use the DISTINCT method
WITH unique_rows as
(SELECT DISTINCT *
FROM my_schema.my_table)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows
And I get this result:
count_unique_rows
-----------------
63
So the CTE with GROUP BY seems to indicate 27 unique rows and CTE with the DISTINCT shows 63 unique rows.
As the next troubleshooting step I executed the GROUP BY outside the CTE and it yields 63 rows!
I also exported the 83 original rows to excel and applied the remove duplicated function and 63 rows remained, so that seems to be the correct number.
What I can't understand, for the life of me, is where the number 27 comes from when I use the CTE combined with the GROUP BY.
Is there a limitation with CTEs and Redshift that I'm not aware of? Is it a bug in my code?
Is it a bug in Redshift?
Any help in clarifying this mystery would be greatly appreciated!!
Reproducible Example
Create and populate the table
create table my_schema.students
(name VARCHAR(100),
day DATE,
course VARCHAR(100),
country VARCHAR(100),
address VARCHAR(100),
age INTEGER,
PRIMARY KEY (name))
INSERT INTO my_schema.students
VALUES
('Alan', '2000-07-15', 'Physics', 'CA', '12th Street', NULL),
('Alan', '2021-01-15', 'Math', 'USA', '8th Avenue', 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Kate', '2018-07-20', 'Literature', 'AR', '8th and 23th', 18),
('Kate', '2021-10-20', 'Philosophy', 'ES', NULL, 30);
Calculate unique rows with CTE and GROUP BY
WITH duplicated_rows as
(SELECT *, COUNT(*) AS q
FROM my_schema.students
GROUP BY 1, 2, 3, 4, 5, 6)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows
The result is INCORRECT!
count_unique_rows | count_total_rows
-------------------------------------
4 | 8
Calculate unique rows with CTE and DISTINCT
WITH unique_rows as
(SELECT DISTINCT *
FROM my_schema.students)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows
The result is CORRECT!
count_unique_rows
-----------------
6
The core of the issue seems to be the primary key, which Redshift doesn't enforce, but uses it for some kind of lazy evaluation to determine row differences within a CTE, which leads to inconsistent results.

The strange behaviour is caused by this line:
PRIMARY KEY (name)
From Defining table constraints - Amazon Redshift:
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.
For example, the query planner uses primary and foreign keys in certain statistical computations. It does this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By doing this, it can order large numbers of joins and eliminate redundant joins.
The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.
In your sample data, the PRIMARY KEY clearly cannot be name because there are multiple rows with the same name. This violates assumptions made by Redshift and can lead to incorrect results.
If you remove the PRIMARY KEY (name) line, the data results are correct.
(FYI, I discovered this by running your commands in sqlfiddle.com against a PostgreSQL database. It would not allow the data to be inserted because it violated the PRIMARY KEY condition.)

Postgres very hard dynamic select statement with COALESCE

Having a table and data like this
CREATE TABLE solicitations
(
id SERIAL PRIMARY KEY,
name text
);
CREATE TABLE donations
(
id SERIAL PRIMARY KEY,
solicitation_id integer REFERENCES solicitations, -- can be null
created_at timestamp without time zone NOT NULL DEFAULT (now() at time zone 'utc'),
amount bigint NOT NULL DEFAULT 0
);
INSERT INTO solicitations (name) VALUES
('solicitation1'), ('solicitation2');
INSERT INTO donations (created_at, solicitation_id, amount) VALUES
('2018-06-26', null, 10), ('2018-06-26', 1, 20), ('2018-06-26', 2, 30),
('2018-06-27', null, 10), ('2018-06-27', 1, 20),
('2018-06-28', null, 10), ('2018-06-28', 1, 20), ('2018-06-28', 2, 30);
How to make solicitation id's dynamic in following select statement using only postgres???
SELECT
"created_at"
-- make dynamic this begins
, COALESCE("no_solicitation", 0) AS "no_solicitation"
, COALESCE("1", 0) AS "1"
, COALESCE("2", 0) AS "2"
-- make dynamic this ends
FROM crosstab(
$source_sql$
SELECT
created_at::date as row_id
, COALESCE(solicitation_id::text, 'no_solicitation') as category
, SUM(amount) as value
FROM donations
GROUP BY row_id, category
ORDER BY row_id, category
$source_sql$
, $category_sql$
-- parametrize with ids from here begins
SELECT unnest('{no_solicitation}'::text[] || ARRAY(SELECT DISTINCT id::text FROM solicitations ORDER BY id))
-- parametrize with ids from here ends
$category_sql$
) AS ct (
"created_at" date
-- make dynamic this begins
, "no_solicitation" bigint
, "1" bigint
, "2" bigint
-- make dynamic this ends
)
The select should return data like this
created_at no_solicitation 1 2
____________________________________
2018-06-26 10 20 30
2018-06-27 10 20 0
2018-06-28 10 20 30
The solicitation ids that should parametrize select are the same as in
SELECT unnest('{no_solicitation}'::text[] || ARRAY(SELECT DISTINCT id::text FROM solicitations ORDER BY id))
One can fiddle the code here

I decided to use json, which is much simpler then crosstab
WITH
all_solicitation_ids AS (
SELECT
unnest('{no_solicitation}'::text[] ||
ARRAY(SELECT DISTINCT id::text FROM solicitations ORDER BY id))
AS col
)
, all_days AS (
SELECT
-- TODO: compute days ad hoc, from min created_at day of donations to max created_at day of donations
generate_series('2018-06-26', '2018-06-28', '1 day'::interval)::date
AS col
)
, all_days_and_all_solicitation_ids AS (
SELECT
all_days.col AS created_at
, all_solicitation_ids.col AS solicitation_id
FROM all_days, all_solicitation_ids
ORDER BY all_days.col, all_solicitation_ids.col
)
, donations_ AS (
SELECT
created_at::date as created_at
, COALESCE(solicitation_id::text, 'no_solicitation') as solicitation_id
, SUM(amount) as amount
FROM donations
GROUP BY created_at, solicitation_id
ORDER BY created_at, solicitation_id
)
, donations__ AS (
SELECT
all_days_and_all_solicitation_ids.created_at
, all_days_and_all_solicitation_ids.solicitation_id
, COALESCE(donations_.amount, 0) AS amount
FROM all_days_and_all_solicitation_ids
LEFT JOIN donations_
ON all_days_and_all_solicitation_ids.created_at = donations_.created_at
AND all_days_and_all_solicitation_ids.solicitation_id = donations_.solicitation_id
)
SELECT
jsonb_object_agg(solicitation_id, amount) ||
jsonb_object_agg('date', created_at)
AS data
FROM donations__
GROUP BY created_at
which results
data
______________________________________________________________
{"1": 20, "2": 30, "date": "2018-06-28", "no_solicitation": 10}
{"1": 20, "2": 30, "date": "2018-06-26", "no_solicitation": 10}
{"1": 20, "2": 0, "date": "2018-06-27", "no_solicitation": 10}
Thought its not the same that I requested.
It returns only data column, instead of date, no_solicitation, 1, 2, ...., to do so I need to use json_to_record, but I dont know how to produce its as argument dynamically

Performance Issue with finding recent date of each group and joining to all records

I have following tables:
CREATE TABLE person (
id INTEGER NOT NULL,
name TEXT,
CONSTRAINT person_pkey PRIMARY KEY(id)
);
INSERT INTO person ("id", "name")
VALUES
(1, E'Person1'),
(2, E'Person2'),
(3, E'Person3'),
(4, E'Person4'),
(5, E'Person5'),
(6, E'Person6');
CREATE TABLE person_book (
id INTEGER NOT NULL,
person_id INTEGER,
book_id INTEGER,
receive_date DATE,
expire_date DATE,
CONSTRAINT person_book_pkey PRIMARY KEY(id)
);
/* Data for the 'person_book' table (Records 1 - 9) */
INSERT INTO person_book ("id", "person_id", "book_id", "receive_date", "expire_date")
VALUES
(1, 1, 1, E'2016-01-18', NULL),
(2, 1, 2, E'2016-02-18', E'2016-10-18'),
(3, 1, 4, E'2016-03-18', E'2016-12-18'),
(4, 2, 3, E'2017-02-18', NULL),
(5, 3, 5, E'2015-02-18', E'2016-02-23'),
(6, 4, 34, E'2016-12-18', E'2018-02-18'),
(7, 5, 56, E'2016-12-28', NULL),
(8, 5, 34, E'2018-01-19', E'2018-10-09'),
(9, 5, 57, E'2018-06-09', E'2018-10-09');
CREATE TABLE book (
id INTEGER NOT NULL,
type TEXT,
CONSTRAINT book_pkey PRIMARY KEY(id)
) ;
/* Data for the 'book' table (Records 1 - 8) */
INSERT INTO book ("id", "type")
VALUES
( 1, E'Btype1'),
( 2, E'Btype2'),
( 3, E'Btype3'),
( 4, E'Btype4'),
( 5, E'Btype5'),
(34, E'Btype34'),
(56, E'Btype56'),
(67, E'Btype67');
My query should list name of all persons and for persons with recently received book types of (book_id IN (2, 4, 34, 56, 67)), it should display the book type and expire date; if a person hasn’t received such book type it should display blank as book type and expire date.
My query looks like this:
SELECT p.name,
pb.expire_date,
b.type
FROM
(SELECT p.id AS person_id, MAX(pb.receive_date) recent_date
FROM
Person p
JOIN person_book pb ON pb.person_id = p.id
WHERE pb.book_id IN (2, 4, 34, 56, 67)
GROUP BY p.id
)tmp
JOIN person_book pb ON pb.person_id = tmp.person_id
AND tmp.recent_date = pb.receive_date AND pb.book_id IN
(2, 4, 34, 56, 67)
JOIN book b ON b.id = pb.book_id
RIGHT JOIN Person p ON p.id = pb.person_id
The (correct) result:
name | expire_date | type
---------+-------------+---------
Person1 | 2016-12-18 | Btype4
Person2 | |
Person3 | |
Person4 | 2018-02-18 | Btype34
Person5 | 2018-10-09 | Btype34
Person6 | |
The query works fine but since I'm right joining a small table with a huge one, it's slow. Is there any efficient way of rewriting this query?
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.

Problems with your setup
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.
That makes two major problems before even looking at the query:
Postgres 8.4 is just too old. Especially for "production". It has reached EOL in July 2014. No more security upgrades, hopelessly outdated. Urgently consider upgrading to a current version.
It's a loaded footgun to use very different versions for development and production. Confusion and errors that go undetected. We have seen more than one desperate request here on SO stemming from this folly.
Better query
This equivalent should be substantially simpler and faster (works in pg 8.4, too):
SELECT p.name, pb.expire_date, b.type
FROM (
SELECT DISTINCT ON (person_id)
person_id, book_id, expire_date
FROM person_book
WHERE book_id IN (2, 4, 34, 56, 67)
ORDER BY person_id, receive_date DESC NULLS LAST
) pb
JOIN book b ON b.id = pb.book_id
RIGHT JOIN person p ON p.id = pb.person_id;
To optimize read performance, this partial multicolumn index with matching sort order would be perfect:
CREATE INDEX ON person_book (person_id, receive_date DESC NULLS LAST)
WHERE book_id IN (2, 4, 34, 56, 67);
In modern Postgres versions (9.2 or later) you might append book_id, expire_date to the index columns to get index-only scans. See:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
About DISTINCT ON:
Select first row in each GROUP BY group?
About DESC NULLS LAST:
PostgreSQL sort by datetime asc, null first?

How can I insert IF in where cluase? (TSQL)

I have this query in TSQL
DECLARE #DATA_INIZIO AS DATE
DECLARE #DATA_FINE AS DATE
SELECT 'TEMPERATURA RILEVATA', VS.EffectiveTimeMin as ORA, CAST(VS.Value AS INT), row_number() over (order by VS.EffectiveTimeMin) AS CONTEGGIO
FROM AA_V_PHR_CCD_VitalSigns V INNER JOIN AA_V_PHR_CCD_VitalSignsXResults VS ON V.ID = VS.IDVitalSigns
WHERE V.IDResultTypeValueSet = 23 AND V.IDClinicalDocument = 46
ORDER BY VS.EffectiveTimeMin asc
But, I want to add this condiTIon at the where clause:
IF(#Data_Inizio is not null and #Data_Fine is not null)
v.effectivetime between #data_inizio and #data_fine
how can I do this?

(
(#Data_Inizio is not null and #Data_Fine is not null)
AND
v.effectivetime between #data_inizio and #data_fine
)
Should be fine if I understood you properly.
Hmm maybe you don't have properly values, or your dates are incorrect, look at my sample below:
DECLARE #DATA_INIZIO AS DATE = '2017/01/01'
DECLARE #DATA_FINE AS DATE = '2017/02/01'
Declare #Tmp Table
(
IdResult Int, IdClinical Int, EffectiveTime DateTime
)
Insert Into #Tmp (IdResult, IdClinical, EffectiveTime)
Values
(23, 46, '2017/01/23'), (22, 46, '2017/01/23'),
(23, 46, '2017/01/2'), (23, 47, '2017/01/23'),
(23, 46, '2017/01/5'), (23, 46, '2017/02/23')
SELECT *
FROM
#Tmp V
WHERE
V.IdResult = 23 AND V.IdClinical = 46
AND
(
(#Data_Inizio is null OR #Data_Fine is null)
OR
v.effectivetime between #data_inizio and #data_fine
)

try to add those lines in your WHERE clause:
SELECT
'TEMPERATURA RILEVATA',
VS.EffectiveTimeMin as ORA,
CAST(VS.Value AS INT),
row_number() over (order by VS.EffectiveTimeMin) AS CONTEGGIO
FROM
AA_V_PHR_CCD_VitalSigns V
INNER JOIN AA_V_PHR_CCD_VitalSignsXResults VS
ON V.ID = VS.IDVitalSigns
WHERE
V.IDResultTypeValueSet = 23
AND V.IDClinicalDocument = 46
AND ( (#DATA_INIZIO is null OR #DATA_FINE is null)
OR -- when both are not null
(#DATA_INIZIO <= v.effectivetime AND v.effectivetime <= #DATA_FINE)
)
ORDER BY VS.EffectiveTimeMin ASC
you can 'translate' your IF statement to sth like:
when DATA_INIZIO or DATA_FINE is not set (null) then return record
when they are both set - check if v.effectivetime fits their range

Postgresql Run query Alternatively

I have 3 Queries.
They insert rows in account_move which they fetch from a table which has approx 15k rows.
When i execute them, Query 1 first process 15k rows from table at once. Query 2 and Query 3 does the same.
I want them to insert a row one by one. i.e. When executed,
Query 1 - insert one row.
Query 2 - insert one row.
Query 3 - insert one row.
Query 1 - ....
Query 2...... so on.
Query 1:
Insert into account_move(create_uid, partner_id, create_date, name, company_id, write_uid, journal_id, state, period_id, write_date, date, ref, to_check)
Select
1,
res_partner.id,
LOCALTIMESTAMP,
account_invoice.number,
1,
1,
2,
'posted',
6,
LOCALTIMESTAMP,
date(LOCALTIMESTAMP),
account_invoice.name,
FALSE
from account_invoice, sale_order, res_partner, temp_unicom
where account_invoice.origin = sale_order.name and
sale_order.partner_id = res_partner.id and
res_partner.ref = temp_unicom.sale_order_item_code;
Query 2:
Insert into account_move(create_uid, partner_id, create_date, name, company_id, write_uid, journal_id, state, period_id, write_date, date, ref, to_check)
Select
1,
res_partner.id,
LOCALTIMESTAMP,
'/',
1,
1,
9,
'draft',
6,
LOCALTIMESTAMP,
date(LOCALTIMESTAMP),
account_invoice.number,
FALSE
from account_invoice, sale_order, res_partner, temp_unicom
where account_invoice.origin = sale_order.name and
sale_order.partner_id = res_partner.id and
res_partner.ref = temp_unicom.sale_order_item_code;
Quecy 3:
Insert into account_move(create_uid, partner_id, create_date, name, company_id, write_uid, journal_id, state, period_id, write_date, date, ref, to_check)
Select
1,
res_partner.id,
LOCALTIMESTAMP,
'/',
1,
1,
9,
'draft',
6,
LOCALTIMESTAMP,
date(LOCALTIMESTAMP),
'ST_' || sale_order.name,
FALSE
from account_invoice, sale_order, res_partner, temp_unicom
where account_invoice.origin = sale_order.name and
sale_order.partner_id = res_partner.id and
res_partner.ref = temp_unicom.sale_order_item_code;
I also tried nesting but not working.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I fetch rows upto some condition for multiple group of ids - postgresql

SELECT t1.* FROM sourcetable t1 WHERE t1.condition = 'FALSE' AND NOT EXISTS ( SELECT NULL FROM sourcetable t2 WHERE t1.fk_user_id = t2.fk_user_id AND t1.created_date < t2.created_date /* or <= */ AND t2.condition = 'TRUE' )

Related

Redshift duplicated rows count mismatch using CTE due to table primary key configuration

Postgres very hard dynamic select statement with COALESCE

Performance Issue with finding recent date of each group and joining to all records

How can I insert IF in where cluase? (TSQL)

Postgresql Run query Alternatively

Categories

Resources