Redshift duplicated rows count mismatch using CTE due to table primary key configuration - group-by

It looks like I've come across a Redshift bug/inconsistency. I explain my original question first and include below a reproducible example.
Original question
I have a table with many columns in Redshift with some duplicated rows. I've tried to determine the number of unique rows using CTEs and two different methods: DISTINCT and GROUP BY.
The GROUP BY method looks something like this:
WITH duplicated_rows as
(SELECT *, COUNT(*) AS q
FROM my_schema.my_table
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45,
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60,
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows
With this query I get this result:
count_unique_rows | count_total_rows
------------------------------------
27 | 83
Then I use the DISTINCT method
WITH unique_rows as
(SELECT DISTINCT *
FROM my_schema.my_table)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows
And I get this result:
count_unique_rows
-----------------
63
So the CTE with GROUP BY seems to indicate 27 unique rows and CTE with the DISTINCT shows 63 unique rows.
As the next troubleshooting step I executed the GROUP BY outside the CTE and it yields 63 rows!
I also exported the 83 original rows to excel and applied the remove duplicated function and 63 rows remained, so that seems to be the correct number.
What I can't understand, for the life of me, is where the number 27 comes from when I use the CTE combined with the GROUP BY.
Is there a limitation with CTEs and Redshift that I'm not aware of? Is it a bug in my code?
Is it a bug in Redshift?
Any help in clarifying this mystery would be greatly appreciated!!
Reproducible Example
Create and populate the table
create table my_schema.students
(name VARCHAR(100),
day DATE,
course VARCHAR(100),
country VARCHAR(100),
address VARCHAR(100),
age INTEGER,
PRIMARY KEY (name))
INSERT INTO my_schema.students
VALUES
('Alan', '2000-07-15', 'Physics', 'CA', '12th Street', NULL),
('Alan', '2021-01-15', 'Math', 'USA', '8th Avenue', 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Kate', '2018-07-20', 'Literature', 'AR', '8th and 23th', 18),
('Kate', '2021-10-20', 'Philosophy', 'ES', NULL, 30);
Calculate unique rows with CTE and GROUP BY
WITH duplicated_rows as
(SELECT *, COUNT(*) AS q
FROM my_schema.students
GROUP BY 1, 2, 3, 4, 5, 6)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows
The result is INCORRECT!
count_unique_rows | count_total_rows
-------------------------------------
4 | 8
Calculate unique rows with CTE and DISTINCT
WITH unique_rows as
(SELECT DISTINCT *
FROM my_schema.students)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows
The result is CORRECT!
count_unique_rows
-----------------
6
The core of the issue seems to be the primary key, which Redshift doesn't enforce, but uses it for some kind of lazy evaluation to determine row differences within a CTE, which leads to inconsistent results.

The strange behaviour is caused by this line:
PRIMARY KEY (name)
From Defining table constraints - Amazon Redshift:
Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.
For example, the query planner uses primary and foreign keys in certain statistical computations. It does this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By doing this, it can order large numbers of joins and eliminate redundant joins.
The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.
In your sample data, the PRIMARY KEY clearly cannot be name because there are multiple rows with the same name. This violates assumptions made by Redshift and can lead to incorrect results.
If you remove the PRIMARY KEY (name) line, the data results are correct.
(FYI, I discovered this by running your commands in sqlfiddle.com against a PostgreSQL database. It would not allow the data to be inserted because it violated the PRIMARY KEY condition.)

Related

How can I fetch rows upto some condition for multiple group of ids

How can I fetch rows up-to some condition for multiple groups
Please refer below my_table currently applied order by
(1) fk_user_id in ascending (2) created_date in descending
I am using postgresql and spring boot jpa, jpql.
1) Please find query to create table and insert data as below
Create and insert statements:
CREATE TABLE public.my_table
(
id bigint,
condition boolean,
fk_user_id bigint,
created_date date
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.my_table
OWNER TO postgres;
INSERT INTO public.my_table(
id, condition, fk_user_id, created_date)
VALUES
(137, FALSE, 23, '2019-08-28'),
(107, FALSE, 23, '2019-05-13'),
(83, TRUE, 23, '2019-04-28'),
(78, FALSE, 23, '2019-04-07'),
(67, TRUE, 23, '2019-03-18'),
(32, FALSE, 23, '2019-01-19'),
(181, FALSE, 57, '2019-11-04'),
(158, TRUE, 57, '2019-09-27'),
(146, FALSE, 57, '2019-09-16'),
(125, FALSE, 57, '2019-07-24'),
(378, TRUE, 71, '2020-02-16'),
(228, TRUE, 71, '2019-12-13'),
(179, FALSE, 71, '2019-10-06'),
(130, FALSE, 71, '2019-08-19'),
(114, TRUE, 71, '2019-06-29'),
(593, FALSE, 92, '2020-03-02'),
(320, FALSE, 92, '2020-01-30'),
(187, FALSE, 92, '2019-11-23'),
(180, TRUE, 92, '2019-10-17'),
(124, FALSE, 92, '2019-08-05');
I would like to fetch all the rows which have ALL latest FALSE condition upto last TRUE found and then skip other rows for that user.
For ex.,
1) User id = 23 -- First 2 rows with id (137, 107) will fetch as on 2019-04-28, it has TRUE condition and so, will skip other rows
2) User id = 57 -- Only 1 row with id (181)
3) User id = 71 -- No rows will be fetch as it has latest TRUE condition
likewise, my result rows should be like as below.
I can find rows for only 1 user with below query
select * from user_condition where
fk_user_id = 23 and created_date > (select max(created_date) from user_condition where fk_user_id = 23 and condition like 'TRUE' group by fk_user_id);
But, I want rows for all fk_user_id
SELECT t1.*
FROM sourcetable t1
WHERE t1.condition = 'FALSE'
AND NOT EXISTS ( SELECT NULL
FROM sourcetable t2
WHERE t1.fk_user_id = t2.fk_user_id
AND t1.created_date < t2.created_date /* or <= */
AND t2.condition = 'TRUE' )

T-SQL Grouping with LESS THAN {date} that breaks off on each occurrence of date

I am struggling with creating a grouping using LESS THAN that breaks off on each date for the parent row. I have created a contrived example to explain the data and what I would like out as a result:
CREATE TABLE [dbo].[CustomerOrderPoints](
[CustomerID] [int] NOT NULL,
[OrderPoints] [int] NOT NULL,
[OrderPointsExpiry] [date] NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[CustomerOrderPointsUsed](
[CustomerID] [int] NOT NULL,
[OrderPointsUsed] [int] NOT NULL,
[OrderPointsUseDate] [date] NOT NULL
) ON [PRIMARY]
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (10, 200, CAST(N'2018-03-18' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (10, 100, CAST(N'2018-04-18' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (20, 120, CAST(N'2018-05-10' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (30, 75, CAST(N'2018-02-10' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (30, 60, CAST(N'2018-04-24' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (30, 90, CAST(N'2018-06-25' AS Date))
GO
INSERT [dbo].[CustomerOrderPoints] ([CustomerID], [OrderPoints], [OrderPointsExpiry]) VALUES (40, 100, CAST(N'2018-06-13' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (10, 15, CAST(N'2018-02-10' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (10, 30, CAST(N'2018-02-17' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (10, 25, CAST(N'2018-03-16' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (10, 45, CAST(N'2018-04-10' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (20, 10, CAST(N'2018-02-08' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (20, 70, CAST(N'2018-04-29' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (20, 25, CAST(N'2018-05-29' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (30, 60, CAST(N'2018-02-05' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (30, 30, CAST(N'2018-03-13' AS Date))
GO
INSERT [dbo].[CustomerOrderPointsUsed] ([CustomerID], [OrderPointsUsed], [OrderPointsUseDate]) VALUES (40, 120, CAST(N'2018-06-10' AS Date))
Customers gain points, which have an expiry. We have a CustomerOrderPoints table which shows OrderPoints for customers together with the Expiry date for the points. A Customer may have many rows in this table.
We then also have the CustomerOrderPointsUsed table which shows the points that have been used and when they were used by a Customer.
I am trying to get a grouping of Customer data which will show OrderPoints used as a group against each customer but, separated on the ExpiryDate. The picture below shows an example of the Grouped Results that I would like to obtain.
We have bad, but working code that has been developed using a recursive method (RBAR), but it is very slow. I have tried a number of different SET Based grouping approaches, but cannot get the final Less Than grouping which takes into account the previous expiry dates.
This DB is on SQL Server 2008R2. Ideally I am looking for a solution that will work with SQL Server 2008R2, but will welcome options for later versions, as we may need to move this particular DB to solve this problem.
I have tried using a combination of RanK, DenseRank and RowNumber (for later versions) and LAG, but have not been able to get anything working that can be built upon.
Is there a way to use SET based T-SQL to achieve this?
First - this is ignoring the question I raised in the comments above, and just allocates all rows to the expiry date on or after the use by date. You would need to rethink this if you need to split one use among multiple expiry dates
First, allocate an expiry date to each PointsUsed row. This is done by joining to all OrderPoints rows with an expiry date on or after the UseDate, then taking the minimum date.
Then the second query reports all OrderPoints rows, joining to the first query by the allocated expiry date, which has all the data needed.
WITH allocatedPoints as
(
Select U.CustomerID, U.OrderPointsUsed, MIN(P.OrderPointsExpiry) as OrderPointsExpiry
from CustomerOrderPointsUsed U
inner join CustomerOrderPoints P on P.CustomerID = U.CustomerID and P.OrderPointsExpiry >= U.OrderPointsUseDate
GROUP BY U.CustomerID, U.OrderPointsUseDate, U.OrderPointsUsed
)
Select P.CustomerID, P.OrderPoints, P.OrderPointsExpiry,
ISNULL(SUM(AP.OrderPointsUsed), 0) as used,
P.OrderPoints - ISNULL(SUM(AP.OrderPointsUsed), 0) as remaining
from CustomerOrderPoints P
left outer join allocatedPoints AP on AP.CustomerID = P.CustomerID and AP.OrderPointsExpiry = P.OrderPointsExpiry
GROUP BY P.CustomerID, P.OrderPoints, P.OrderPointsExpiry

Performance Issue with finding recent date of each group and joining to all records

I have following tables:
CREATE TABLE person (
id INTEGER NOT NULL,
name TEXT,
CONSTRAINT person_pkey PRIMARY KEY(id)
);
INSERT INTO person ("id", "name")
VALUES
(1, E'Person1'),
(2, E'Person2'),
(3, E'Person3'),
(4, E'Person4'),
(5, E'Person5'),
(6, E'Person6');
CREATE TABLE person_book (
id INTEGER NOT NULL,
person_id INTEGER,
book_id INTEGER,
receive_date DATE,
expire_date DATE,
CONSTRAINT person_book_pkey PRIMARY KEY(id)
);
/* Data for the 'person_book' table (Records 1 - 9) */
INSERT INTO person_book ("id", "person_id", "book_id", "receive_date", "expire_date")
VALUES
(1, 1, 1, E'2016-01-18', NULL),
(2, 1, 2, E'2016-02-18', E'2016-10-18'),
(3, 1, 4, E'2016-03-18', E'2016-12-18'),
(4, 2, 3, E'2017-02-18', NULL),
(5, 3, 5, E'2015-02-18', E'2016-02-23'),
(6, 4, 34, E'2016-12-18', E'2018-02-18'),
(7, 5, 56, E'2016-12-28', NULL),
(8, 5, 34, E'2018-01-19', E'2018-10-09'),
(9, 5, 57, E'2018-06-09', E'2018-10-09');
CREATE TABLE book (
id INTEGER NOT NULL,
type TEXT,
CONSTRAINT book_pkey PRIMARY KEY(id)
) ;
/* Data for the 'book' table (Records 1 - 8) */
INSERT INTO book ("id", "type")
VALUES
( 1, E'Btype1'),
( 2, E'Btype2'),
( 3, E'Btype3'),
( 4, E'Btype4'),
( 5, E'Btype5'),
(34, E'Btype34'),
(56, E'Btype56'),
(67, E'Btype67');
My query should list name of all persons and for persons with recently received book types of (book_id IN (2, 4, 34, 56, 67)), it should display the book type and expire date; if a person hasn’t received such book type it should display blank as book type and expire date.
My query looks like this:
SELECT p.name,
pb.expire_date,
b.type
FROM
(SELECT p.id AS person_id, MAX(pb.receive_date) recent_date
FROM
Person p
JOIN person_book pb ON pb.person_id = p.id
WHERE pb.book_id IN (2, 4, 34, 56, 67)
GROUP BY p.id
)tmp
JOIN person_book pb ON pb.person_id = tmp.person_id
AND tmp.recent_date = pb.receive_date AND pb.book_id IN
(2, 4, 34, 56, 67)
JOIN book b ON b.id = pb.book_id
RIGHT JOIN Person p ON p.id = pb.person_id
The (correct) result:
name | expire_date | type
---------+-------------+---------
Person1 | 2016-12-18 | Btype4
Person2 | |
Person3 | |
Person4 | 2018-02-18 | Btype34
Person5 | 2018-10-09 | Btype34
Person6 | |
The query works fine but since I'm right joining a small table with a huge one, it's slow. Is there any efficient way of rewriting this query?
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.
Problems with your setup
My local PostgreSQL version is 9.3.18; but the query should work on version 8.4 as well since that's our productions version.
That makes two major problems before even looking at the query:
Postgres 8.4 is just too old. Especially for "production". It has reached EOL in July 2014. No more security upgrades, hopelessly outdated. Urgently consider upgrading to a current version.
It's a loaded footgun to use very different versions for development and production. Confusion and errors that go undetected. We have seen more than one desperate request here on SO stemming from this folly.
Better query
This equivalent should be substantially simpler and faster (works in pg 8.4, too):
SELECT p.name, pb.expire_date, b.type
FROM (
SELECT DISTINCT ON (person_id)
person_id, book_id, expire_date
FROM person_book
WHERE book_id IN (2, 4, 34, 56, 67)
ORDER BY person_id, receive_date DESC NULLS LAST
) pb
JOIN book b ON b.id = pb.book_id
RIGHT JOIN person p ON p.id = pb.person_id;
To optimize read performance, this partial multicolumn index with matching sort order would be perfect:
CREATE INDEX ON person_book (person_id, receive_date DESC NULLS LAST)
WHERE book_id IN (2, 4, 34, 56, 67);
In modern Postgres versions (9.2 or later) you might append book_id, expire_date to the index columns to get index-only scans. See:
How does PostgreSQL perform ORDER BY if a b-tree index is built on that field?
About DISTINCT ON:
Select first row in each GROUP BY group?
About DESC NULLS LAST:
PostgreSQL sort by datetime asc, null first?

How can I find distinct accountid's with no CEO contact?

I have a contact table which has unique contactid as primary key, but may or may not have multiple records for the same account. My task is to return a list of accountid's and account_name's that do not have any contact with a CEO designation.
I agree this should be simple, and I freely admit to being dumb, so what I did was create a temp table with all unique accountid's, then flag the ones that did have the CEO job title, then do a select distinct accountid, account_name where flag is null group by, etc., which worked quickly and correctly, but is pretty lame. I frequently write lame scripts which work great, but are shamefully elementary, namely because that's how I think.
There must a nice, elegant way to do this so maybe I can learn something. Can someone help out? Thanks heaps in advance for your help! (p.s. Using SS2014)
Sample data below, in which companies 2,3,5 do not have a CEO:
create table contact (
contactid int,
accountid int,
account_name varchar(10),
designation varchar(5));
insert into contact
values
(1, 100, 'COMPANY1', 'MGR'),
(2, 100, 'COMPANY1', 'MGR'),
(3, 100, 'COMPANY1', 'VP'),
(4, 100, 'COMPANY1', 'CEO'),
(5, 200, 'COMPANY2', 'COO'),
(6, 200, 'COMPANY2', 'CIO'),
(7, 200, 'COMPANY2', 'VP'),
(8, 200, 'COMPANY2', 'VP'),
(9, 300, 'COMPANY3', 'MGR'),
(10, 400, 'COMPANY4', 'MGR'),
(11, 400, 'COMPANY4', 'MGR'),
(12, 400, 'COMPANY4', 'CEO'),
(13, 500, 'COMPANY5', 'VP'),
(14, 500, 'COMPANY5', 'VP'),
(15, 500, 'COMPANY5', 'VP'),
(16, 500, 'COMPANY5', 'VP');
For something like this, I usually just go with a self-join where null, like this:
SELECT DISTINCT
C.accountid
FROM contact C
LEFT JOIN contact CEO
ON CEO.accountid = C.accountid
AND CEO.designation = 'CEO'
WHERE
CEO.contactid IS NULL
Something like this?
WITH CEO_IDs AS
(
SELECT DISTINCT accountID
FROM contact
WHERE designation='CEO'
)
SELECT DISTINCT accountID
FROM contact
WHERE accountid NOT IN(SELECT x.accountID FROM CEO_IDs AS x)
The CTE finds all accountID, which do have a CEO and uses this as a negative filter to get all accountIDs, which do not have a CEO...
You'd get the same with a sub-select:
SELECT DISTINCT accountID
FROM contact
WHERE accountid NOT IN
(SELECT x.accountID
FROM contact AS x
WHERE x.designation='CEO')

Postgres how can I merge 2 separate select queries into 1

I am using postgres 9.4 and I would like to merge 2 separate queries into one statement. I been looking at this How to merge these queries into 1 using subquery post but still can't figure out how to work it. These 2 queries do work independently. Here they are
# 1: select * from votes v where v.user_id=32 and v.stream_id=130;
#2: select city,state,post,created_on,votes,id as Voted from streams
where latitudes >=28.0363 AND 28.9059>= latitudes order by votes desc limit 5 ;
I would like query #2 to be limited by 5, however I don't want query #1 to be included in that limit so that up to 6 rows could be returned in total. This works like a suggestion engine where query #1 has a main thread and query #2 gives up to 5 different suggestions however they are obviously located in a different table.
Having no model and data I simulated this problem with dummies of both in this SQL Fiddle.
CREATE TABLE votes
(
id smallint
, user_id smallint
);
CREATE TABLE streams
(
id smallint
, foo boolean
);
INSERT INTO votes
VALUES (1, 42), (2, 32), (3, 17), (4, 37), (5, 73), (6, 69), (7, 21), (8, 18), (9, 11), (10, 15), (11, 28);
INSERT INTO streams
VALUES (1, true), (2, true), (3, true), (4, true), (5, true), (6, true), (7, false), (8, false), (9, false), (10, false), (11, false);
SELECT
id
FROM
(SELECT id, 1 AS sort FROM votes WHERE user_id = 32) AS query_1
FULL JOIN (SELECT id FROM streams WHERE NOT foo) AS query_2 USING (id)
ORDER BY
sort
LIMIT 6;
Also I have to point out, that this isn't my work entirely, but an adaptation of this answer I came across the other day. Maybe this is an approach here too.
So, what's going on? Column id stands for any column your tables and sub-queries will have in common. votes.user_id I made to have sth. to select in the one sub-query and streams.foo in the other.
As you demanded to have 6 rows at the most I used the limit clause twice. First in the sub-query just in case there is a huge amount of rows in your table you don't want to select and again in the outer query to finally restrict the number of rows. Fiddle about a little on the two limits and toggle WHERE foo and WHERE NOT foo and you see why.
In the first sub-query I added a sort column like it is done in that answer. That's because I guess you want the result of the first sub-query always on top too.