postgres select count distinct returning unexpected extra row

postgres select count distinct returning unexpected extra row - postgresql

If there is one more UID in sessions than there is in users (obviously not supposed to be that way), then I expect to have a non-empty result set when I run the last select, but I get no rows returned - this result just doesn't make logical sense to me...
select count(distinct(uid)) from users;
> 108736
select count(distinct(uid)) from sessions;
> 108737
select count(*) from sessions where uid not in (select uid from users);
> 0
and just for completeness:
select count(*) from users where uid not in (select uid from sessions);
> 0
I have checked for nulls:
select count( * ) from sessions where uid is null;
> 0
select count( * ) from users where uid is null;
> 14
The schema is defined in sqlalchemy and includes a foreign key in the session table:
uid = Column(Integer, ForeignKey('users.uid', use_alter=True, name='fk_uid'))
This schema is a static dump for analytics purposes so there is no chance of concurrency issues...

Your third query does not do what you think it does.
The following query illustrates the problem:
SELECT 1 NOT IN (SELECT unnest(ARRAY[NULL]::int[]));
This returns NULL, because it can't say if 1 <> NULL.
So, in your query the where condition is always NULL, because users contains a NULL uid.
I recommend using EXCEPT do find the culprit in your sessions table.
SELECT uid from sessions EXCEPT SELECT uid from users;

Related

SQL statement timing out on large data set

Trying to complete a web process, I'm getting the error canceling statement due to statement timeout. Debugging the codebase it turns out that the below query is timing out due to large data set. I appreciate any suggestions on how to increase the below query performance.
select userid, max(recent_activity_date) recent_activity_date
from (
SELECT id AS userid,
recent_logged_in AS recent_activity_date
FROM user
WHERE recent_logged_in > now() - cast('10 days' AS INTERVAL)
UNION
SELECT userid AS userid, max(recentaccessed) AS recent_activity_date
FROM tokencreds
WHERE recentaccessed > now() - cast('10 days' AS INTERVAL)
GROUP BY userid
) recent_activity
WHERE EXISTS(select 1 from user where id = userid and not deleted)
group by userid
order by userid;
Index per table:
Table user:
user_recent_logged_in on user (recent_logged_in)
Table tokencreds: tokencreds_userid_token on tokencreds (userid, token). tokencreds_userid_token is unique.

A lot depends on the 'layout' of your data. Are there a lot of records 'hit'? Are there a lot of records in users? in tokencreds? etc...
Personally I would go for this:
SELECT userid, max(recent_activity_date) recent_activity_date _user
FROM (
SELECT id AS userid, MAX(recent_logged_in) AS recent_activity_date
FROM user
WHERE recent_logged_in > now() - cast('10 days' AS INTERVAL)
AND NOT deleted
UNION ALL
SELECT userid AS userid, MAX(recentaccessed) AS recent_activity_date
FROM tokencreds
WHERE recentaccessed > now() - cast('10 days' AS INTERVAL)
AND EXISTS(SELECT * FROM user WHERE id = userid AND NOT deleted)
) recent_activity
GROUP BY userid
ORDER BY userid;
-- indexes 'needed' on :
CREATE INDEX idx_userid_not_deleted ON user (userid) WHERE NOT deleted;
CREATE INDEX idx_recent_logged_in_user_id_not_deleted ON user (recent_logged_in, userid) WHERE not deleted;
CREATE INDEX idx_recentaccessed_user_id ON tokencreds (recentaccessed, userid);
but YMMV. To get a better idea you really should provide the full EXPLAIN ANALYZE result otherwise we're just flying blind and guessing here. Could be the system will refuse to use any of the suggested indexes in which case you better remove then again off course.
Reasoning:
the UNION will cause an implicit distinct on the sub-select which you don't really need as te MAX() and GROUP BY later on will pretty much do the same so why do things twice?
better to filter 'as soon as possible' rather then filter in the end IMHO (**)
**: Do note that the results here ARE going to be different! (but I think mine are 'better'). E.g. suppose you have 3 records for user_id 5
user_id = 5, deleted = true, recent_activity_date = Dec 10
user_id = 5, deleted = false, recent_activity_date = Dec 8
user_id = 5, deleted = false, recent_activity_date = Dec 5
Ignoring the tokencreds table, the result for user_id 5 will be Dec 10 in your version while in mine it will be Dec 8. Check your requirements on which one you want!
edit: mistake in suggested indexes

Get rid of the union, and the exists(), and combine them into a straight join:
SELECT x.userid
, GREATEST(x.recent_logged_in, x.recent_activity_date ) AS recent_activity_date
FROM (
SELECT u.id AS userid
, u.recent_logged_in
, MAX(t.recentaccessed) AS recent_activity_date
FROM users u
LEFT JOIN tokencreds AS t ON t.userid = u.id
WHERE NOT u.deleted
AND (
u.recent_logged_in > now() - cast('10 days' AS INTERVAL)
OR t.recentaccessed > now() - cast('10 days' AS INTERVAL)
)
GROUP BY u.id
) x
ORDER by userid;

Postgres SQL array aggregate values from a single table based on conditions

I'm a SQL newbie and I'm trying to write the following query:
I have the following table:
user_id | chat_id
---------+---------
Which represents a many to many mapping of users to chat rooms;
I'm looking to create a query that finds all the chat_ids that are associated with the input user_id, and then array aggregates all the user_ids associated with those chats excluding the input user_id.
So the result should look like this for example:
chat_id | user_id
---------+---------
1 {1,3,5,6}
I've kind of jumbled together to following query; but I'm pretty sure I got something wrong:
WITH chatIDs AS (SELECT user_chats.chat_id FROM user_chats WHERE user_chats.user_id=$1)
WITH userIDs AS (SELECT user_chats.user_id FROM user_chats WHERE user_chats.chat_id=chatIDs AND user_chats.user_id != $1)
SELECT chatIDs, array_agg(userIDs) FROM user_chats;
EDIT:
Edited for clarity

I believe you could just use a where clause to exclude the user:
SELECT chat_id, array_agg(user_id) FROM user_chats
WHERE user_id != $1 AND chat_id IN (SELECT chat_id FROM user_chats WHERE user_id = $1)
GROUP BY chat_id

Merge two tables in Postgresql giving preference to one particular table

I have two tables, Users and Masters. Users are having User specific settingkey-value. Masters is having master settingkey-value. I want to display key-value from the two tables, where
if users do not have that particular key, need to take it from masters
2 if the users do not exists in the table, need to display all from masters key-value
if users having key-value, have to display users key-value
Example:
Inputs being - UserID and appID = 1.
I tried with left join combination, but not getting desired result if Users do not exists at all in the Users table.
Could you please give me some advise.

step-by-step demo:db<>fiddle
SELECT
COALESCE(m.app_id, u.app_id) as app_id,
COALESCE(m.setting_key, u.setting_key) as setting_key,
COALESCE(u.setting_value, m.setting_value) as setting_value -- 2
FROM
master_table m
FULL OUTER JOIN -- 1
user_table u
ON m.app_id = u.app_id AND m.setting_key = u.setting_key
WHERE COALESCE(m.app_id, u.app_id) = 1 -- 3
AND (u.user_id = 1 OR u.user_id IS NULL)
You need a FULL OUTER JOIN to join also data set that the other table does not contain
COALESCE(a, b) gives you the first non-null value. So, if a (here the user value) is available, it will be returned. Otherwise b (here the master value)
Filter by app_id and user_id; second needs to be filtered by user_id == NULL too, to get all setting_keys. Of course, you could use here COALESCE as well: COALESCE(u.user_id, 1) whereas the last 1 is the specific user_id you're asking
Edit: If User does not exist, give out the Masters values for app_id:
step-by-step demo:db<>fiddle:
SELECT DISTINCT ON (app_id, setting_key) -- 3
*
FROM (
SELECT
COALESCE(user_app_id, master_app_id) AS app_id, -- 2
COALESCE(user_setting_key, master_setting_key) AS setting_key,
COALESCE(user_setting_value, master_setting_value) AS setting_value,
user_id
FROM (
SELECT
app_id as master_app_id,
setting_key as master_setting_key,
setting_value as master_setting_value,
null as user_id,
null as user_app_id,
null as user_setting_key,
null as user_setting_value
FROM
master_table m
UNION -- 1
SELECT
*
FROM
master_table m
FULL OUTER JOIN
user_table u
ON m.app_id = u.app_id AND m.setting_key = u.setting_key
) s
) s
WHERE app_id = 1
AND (user_id = 2 OR user_id IS NULL)
ORDER BY app_id, setting_key, user_id NULLS LAST -- 3
This is a little more complicated. You need a separate data set for user_id == NULL which could be fetched. So, the NULL user represents the unknown user.
You can achieve this by adding the Master table with NULL values using an UNION.
Now you can create the expected columns with the COALESCE() functions as described above.
The third trick is using the DISTINCT ON clause on the app_id and the setting_key columns. When you ordered the NULL columns from the default UNION part in (1) last, then the DISTINCT ON will fetch the user record. However, when the user didn't exist, then the DISTINCT ON will fetch the default Master record.

Postgresql OR conditions with an empty subquery

How can I optimize a query whose WHERE conditions include a check for user_id = X OR user_id IN (some subquery that might return no results)
In my example below, queries 1 and 2 are both extremely fast (< 1 ms), but query 3, which is simply an OR of the conditions in queries 1 and 2, is much slower (50 ms)
Can somebody please explain why query 3 is so slow, and in general what types of query optimization strategies should I be pursuing to avoid this problem? I realize the subquery in my example could easily be eliminated, but in real life sometimes subqueries seem like the least complicated way to get the data I want.
relevant code and data:
posts data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_posts.csv
users data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_users.csv
# from the shell:
# > createdb test
CREATE TABLE posts (
id integer PRIMARY KEY NOT NULL,
created_by_id integer NOT NULL,
created_at integer NOT NULL
);
CREATE INDEX index_posts ON posts (created_by_id, created_at);
CREATE INDEX index_posts_2 ON posts (created_at);
CREATE TABLE users (
id integer PRIMARY KEY NOT NULL,
login varchar(50) NOT NULL
);
CREATE INDEX index_users ON users (login);
COPY posts FROM '/path/to/sanitized_posts.csv' DELIMITERS ',' CSV;
COPY users FROM '/path/to/sanitized_users.csv' DELIMITERS ',' CSV;
-- queries:
-- query 1, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 LIMIT 100;
-- query 2, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
-- query 3, slow:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 OR created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;

Split the query (edited):
SELECT * FROM (
SELECT * FROM posts p WHERE p.created_by_id = 123
union
SELECT * FROM posts p
WHERE
EXISTS ( SELECT TRUE FROM users WHERE id = p.created_by_id AND login = 'nobodyhasthislogin')
) p
LIMIT 100;

How about:
EXPLAIN ANALYZE
SELECT *
FROM posts
WHERE created_by_id IN (
SELECT 123
UNION ALL
SELECT id FROM
users WHERE
login = 'nobodyhasthislogin') LIMIT 100;

Most of the time in this particular query is related to an index scan. Here is a query goes at it from a different angle to avoid this, but should return equivalent results.
SELECT posts.* FROM users JOIN posts on posts.created_by_id=users.id WHERE users.id=123 or login='nobodyhasthislogin'
This selects from the users table, doing the filter once, and then joins posts onto that.
I realize that the question is about tips for optimization, not really this specific query. To answer that, my suggestion is to run EXPLAIN ANALYZE and read up on interpreting the results, - this answer was helpful to me.

SPROC T-SQL Syntax to return results if rows exist on multiple days

what I need to test for on my table is if there are rows for a given user id and order id on two separate days (DATETIME field for a timestamp).
I'm pretty sure I'd need a having clause and that's why I'm here...that frightens me terribly.

Having shouldn't scare you, it is just a "Where" on an aggregated field:
Select UserID, Count(*) From OrderTbl Group By UserID Having Count(*) > 1
That'll give you all the Users that have multiple orders.
Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID Having Count(*) > 1
will give you the count if there are multiple records for the user id in #UserID and null if not.
if exists (Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID
Having Count(*) > 1) Select 1 else Select 0
will return a 1 if there are multiple records for the User, 0 if not.
Update: Didn't realize that you could have multiple orders per day. This query will do what you want:
With DistinctDates as (Select Distinct UserID, [DATE] From OrderTbl Where (UserID=#UserID))
Select UserID, Count(*) From DistinctDates
Group By UserID Having Count(*) > 1

I am not sure if I understood your question, but this may work for you. The HAVING is your friend and you can still use the WHERE clause. This should let you know what order and user id combo is occuring more than once in the table.
SELECT [UserId], [OrderId]
FROM OrderTable
WHERE UserId = #UserId
AND OrderId = #OrderId
GROUP BY UserId, OrderId
HAVING Count(*) > 1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

postgres select count distinct returning unexpected extra row - postgresql

Related

SQL statement timing out on large data set

Postgres SQL array aggregate values from a single table based on conditions

Merge two tables in Postgresql giving preference to one particular table

Postgresql OR conditions with an empty subquery

SPROC T-SQL Syntax to return results if rows exist on multiple days

Categories

Resources