SQL statement timing out on large data set

SQL statement timing out on large data set - postgresql

Trying to complete a web process, I'm getting the error canceling statement due to statement timeout. Debugging the codebase it turns out that the below query is timing out due to large data set. I appreciate any suggestions on how to increase the below query performance.
select userid, max(recent_activity_date) recent_activity_date
from (
SELECT id AS userid,
recent_logged_in AS recent_activity_date
FROM user
WHERE recent_logged_in > now() - cast('10 days' AS INTERVAL)
UNION
SELECT userid AS userid, max(recentaccessed) AS recent_activity_date
FROM tokencreds
WHERE recentaccessed > now() - cast('10 days' AS INTERVAL)
GROUP BY userid
) recent_activity
WHERE EXISTS(select 1 from user where id = userid and not deleted)
group by userid
order by userid;
Index per table:
Table user:
user_recent_logged_in on user (recent_logged_in)
Table tokencreds: tokencreds_userid_token on tokencreds (userid, token). tokencreds_userid_token is unique.

A lot depends on the 'layout' of your data. Are there a lot of records 'hit'? Are there a lot of records in users? in tokencreds? etc...
Personally I would go for this:
SELECT userid, max(recent_activity_date) recent_activity_date _user
FROM (
SELECT id AS userid, MAX(recent_logged_in) AS recent_activity_date
FROM user
WHERE recent_logged_in > now() - cast('10 days' AS INTERVAL)
AND NOT deleted
UNION ALL
SELECT userid AS userid, MAX(recentaccessed) AS recent_activity_date
FROM tokencreds
WHERE recentaccessed > now() - cast('10 days' AS INTERVAL)
AND EXISTS(SELECT * FROM user WHERE id = userid AND NOT deleted)
) recent_activity
GROUP BY userid
ORDER BY userid;
-- indexes 'needed' on :
CREATE INDEX idx_userid_not_deleted ON user (userid) WHERE NOT deleted;
CREATE INDEX idx_recent_logged_in_user_id_not_deleted ON user (recent_logged_in, userid) WHERE not deleted;
CREATE INDEX idx_recentaccessed_user_id ON tokencreds (recentaccessed, userid);
but YMMV. To get a better idea you really should provide the full EXPLAIN ANALYZE result otherwise we're just flying blind and guessing here. Could be the system will refuse to use any of the suggested indexes in which case you better remove then again off course.
Reasoning:
the UNION will cause an implicit distinct on the sub-select which you don't really need as te MAX() and GROUP BY later on will pretty much do the same so why do things twice?
better to filter 'as soon as possible' rather then filter in the end IMHO (**)
**: Do note that the results here ARE going to be different! (but I think mine are 'better'). E.g. suppose you have 3 records for user_id 5
user_id = 5, deleted = true, recent_activity_date = Dec 10
user_id = 5, deleted = false, recent_activity_date = Dec 8
user_id = 5, deleted = false, recent_activity_date = Dec 5
Ignoring the tokencreds table, the result for user_id 5 will be Dec 10 in your version while in mine it will be Dec 8. Check your requirements on which one you want!
edit: mistake in suggested indexes

Get rid of the union, and the exists(), and combine them into a straight join:
SELECT x.userid
, GREATEST(x.recent_logged_in, x.recent_activity_date ) AS recent_activity_date
FROM (
SELECT u.id AS userid
, u.recent_logged_in
, MAX(t.recentaccessed) AS recent_activity_date
FROM users u
LEFT JOIN tokencreds AS t ON t.userid = u.id
WHERE NOT u.deleted
AND (
u.recent_logged_in > now() - cast('10 days' AS INTERVAL)
OR t.recentaccessed > now() - cast('10 days' AS INTERVAL)
)
GROUP BY u.id
) x
ORDER by userid;

Related

Count distinct loop in sql

I am trying to pull unique active users before a date.
So specifically, I have a date range (let's say August - November) where I want to know the cumulative unique active users on or before a day within a month.
So, the pseudocode would look something like this:
SELECT COUNT(DISTINCT USERS) FROM USER_DB
WHERE
Month = [loop through months 8-11]
AND
DAY <= [day in loop of 1:31]
The output I desire is something Like this

step-by-step demo: db<>fiddle
SELECT
mydate,
SUM( -- 3
COUNT(DISTINCT username) -- 1, 2
) OVER (ORDER BY mydate) -- 3
FROM t
GROUP BY mydate -- 2
GROUP BY your date and count the users
Because you don't want to count ALL user accesses, but only one access per user and day, you need to add the DISTINCT
This is a window function. This one aggregates all counts which where previously done cumulatively.
If you want to get unique user over ALL days (count a user only on its first access) you can filter the users with a DISTINCT ON clause first:
demo: db<>fiddle
SELECT DISTINCT ON (username)
*
FROM t
ORDER BY username, mydate
This yields:
SELECT
mydate,
SUM(
COUNT(*)
) OVER (ORDER BY mydate)
FROM (
SELECT DISTINCT ON (username)
*
FROM t
ORDER BY username, mydate
) s
GROUP BY mydate

postgres select count distinct returning unexpected extra row

If there is one more UID in sessions than there is in users (obviously not supposed to be that way), then I expect to have a non-empty result set when I run the last select, but I get no rows returned - this result just doesn't make logical sense to me...
select count(distinct(uid)) from users;
> 108736
select count(distinct(uid)) from sessions;
> 108737
select count(*) from sessions where uid not in (select uid from users);
> 0
and just for completeness:
select count(*) from users where uid not in (select uid from sessions);
> 0
I have checked for nulls:
select count( * ) from sessions where uid is null;
> 0
select count( * ) from users where uid is null;
> 14
The schema is defined in sqlalchemy and includes a foreign key in the session table:
uid = Column(Integer, ForeignKey('users.uid', use_alter=True, name='fk_uid'))
This schema is a static dump for analytics purposes so there is no chance of concurrency issues...

Your third query does not do what you think it does.
The following query illustrates the problem:
SELECT 1 NOT IN (SELECT unnest(ARRAY[NULL]::int[]));
This returns NULL, because it can't say if 1 <> NULL.
So, in your query the where condition is always NULL, because users contains a NULL uid.
I recommend using EXCEPT do find the culprit in your sessions table.
SELECT uid from sessions EXCEPT SELECT uid from users;

TSQL get the Prev and Next ID on a list

Let's say I have a table Sales
SaleID int
UserID int
Field1 varchar(10)
Created Datetime
and right now I have loaded and viewing the record with SaleID = 23
What's the right way to find out, using a stored procedure, what's the PREVIOUS and NEXT SalesID value off the current SaleID = 23, that belongs to me (UserID = 1)?
I could do a
SELECT TOP 1 *
FROM Sales
WHERE SaleID > 23 AND UserID = 1
and the same for SaleID < 23 but that's 2 SQL calls.
Is there a better way?
I'm using the SQL Server 2012.

You can get the previous/next SaleID (or any other field) by using the LAG() and LEAD() functions introduced in SQL Server 2012.
For example:
SELECT *,
LAG(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Next
FROM Sales S
SqlFiddle

If you omit the PARTIITION BY clause in the LAG() or LEAD() functions in the answer of thepirat000's, you can find the related previous or next records according to the SaleID column.
Here is the SQL query
SELECT *,
LAG(SaleID) OVER (ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (ORDER BY SaleID) Next
FROM Sales S
The PARTITION BY clause enables you to use these functions within a grouping based on UserID as in the thepirat000's code

If you want the next and previous records only for a single row, or at least for a small set of item following query can also help in terms of performance (as an answer to Eager to Learn's comment)
select
(select top 1 t.SaleID from Sales t where t.SaleID < tab1.SaleID) as prev_id,
SaleID as current_id,
(select top 1 t.SaleID from Sales t where t.SaleID > tab1.SaleID) as next_id
from Sales where SaleID = 2

Tsql, returning rows with identical column values

Given an example table 'Users', which has an int column named 'UserID' (and some arbitrary number of other columns), what is the best way to select all rows from which UserID appears more than once?
So far I've come up with
select * from Users where UserID in
(select UserID from Users group by UserID having COUNT(UserID) > 1)
This seems like quite an innefficient way to do this though, is there a better way?

In SQL Server 2005+ you could use this approach:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
INNER JOIN UsersNumbered n ON u.UserID = n.UserID AND n.rownum = 2
Provided there exists a non-clustered index on UserID, this yields a slightly worse execution plan than your approach. To make it better (actually, same as yours), you'll need to use... a subquery, however counter-intuitive it may seem:
;WITH UsersNumbered AS (
SELECT
UserID,
rownum = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY UserID)
FROM Users
)
SELECT u.*
FROM Users u
WHERE EXISTS (
SELECT *
FROM UsersNumbered n
WHERE u.UserID = n.UserID AND n.rownum = 2
);
In case of a clustered index on UserID all three solutions give the same plan.

This would do the same thing but evaluate the performance and it would likely be faster/more efficient. Of course there should be an index on this UserID column.
select u.*
from Users u
join (select UserID,count(UserID) as CUserID from Users group by UserID) u1 on u1.UserID = u.UserID
where CUserID > 1

SPROC T-SQL Syntax to return results if rows exist on multiple days

what I need to test for on my table is if there are rows for a given user id and order id on two separate days (DATETIME field for a timestamp).
I'm pretty sure I'd need a having clause and that's why I'm here...that frightens me terribly.

Having shouldn't scare you, it is just a "Where" on an aggregated field:
Select UserID, Count(*) From OrderTbl Group By UserID Having Count(*) > 1
That'll give you all the Users that have multiple orders.
Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID Having Count(*) > 1
will give you the count if there are multiple records for the user id in #UserID and null if not.
if exists (Select UserID, Count(*) From OrderTbl Where (UserID=#UserID) Group By UserID
Having Count(*) > 1) Select 1 else Select 0
will return a 1 if there are multiple records for the User, 0 if not.
Update: Didn't realize that you could have multiple orders per day. This query will do what you want:
With DistinctDates as (Select Distinct UserID, [DATE] From OrderTbl Where (UserID=#UserID))
Select UserID, Count(*) From DistinctDates
Group By UserID Having Count(*) > 1

I am not sure if I understood your question, but this may work for you. The HAVING is your friend and you can still use the WHERE clause. This should let you know what order and user id combo is occuring more than once in the table.
SELECT [UserId], [OrderId]
FROM OrderTable
WHERE UserId = #UserId
AND OrderId = #OrderId
GROUP BY UserId, OrderId
HAVING Count(*) > 1

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

SQL statement timing out on large data set - postgresql

Related

Count distinct loop in sql

postgres select count distinct returning unexpected extra row

TSQL get the Prev and Next ID on a list

Tsql, returning rows with identical column values

SPROC T-SQL Syntax to return results if rows exist on multiple days

Categories

Resources