Effect of group by if partition by used or not used - group-by

I'm working on data from google analytics in Bigquery.
I wrote the query to fetch the previous pagepath by lag(pagepath) over (Partition by pagepath order by hitnumber).
Select lag(pagepath) over (Partition by pagepath order by hitnumber) as previous page,pagepath,hitnumber,clientid from table
I wanted to know if I'm grouping the column with pagepath in my query and I don't use partition by , just order by in over() function, why result is different?
Select lag(pagepath) over (order by hitnumber) as previous page,pagepath,hitnumber,clientid from table group by pagepath

Related

MAX(COUNT(...)) - ERROR: calls to aggregate functions cannot be nested

I have two tables - ticket(..,event_name,..) and event(..,name,..).
Task:
Write a query that displays the name of the event and the number of tickets for the event for which the largest number of tickets was sold.
The code below get an error: ERROR: calls to aggregate functions cannot be nested
SELECT name, COUNT(event_name) as ticket_count
FROM event
INNER JOIN ticket
ON event.name = ticket.event_name
GROUP BY name
HAVING COUNT(event_name) = MAX(COUNT(event_name));
I know I can't use MAX(COUNT()), but what I should write instead of it for having the similar logic?
I only have the hint from my lecturer :)
COUNT(...)= (SELECT MAX(...) FROM (...))
You will have to rank the events based on the count, e.g. using the window function dense_rank().
select event_name, ticket_count
from (
select event_name,
count(*) as ticket_count,
dense_rank() over (order by count(*) desc) as rnk
from ticket
group by event_name
) t
where rnk = 1;
The window function dense_rank() is applied after the group by and will calculate the rank based on the number of tickets for that event. Because of the order by ... DESC the even with the highest number of tickets will get a rank of 1.
If there are two events with the same highest number of tickets, both will be listed. If you don't want that, use row_number() instead of dense_rank()
Note that I also removed the event table from the query as it is not needed for this.
A simplest solution I can think of is making use of ordering and LIMIT 1.
SELECT
name, COUNT(event_name) as ticket_count
FROM event
LEFT JOIN ticket ON event.name = ticket.event_name
GROUP BY
name
ORDER BY
ticket_count DESC
LIMIT 1;
(using LEFT JOIN as you may have events without tickets I guess)

Condition and max reference in redshift window function

I have a list of dates, accounts, and sources of data. I'm taking the latest max date for each account and using that number in my window reference.
In my window reference, I'm using row_number () to assign unique rows to each account and sources of data that we're receiving and sorting it by the max date for each account and source of data. The end result should list out one row for each unique account + source of data combination, with the max date available in that combination. The record with the highest date will have 1 listed.
I'm trying to set a condition on my window function where only rows that populate with 1 are listed in the query, while the other ones are not shown at all. This is what I have below and where I get stuck:
SELECT
date,
account,
data source,
MAX(date) max_date,
ROW_NUMBER () OVER (PARTITION BY account ORDER BY max_date) ROWNUM
FROM table
GROUP BY
date,
account,
data source
Any help is greatly appreciated. I can elaborate on anything if necessary
If I understood your question correctly this SQL would do the trick
SELECT
date,
account,
data source,
MAX(date) max_date
FROM (
SELECT
date,
account,
data source,
MAX(date) max_date,
ROW_NUMBER () OVER (PARTITION BY account ORDER BY max_date) ROWNUM
FROM table
GROUP BY
date,
account,
data source
)
where ROWNUM = 1
If you do not need the row number for anything other than uniqueness then a query like this should work:
select distinct t.account, data_source, date
from table t
join (select account, max(date) max_date from table group by account) m
on t.account=m.account and t.date=m.max_date
This can still generate two records for one account if two records for different data sources have the identical date. If that is a possibility then mdem7's approach is probably best.
It's a bit unclear from the question but if you want each combination of account and data_source with its max date making sure there are no duplicates, then distinct should be enough:
select distinct account, data_source, max(date) max_date
from table t
group by account, data_source

select last of an item for each user in postgres

I want to get the last entry for each user but the customer_id is a hash 'ASAG#...' order by customer_id destroys the query. Is there an alternative?
Select Distinct On (l.customer_id)
l.customer_id
,l.created_at
,l.text
From likes l
Order By l.customer_id, l.created_at Desc
Your current query already appears to be working, q.v. here:
Demo
I don't know why your current query is not generating the results you would expect. It should return one distinct record for every customer, corresponding to the more recent one, given your ORDER BY statement.
In any case, if it does not do what you want, an alternative would be to use ROW_NUMBER() here with a partition by user. The inner query assigns a row number to each user, with the value 1 going to the most recent record for each user. Then the outer query retains only the latest record.
SELECT
t.customer_id,
t.created_at,
t.text
FROM
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY created_at DESC) rn
FROM likes
) t
WHERE t.rn = 1
To speed up the inner query which uses ROW_NUMBER() you can try adding a composite index on the customer_id and created_at columns:
CREATE INDEX yourIdx ON likes (customer_id, created_at);

Partitioning in window functions with same value in different chunk of rows

In the picture below you can example data. I would like to get first occurence of batch_start for each batch. As you can see (green highlight) batch 1522049 occurs in 2 chunks, one has 2 rows and second has 1 row.
SELECT FIRST_VALUE(batch_start) OVER (PARTITION BY batch ORDER BY batch_start)
does not solve the problem, since it joins both chunks into one and result is '2013-01-29 10:27:23' for both of them.
Any idea how to distinguish these rows and get batch_start of each chunk of data?
This seems to me a simple gaps-and-islands problem: you just need to calculate a value, which is the same for every subsequent rows for the same batch value, which will be
row_number() over (order by batch_start) - row_number() over (partition by batch order by batch_start)
From this, the solution depends on what do you want to do with these "batch groups". F.ex. here is a variant, which will aggregate them, to find out which is the first batch_start:
select batch, min(batch_start)
from (select *, row_number() over (order by batch_start) -
row_number() over (partition by batch order by batch_start) batch_number
from batches) b
group by batch, batch_number
http://rextester.com/XLX80303
Maybe: select batch, min(batch_start) firstOccurance, max(batch_start) lastOccurance from yourTable group by batch
or try (keeping your part of query):
SELECT FIRST_VALUE(a.batch_start) OVER (PARTITION BY a.batch ORDER BY a.batch_start) from yourTable a
join (select batch, min(batch_start) firstOccurance, max(batch_end) lastOccurance from yourTable group by batch) b on a.batch = b.batch

PostgreSQL RANK() function over aggregated column

I'm constructing quite complex query, where I try to load users with their aggregated points altogether with their rank. I found the RANK() function that could help me to achieve this but can't get it working.
Here's the query that is working without RANK:
SELECT users.*, SUM(received_points.count) AS pts
FROM users
LEFT JOIN received_points ON received_points.user_id = users.id AND ...other joining conditions...
GROUP BY users.id
ORDER BY pts DESC NULLS LAST
Now I would like to select also the rank - but this way using RANK function it's not working:
SELECT users.*, SUM(received_points.count) AS pts,
RANK() OVER (ORDER BY pts DESC NULLS LAST) AS position
FROM users
LEFT JOIN received_points ON received_points.user_id = users.id AND ...other joining conditions...
GROUP BY users.id
ORDER BY pts DESC NULLS LAST
It tells: PG::UndefinedColumn: ERROR: column "pts" does not exist
I guess I get whole concept of window functions wrong. How can I select the rank of user sorted by aggregated value like pts in example above?
I know I can assign ranks manually afterwards but what if I want to also filter the rows according to users.name in query and still get user's rank in general (not-filtered) leaderboard...? Dont know if I'm clear...
As Marth suggested in his comment:
You can't use pts here as the alias doesn't exist yet (you can't reference an alias in the same SELECT it's defined). RANK() OVER (ORDER BY SUM(received_points.count) DESC NULLS LAST) should work fine.