select top n posts by score count - group-by

I am trying to get the top n users by post using hive. The table looks like this.
Score User
10 1
20 2
50 1
20 2
0 3
3 1
40 2
...
I want to generate output which shows like
Rows Users
3 1
3 2
1 3
here is my query
SELECT * FROM (SELECT COUNT(score) as Score, UserID AS COUNT FROM A WHERE UserID IS NOT NULL GROUP BY UserID,score LIMIT 10) A;
The output I get is something like this
0 0
0 1
0 2
0 3
0 4
0 5
0 6
0 7
0 8
0 9
Can someone guide me where I am going wrong.

SELECT COUNT(score) as Score, UserID FROM A WHERE UserID IS NOT NULL GROUP BY UserID LIMIT 10

Related

query customer retention over range

I am trying to find the best way to accomplish the following.
Get the beginning customer count, which carries from the previous day
Get New Customer count
Get the number of Customers who have not come in since the prior month
Get the number of Customers who have come back after lapsing
Get the number of total customers
The following example
Customer ID
Store ID
Date
Amount
1
1
1/2/22
1.00
2
2
1/2/22
2.00
1
1
2/2/22
1.00
3
2
3/2/22
1.00
2
2
3/2/22
1.00
1
1
3/2/22
1.00
1
1
4/2/22
1.00
4
1
4/2/22
1.00
2
2
4/2/22
1.00
The result would be
Date
Store
Beginning
New
Dropped
Returned
Total
1/2/22
1
0
1
0
0
1
1/2/22
2
0
1
0
0
1
2/2/22
1
1
0
0
0
1
2/2/22
2
1
0
1
0
0
3/2/22
1
1
0
0
0
1
3/2/22
2
0
1
0
1
2
4/2/22
1
1
1
0
0
2
4/2/22
2
2
0
1
0
1
I kind of have a query, but it's not getting the right results
WITH customerset AS (
SELECT
location_id,
date,
array_agg(DISTINCT customer_id ORDER BY customer_id ASC) AS customer_ids
FROM customer_orders
GROUP BY
location_id,
date
)
SELECT
cset.location_id,
cset.date,
array_length(cset2.customer_ids, 1) AS beginning,
array_length((past2.customer_ids - cset.customer_ids), 1) AS dropped,
array_length((cset.customer_ids - past2.customer_ids), 1) AS returned
FROM
(
SELECT
ords.location_id,
ords.date,
array_agg(DISTINCT ords.customer_id ORDER BY ords.customer_id ASC) AS customers_id
FROM customer_orders ords
GROUP BY
ords.location_id,
ords.date
) cset
JOIN
customerset cset2 ON cset.date - '1 month'::interval = cset2.date
AND cset2.location_id = cset.location_id
GROUP BY
cset.location_id,
cset.date,
cset2.customer_ids,
cset.customer_ids
ORDER BY
cset.date ASC

TSQL - Max per group?

I have a table that looks like this:
GroupID UserID Value
1 1 10
1 2 20
1 3 30
1 4 40
1 5 45
1 6 49
1 7 80
1 8 90
2 1 2
2 2 24
2 3 34
2 4 48
2 5 56
3 1 etc.
3 2
3 3
3 4
4 1
4 2
4 3
I am trying to write a LEAD function that will give me the midpoint between each value. To do this I have written the following:
SELECT
[GroupID]
, [UserID]+0.5
, (LEAD ([Value], 1) OVER (ORDER BY GroupID, UserID) + [Value])/2 as [Value]
from dbo.myTable
The problem with this function is that when it gets to the last User in the group, it gives me a bad value because it's taking the [Value] on the current row and the value from the next row.
What I want to do is stop it when it reaches the maximum UserID for each Group. In other words, when it gets to GroupID = 1 and UserID = 8, it should end and start at the next Group. I do not want a row that looks like this:
GroupID UserID Value
1 8.5 46
I could run a DELETE statement after I INSERT the rows into the original table, but I don't have anything to identify when a row is the "maximum" User for it's Group. Ideally, I would like to somehow tell the lead statement not to calculate it in the first place.

Recursive Cumulative Sum up to a certain value Postgres

I have my data that looks like this:
user_id touchpoint_number days_difference
1 1 5
1 2 20
1 3 25
1 4 10
2 1 2
2 2 30
2 3 4
I would like to create one more column that would create a cumulative sum of the days_difference, partitioned by user_id, but would reset whenever the value reaches 30 and starts counting from 0. I have been trying to do it, but I couldn't figure it out how to do it in PostgreSQL, because it has to be recursive.
The outcome I would like to have would be something like:
user_id touchpoint_number days_difference cum_sum_upto30
1 1 5 5
1 2 20 25
1 3 25 0 --- new count all over again
1 4 10 10
2 1 2 2
2 2 30 0 --- new count all over again
2 3 4 4
Do you have any cool ideas how this could be done?
This should do what you want:
with cte as (
select t.a, t.b, t.c, t.c as sumc
from t
where b = 1
union all
select t.a, t.b, t.c,
(case when t.c + cte.sumc > 30 then 0 else t.c + cte.sumc end)
from t join
cte
on t.b = cte.b + 1 and t.a = cte.a
)
select *
from cte
order by a, b;
Here is a rextester.

Addition of columns after doing arithmetic operations in pyspark

I am actually new to pyspark and i am trying to do some data manipulations with it.
I have a DataFrame like below example:
Trxn Cust_ID Group
3370 A 1
8809 C 2
3525 B 3
8260 A 3
6349 B 3
3359 C 3
3701 NULL 3
5572 NULL 2
2580 A 1
In this DF, Trxn's are unique and the cust_id's can be repetitive and every cust_id belongs to some group. I need a Final Dataframe with the new group column names like array(Group_1, Group_2.. so on) where I do have a count of cust_id's belong to each group. Below is the output example:
Trxn Cust_ID Group Group_1 Group_2 Group_3
3370 A 1 2 0 1
8809 C 2 0 1 1
3525 B 3 0 0 2
8260 A 3 2 0 1
6349 B 3 0 0 2
3359 C 3 0 1 1
3701 NULL 3 0 1 1
5572 NULL 2 0 1 1
2580 A 1 2 0 1
Can someone let me know how to get this exact output in pyspark? Any help or hints would be greatly appreciated.
Seems like you are trying to do pivot here.
https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

Reorder Ranked rows

Recently i needed to implement a way to allow for Table Records to be Ranked.
Initially i deployed an Update statement to seed the ranks:
;with cte as (
select
t.id,
Rank() Over (
Partition by t.field2
Order by t.id
) as [Rank],
t.index,
t.field2,
t.field3 ,
t.field4
from dbo.Table t
where t.field2 = #fldValue
) Update cte
set index = [Rank]
But now i need to be able to have the end-user re-order the ranks. Any suggestions on how to allow an end-user to take Rank value 92 to Rank value 15 and have everything be re-ranked appropriately.
I had thought about doing this via cursor but am trying to do this via Set based operation.
My first goto was to do a Procedural based operation but need to get more inline with Set based operation.
Table Schema
Table:
id bigint
field2 int
field3 int ---> This field will be the key pivoting column for ranking
field4 int
Data:
id field2 field3 field4
1 0 1 1
2 0 1 1
3 0 1 1
4 0 1 2
5 0 1 2
6 0 1 1
7 0 1 1
8 0 1 1
9 0 1 1
10 0 1 2
11 0 1 2
12 0 1 1
13 0 1 1
14 0 1 1
15 0 1 2
16 0 1 1
17 0 1 2
18 0 1 2
19 0 1 1