CASE WHEN with COLLECT_SET - hiveql

I have a toy table:
hive> SELECT * FROM ds.forgerock;
OK
forgerock.id forgerock.productname forgerock.description
1 OpenIDM Platform for building enterprise provisioning solutions
2 OpenAM Full-featured access management
3 OpenDJ Robust LDAP server for Java
4 OpenDJ desc2
4 OpenDJ desc2
Time taken: 0.083 seconds, Fetched: 5 row(s)
I am trying to get a table like:
id flag
1 0
2 0
3 1
4 1
I am using the toy table to iterate and develop working code.
SELECT id, CASE WHEN "OpenDJ" IN COLLECT_SET(productname) THEN 1 ELSE 0 END AS flag,
GROUP BY id FROM ds.forgerock;
Note that in the toy data set, every id only has one distinct value, so COLLECT_SET doesn't seem necessary. However, given the actual data set actually has more than one distinct value, what I am trying to do will make more sense.

Use max() for flag aggregation by id:
SELECT id, max(CASE WHEN productname='OpenDJ' THEN 1 ELSE 0 END) AS flag
FROM ds.forgerock
GROUP BY id;

Related

User Sessions | Month's Since Last Active Using SQL

UserID
CalMonth
ActiveFlag
Months_since_last_active
A
1/1/2021
1
0
A
2/1/2021
1
A
3/1/2021
2
A
4/1/2021
1
0
B
1/1/2021
1
0
B
2/1/2021
1
B
3/1/2021
1
0
Problem --> The first 3 colums are given. Generate the last one 'Months_since_last_active' by adding 1 until the use is active again
My Solution as below:
With active_sessions as (
Select
User_Id
, CalMonth
, active flag as current_flag
, LAG (ActiveFlag,1) over (partition by User_Id order by CalMonth) as previous_flag
)
Select User_Id, CalMonth, current_flag, sum(case when current_flag =1 then 0
when current_flag IS NULL then Months_since_last_active + 1
END
) as Months_since_last_active
from active_sessions
order by 1,2
I was asked the above question in an interview and told that my proposed solution would not work because:
When it comes to 3/1/2021 and beyond, the previous values of 'Months_since_last_active' are not in the table yet -- they are only in the code
If I wanted to use LAG function, then it'd take innumerable LAG functions to achieve what I was trying to achieve
I will appreciate if someone can comment on my solution.
Your solution has 3 major problems, 2 of them may be related to copy/past errors. The active_sessions CTE is missing the from clause, so there is no data source. Then the main portion uses the aggregate function SUM, however, the query has no group by which is required for the aggregate function. These are easily corrected. The other issue concerns the LAG function and your use of it.
First off in the CTE you alias the result as previous_flag, then in the main query you reference Months_since_last_active which does not exist yet. I think this is the source of the interviewer's first point.
The interviewer's second point also stems form the LAG function. As written it always looks back exactly 1 row, but from the current row yet it needs to look back 2 rows for (userid, calmonth) = ('A', 2021-03-01), and 3 rows for (A, 2021-04-01), etc. Basically you need to look back to to the last row with active_flag = 1. This leads directly to the it'd take innumerable LAG functions as you do not know how far beck you need to look. Suppose you had 30-40 or more inactive rows between active rows. You need a LAG(activeflag,n) ... for each possibility.
A solution. I dislike the problem statement it should not contain by adding 1 until the use is active again (is it yours or theirs). Either way this is an XY. If theirs they should be telling you what to solve, i.e. find number of months since last active. If yours you have created the problem for yourself. The problem statement should not say anything about how to solve the it. I will ignore that portion of the problem (And in a real interview I would/have ignored it, but be prepared to explain why).
What you have a a version of a Gaps And Islands (google it, you will find more that to think about). In this version lets consider each row with activeflag = 'Y' an as island, and anything else as a gap. Nor what you are looking for is the length of the gaps between islands. In the following the island_num CTE does 2 things. It assigns a sequence number to each row for a (userid, calmonth) and generates a boolean for each island. The `gap_points' then joins the results with itself, selecting the assigned for the max island whose calmonth is less than the current rows calmonth. In the main part the Months_since_last_active is assigned 0 if the current row is an island, and the difference between the generated row numbers if it is a gap. (see demo)
with island_num (userid, cal_month, active_flag, is_island, row_num) as
( select am.*
, case when am.activeflag = 1 then true else false end is_island
, row_number() over (partition by am.userid order by am.calmonth) rn
from active_month am
) -- select * from island_num
, gap_points(userid, cal_month, active_flag, is_island, row_num, island_row) as
( select *
from island_num i1
join lateral
(select max(row_num)
from island_num i2
where i1.userid = i2.userid
and i2.cal_month < i1.cal_month
and i2.is_island
) s0
on true
) --select * from gap_points;
select userid "User Id"
, cal_month "Cal Month"
, active_flag "Active Flag"
, case when is_island then 0
else row_num - island_row
end "Months_since_last_active"
from gap_points;

Postgres sequence that resets once the id is different

I am trying to make a Postgres sequence that will reset once the id of the item it is linked to changes, e.g:
ID SEQUENCE_VALUE
1 1
2 1
1 2
1 3
2 2
3 1
I don't know PSQL or SQL in general very well and I can't find a similar question, any Help Is greatly appreciated!
Just use a normal sequence that does not reset and calculate the desired value in the query:
SELECT id,
row_number() OVER (PARTITION BY id
ORDER BY seq_col)
AS sequence_value
FROM mytable;
Here, seq_col is a column that is auto-generated from a sequence (an identity column).

Query a history table to find state on a given date in postgresql

I have created a history table that is populated by triggers on another "live" table. I now want to be able to see how it looked on a given date. I am able to query a single product using a where clause which gives me the desired output for a single product.
SELECT * FROM test
WHERE productid = 1
AND updated < '2020-02-15'
ORDER BY updated DESC
LIMIT 1
But how do I get the last updated value before my given date (mid-Feb in this example) for each product in the table?
A simple version of my table looks like this:-
productid amount updated
1 5 01/01/2020
1 6 01/02/2020
1 7 01/03/2020
2 13 01/01/2020
2 14 01/02/2020
2 15 01/04/2020
and my desired outcome is:
productid amount updated
1 6 01/02/2020
2 14 01/02/2020
Many thanks
You can use distinct on:
select distinct on (productid) t.*
from test t
where updated < date '2020-02-15'
order by productid, updated desc

Merge selected group keys in KDB (Q) group by query

I have a query that essentially does counting by group key in KDB, in which I want to treat some of the groups as one for the purpose of this query. A simplified description of what I'm trying to do would be to count orders by customer in a month, where I have a couple of customers in the database that are actually subsidiaries of another customer, and I want to combine the counts of the subsidiaries with their parent organisation. The real scenario us much more complicated than that and without getting into unnecessary detail, suffice to say that I can't just group by customer and manipulate the results to merge counts after the query is executed - I need the "by" clause of my query to do the merging directly.
In SQL, I would do something like this:
select customer_id, count(*) as order_count
from orders
order by select case when customer_id = 1 then 2 when customer_id = 3 then 4 else customer_id end
In the above example, customer 1 is a subsidiary of customer 2, customer 3 is a subsidiary of customer 4 and every other customer is treated normally
Let's say the equivalent code in Q (without the manipulation of group keys) is:
select order_count:count i by customer_id from orders
How would I put in the equivalent select case statement to manipulate the group key? I tried this, but got a rank error:
select order_count:count i by $[customer_id=1;2;customer_id=3;4;customer_id] from orders
I'm terrible at Q so I'm probably making a very simple mistake. Any advice greatly appreciated.
One approach might be to have a dictionary of subsidiaries and use a lookup/re-map in your by clause:
q)dict:1 3!2 4
q)show t:([] order:1+til 10;customer:1+10?6)
order customer
--------------
1 1
2 1
3 6
4 2
5 3
6 4
7 5
8 5
9 3
10 5
q)select order_count:count i by customer^dict[customer] from t
customer| order_count
--------| -----------
2 | 3
4 | 3
5 | 3
6 | 1
You will lose some information about who actually owns the orders though, you'll only know at the parent level

Postgres Pivot based on variable column to create a new id

I have the following table
type attribute order
1 11 1
1 12 2
2 11 1
2 12 2
3 15 1
3 16 2
4 15 1
4 16 2
I need to understand which types have identical attributes and then assign them a new id. The order column can be as well if it's helpful because each attribute can only have one order, but you don't need to use it.
Ideally the result set would be the following where you have a new id for each type that is based on the attributes in the first table.
type new_id
1 1
2 1
3 2
4 2
I was planning on trying to pivot the table based on the order column and concatenating the attribute id's to create a new id, but I cannot use crosstab and the number of attributes a type has could vary and I need to account for that.
Any suggestions on what to do here?
This works, there's possibly a better way to do it but it's what came to mind:
SELECT UNNEST(types) AS type, new_id
FROM (
SELECT ARRAY_AGG(type) AS types, ROW_NUMBER() OVER() AS new_id
FROM (
SELECT type, ARRAY_AGG(attribute ORDER BY attribute) AS attr
FROM t
GROUP BY type
) x
GROUP BY attr
) y
Output:
1;1
2;1
3;2
4;2
So first it gets the list of attributes for each type, then it gets the list of types for each common list of attributes (this is where it makes sure each type shares the same attributes) and gets a new id for each group of types. Then unnest that to put each type on a new row, and that row number is the new id.