U-SQL How to select first value in column that is different from current row? - row-number

I am struggling on how to make a "multi-row" formula in U-SQL. I have ordered the data by Date, and for each for I want to find the first value of "Port" that is not equal to the current row's value. In a similar way I want to find the last row in the date value with the current port value to figure out how many days a vessel has been in the port. Keep in mind here that is has to be the row with the same port name, with no new/other ports in between.
I am loading my data in like this:
#res = SELECT
Port,
Date
FROM #data;
This is how my date is structured:
Port | Date |
Port A | 1/1/2017 |
Port A | 1/1/2017 |
Port A | 1/2/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/5/2017 |
Port B | 1/6/2017 |
Port C | 1/9/2017 |
Port C | 1/10/2017 |
Port C | 1/11/2017 |
Port A | 1/14/2017 |
Port A | 1/15/2017 |
How I would like the data to be structured:
Port | Date | Time in Port | Previous Port
Port A | 1/1/2017 | 0 | N/A
Port A | 1/1/2017 | 0 | N/A
Port A | 1/2/2017 | 1 | N/A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/5/2017 | 1 | Port A
Port B | 1/6/2017 | 2 | Port A
Port C | 1/9/2017 | 0 | Port B
Port C | 1/10/2017 | 1 | Port B
Port C | 1/11/2017 | 2 | Port B
Port A | 1/14/2017 | 0 | Port C
Port A | 1/15/2017 | 1 | Port C
I am new to U-SQL and so I am having a bit of trouble on how to approach this.
My first instinct would be to use some combination of LEAD()/LAG() and ROW_NUMBER() OVER(PARTITION BY xx ORDER BY Date), but I am unsure of how to get the exact effect I am looking for.
Could anyone point me in the right direction?

You can do what you need with the so-called Ranking and Analytic functions like LAG, DENSE_RANK and the OVER clause, although it's not entirely straightforward. This simple rig worked for your test data, I would suggest testing thoroughly with a larger and more complex dataset.
// Test data
#input = SELECT *
FROM (
VALUES
( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/2/2017", new CultureInfo("en-US") ), 1 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
( "Port B", DateTime.Parse("1/5/2017", new CultureInfo("en-US") ), 1 ),
( "Port B", DateTime.Parse("1/6/2017", new CultureInfo("en-US") ), 2 ),
( "Port C", DateTime.Parse("1/9/2017", new CultureInfo("en-US") ), 0 ),
( "Port C", DateTime.Parse("1/10/2017", new CultureInfo("en-US") ), 1 ),
( "Port C", DateTime.Parse("1/11/2017", new CultureInfo("en-US") ), 2 ),
( "Port A", DateTime.Parse("1/14/2017", new CultureInfo("en-US") ), 0 ),
( "Port A", DateTime.Parse("1/15/2017", new CultureInfo("en-US") ), 1 )
) AS x ( Port, Date, timeInPort );
// Add a group id to the dataset
#working =
SELECT Port,
Date,
timeInPort,
DENSE_RANK() OVER(ORDER BY Date) - DENSE_RANK() OVER(PARTITION BY Port ORDER BY Date) AS groupId
FROM #input;
// Use the group id to work out the datediff with previous row
#working =
SELECT Port,
Date,
timeInPort,
groupId,
Date.Date.Subtract((DateTime)(LAG(Date) OVER(PARTITION BY groupId ORDER BY Date) ?? Date)).TotalDays AS diff // datediff
FROM #working;
// Work out the previous port, based on group id
#ports =
SELECT Port, groupId
FROM #working
GROUP BY Port, groupId;
#ports =
SELECT Port, groupId, LAG(Port) OVER( ORDER BY groupId ) AS previousPort
FROM #ports;
// Prep the final output
#output =
SELECT w.Port,
w.Date.ToString("M/d/yyyy") AS Date,
SUM(w.diff) OVER( PARTITION BY w.groupId ORDER BY w.Date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS timeInPort,
p.previousPort
FROM #working AS w
INNER JOIN
#ports AS p
ON w.Port == p.Port
AND w.groupId == p.groupId;
OUTPUT #output TO "/output/output.csv"
ORDER BY Date, Port
USING Outputters.Csv(quoting:false);
My results:

Related

Postgres Hierarchy output

im struggling on how to get the correct output using hierarchy query.
I have one table which loads per day all product and its price. during time this can cancel and being activate again.
I believe with oracle we could use the Connect By.
WITH RECURSIVE cte AS (
select min(event_date) event_date, item_code,sum(price::numeric)/1024/1024 price, 1 AS level
from rdpidevdat.raid_r_cbs_offer_accttype_map where product_type='cars' and item_code in ('Renault')
group by item_code
UNION ALL
SELECT e.event_date, e.item_code, e.price, cte.level + 1
from (select event_date, item_code,sum(price::numeric)/1024/1024 price
from rdpidevdat.raid_r_cbs_offer_accttype_map where product_type='cars' and item_code in ('9859')
group by event_date,item_code) e join cte ON e.event_date = cte.event_date and e.item_code = cte.item_code
)
SELECT *
FROM cte where item_code in ('Renault') ;
how do i put an ouput where will have the range of each product during time?
if we have the data:
EVENT_DATE | ITEM_COD| PRICE
20210910 | Renaut | 2500
20210915 | Renaut | 2500
20210920 | Renaut | 2600
20211020 | Renaut | 2900
20220101 | Renaut | 2500
the expected output should be:
-------------------------------------------------
FROM_EVENT_DATE | TO_EVENT_DATE | ITEM_COD| PRICE
20210910 | 20210915 | Renaut | 2500
20210915 | 20210920 | Renaut | 2600
20210920 | 20211020 | Renaut | 2900
20211020 | 20220101 | Renaut | 2500
Thanks in Advance and Regards!
I already found the solution. Using the Lag and lastvalue function. no need to use the hierarchy one.

How to compute frequency/count of concurrent events by combination in postgresql?

I am looking for a way to identify event names names that co-occur: i.e., correlate event names with the same start (startts) and end (endts) times: the events are exactly concurrent (partial overlap is not a feature of this data base, which makes this conditional criterion a bit simpler to satisfy).
toy dataframe
+------------------+
|name startts endts|
| A 02:20 02:23 |
| A 02:23 02:25 |
| A 02:27 02:28 |
| B 02:20 02:23 |
| B 02:23 02:25 |
| B 02:25 02:27 |
| C 02:27 02:28 |
| D 02:27 02:28 |
| D 02:28 02:31 |
| E 02:27 02:28 |
| E 02:29 02:31 |
+------------------+
Ideal output:
+---------------------------+
|combination| count |
+---------------------------+
| AB | 2 |
| AC | 1 |
| AE | 1 |
| AD | 1 |
| BC | 0 |
| BD | 0 |
| BE | 0 |
| CE | 0 |
+-----------+---------------+
Naturally, I would have tried a loop but I recognize PostgreSQL is not optimal for this.
What I've tried is generating a temporary table by selecting for distinct name and startts and endts combinations and then doing a left join on the table itself (selecting name).
User #GMB provided the following (modified) solution; however, the performance is not satisfactory given the size of the database (even running the query on a time window of 10 minutes never completes). For context, there are about 300-400 unique names; so about 80200 combinations (if my math checks out). Order is not important for the permutations.
#GMB's attempt:
I understand this as a self-join, aggregation, and a conditional count of matching intervals:
select t1.name name1, t2.name name2,
sum(case when t1.startts = t2.startts and t1.endts = t2.endts then 1 else 0 end) cnt
from mytable t1
inner join mytable t2 on t2.name > t1.name
group by t1.name, t2.name
order by t1.name, t2.name
Demo on DB Fiddle:
name1 | name2 | cnt
:---- | :---- | --:
A | B | 2
A | C | 1
A | D | 1
A | E | 1
B | C | 0
B | D | 0
B | E | 0
C | D | 1
C | E | 1
D | E | 1
#GMB notes that, if you are looking for a count of overlapping intervals, all you have to do is change the sum() to:
sum(t1.startts <= t2.endts and t1.endts >= t2.startts) cnt
Version = PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.19097
Thank you.
Consider the following in MySQL (where your DBFiddle points to):
SELECT name, COUNT(*)
FROM (
SELECT group_concat(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
Equivalent in PostgreSQL:
SELECT name, COUNT(*)
FROM (
SELECT string_agg(name ORDER BY name) name
FROM mytable
GROUP BY startts, endts
ORDER BY name
) as names
GROUP BY name
ORDER BY name
First, you create a list of concurrent events (in the subquery), and then you count them.

Calculate won, tie and lost games in postgresql

I have two tables "matches" and "opponents".
Matches
id | date
---+------------
1 | 2016-03-21 21:00:00
2 | 2016-03-22 09:00:00
...
Opponents
(score is null if not played)
id | match_id | team_id | score
---+----------+---------+------------
1 | 1 | 1 | 0
2 | 1 | 2 | 1
3 | 2 | 3 | 1
4 | 2 | 4 | 1
4 | 3 | 1 |
4 | 3 | 2 |
....
The goal is to create the following table
Team | won | tie | lost | total
-----+-----+-----+------+----------
2 | 1 | 0 | 0 | 1
3 | 0 | 1 | 0 | 1
4 | 0 | 1 | 0 | 1
1 | 0 | 0 | 1 | 1
Postgres v9.5
How do I do this? (Im open to maybe moving the "score" to somewhere else in my model if it makes sense.)
Divide et impera my son
with teams as (
select distinct team_id from opponents
),
teamgames as (
select t.team_id, o.match_id, o.score as team_score, oo.score as opponent_score
from teams t
join opponents o on t.team_id = o.team_id
join opponents oo on (oo.match_id = o.match_id and oo.id != o.id)
),
rankgames as (
select
team_id,
case
when team_score > opponent_score then 1
else 0
end as win,
case
when team_score = opponent_score then 1
else 0
end as tie,
case
when team_score < opponent_score then 1
else 0
end as loss
from teamgames
),
rank as (
select
team_id, sum(win) as win, sum(tie) as tie, sum(loss) as loss,
sum( win * 3 + tie * 1 ) as score
from rankgames
group by team_id
order by score desc
)
select * from rank;
Note1: You probably don't need the first "with" as you probably have already a table with one record per team
Note2: i think you can also achieve the same result with one single query, but in this way steps are clearer

How to count the number of joins between a user and several tables?

I'm using postgresql 9.3.9 and have a table users that looks like this:
user_id | email
----------------------------
1001 | hello#world.com
1030 | mel#hotmail.com
2333 | jess#gmail.com
2502 | peter#gmail.com
3000 | olivia#hotmail.com
4000 | sharon#gmail.com
4900 | lisa#gmail.com
I then have several tables that list what users are connected on various platforms and when they connected. Ie platform_a, platform_b, platform_c, etc.
platform_a may look like this:
user_id | created_at
----------------------------
1001 | 2015-04-30
2333 | 2015-05-15
3000 | 2014-02-15
platform_b may look like this:
user_id | created_at
----------------------------
1001 | 2015-06-30
2333 | 2015-07-02
4900 | 2015-07-03
platform_c may look like this:
user_id | created_at
----------------------------
1001 | 2015-08-16
1030 | 2015-07-03
3000 | 2015-09-01
4000 | 2015-09-01
I want the end result to look like this:
user_id | # of connections | latest created_at | connected to a | connected to b | connected to c
--------------------------------------------------------------------------------------------------
1001 | 3 | 2015-08-16 | yes | yes | yes
1030 | 1 | 2015-07-03 | no | no | yes
2333 | 2 | 2015-07-02 | yes | yes | no
2502 | 0 | | no | no | no
3000 | 2 | 2015-09-01 | yes | no | yes
4000 | 1 | 2015-09-01 | no | no | yes
4900 | 1 | 2015-07-03 | no | yes | no
How would I do this?
First, make an union with all your tables :
SELECT user_id, created_at, 1 AS a, 0 AS b, 0 AS c FROM tableA
UNION
SELECT user_id, created_at, 0 AS a, 1 AS b, 0 AS c FROM tableB
UNION
SELECT user_id, created_at, 0 AS a, 0 AS b, 1 AS c FROM tableC
then group the result from this subquery
SELECT user_id, COUNT(user_id), MAX(created_at), MAX(a), MAX(b), MAX(c)
FROM subquery_above
GROUP BY user_id
This won't give you the zero results, but you can achieve that with a LEFT JOIN on the user list.
select
user_id,
count(p),
max(created_at),
coalesce(sum((pl = 'a')::int), 0) connected_to_a,
coalesce(sum((pl = 'b')::int), 0) connected_to_b,
coalesce(sum((pl = 'c')::int), 0) connected_to_c
from users u
left join (
select *, 'a' pl from platform_a
union all
select *, 'b' pl from platform_b
union all
select *, 'c' pl from platform_c
) p
using (user_id)
group by 1;
user_id | count | max | connected_to_a | connected_to_b | connected_to_c
---------+-------+------------+----------------+----------------+----------------
1001 | 3 | 2015-08-16 | 1 | 1 | 1
1030 | 1 | 2015-07-03 | 0 | 0 | 1
2333 | 2 | 2015-07-02 | 1 | 1 | 0
2502 | 0 | | 0 | 0 | 0
3000 | 2 | 2015-09-01 | 1 | 0 | 1
4000 | 1 | 2015-09-01 | 0 | 0 | 1
4900 | 1 | 2015-07-03 | 0 | 1 | 0
(7 rows)
While you check for all users, it's typically fastest to aggregate before you join:
SELECT *
FROM (SELECT user_id FROM users) u -- subquery to clip other columns
LEFT JOIN (
SELECT user_id, count(*) AS connections, max(created_at) AS latest_created_at
, bool_or(pl = 'a') AS connected_to_a
, bool_or(pl = 'b') AS connected_to_b
, bool_or(pl = 'c') AS connected_to_c
FROM ( SELECT user_id, created_at, 'a'::"char" AS pl FROM platform_a
UNION ALL SELECT user_id, created_at, 'b' FROM platform_b
UNION ALL SELECT user_id, created_at, 'c' FROM platform_b
) p1
) p2 USING (user_id)
ORDER BY user_id;
Result is exactly as desired - except that connections is NULL instead of '0' in your example. Use COALESCE() in the outer SELECT if you need to convert that. I didn't, because SELECT * is so convenient.
If you are going to list all columns in the outer SELECT you can as well just use users instead of the subquery u to clip other columns.
bool_or() is the perfect aggregate function for the job.
There might be multiple links to one platform. This query still returns a single row per user.

Comparing the latest record to previous record in postgresql

I have a table in PostgreSQL DB like this:
Client | Rate | StartDate|EndDate
A | 1000 | 2005-1-1 |2005-12-31
A | 2000 | 2006-1-1 |2006-12-31
A | 3000 | 2007-1-1 |2007-12-31
B | 5000 | 2006-1-1 |2006-12-31
B | 8000 | 2008-1-1 |2008-12-31
C | 2000 | 2006-1-1 |2006-12-31
I want to get the latest change, like this table. How?
Client | Rate | StartDate|EndDate |Pre Rate | Pre StartDate |Pre EndDate
A | 3000 | 2007-1-1 |2007-12-31 | 2000 | 2006-1-1 |2006-12-31
B | 8000 | 2008-1-1 |2008-12-31 | 5000 | 2006-1-1 |2006-12-31
C | 2000 | 2006-1-1 |2006-12-31
SELECT DISTINCT ON (Client) Client,
Rate,
StartDate,
EndDate,
LAG(Rate) OVER (PARTITION BY Client
ORDER BY StartDate) AS "Pre Rate",
LAG(StartDate) OVER (PARTITION BY Client
ORDER BY StartDate) AS "Pre StartDate",
LAG(EndDate) OVER (PARTITION BY Client
ORDER BY StartDate) AS "Pre EndDate"
FROM ClientRates
ORDER BY Client,
StartDate DESC;
I can't help thinking there's a simpler way to express this.
with current_start_dates as (
select client, max(startdate) cur_startdate
from client_rates
group by client
),
extended_client_rates as (
select client, rate, startdate, enddate,
lag(rate, 1) over (partition by client order by startdate) prev_rate,
lag(startdate,1) over (partition by client order by startdate) prev_startdate,
lag(enddate,1) over (partition by client order by startdate) prev_enddate
from client_rates
)
select cr.*
from extended_client_rates cr
inner join current_start_dates csd on csd.client = cr.client
and csd.cur_startdate = cr.startdate
Esentially the same as Tim's answer (+1), with some polishing and full script for trying/cheking
CREATE TEMP TABLE client_rates (client VARCHAR, rate INTEGER,
start_date DATE, end_date DATE);
INSERT INTO client_rates VALUES ('A',1000,'2005-1-1','2005-12-31');
INSERT INTO client_rates VALUES ('A',2000,'2006-1-1','2006-12-31');
INSERT INTO client_rates VALUES ('A',3000,'2007-1-1','2007-12-31');
INSERT INTO client_rates VALUES ('B',5000,'2006-1-1','2006-12-31');
INSERT INTO client_rates VALUES ('B',8000,'2008-1-1','2008-12-31');
INSERT INTO client_rates VALUES ('C',2000,'2006-1-1','2006-12-31');
SELECT DISTINCT ON (client) * FROM
(
SELECT client, rate, start_date, end_date,
lag(rate) OVER w1 AS prev_rate,
lag(start_date) OVER w1 AS prev_start_date,
lag(end_date) OVER w1 AS prev_end_date
FROM client_rates
WINDOW w1 AS (PARTITION BY client ORDER BY start_date)
ORDER BY client,start_date desc
) AS foo;
client | rate | start_date | end_date | prev_rate | prev_start_date | prev_end_date
--------+------+------------+------------+-----------+-----------------+---------------
A | 3000 | 2007-01-01 | 2007-12-31 | 2000 | 2006-01-01 | 2006-12-31
B | 8000 | 2008-01-01 | 2008-12-31 | 5000 | 2006-01-01 | 2006-12-31
C | 2000 | 2006-01-01 | 2006-12-31 | | |