T-SQL Need assistance with complex join - tsql

I am really out of ideas of how to solve this issue and need some assistance - not only solution but idea of how to approach will be welcomed.
I have the following table:
TABLE Data
(
RecordID
,DateAdd
,Status
)
with sample date like this:
11 2012-10-01 OK
11 2012-10-04 NO
11 2012-11-05 NO
22 2012-10-01 OK
33 2012-11-01 NO
33 2012-11-15 OK
And this table with the following example data:
TABLE Periods
(
PeriodID
,PeriodName
,DateStart
,DateEnd
)
1 Octomer 2012-10-01 2012-10-31
2 November 2012-11-01 2012-11-30
What I need to do, is to populate a new table:
TABLE DataPerPeriods
(
PeriodID,
RecordID,
Status
)
That will store all possible combinations of PeriodID and RecordID and the latest status for period if available. If status is not available for give period, then the status for previous periods. If there are no previous status at all - then NULL for status.
For example with the following data I need something like this:
1 11 NO //We have status "OK" and "NO", but "NO" is latest for the period
1 22 OK
1 33 NULL//Because there are no records for this and previous periods
2 11 NO //We get the previos status as there are no records in this periods
2 22 OK //There are not records for this period, but record for last periods is available
2 33 NO //We have status "OK" and "NO", but "OK" is latest for the period
EDIT: I have already populate the period ids and the records ids in the last table, I need more help on the status update.

There might be a better way to do this. But this is the most straight-forward path I know to get what you're looking for, unconventional as it appears. For larger datasets you may have to change your approach:
SELECT p.PeriodID, td.RecordID, statusData.[Status] FROM Periods p
CROSS JOIN (SELECT DISTINCT RecordID FROM Data) td
OUTER APPLY (SELECT TOP 1 [Status], [DateAdd]
FROM Data
WHERE [DateAdd] <= p.DateEnd
AND [RecordID] = td.RecordID
ORDER BY [DateAdd] DESC) statusData
ORDER BY p.PeriodID, td.RecordID
The CROSS JOIN is what gives you every possible combination of RecordIDs and DISTINCT Periods.
The OUTER APPLY selects the latest Status before then end of each Period.

Check out my answer on another question to know how to grab the first or last status : Aggregate SQL Function to grab only the first from each group

OK, here's an idea. Nobody likes cursors, including me, but sometimes for things like this they do come in handy.
The idea is that this cursor loops through each of the Data records, pulling out the ID as an identifier. Inside the loop it finds a single data record and gets the count of a join that meets your criteria.
If the #count = 0, the condition is not met and you should not insert a record for that period.
If the #Count=1, the condition is met so insert a record for the period.
If these conditions need to be updated frequently, you can your query to a job and run it every minute or hour ... what have you.
Hope this helps.
DECLARE #ID int
DECLARE merge_cursor CURSOR FAST_FORWARD FOR
select recordID
from data
OPEN merge_cursor
FETCH NEXT FROM merge_cursor INTO #ID
WHILE ##FETCH_STATUS = 0
BEGIN
--get join if record is found in the periods
declare #Count int
select #Count= count(*)
from data a inner join periods b
on a.[dateadd] between b.datestart and b.dateend
where a.recordID = #ID
if #count>0
--insert into DataPerPeriods(PeriodID, RecordID, Status)
select b.periodid, a.recordid, a.status
from data a inner join periods b on a.[dateadd] between b.datestart and b.dateend --between beginning of month and end of month
where a.recordid = #ID
else
--insert into DataPerPeriods(PeriodID, RecordID, Status)
select b.periodid, a.recordid, a.status
from data a inner join periods b on a.[dateadd] < b.dateend
where a.recordID = #ID --fix this area
FETCH NEXT FROM merge_cursor INTO #ID
END
CLOSE merge_cursor
DEALLOCATE merge_cursor

Related

Postgres : Need distinct records count

I have a table with duplicate entries and the objective is to get the distinct entries based on the latest time stamp.
In my case 'serial_no' will have duplicate entries but I select unique entries based on the latest time stamp.
Below query is giving me the unique results with the latest time stamp.
But my concern is I need to get the total of unique entries.
For example assume my table has 40 entries overall. With the below query I am able to get 20 unique rows based on the serial number.
But the 'total' is returned as 40 instead of 20.
Any help on this pls?
SELECT
*
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp,
COUNT(*) OVER() as total
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC OFFSET 0
LIMIT
10
) AS my_info
ORDER BY
serial_no asc
product_info table intially has this data
serial_no name timestamp
11212 pulp12 2018-06-01 20:00:01
11213 mango 2018-06-01 17:00:01
11214 grapes 2018-06-02 04:00:01
11215 orange 2018-06-02 07:05:30
11212 pulp12 2018-06-03 14:00:01
11213 mango 2018-06-03 13:00:00
After the distict query I got all unique results based on the latest
timestamp:
serial_no name timestamp total
11212 pulp12 2018-06-03 14:00:01 6
11213 mango 2018-06-03 13:00:00 6
11214 grapes 2018-06-02 04:00:01 6
11215 orange 2018-06-02 07:05:30 6
But total is appearing as 6 . I wanted the total to be 4 since it has
only 4 unique entries.
I am not sure how to modify my existing query to get this desired
result.
Postgres supports COUNT(DISTINCT column_name), so if I have understood your request, using that instead of COUNT(*) will work, and you can drop the OVER.
What you could do is move the window function to a higher level select statement. This is because window function is evaluated before distinct on and limit clauses are applied. Also, you can not include DISTINCT keyword within window functions - it has not been implemented yet (as of Postgres 9.6).
SELECT
*,
COUNT(*) OVER() as total -- here
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC
LIMIT
10
) AS my_info
Additionally, offset is not required there and one more sorting is also superfluous. I've removed these.
Another way would be to include a computed column in the select clause but this would not be as fast as it would require one more scan of the table. This is obviously assuming that your total is strictly connected to your resultset and not what's beyond that being stored in the table, but gets filtered out.
select count(*), serial_no from product_info group by serial_no
will give you the number of duplicates for each serial number
The most mindless way of incorporating that information would be to join in a sub query
SELECT
*
FROM
(
SELECT
DISTINCT ON (serial_no) id,
serial_no,
name,
timestamp,
COUNT(*) OVER() as total
FROM
product_info
INNER JOIN my.account ON id = accountid
WHERE
lower(name) = 'hello'
ORDER BY
serial_no,
timestamp DESC OFFSET 0
LIMIT
10
) AS my_info
join (select count(*) as counts, serial_no from product_info group by serial_no) as X
on X.serial_no = my_info.serial_no
ORDER BY
serial_no asc

Looping SQL query - PostgreSQL

I'm trying to get a query to loop through a set of pre-defined integers:
I've made the query very simple for this question.. This is pseudo code as well obviously!
my_id = 0
WHILE my_id < 10
SELECT * from table where id = :my_id`
my_id += 1
END
I know that for this query I could just do something like where id < 10.. But the actual query I'm performing is about 60 lines long, with quite a few window statements all referring to the variable in question.
It works, and gets me the results I want when I have the variable set to a single figure.. I just need to be able to re-run the query 10 times with different variables hopefully ending up with one single set of results.
So far I have this:
CREATE OR REPLACE FUNCTION stay_prices ( a_product_id int ) RETURNS TABLE (
pid int,
pp_price int
) AS $$
DECLARE
nights int;
nights_arr INT[] := ARRAY[1,2,3,4];
j int;
BEGIN
j := 1;
FOREACH nights IN ARRAY nights_arr LOOP
-- query here..
END LOOP;
RETURN;
END;
$$ LANGUAGE plpgsql;
But I'm getting this back:
ERROR: query has no destination for result data
HINT: If you want to discard the results of a SELECT, use PERFORM instead.
So do I need to get my query to SELECT ... INTO the returning table somehow? Or is there something else I can do?
EDIT: this is an example of the actual query I'm running:
\x auto
\set nights 7
WITH x AS (
SELECT
product_id, night,
LAG(night, (:nights - 1)) OVER (
PARTITION BY product_id
ORDER BY night
) AS night_start,
SUM(price_pp_gbp) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS pp_price,
MIN(spaces_available) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_spaces_available,
MIN(period_date_from) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS min_period_date_from,
MAX(period_date_to) OVER (
PARTITION BY product_id
ORDER BY night
ROWS BETWEEN (:nights - 1) PRECEDING
AND CURRENT ROW
) AS max_period_date_to
FROM products_nightlypriceperiod pnpp
WHERE
spaces_available >= 1
AND min_group_size <= 1
AND night >= '2016-01-01'::date
AND night <= '2017-01-01'::date
)
SELECT
product_id as pid,
CASE WHEN x.pp_price > 0 THEN x.pp_price::int ELSE null END as pp_price,
night_start as from_date,
night as to_date,
(night-night_start)+1 as duration,
min_spaces_available as spaces
FROM x
WHERE
night_start = night - (:nights - 1)
AND min_period_date_from = night_start
AND max_period_date_to = night;
That will get me all the nights night periods available for all my products in 2016 along with the price for the period and the max number of spaces I could fill in that period.
I'd like to be able to run this query to get all the periods available between 2 and 30 days for all my products.
This is likely to produce a table with millions of rows. The plan is to re-create this table periodically to enable a very quick look up of what's available for a particular date. The products_nightlypriceperiod represents a night of availability of a product - e.g. Product X has 3 spaces left for Jan 1st 2016, and costs £100 for the night.
Why use a loop? You can do something like this (using your first query):
with params as (
select generate_series(1, 10) as id
)
select t.*
from params cross join
table t
where t.id = params.id;
You can modify params to have the values you really want. Then just use cross join and let the database "do the looping."

'View' (NOT DELETE) Duplicate Rows from a Postgresql table obtained from joins

So I have temp table I created by joining three tables :
Trips
Stops
Stop_times
The Stop_times table has a list of trip_ids, the corresponding stops and the scheduled arrival and departure times of buses at those stops.
I searched online and everywhere I seem to find answers for how to delete duplicates (using ctid, nested queries) but not view them.
My query looks something like this :
CREATE TEMP TABLE temp as
SELECT
(CASE st.arrival_time < current_timestamp::time
WHEN true THEN (current_timestamp::date + interval '1 day') + st.arrival_time
ELSE (current_timestamp::date) + st.arrival_time
END) as arrival,
CASE st.departure_time < current_timestamp::time
WHEN true THEN (current_timestamp::date + interval '1 day') + st.departure_time
ELSE (current_timestamp::date) + st.departure_time
END as departure, st.trip_id, st.stop_id, st.stop_headsign,route_id, t.trip_headsign, s.stop_code, s.stop_name, s.stop_lat, s.stop_lon
FROM schema.stop_times st
JOIN schema.trips t ON t.trip_id=st.trip_id
JOIN schema.stops s ON s.stop_id=st.stop_id
order by arrival, departure;
I know that there are duplicates (by running the select * and select DISTINCT on temp), I just need to identify the duplicates...any help will be appreciated!
PS : I know I can use DISTINCT and get rid of duplicates, but it is slowing down the query a lot so I need to rework the query for which I need to identify the duplicates, the resultant records are greater than 200,000 so exporting them to excel and filtering duplicates is not an option either (I tried but excel can't handle it)
I believe this will give you what you want:
SELECT arrival, departure, trip_id, stop_id, stop_headsign, route_id,
headsign, stop_code, stop_name, stop_lat, stop_lon, count(*)
FROM temp
GROUP BY arrival, departure, trip_id, stop_id, stop_headsign, route_id,
headsign, stop_code, stop_name, stop_lat, stop_lon
HAVING count(*) > 1;

TSQL get the Prev and Next ID on a list

Let's say I have a table Sales
SaleID int
UserID int
Field1 varchar(10)
Created Datetime
and right now I have loaded and viewing the record with SaleID = 23
What's the right way to find out, using a stored procedure, what's the PREVIOUS and NEXT SalesID value off the current SaleID = 23, that belongs to me (UserID = 1)?
I could do a
SELECT TOP 1 *
FROM Sales
WHERE SaleID > 23 AND UserID = 1
and the same for SaleID < 23 but that's 2 SQL calls.
Is there a better way?
I'm using the SQL Server 2012.
You can get the previous/next SaleID (or any other field) by using the LAG() and LEAD() functions introduced in SQL Server 2012.
For example:
SELECT *,
LAG(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (PARTITION BY UserID ORDER BY SaleID) Next
FROM Sales S
SqlFiddle
If you omit the PARTIITION BY clause in the LAG() or LEAD() functions in the answer of thepirat000's, you can find the related previous or next records according to the SaleID column.
Here is the SQL query
SELECT *,
LAG(SaleID) OVER (ORDER BY SaleID) Prev,
LEAD(SaleID) OVER (ORDER BY SaleID) Next
FROM Sales S
The PARTITION BY clause enables you to use these functions within a grouping based on UserID as in the thepirat000's code
If you want the next and previous records only for a single row, or at least for a small set of item following query can also help in terms of performance (as an answer to Eager to Learn's comment)
select
(select top 1 t.SaleID from Sales t where t.SaleID < tab1.SaleID) as prev_id,
SaleID as current_id,
(select top 1 t.SaleID from Sales t where t.SaleID > tab1.SaleID) as next_id
from Sales where SaleID = 2

Optimize recursive query using exclusion list

I'm trying to optimizes a recursive query for speed. The full query runs for 15 minutes.
The part I'm trying to optimize takes ~3.5min to execute, and the same logic is used twice in the query.
Description:
Table Ret contains over 300K rows with 30 columns (Daily snapshot)
Table Ret_Wh is the werehouse for Ret with over 5million rows (Snapshot history, 90days)
datadate - the day the info was recorded (like 10-01-2012)
statusA - a status like (Red, Blue) that an account can have.
statusB - a different status like (Large, Small) that an account can have.
Statuses can change day to day.
old - an integer age on the account. Age can be increased/decreased if there is a payment on the account. Otherwise incerase by 1 with each day.
account - the account number, and primary key of a row.
In Ret the account is unique.
In RetWh account is unique per datadate.
money - dollars in the account
Both Ret and Ret_Wh have the columns listed above
Query Goal: Select all accounts from Ret_Wh that had an age in a certain range, at ANY time during he month, and had a specific status while in that range.
Then select from those results, matching accounts in Ret, with a specific age "today", no matter their status.
My Goal: Do this in a way that doesn't take 3.5 minutes
Pseudo_Code:
#sdt='2012-10-01' -- or the beginning of any month
#dt = getdate()
create table #temp (account char(20))
create table #result (account char(20), money money)
while #sdt < #dt
BEGIN
insert into #temp
select
A.account
from Ret_Wh as A
where a.datadate = #sdt
and a.statusA = 'Red'
and a.statusB = 'Large'
and a.old between 61 and 80
set #sdt=(add 1 day to #sdt)
END
------
select distinct
b.account
,b.money
into #result
from #temp as A
join (Select account, money from Ret where old = 81) as B
on A.account=B.account
I want to create a distinct list of accounts in Ret_Wh (call it #shrinking_list). Then, in the while, I join Ret_Wh to #shrkining_list. At the end of the while, I delete one account from #shrinking_list. Then the while iterrates, with a smaller list joined to Ret_Wh, thereby speeding up the query as #sdt increases by 1 day. However, I don't know how to pass the exact same account number selected, to an external variable in the while, so that I can delete it from the #shrinking_list.
Any ideas on that, or how to speed this up in general?
Why are you using a cursor to get dates from #sdt to #dt one at a time?
select distinct b.account, b.money
from Ret as B
join Ret_Wh as A
on A.account = B.account
and a.datadate >= #sdt
and a.datadate < #dt
and a.statusA = 'Red'
and a.statusB = 'Large'
and a.old between 61 and 80
where b.old = 81