SQL Query IF equal 2 or 3 - tsql

What is the best way to write a query where I am asking to return any values in ReveiewType of 2 and 3 with the same ElementID in the dateset below:
ReviewType fiscalYear ElementID Cycle dateCompleted sNumber
2 1819 SI-063 2016 2018-09-24 128221
3 1819 SI-063 2016 2018-09-24 128221
2 1819 SI-065 2016 2018-09-24 128221
3 1819 SI-065 2016 2018-09-24 128221
2 1819 SI-066 2016 2018-09-25 128221
3 1819 SI-066 2016 2018-09-25 128221
This is what I have so far.. what is not working out...
SELECT Distinct ElementID
WHERE ElementID Like '#element#'
AND ReviewType IN ('2','3')
ORDER BY ElementID
FROM [DATABSE]

This sounds like homework. If so, it's polite to let us know.
Instead of distinct elementid, I think you'll want to use group by elementid with a having count(*) > 1 clause.

I don't really understand your question but your sql syntax is incorrect.
It should be like this:
SELECT
Distinct ElementID
FROM [DATABSE]
WHERE ElementID Like '%St%'
AND ReviewType IN ('2','3')
ORDER BY ElementID
Please elaborate more on what outcome / output do you want from the list of data.

Related

PostgresSQL: Fill values for null rows based on rows for which we do have values

I have the following table:
country year rank
1 Austria 2019 1
2 Austria 2018 NA
3 Austria 2017 NA
4 Austria 2016 NA
5 Spain 2019 2
6 Spain 2018 NA
7 Spain 2017 NA
8 Spain 2016 NA
9 Belgium 2019 3
10 Belgium 2018 NA
11 Belgium 2017 NA
12 Belgium 2016 NA
I want to fill in the NA values for 2018, 2017 and 2016 for each country with the value for 2019 (which we have).
I want the output table to look like this:
country year rank
1 Austria 2019 1
2 Austria 2018 1
3 Austria 2017 1
4 Austria 2016 1
5 Spain 2019 2
6 Spain 2018 2
7 Spain 2017 2
8 Spain 2016 2
9 Belgium 2019 3
10 Belgium 2018 3
11 Belgium 2017 3
12 Belgium 2016 3
I do not know where to get started with this question. I typically work with R but am now working on a platform which uses postgresSQL. I could do this in R but thought it would be worthwhile to figure out how it is done with postgres.
Any help with this would be greatly appreciated. Thank you.
Using an update to join to find the non NULL rank value for each country:
UPDATE yourTable AS t1
SET "rank" = t2.max_rank
FROM
(
SELECT country, MAX("rank") AS max_rank
FROM yourTable
GROUP BY country
) t2
WHERE t2.country = t1.country;
-- AND year IN (2016, 2017, 2018)
Add the commented out portion of the WHERE clause if you really only want to target certain years (your example seems to imply that you want to backfill all missing data).
If you just want to view your data in the format of the output, then use MAX as an analytic function:
SELECT country, year, MAX("rank") OVER (PARTITION BY country) AS "rank"
FROM yourTable
ORDER BY country, year DESC;
If you just want the output then
try this,
with cte as (
select distinct on (country) * from test
order by country, year desc
)
select
t1.id,t1.country,t1.year,t2.rank
from test t1 left join cte t2 on t1.country=t2.country
If you want to update your table then try this:
with cte as (
select distinct on (country) * from test
order by country, year desc
)
update test set rank=cte.rank from cte
where test.country=cte.country
DEMO

Getting ranking based on a number from CTE

I have a complex situation in PostgreSQL 11 where i need to generate a numbering based on a single figure which i get it from a CTE.
Below is the CTE
WITH pending_orders_to_be_processed_details
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY so.create_date ) as queue_no
, name,so.create_date ::TIMESTAMP
FROM picking sp
LEFT JOIN order so ON so.name=sp.origin
WHERE sp.state IN('assigned','confirmed')
)
,orders_which_can_be_processed_today AS
(
-- This CTE will give me a count of orders
and its hourly average, Lets say count is 400 and hourly avg is 3
)
Now i need to number the details according to the hourly average, Means the first 3 orders need to be ranked as 1, next 3 to be ranked as 2 and so on, so that i can able to identify that these can be processed based on this ranking.
Input will be
name queu_number. create_date
so1 1 2021-03-11 12:00:00
so2 2 2021-03-11 13:00:00
so3 3 2021-03-11 14:00:00
so4 4 2021-03-11 15:00:00
so5 5 2021-03-11 16:00:00
so6 6 2021-03-11 17:00:00
so7 7 2021-03-11 18:00:00
so8 8 2021-03-11 19:00:00
so9 9 2021-03-11 20:00:00
The expected output will be
name rank
so1 1
so2 1
so3 1
so4 2
so5 2
so6 2
so7 3
so8 3
so9 3
Any help/suggestions.
Edit: I recently learned about a function, which fits well here:
demo:db<>fiddle
You can use the ntile() window function for that:
SELECT
*,
ntile(3) OVER (ORDER BY create_date)
FROM mytable
demo:db<>fiddle
Since you already created a cumulative row count, you can use this to create your expected rank:
SELECT
*,
floor((queue_no - 1) / 3) + 1 as rank
FROM my_cte
queue_no - 1 (so, 1 to 3 will be shifted to 0 to 2)
Diff by 3: so, 0 to 2 will be 0.x and 3 to 5 will be 1.x, ...
Now round these result to 0, 1, 2, ...
If you want to start with 1 instead of 0, add 1

Pyspark Joing using monthly range

There are three tables A , B in Hive
A Table has the following columns and is Partitioned based upon Day. We need to extract data from 1st jan 2016 till 31st Dec 2016. I've just mentioned sample but these records are in millions for 1 year.
ID Day Name Description
1 2016-09-01 Sam Retail
2 2016-01-28 Chris Retail
3 2016-02-06 ChrisTY Retail
4 2016-02-26 Christa Retail
3 2016-12-06 ChrisTu Retail
4 2016-12-31 Christi Retail
Table B
ID SkEY
1 1.1
2 1.2
3 1.3
Table C
Start_Date End_Date Month_No
2016-01-01 2016-01-31 1
2016-02-01 2016-02-28 2
2016-03-01 2016-03-31 3
2016-04-01 2016-04-30 4
2016-05-01 2016-05-31 5
2016-06-01 2016-06-30 6
2016-07-01 2016-07-31 7
2016-08-01 2016-08-31 8
2016-09-01 2016-09-30 9
2016-10-01 2016-10-30 10
2016-11-01 2016-11-31 11
2016-12-01 2016-12-31 12
I've tried to write the code in spark but didn't work and resulting in a cartisa product on the join and performance was also very bad
Df_A=spark.sql("select * from A join B where a.day>=b.start_date
and a.day<=b.end_date and b.month_no=(I)")
Actual Output should have the code in pyspark where A join B where every month needs to be processed. the value of I should automatically be incremented from 1 to 12 along with month dates.
A Join B as shown above and A Join C using ID as well as performance should be good
from pyspark.sql import sparksession
from pyspark.sql import functions as F
from pyspark import HiveContext
hiveContext= HiveContext(sc)
def UDF_df(i):
print(i[0])
ABC2=spark.sql("select * From A where day where day
='{0}'.format(i[0]))
Join=ABC2.join(Tab2.join(ABC2.ID == Tab2.ID))\
.select(Tab2.skey,ABC2.Day,ABC2.Name,ABC2.Description)
Join\
.select("Tab2.skey","ABC2.Day","ABC2.Name","ABC2.Description")
.write\
.mode("append")\
.format("parquet')\
.insertinto("Table")
ABC=spark.sql("select distinct day from A where day<= ' 2016-01-01' and day<='2016-
12-31'")
Tab2=spark.sql("select * from B where day is not null)
for in in ABC.collect():
UDF_df(i)
The following query is working but taking a long time as the number of
columns are around 60(just used sample 3). Also didn't join Table C as I
wasn't sure how to join to avoid cartisan join. performance isn't good, am
not sure how to optimise the query.

Optimized querying in PostgreSQL

Assume you have a table named tracker with following records.
issue_id | ingest_date | verb,status
10 2015-01-24 00:00:00 1,1
10 2015-01-25 00:00:00 2,2
10 2015-01-26 00:00:00 2,3
10 2015-01-27 00:00:00 3,4
11 2015-01-10 00:00:00 1,3
11 2015-01-11 00:00:00 2,4
I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
I am trying out this query
select *
from etl_change_fact
where ingest_date = (select max(ingest_date)
from etl_change_fact);
However, this gives me only
10 2015-01-26 00:00:00 2,3
this record.
But, I want all unique records(change_id) with
(a) max(ingest_date) AND
(b) verb columns priority being (2 - First preferred ,1 - Second preferred ,3 - last preferred)
Hence, I need the following results
10 2015-01-26 00:00:00 2,3
11 2015-01-11 00:00:00 2,4
Please help me to efficiently query it.
P.S :
I am not to index ingest_date because I am going to set it as "distribution key" in Distributed Computing setup.
I am newbie to Data Warehouse and querying.
Hence, please help me with optimized way to hit my TB sized DB.
This is a typical "greatest-n-per-group" problem. If you search for this tag here, you'll get plenty of solutions - including MySQL.
For Postgres the quickest way to do it is using distinct on (which is a Postgres proprietary extension to the SQL language)
select distinct on (issue_id) issue_id, ingest_date, verb, status
from etl_change_fact
order by issue_id,
case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc;
You can enhance your original query to use a co-related sub-query to achieve the same thing:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select max(f2.ingest_date)
from etl_change_fact f2
where f1.issue_id = f2.issue_id);
Edit
For an outdated and unsupported Postgres version, you can probably get away using something like this:
select f1.*
from etl_change_fact f1
where f1.ingest_date = (select f2.ingest_date
from etl_change_fact f2
where f1.issue_id = f2.issue_id
order by case verb
when 2 then 1
when 1 then 2
else 3
end, ingest_date desc
limit 1);
SQLFiddle example: http://sqlfiddle.com/#!15/3bb05/1

T-SQL Determine Status Changes in History Table

I have an application which logs changes to records in the "production" table to a "history" table. The history table is basically a field for field copy of the production table, with a few extra columns like last modified date, last modified by user, etc.
This works well because we get a snapshot of the record anytime the record changes. However, it makes it hard to determine unique status changes to a record. An example is below.
BoxID StatusID SubStatusID ModifiedTime
1 4 27 2011-08-11 15:31
1 4 11 2011-08-11 15:28
1 4 11 2011-08-10 09:07
1 5 14 2011-08-09 08:53
1 5 14 2011-08-09 08:19
1 4 11 2011-08-08 14:15
1 4 9 2011-07-27 15:52
1 4 9 2011-07-27 15:49
1 2 8 2011-07-26 12:00
As you can see in the above table (data comes from the real system with other fields removed for brevity and security) BoxID 1 has had 9 changes to the production record. Some of those updates resulted in statuses being changed and some did not, which means other fields (those not shown) have changed.
I need to be able, in TSQL, to extract from this data the unique status changes. The output I am looking for, given the above input table, is below.
BoxID StatusID SubStatusID ModifiedTime
1 4 27 2011-08-11 15:31
1 4 11 2011-08-10 09:07
1 5 14 2011-08-09 08:19
1 4 11 2011-08-08 14:15
1 4 9 2011-07-27 15:49
1 2 8 2011-07-26 12:00
This is not as easy as grouping by StatusID and SubStatusID and taking the min(ModifiedTime) then joining back into the history table since statuses can go backwards as well (see StatusID 4, SubStatusID 11 gets set twice).
Any help would be greatly appreciated!
Does this do work for you
;WITH Boxes_CTE AS
(
SELECT Boxid, StatusID, SubStatusID, ModifiedTime,
ROW_NUMBER() OVER (PARTITION BY Boxid ORDER BY ModifiedTime) AS SEQUENCE
FROM Boxes
)
SELECT b1.Boxid, b1.StatusID, b1.SubStatusID, b1.ModifiedTime
FROM Boxes_CTE b1
LEFT OUTER JOIN Boxes_CTE b2 ON b1.Boxid = b2.Boxid
AND b1.Sequence = b2.Sequence + 1
WHERE b1.StatusID <> b2.StatusID
OR b1.SubStatusID <> b2.SubStatusID
OR b2.StatusID IS NULL
ORDER BY b1.ModifiedTime DESC
;
Select BoxID,StatusID,SubStatusID FROM Staty CurrentStaty
INNER JOIN ON
(
Select BoxID,StatusID,SubStatusID FROM Staty PriorStaty
)
Where Staty.ModifiedTime=
(Select Max(PriorStaty.ModifiedTime) FROM PriorStaty
Where PriortStaty.ModifiedTime<Staty.ModifiedTime)
AND Staty.BoxID=PriorStaty.BoxID
AND NOT (
Staty.StatusID=PriorStaty.StatusID
AND
Staty.SubStatusID=PriorStaty.StatusID
)