Removing Duplicate Row using SQL (Hive / Impala syntax) - hiveql

I would like to remove duplicate rows based on event_dates and case_ids.
I have a query that looks like this (the query is much longer, this is just to show the problem):
SELECT
event_date,
event_id,
event_owner
FROM eventtable
This gives me results such as the following:
event_date event_id event_owner
2018-02-06 00:00:00 123456 UNASSIGNED
2018-02-07 00:00:00 123456 UNASSIGNED
2018-02-07 00:00:00 123456 Mickey Mouse
2018-02-08 00:00:00 123456 Mickey Mouse
2018-02-09 00:00:00 123456 Minnie Mouse
2018-02-10 00:00:00 123456 Minnie Mouse
2018-02-11 00:00:00 123456 Mickey Mouse
.
.
.
Problem:
I have duplicate entries on 2018-02-07. I would like to have only the second one to remain.
So the result should be this:
event_date event_id event_owner
2018-02-06 00:00:00 123456 UNASSIGNED
2018-02-07 00:00:00 123456 Mickey Mouse
2018-02-08 00:00:00 123456 Mickey Mouse
2018-02-09 00:00:00 123456 Minnie Mouse
2018-02-10 00:00:00 123456 Minnie Mouse
2018-02-11 00:00:00 123456 Mickey Mouse
.
.
.
I've tried to use SELECT DISTINCT ... , but that gives back all the results since it takes into consideration all 3 columns and in that sence all rows are uniqe. I only want to apply DISTINCT on 2 columns event_data and event_id.
Should I use nested sub-queries? Or where lies the truth? All help is much appreciated.

You can use the ROW_NUMBER analytic function for this purpose, but you should clarify the order when you say " I would like to have only the second one to remain". That order doesn't exists in the data, so you need to do something to generate it by yourself.
Try this query:
select event_date, event_id, event_owner
from (
select
row_number() over (partition by event_date order by case when event_owner='UNASSIGNED' then 0 else 1 end desc) as rn,
*
from eventtable
) t
where rn=1

Related

T-SQL Dynamic Date based on Today's Month

My fiscal year begins on April 1 and I need to include 1 full year of historical data plus current fiscal year as of today. In DAX this looks like:
DATESBETWEEN(Calendar_Date
,IF(MONTH(TODAY()) < 4
,DATE(YEAR(TODAY())-2, 4, 1)
,DATE(YEAR(TODAY())-1, 4, 1)
)
,DATE(TODAY())
)
I need to create this same range as a filter in a T-SQL query, preferably in the "WHERE" clause, but I am totally new to sql and have been unsuccessful in finding a solution online. Any help from more experienced people would be much appreciated!
If you just want to find these values and use as a where filter this is fairly straightforward date arithmetic, the logic for which you already have in your DAX code:
declare #dates table(d date);
insert into #dates values
('20190101')
,('20190601')
,('20200213')
,('20201011')
,('20190101')
,(getdate())
;
select d
,dateadd(month,3,dateadd(year,datediff(year,0,dateadd(month,-4,d))-1,0)) as TraditionalMethod
,case when month(d) < 4
then datetime2fromparts(year(d)-2,4,1,0,0,0,0,0)
else datetime2fromparts(year(d)-1,4,1,0,0,0,0,0)
end as YourDAXTranslated
from #dates;
Which outputs:
d
TraditionalMethod
YourDAXTranslated
2019-01-01
2017-04-01 00:00:00.000
2017-04-01 00:00:00
2019-06-01
2018-04-01 00:00:00.000
2018-04-01 00:00:00
2020-02-13
2018-04-01 00:00:00.000
2018-04-01 00:00:00
2020-10-11
2019-04-01 00:00:00.000
2019-04-01 00:00:00
2019-01-01
2017-04-01 00:00:00.000
2017-04-01 00:00:00
2021-07-22
2020-04-01 00:00:00.000
2020-04-01 00:00:00
However, I would suggest that you may be better served by creating a Dates Table to which you apply filters and from which you join to your transactional data to return the values you require. In an appropriately configured environment this will make full use of available indexes and should provide very good performance.
A very basic tally table approach to generate such a Dates Table is as follows, which returns all dates and their fiscal year start dates for 2015-01-01 to 2042-05-18:
with t as (select t from(values(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) as t(t))
,d as (select dateadd(day,row_number() over (order by (select null))-1,'20150101') as d from t,t t2,t t3,t t4)
select d as DateValue
,case when month(d) < 4
then datetime2fromparts(year(d)-1,4,1,0,0,0,0,0)
else datetime2fromparts(year(d),4,1,0,0,0,0,0)
end as FinancialYearStart
from d
order by DateValue;

how can i fetch record month wise in between two dates in postgresql (grouping dates by their months)?

I am using below query for fetching the record month wise but it give
wrong data
SELECT
(count( server_time::timestamp::date)) ,
min(server_time::timestamp::date) as "Month Date"
FROM
complaint_details_v2
WHERE
server_time between '2018/08/01' and '2018/10/30'
GROUP BY
floor((server_time::timestamp::date - '2018/08/01'::date)/30)
ORDER BY
2 ASC
Result
Count Month Date
2774 2018-08-01
5893 2018-08-31
1193 2018-09-30
But result will be
Count Month Date
2774 2018-08-01
5893 2018-09-01
1193 2018-10-01
Use date_trunc
demo:db<>fiddle
SELECT
count(*),
date_trunc('month', servertime)::date as month_date
FROM log
GROUP BY date_trunc('month', servertime)
ORDER BY 2

Simple temporal view?

How can I write a query to compute the end date per ID in postgres? I am able to do this in-memory with Python, but I would rather keep it simple and just create a view.
My table appends any new combination of system_1_id and system_2_id along with the date of the file the data was from (I am reading a snapshot mapping file which is sent a few times per week). It looks like this:
system_1_id system_2_id start_date is_current
123456 05236 2016-06-01 False
123456 98899 2017-01-03 False
123456 05236 2017-04-15 True
To:
system_1_id system_2_id start_date end_date
123456 05236 2016-06-01 2017-01-02
123456 98899 2017-01-03 2017-04-14
123456 05236 2017-04-15
Note that there can only be one system_2_id assigned to a system_1_id at a time, but they can be recycled and even reassigned at a later date.
The end date is simply just 1 day less than the next row date for the same ID
My goal is eventually to be able to join other tables to the data and pull the accurate ids per date:
where t1.system_2_id = t2.system_2_id and t1.report_date >= t2.start_date and t1.report_date <= t2.end_date
A simple temporal table without worrying about triggers or rules or using an extension.
The lead() window function will do this for you, with your example data:
select
system_1_id,
system_2_id,
start_date,
cast(lead(start_date, 1, Null) over(partition by system_1_id order by start_date) - interval '1 day' as date) as end_date
from
the_table;

Postgresql date_part on muliple records

I've multiple records whit same who value and different when
table:
I've an attendance table, with this structure:
who | when | why
"when" timestamp with time zone NOT NULL,
"who" text,
I need to calculate the difference between each record for the same who.
I've tried :
DATE_PART('hour', table."when"::timestamp - table."when"::timestamp)
but don't seems to work.
i.e.
A | 2017-03-01 08:30
A | 2017-03-01 12:30
B | 2017-03-01 08:30
B | 2017-03-01 12:30
Need to get total hours for A and B separated
You need a window function in order access the value of the "previous" row:
select who,
when,
when - lag(when) over (partition by who order by when) as diff
from the_table
order by who, when;
If you only ever have two rows per who, or just care for the first and last value of when, use a simple aggregation:
select who,
max(when) - min(when)
from the_table
group by who
order by who;

Creating sequence of dates and inserting each date into query

I need to find certain data within first day of current month to the last day of current month.
select count(*) from q_aggr_data as a
where a.filial_='fil1'
and a.operator_ like 'unit%'
and date_trunc('day',a.s_end_)='"+ date_to_search+ "'
group by a.s_name_,date_trunc('day',a.s_end_)
date_to_searh here is 01.09.2014,02.09.2014, 03.09.2014,...,30.09.2014
I've tried to loop through i=0...30 and make 30 queries, but that takes too long and extremely naive. Also to the days where there is no entry it should return 0. I've seen how to generate date sequences, but can't get my head around on how to inject those days one by one into the query
By creating not only a series, but a set of 1 day ranges, any timestamp data can be joined to the range using >= with <
Note in particular that this approach avoids functions on the data (such as truncating to date) and because of this it permits the use indexes to assist query performance.
If some data looked like this:
CREATE TABLE my_data
("data_dt" timestamp)
;
INSERT INTO my_data
("data_dt")
VALUES
('2014-09-01 08:24:00'),
('2014-09-01 22:48:00'),
('2014-09-02 13:12:00'),
('2014-09-03 03:36:00'),
('2014-09-03 18:00:00'),
Then that can be joined, using an outer join so unmatched ranges are still reported to a generated set of ranges (dt_start & dt_end pairs)
SELECT
r.dt_start
, count(d.data_dt)
FROM (
SELECT
dt_start
, dt_start + INTERVAL '1 Day' dt_end
FROM
generate_series('2014-09-01 00:00'::timestamp,
'2014-09-30 00:00', '1 Day') AS dt_start
) AS r
LEFT OUTER JOIN my_data d ON d.data_dt >= r.dt_start
AND d.data_dt < r.dt_end
GROUP BY
r.dt_start
ORDER BY
r.dt_start
;
and a result such as this is produced:
| DT_START | COUNT |
|----------------------------------|-------|
| September, 01 2014 00:00:00+0000 | 2 |
| September, 02 2014 00:00:00+0000 | 1 |
| September, 03 2014 00:00:00+0000 | 2 |
| September, 04 2014 00:00:00+0000 | 2 |
...
| September, 29 2014 00:00:00+0000 | 0 |
| September, 30 2014 00:00:00+0000 | 0 |
See this SQLFiddle demo
One way to solve this problem is to group by truncated date.
select count(*)
from q_aggr_data as a
where a.filial_='fil1'
and a.operator_ like 'unit%'
group by date_trunc('day',a.s_end_), a.s_name_;
The other way is to use a window function, for getting the count over truncated date for example.
Please check if this query satisfies your requirements:
select sum(matched) -- include s_name_, s_end_ if you want to verify the results
from
(select a.filial_
, a.operator_
, a.s_name_
, generate_series s_end_
, (case when a.filial_ = 'fil1' then 1 else 0 end) as matched
from q_aggr_data as a
right join generate_series('2014-09-01', '2014-09-30', interval '1 day')
on a.s_end_ = generate_series
and a.filial_ = 'fil1'
and a.operator_ like 'unit%') aa
group by s_name_, s_end_
order by s_end_, s_name_
http://sqlfiddle.com/#!15/e8edf/3