Dedup using SQL on a huge 1 billion data set

Dedup using SQL on a huge 1 billion data set - postgresql

I am having out of memory issues while trying to dedup a table consisting of huge amount of data.
Scenario :
Column A | Column B ( Date )
Value1 Date1
Value1 Date2
Value2 Date3
Value2 Date4
I need to dedup on both these columns, I need to pick the latest record using column b.
Lets say date2 and date4 are the latest dates. My output should be:
Column A | Column B ( Date )
Value1 Date2
Value2 Date4
Currently I am using the below query which works. Is there a better way of doing this using less memory.
CREATE TABLE UNIQUE_TABLENAME AS (
SELECT a.column a, a.column b, a.column c, a.column d
from tablename a,
(select column a,max(column b) from tablename group by column a)b
where a.column a = b.column a
and a.column b= b.column b)
Thanks in advance!

select distinct on (col_a)
col_a as value, col_b as "date"
from t
order by col_a, col_b desc
Check distinct on

Related

How to find the average of the three maximum values in a specific group in a moving window in Big Query?

I have a data set as in the table below. I want to find the average of the maximum three values in a rolling 12 month window grouped by id.
id date value
id1 2020/01/01 500
id1 2021/02/01 300
id1 2021/03/01 150
id1 2021/08/01 100
id1 2021/12/01 400
id2 2020/01/01 50
id2 2020/02/01 900
id2 2021/12/01 100
So my expected output is:
id date value
id1 2020/01/01 500
id1 2021/02/01 300
id1 2021/03/01 225
id1 2021/08/01 183.33
id1 2021/12/01 283.33
id2 2020/01/01 50
id2 2020/02/01 500
id2 2021/12/01 100
I.e. for id1 2021/12/01: (400+300+150)/3 = 283.33 which is the average of the three largest values in a rolling 12 month window for group ID1.
I managed to get to this point:
CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
SELECT AVG(val) FROM(
SELECT val FROM UNNEST(arr) val ORDER BY val DESC LIMIT 3)
)
);
SELECT id, date, avg_array(val_arr)
FROM (
SELECT
id, date, ARRAY_AGG(value) OVER (
PARTITION BY id
ORDER BY id, date DESC ROWS BETWEEN CURRENT ROW AND 11 FOLLOWING
) as val_arr
FROM `table` )
Which works, but I feel like there must be a better way to do this. Specifically, I can't figure out how to get the average of the maximum three from the OVER as well rather than creating a seperate function.
(If not possible to combine date window with finding maximum values, it would also be useful for me to know how to find the average of the maximum three in any group by group without creating a seperate function)
`

In your code, the year of the date in the “PARTITION BY id,EXTRACT(YEAR FROM date) “ statement is missing.
CREATE TEMP FUNCTION avg_array(arr ANY TYPE) AS ((
SELECT AVG(val) FROM(
SELECT val FROM UNNEST(arr) val ORDER BY val DESC LIMIT 3))
);
SELECT id, date, avg_array(val_arr)
FROM (
SELECT
id, date, ARRAY_AGG(value) OVER (
PARTITION BY id,EXTRACT(YEAR FROM date)
ORDER BY id, date DESC ROWS BETWEEN CURRENT ROW AND 11 FOLLOWING
) as val_arr
FROM `table` )
order by id,date asc
Here, you can see a sample code to get the maximum 3 numbers of a group:
select id,AVG(value) as vg from (
select id,date,value from (
select id, date, value from `table`
order by value desc) a limit 3
) b group by id
You can see more information about over function in this link.

Consider below approach
select id, date,
(select round(avg(value), 2) from (
select value from t.arr value
order by value desc
limit 3
)) value
from (
select *, array_agg(value) over last_12_month arr from table
window last_12_month as (partition by id
order by 12 * (extract(year from date)) + extract(month from date)
range between 11 preceding and current row
)
) t
if applied to sample data in your question - output is

DB2: SQL to return all rows in a group having a particular value of a column in two latest records of this group

I have a DB2 table having one of the columns (A) which has either value PQR or XYZ.
I need output where the latest two records based on col C date have value A = PQR.
Sample Table
A B C
--- ----- ----------
PQR Mark 08/08/2019
PQR Mark 08/01/2019
XYZ Mark 07/01/2019
PQR Joe 10/11/2019
XYZ Joe 10/01/2019
PQR Craig 06/06/2019
PQR Craig 06/20/2019
In this sample table, my output would be Mark and Craig records

Since 11.1
You may use the nth_value OLAP function.
Refer to OLAP specification.
SELECT A, B, C
FROM
(
SELECT
A, B, C
, NTH_VALUE (A, 1) OVER (PARTITION BY B ORDER BY C DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) C1
, NTH_VALUE (A, 2) OVER (PARTITION BY B ORDER BY C DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) C2
FROM TAB
)
WHERE C1 = 'PQR' AND C2 = 'PQR'
dbfiddle link.
Older versions
SELECT T.*
FROM TAB T
JOIN
(
SELECT B
FROM
(
SELECT
A, B
, ROWNUMBER() OVER (PARTITION BY B ORDER BY C DESC) RN
FROM TAB
)
WHERE RN IN (1, 2)
GROUP BY B
HAVING MIN(A) = MAX(A) AND COUNT(1) = 2 AND MIN(A) = 'PQR'
) G ON G.B = T.B;

A simple solution could be
SELECT A,B,C
FROM tab
WHERE A = 'PQR'
ORDER BY C DESC FETCH FIRST 2 ROWS only

How to convert timestamp to numbers

Suppose I have a table like this:
Id Types Timestamp
1 A 2014-02-04 00:00:00
2 A 2014-02-05 00:00:00
1 A 2014-02-05 03:59:00
3 C 2014-05-06 03:59:00
1 B 2014-02-04 03:00:00
2 D 2014-02-05 00:40:00
I would like the output to be like this:
Id 1 2 3 4 5 etc
1 A B A C D ...
2 A D NULL NULL NULL
3 C NULL NULL NULL NULL
Is it possible to make time expresses the type's order.
Thanks for any hints.

Preliminary comments:
SQL can only return a predefined number of columns returned. IMHO, the best you can get is values concatenated in an array.
I have name your input table MyTable and renamed the column Timestamp to MyTimestamp to avoid conflict with the corresponding type's keyword.
You have put C and D in the 1 row of your output. I will treat it as a typo (they are not on ID = 1)
-
WITH RECURSIVE ConcatAndOrder(ID, MyResult, RowNumForOrder, RowCountForOrder) AS (
SELECT ID, ARRAY[Type], RowNumForOrder, RowCountForOrder
FROM IndexedTable
WHERE RowNumForOrder = 1
UNION ALL
SELECT I.ID, MyResult || I.Type, I.RowNumForOrder, I.RowCountForOrder
FROM IndexedTable I
JOIN ConcatAndOrder C on I.ID = C.ID and I.RowNumForOrder = C.RowNumForOrder + 1
), IndexedTable(ID, Type, RowNumForOrder, RowCountForOrder) AS (
SELECT ID, Type,
row_number() OVER (PARTITION BY ID ORDER BY MyTimestamp),
count(*) OVER (PARTITION BY ID)
FROM MyTable
)
SELECT ID, MyResult
FROM ConcatAndOrder
WHERE RowNumForOrder = RowCountForOrder
ORDER BY ID

Cumulative sum with group by and join

I'm a little struggled with finding a clean way to do this. Assume that I have the following records in my table named Records:
|Name| |InsertDate| |Size|
john 30.06.2015 1
john 10.01.2016 10
john 12.01.2016 100
john 05.03.2016 1000
doe 01.01.2016 1
How do I get the records for year of 2016 and month is equal to or less than 3 grouped by month(even that month does not exists e.g. month 2 in this case) with cumulative sum of Size including that month? I want to get the result as the following:
|Name| |Month| |Size|
john 1 111
john 2 111
john 3 1111
doe 1 1

As other commenters have already stated, you simply need a table with dates in that you can join from to give you the dates that your source table does not have records for:
-- Build the source data table.
declare #t table(Name nvarchar(10)
,InsertDate date
,Size int
);
insert into #t values
('john','20150630',1 )
,('john','20160110',10 )
,('john','20160112',100 )
,('john','20160305',1000)
,('doe' ,'20160101',1 );
-- Specify the year you want to search for by storing the first day here.
declare #year date = '20160101';
-- This derived table builds a set of dates that you can join from.
-- LEFT JOINing from here is what gives you rows for months without records in your source data.
with Dates
as
(
select #year as MonthStart
,dateadd(day,-1,dateadd(month,1,#year)) as MonthEnd
union all
select dateadd(month,1,MonthStart)
,dateadd(day,-1,dateadd(month,2,MonthStart))
from Dates
where dateadd(month,1,MonthStart) < dateadd(yyyy,1,#year)
)
select t.Name
,d.MonthStart
,sum(t.Size) as Size
from Dates d
left join #t t
on(t.InsertDate <= d.MonthEnd)
where d.MonthStart <= '20160301' -- Without knowing what your logic is for specifying values only up to March, I have left this part for you to automate.
group by t.Name
,d.MonthStart
order by t.Name
,d.MonthStart;
If you have a static date reference table in your database, you don't need to do the derived table creation and can just do:
select d.DateValue
,<Other columns>
from DatesReferenceTable d
left join <Other Tables> o
on(d.DateValue = o.AnyDateColumn)
etc

Here's another approach that utilizes a tally table (aka numbers table) to create the date table. Note my comments.
-- Build the source data table.
declare #t table(Name nvarchar(10), InsertDate date, Size int);
insert into #t values
('john','20150630',1 )
,('john','20160110',10 )
,('john','20160112',100 )
,('john','20160305',1000)
,('doe' ,'20160101',1 );
-- A year is fine, don't need a date data type
declare #year smallint = 2016;
WITH -- dummy rows for a tally table:
E AS (SELECT E FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(e)),
dateRange(totalDays, mn, mx) AS -- Get the range and number of months to create
(
SELECT DATEDIFF(MONTH, MIN(InsertDate), MAX(InsertDate)), MIN(InsertDate), MAX(InsertDate)
FROM #t
),
iTally(N) AS -- Tally Oh! Create an inline Tally (aka numbers) table starting with 0
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))-1
FROM E a CROSS JOIN E b CROSS JOIN E c CROSS JOIN E d
),
RunningTotal AS -- perform a running total by year/month for each person (Name)
(
SELECT
yr = YEAR(DATEADD(MONTH, n, mn)),
mo = MONTH(DATEADD(MONTH, n, mn)),
Name,
Size = SUM(Size) OVER
(PARTITION BY Name ORDER BY YEAR(DATEADD(MONTH, n, mn)), MONTH(DATEADD(MONTH, n, mn)))
FROM iTally
CROSS JOIN dateRange
LEFT JOIN #t ON MONTH(InsertDate) = MONTH(DATEADD(MONTH, n, mn))
WHERE N <= totalDays
) -- Final output will only return rows where the year matches #year:
SELECT
name = ISNULL(name, LAG(Name, 1) OVER (ORDER BY yr, mo)),
yr, mo,
size = ISNULL(Size, LAG(Size, 1) OVER (ORDER BY yr, mo))
FROM RunningTotal
WHERE yr = #year
GROUP BY yr, mo, name, size;
Results:
name yr mo size
---------- ----------- ----------- -----------
doe 2016 1 1
john 2016 1 111
john 2016 2 111
john 2016 3 1111

Grouping consecutive dates in PostgreSQL

I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.
I'm using PostgreSQL.
Table A
id startdate enddate
--------------------------
101 12/28/2013 12/31/2013
Table B
id startdate enddate
--------------------------
101 12/15/2013 12/15/2013
101 12/16/2013 12/16/2013
101 12/28/2013 12/28/2013
101 12/29/2013 12/31/2013
Desired Result
id startdate enddate
-------------------------
101 12/15/2013 12/16/2013
101 12/28/2013 12/31/2013

Right. I have a query that I think works. It certainly works on the sample records you provided. It uses a recursive CTE.
First, you need to merge the two tables. Next, use a recursive CTE to get the sequences of overlapping dates. Finally, get the start and end dates, and join back to the "merged" table to get the id.
with recursive allrecords as -- this merges the input tables. Add a unique row identifier
(
select *, row_number() over (ORDER BY startdate) as rowid from
(select * from table1
UNION
select * from table2) a
),
path as ( -- the recursive CTE. This gets the sequences
select rowid as parent,rowid,startdate,enddate from allrecords a
union
select p.parent,b.rowid,b.startdate,b.enddate from allrecords b join path p on (p.enddate + interval '1 day')>=b.startdate and p.startdate <= b.startdate
)
SELECT id,g.startdate,g.enddate FROM -- outer query to get the id
-- inner query to get the start and end of each sequence
(select parent,min(startdate) as startdate, max(enddate) as enddate from
(
select *, row_number() OVER (partition by rowid order by parent,startdate) as row_number from path
) a
where row_number = 1 -- We only want the first occurrence of each record
group by parent)g
INNER JOIN allrecords a on a.rowid = parent

The below fragment does what you intend. (but it will probably be very slow) The problem is that detecteng (non)overlapping dateranges is impossible with standard range operators, since a range could be split into two parts.
So, my code does the following:
split the dateranges from table_A into atomic records, with one date per record
[the same for table_b]
cross join these two tables (we are only interested in A_not_in_B, and B_not_in_A) , remembering which of the L/R outer join wings it came from.
re-aggregate the resulting records into date ranges.
-- EXPLAIN ANALYZE
--
WITH RECURSIVE ranges AS (
-- Chop up the a-table into atomic date units
WITH ar AS (
SELECT generate_series(a.startdate,a.enddate , '1day'::interval)::date AS thedate
, 'A'::text AS which
, a.id
FROM a
)
-- Same for the b-table
, br AS (
SELECT generate_series(b.startdate,b.enddate, '1day'::interval)::date AS thedate
, 'B'::text AS which
, b.id
FROM b
)
-- combine the two sets, retaining a_not_in_b plus b_not_in_a
, moments AS (
SELECT COALESCE(ar.id,br.id) AS id
, COALESCE(ar.which, br.which) AS which
, COALESCE(ar.thedate, br.thedate) AS thedate
FROM ar
FULL JOIN br ON br.id = ar.id AND br.thedate = ar.thedate
WHERE ar.id IS NULL OR br.id IS NULL
)
-- use a recursive CTE to re-aggregate the atomic moments into ranges
SELECT m0.id, m0.which
, m0.thedate AS startdate
, m0.thedate AS enddate
FROM moments m0
WHERE NOT EXISTS ( SELECT * FROM moments nx WHERE nx.id = m0.id AND nx.which = m0.which
AND nx.thedate = m0.thedate -1
)
UNION ALL
SELECT rr.id, rr.which
, rr.startdate AS startdate
, m1.thedate AS enddate
FROM ranges rr
JOIN moments m1 ON m1.id = rr.id AND m1.which = rr.which AND m1.thedate = rr.enddate +1
)
SELECT * FROM ranges ra
WHERE NOT EXISTS (SELECT * FROM ranges nx
-- suppress partial subassemblies
WHERE nx.id = ra.id AND nx.which = ra.which
AND nx.startdate = ra.startdate
AND nx.enddate > ra.enddate
)
;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Dedup using SQL on a huge 1 billion data set - postgresql

select distinct on (col_a) col_a as value, col_b as "date" from t order by col_a, col_b desc Check distinct on

Related

How to find the average of the three maximum values in a specific group in a moving window in Big Query?

DB2: SQL to return all rows in a group having a particular value of a column in two latest records of this group

How to convert timestamp to numbers

Cumulative sum with group by and join

Grouping consecutive dates in PostgreSQL

Categories

Resources