Percentile calculation with a window function - postgresql

I know you can get the average, total, min, and max over a subset of the data using a window function. But is it possible to get, say, the median, or the 25th percentile instead of the average with the window function?
Put another way, how do I rewrite this to get the id and the 25th or 50th percentile sales numbers within each district rather than the average?
SELECT id, avg(sales)
OVER (PARTITION BY district) AS district_average
FROM t

You can write this as an aggregation function using percentile_cont() or percentile_disc():
select district, percentile_cont(0.25) within group (order by sales)
from t
group by district;
Unfortunately, Postgres doesn't currently support these as a window functions:
select id, percentile_cont(0.25) within group (order by sales) over (partition by district)
from t;
So, you can use a join:
select t.*, p_25, p_75
from t join
(select district,
percentile_cont(0.25) within group (order by sales) as p_25,
percentile_cont(0.75) within group (order by sales) as p_75
from t
group by district
) td
on t.district = td.district

Another way to to this without joining as in Gordon's solution is by exploiting the array_agg function which can be used as a window function:
create function pg_temp.percentile_cont_window
(c double precision[], p double precision)
returns double precision
language sql
as
$$
with t1 as (select unnest(c) as x)
select percentile_cont(p) WITHIN GROUP (ORDER BY x) from t1;
$$
;
-- -- -- -- -- -- -- -- --
-- Usage examples:
create temporary table t1 as (
select 1 as g, 1 as x
union select 1 as g, 2 as x
union select 2 as g, 3 as x
);
-- Built-in function raises an error if used without group:
-- Error: OVER is not supported for ordered-set aggregate percentile_cont
select *, percentile_cont(.1) within group (order by x) over() from t1;
-- Built-in function with grouping
select g, percentile_cont(.1) within group (order by x) from t1 group by g;
-- | g | percentile_cont |
-- |----:|------------------:|
-- | 1 | 1.1 |
-- | 2 | 3 |
-- Custom function basic usage (note that this is without grouping)
select t1.*, pg_temp.percentile_cont_window(array_agg(x) over(), .1) from t1;
-- | g | x | percentile_cont_window |
-- |----:|----:|-------------------------:|
-- | 1 | 2 | 1.2 |
-- | 1 | 1 | 1.2 |
-- | 2 | 3 | 1.2 |
-- Custom function usage with grouping is the same as using the built-in percentile_cont function
select t1.g, pg_temp.percentile_cont_window(array_agg(x), .1) from t1 group by g;
-- | g | percentile_cont_window |
-- |----:|-------------------------:|
-- | 2 | 3 |
-- | 1 | 1.1 |

Related

PostgreSQL How to merge two tables row to row without condition

I have two tables
The first table contains three text fields(username, email, num) the second have only one column with random birth_date DATE.
I need to merge tables without condition
For example
first table:
+----------+--------------+-----------+
| username | email | num |
+----------+--------------+-----------+
| 'user1' | 'user1#mail' | '+794949' |
| 'user2' | 'user2#mail' | '+799999' |
+----------+--------------+-----------+
second table:
+--------------+
| birth_date |
+--------------+
| '2001-01-01' |
| '2002-02-02' |
+--------------+
And I need result like
+----------+------------+-------------+--------------+
| username | email | num | birth_date |
+----------+------------+-------------+--------------+
| 'user1' | 'us1#mail' | '+7979797' | '2001-01-01' |
| 'user2' | 'us2#mail' | '+79898998' | '2002-02-02' |
+----------+------------+-------------+--------------+
I need to get in result table with 100 rows too
Tried different JOIN but there is no condition here
Sure there is a join condition, about the simplest there is: Join on true or cross join. Either is the basic merge tables without condition. However this does not result in what you want as it generates a result set of 10k rows. But you an then use limit:
select *
from table1
join table2 on true
order by random()
limit 100;
select *
from table1
cross join table2
order by random()
limit 100;
There is other option, witch I think may be closer to what you want. Assign a value to each row of each table. Then join on this assigned value:
select <column list>
from (select *, row_number() over() rn from table1) t1
join (select *, row_number() over() rn from table2) t2
on (t1.rn = t2.rn);
To eliminate the assigned value you must specifically list each column desired in the result. But that is the way it should be done anyway.
See demo here. (demo user just 3 rows instead of 100)

Finding duplicate records posted within a lapse of time, in PostgreSQL

I'm trying to find duplicate rows in a large database (300,000 records). Here's an example of how it looks:
| id | title | thedate |
|----|---------|------------|
| 1 | Title 1 | 2021-01-01 |
| 2 | Title 2 | 2020-12-24 |
| 3 | Title 3 | 2021-02-14 |
| 4 | Title 2 | 2021-05-01 |
| 5 | Title 1 | 2021-01-13 |
I found this excellent (i.e. fast) answer here: Find duplicate rows with PostgreSQL
-- adapted from #MatthewJ answering in https://stackoverflow.com/questions/14471179/find-duplicate-rows-with-postgresql/14471928#14471928
select * from (
SELECT id, title, TO_DATE(thedate,'YYYY-MM-DD'),
ROW_NUMBER() OVER(PARTITION BY title ORDER BY id asc) AS Row
FROM table1
) dups
where
dups.Row > 1
Which I'm trying to use as a base to solve my specific problem: I need to find duplicates according to column values like in the example, but only for records posted within 15 days of each other (the date of record insertion in the column "thedate" in my DB).
I reproduced it in this fiddle http://sqlfiddle.com/#!15/ae109/2, where id 5 (same title as id 1, and posted within 15 days of each other) should be the only acceptable answer.
How would I implement that condition in the query?
With the LAG function you can get the date from the previous row with the same title and then filter based on the time difference.
WITH with_prev AS (
SELECT
*,
LAG(thedate, 1) OVER (PARTITION BY title ORDER BY thedate) AS prev_date
FROM table1
)
SELECT id, title, thedate
FROM with_prev
WHERE thedate::timestamp - prev_date::timestamp < INTERVAL '15 days'
You don't necessarily need window funtions for this, you an use a plain old self-join, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate < n.thedate
where n.thedate::date - p.thedate::date < 15
http://sqlfiddle.com/#!15/a3a73a/7
This has the advantage that it might use some of your indexes on the table, and also, you can decide if you want to use the data (i.e. the ID) of the previous row or the next row from each pair.
If your date column however is not unique, you'll need to be a little more specific in your join condition, like:
select p.id, p.thedate, n.id, n.thedate, p.title
from table1 p
join table1 n on p.title = n.title and p.thedate <= n.thedate and p.id <> n.id
where n.thedate::date - p.thedate::date < 15

Extract words before and after a specific word

I need to extract words before and after a word like '%don%' in a ntext column.
table A, column name: Text
Example:
TEXT
where it was done it will retrieve the...
at the end of the trip clare done everything to improve
it is the only one done in these times
I would like the following results:
was done it
clare done everything
one done in
I am using T-SQL, Left and right functions did not work with ntext data type of the column containing text.
As others have said, you can use a string splitting function to split out each word and then return those you require. Using the previously linked DelimitedSplit8K:
CREATE FUNCTION dbo.DelimitedSplit8K
--===== Define I/O parameters
(#pString VARCHAR(8000), #pDelimiter CHAR(1))
--WARNING!!! DO NOT USE MAX DATA-TYPES HERE! IT WILL KILL PERFORMANCE!
RETURNS TABLE WITH SCHEMABINDING AS
RETURN
--===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
-- enough to cover VARCHAR(8000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT TOP (ISNULL(DATALENGTH(#pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
),
cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
SELECT 1 UNION ALL
SELECT t.N+1 FROM cteTally t WHERE SUBSTRING(#pString,t.N,1) = #pDelimiter
),
cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
SELECT s.N1,
ISNULL(NULLIF(CHARINDEX(#pDelimiter,#pString,s.N1),0)-s.N1,8000)
FROM cteStart s
)
--===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = SUBSTRING(#pString, l.N1, l.L1)
FROM cteLen l
;
go
declare #t table (t ntext);
insert into #t values('where it was done it will retrieve the...'),('at the end of the trip clare done everything to improve'),('we don''t take donut donations here'),('ending in don');
with t as (select cast(t as nvarchar(max)) as t from #t)
,d as (select t.t
,case when patindex('%don%',s.Item) > 0 then 1 else 0 end as d
,s.ItemNumber as i
,lag(s.Item,1,'') over (partition by t.t order by s.ItemNumber) + ' '
+ s.Item + ' '
+ lead(s.Item,1,'') over (partition by t.t order by s.ItemNumber) as r
from t
cross apply dbo.DelimitedSplit8K(t.t, ' ') as s
)
select t
,r
from d
where d = 1
order by t
,i;
Output:
+---------------------------------------------------------+-----------------------+
| t | r |
+---------------------------------------------------------+-----------------------+
| at the end of the trip clare done everything to improve | clare done everything |
| ending in don | in don |
| we don't take donut donations here | we don't take |
| we don't take donut donations here | take donut donations |
| we don't take donut donations here | donut donations here |
| where it was done it will retrieve the... | was done it |
+---------------------------------------------------------+-----------------------+
And a working example:
http://rextester.com/RND43071

Select query for selecting columns from those records from the inner query . where inner query and outer query have different columns

I have a group by query which fetches me some records. What if I wish to find other column details representing those records.
Suppose I have a query as follows .Select id,max(date) from records group by id;
to fetch the most recent entry in the table.
I wish to fetch another column representing those records .
I want to do something like this (This incorrect query is just for example) :
Select type from (Select id,max(date) from records group by id) but here type doesnt exist in the inner query.
I am not able to define the question in a simpler manner.I Apologise for that.
Any help is appreciated.
EDIT :
Column | Type | Modifiers
--------+-----------------------+-----------
id | integer |
rdate | date |
type | character varying(20) |
Sample Data :
id | rdate | type
----+------------+------
1 | 2013-11-03 | E1
1 | 2013-12-12 | E1
2 | 2013-12-12 | A3
3 | 2014-01-11 | B2
1 | 2014-01-15 | A1
4 | 2013-12-23 | C1
5 | 2014-01-05 | C
7 | 2013-12-20 | D
8 | 2013-12-20 | D
9 | 2013-12-23 | A1
While I was trying something like this (I'm no good at sql) : select type from records as r1 inner join (Select id,max(rdate) from records group by id) r2 on r1.rdate = r2.rdate ;
or
select type from records as r1 ,(Select id,max(rdate) from records group by id) r2 inner join r1 on r1.rdate = r2.rdate ;
You can easily do this with a window function:
SELECT id, rdate, type
FROM (
SELECT id, rdate, type, rank() OVER (PARTITION BY id ORDER BY rdate DESC) rnk
FROM records
WHERE rnk = 1
) foo
ORDER BY id;
The window definition OVER (PARTITION BY id ORDER BY rdate DESC) takes all records with the same id value, then sorts then from most recent to least recent rdate and assigns a rank to each row. The rank of 1 is the most recent, so equivalent to max(rdate).
If I've understood the question right, then this should work (or at least get you something you can work with):
SELECT
b.id, b.maxdate, a.type
FROM
records a -- this is the records table, where you'll get the type
INNER JOIN -- now join it to the group by query
(select id, max(rdate) as maxdate FROM records GROUP BY id) b
ON -- join on both rdate and id, otherwise you'll get lots of duplicates
b.id = a.id
AND b.maxdate = a.rdate
Note that if you have records with different types for the same id and rdate combination you'll get duplicates.

PostgreSQL: select nearest rows according to sort order

I have a table like this:
a | user_id
----------+-------------
0.1133 | 2312882332
4.3293 | 7876123213
3.1133 | 2312332332
1.3293 | 7876543213
0.0033 | 2312222332
5.3293 | 5344343213
3.2133 | 4122331112
2.3293 | 9999942333
And I want to locate a particular row - 1.3293 | 7876543213 for example - and select the nearest 4 rows. 2 above, 2 below if possible.
Sort order is ORDER BY a ASC.
In this case I will get:
0.0033 | 2312222332
0.1133 | 2312882332
2.3293 | 9999942333
3.1133 | 2312332332
How can I achieve this using PostgreSQL? (BTW, I'm using PHP.)
P.S.: For the last or first row the nearest rows would be 4 above or 4 below.
Test case:
CREATE TEMP TABLE tbl(a float, user_id bigint);
INSERT INTO tbl VALUES
(0.1133, 2312882332)
,(4.3293, 7876123213)
,(3.1133, 2312332332)
,(1.3293, 7876543213)
,(0.0033, 2312222332)
,(5.3293, 5344343213)
,(3.2133, 4122331112)
,(2.3293, 9999942333);
Query:
WITH x AS (
SELECT a
,user_id
,row_number() OVER (ORDER BY a, user_id) AS rn
FROM tbl
), y AS (
SELECT rn, LEAST(rn - 3, (SELECT max(rn) - 5 FROM x)) AS min_rn
FROM x
WHERE (a, user_id) = (1.3293, 7876543213)
)
SELECT *
FROM x, y
WHERE x.rn > y.min_rn
AND x.rn <> y.rn
ORDER BY x.a, x.user_id
LIMIT 4;
Returns result as depicted in the question. Assuming that (a, user_id) is unique.
It is not clear whether a is supposed to unique. That's why I sort by user_id additionally to break ties. That's also why I use the window function row_number(), an not rank() for this. row_number() is the correct tool in any case. We want 4 rows. rank() would give an undefined number of rows if there were peers in the sort order.
This always returns 4 rows as long as there are at least 5 rows in the table. Close to first / last row, the first / last 4 rows are returned. The two rows before / after in all other cases. The criteria row itself is excluded.
Improved performance
This is an improved version of what #Tim Landscheidt posted. Vote for his answer if you like the idea with the index. Don't bother with small tables. But will boost performance for big tables - provided you have a fitting index in place. Best choice would be a multicolumn index on (a, user_id).
WITH params(_a, _user_id) AS (SELECT 5.3293, 5344343213) -- enter params once
,x AS (
(
SELECT a
,user_id
,row_number() OVER (ORDER BY a DESC, user_id DESC) AS rn
FROM tbl, params p
WHERE a < p._a
OR a = p._a AND user_id < p._user_id -- a is not defined unique
ORDER BY a DESC, user_id DESC
LIMIT 5 -- 4 + 1: including central row
)
UNION ALL -- UNION right away, trim one query level
(
SELECT a
,user_id
,row_number() OVER (ORDER BY a ASC, user_id ASC) AS rn
FROM tbl, params p
WHERE a > p._a
OR a = p._a AND user_id > p._user_id
ORDER BY a ASC, user_id ASC
LIMIT 5
)
)
, y AS (
SELECT a, user_id
FROM x, params p
WHERE (a, user_id) <> (p._a, p._user_id) -- exclude central row
ORDER BY rn -- no need to ORDER BY a
LIMIT 4
)
SELECT *
FROM y
ORDER BY a, user_id -- ORDER result as requested
Major differences to #Tim's version:
According to the question (a, user_id) form the search criteria, not just a. That changes window frame, ORDER BY and WHERE clause in subtly different ways.
UNION right away, no need for an extra query level. You need parenthesis around the two UNION-queries to allow for individual ORDER BY.
Sort result as requested. Requires another query level (at hardly any cost).
As parameters are used in multiple places I centralized the input in a leading CTE.
For repeated use you can wrap this query almost 'as is' into an SQL or plpgsql function.
And another one:
WITH prec_rows AS
(SELECT a,
user_id,
ROW_NUMBER() OVER (ORDER BY a DESC) AS rn
FROM tbl
WHERE a < 1.3293
ORDER BY a DESC LIMIT 4),
succ_rows AS
(SELECT a,
user_id,
ROW_NUMBER() OVER (ORDER BY a ASC) AS rn
FROM tbl
WHERE a > 1.3293
ORDER BY a ASC LIMIT 4)
SELECT a, user_id
FROM
(SELECT a,
user_id,
rn
FROM prec_rows
UNION ALL SELECT a,
user_id,
rn
FROM succ_rows) AS s
ORDER BY rn, a LIMIT 4;
AFAIR WITH will instantiate a memory table, so the focus of this solution is to limit its size as much as possible (in this case eight rows).
set search_path='tmp';
DROP TABLE lutser;
CREATE TABLE lutser
( val float
, num bigint
);
INSERT INTO lutser(val, num)
VALUES ( 0.1133 , 2312882332 )
,( 4.3293 , 7876123213 )
,( 3.1133 , 2312332332 )
,( 1.3293 , 7876543213 )
,( 0.0033 , 2312222332 )
,( 5.3293 , 5344343213 )
,( 3.2133 , 4122331112 )
,( 2.3293 , 9999942333 )
;
WITH ranked_lutsers AS (
SELECT val, num
,rank() OVER (ORDER BY val) AS rnk
FROM lutser
)
SELECT that.val, that.num
, (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-2 AND this.rnk+2)
WHERE this.val = 1.3293
;
Results:
DROP TABLE
CREATE TABLE
INSERT 0 8
val | num | relrnk
--------+------------+--------
0.0033 | 2312222332 | -2
0.1133 | 2312882332 | -1
1.3293 | 7876543213 | 0
2.3293 | 9999942333 | 1
3.1133 | 2312332332 | 2
(5 rows)
As Erwin pointed out, the center row is not wanted in the output. Also, the row_number() should be used instead of rank().
WITH ranked_lutsers AS (
SELECT val, num
-- ,rank() OVER (ORDER BY val) AS rnk
, row_number() OVER (ORDER BY val, num) AS rnk
FROM lutser
) SELECT that.val, that.num
, (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-2 AND this.rnk+2 )
WHERE this.val = 1.3293
AND that.rnk <> this.rnk
;
Result2:
val | num | relrnk
--------+------------+--------
0.0033 | 2312222332 | -2
0.1133 | 2312882332 | -1
2.3293 | 9999942333 | 1
3.1133 | 2312332332 | 2
(4 rows)
UPDATE2: to always select four, even if we are at the top or bottom of the list. This makes the query a bit uglier. (but not as ugly as Erwin's ;-)
WITH ranked_lutsers AS (
SELECT val, num
-- ,rank() OVER (ORDER BY val) AS rnk
, row_number() OVER (ORDER BY val, num) AS rnk
FROM lutser
) SELECT that.val, that.num
, ABS(that.rnk-this.rnk) AS srtrnk
, (that.rnk-this.rnk) AS relrnk
FROM ranked_lutsers that
JOIN ranked_lutsers this ON (that.rnk BETWEEN this.rnk-4 AND this.rnk+4 )
-- WHERE this.val = 1.3293
WHERE this.val = 0.1133
AND that.rnk <> this.rnk
ORDER BY srtrnk ASC
LIMIT 4
;
Output:
val | num | srtrnk | relrnk
--------+------------+--------+--------
0.0033 | 2312222332 | 1 | -1
1.3293 | 7876543213 | 1 | 1
2.3293 | 9999942333 | 2 | 2
3.1133 | 2312332332 | 3 | 3
(4 rows)
UPDATE: A version with a nested CTE (featuring outer join!!!). For conveniance, I added a primary key to the table, which sounds like a good idea anyway IMHO.
WITH distance AS (
WITH ranked_lutsers AS (
SELECT id
, row_number() OVER (ORDER BY val, num) AS rnk
FROM lutser
) SELECT l0.id AS one
,l1.id AS two
, ABS(l1.rnk-l0.rnk) AS dist
-- Warning: Cartesian product below
FROM ranked_lutsers l0
, ranked_lutsers l1 WHERE l0.id <> l1.id
)
SELECT lu.*
FROM lutser lu
JOIN distance di
ON lu.id = di.two
WHERE di.one= 1
ORDER by di.dist
LIMIT 4
;