Conditionally select most common value in an array in PostgreSQL

Conditionally select most common value in an array in PostgreSQL - postgresql

I have a table (only one row) in my PostgreSQL 9.5 db with two columns i.e., count (bigint) and array (text).
count array
6 "112,19.3,142,142,19.3,172,172,20.3,202,202,20.3,232,232,19.3,262,262,19.3,292"
The array represents six (thus count = 6) set of values i.e., Lower_limit, Value and Upper_limit. Now, I need to conditionally modify my array i.e., when upper limit and lower limits are coinciding then select the first upper limit and last lower limit and return the most common value (which is 19.3) among the limits. My desired output would be like:
count array
1 112, 19.3, 292
Could anyone help me to have some pointers towards my desired output?

I must admin - I dont understand how you get count =1, but below is an example of how you can build array with firsrt, last and most common values. Mind if there would be several mos common values it would unpredictably pick on of em
t=#
with a(r) as (values(array[112,19.3,142,142,19.3,172,172,20.3,202,202,20.3,232,232,19.3,262,262,19.3,292]))
, p as (select * from a,unnest(a.r) with ordinality)
, t as (
select count(1) over (partition by unnest)
, unnest u
, r[1] a
, r[array_length(r,1)] e
from p
order by unnest
limit 1
)
select array[a,u,e]
from t
;
array
----------------
{112,19.3,292}
(1 row)

Related

lead and lag on large table 1billion rows

I have a table T as follows with 1 Billion records. Currently, this table has no Primary key or Indexes.
create table T(
day_c date,
str_c varchar2(20),
comm_c varchar2(20),
src_c varchar2(20)
);
some sample data:
insert into T
select to_date('20171011','yyyymmdd') day_c,'st1' str_c,'c1' comm_c,'s1' src_c from dual
union
select to_date('20171012','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171013','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171014','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171015','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171016','yyyymmdd'),'st1','c1','s2' from dual
union
select to_date('20171017','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171018','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171019','yyyymmdd'),'st1','c1','s1' from dual
union
select to_date('20171020','yyyymmdd'),'st1','c1','s1' from dual;
The expected result is to generate the date ranges for the changes in column src_c.
I have the following code snippet which provides the desired result. However, it is slow as the cost of running lag and lead is quite high on the table.
WITH EndsMarked AS (
SELECT
day_c,str_c,comm_c,src_c,
CASE WHEN src_c= LAG(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_START,
CASE WHEN src_c= LEAD(src_c,1) OVER (ORDER BY day_c)
THEN 0 ELSE 1 END AS IS_END
FROM T
), GroupsNumbered AS (
SELECT
day_c,str_c,comm_c,
src_c,
IS_START,
IS_END,
COUNT(CASE WHEN IS_START = 1 THEN 1 END)
OVER (ORDER BY day_c) AS GroupNum
FROM EndsMarked
WHERE IS_START=1 OR IS_END=1
)
SELECT
str_c,comm_c,src_c,
MIN(day_c) AS GROUP_START,
MAX(day_c) AS GROUP_END
FROM GroupsNumbered
GROUP BY str_c,comm_c, src_c,GroupNum
ORDER BY groupnum;
Output :
STR_C COMM_C SRC_C GROUP_START GROUP_END
st1 c1 s1 11-OCT-17 13-OCT-17
st1 c1 s2 14-OCT-17 16-OCT-17
st1 c1 s1 17-OCT-17 20-OCT-17
Any suggestion to speed up?
Oracle database :12c.
SGA Memory:20GB
Total CPU:22
Explain plan:

Order by day_c only, or do you need to partition by str_c and comm_c first? It seems so - in which case I am not sure your query is correct, and Sentinel's solution will need to be adjusted accordingly.
Then:
For some reason (which escapes me), it appears that the match_recognize clause (available only since Oracle 12.1) is faster than analytic functions, even when the work done seems to be the same.
In your problem, (1) you must read 1 billion rows from disk, which can't be done faster than the hardware allows (do you REALLY need to do this on all 1 billion rows, or should you archive a large portion of your table, perhaps after performing this identification of GROUP_START and GROUP_END)? (2) you must order the data by day_c no matter what method you use, and that is time consuming.
With that said, the tabibitosan method (see Sentinel's answer) will be faster than the start-of-group method (which is close to, but simpler than what you currently have).
The match_recognize solution, which will probably be faster than any solution based on analytic functions, looks like this:
select str_c, comm_c, src_c, group_start, group_end
from t
match_recognize(
partition by str_c, comm_c
order by day_c
measures x.src_c as src_c,
first(day_c) as group_start,
last(day_c) as group_end
pattern ( x y* )
define y as src_c = x.src_c
)
-- Add ORDER BY clause here, if needed
;
Here is a quick explanation of how this works; for developers who are not familiar with match_recognize, I provided links to a few good tutorials in a Comment below this Answer.
The match_recognize clause partitions the input rows by str_c and comm_c and orders them by day_c. So far this is exactly the same work that analytic functions do.
Then in the PATTERN and DEFINE clauses I declare and define two "classes" of rows, which will be flagged as X and Y, respectively. X is any row (there are no restrictions on it in the DEFINE clause). However, Y is restricted: it must have the same src_c as the last X row preceding it.
So, in each partition, and reading from the earliest row to the latest (within the partition), I am looking for any number of matches, where a match consists of an arbitrary row (marked X), followed by as many Y rows as possible; where Y means "same src_c as the first row in this match. So, this will identify sequences of rows where the src_c did not change.
For each match that is found, the clause will output the src_c value from the X row (which is the same, really, for all the rows in that match), and the first and the last value in the day_c column for that match. That is what we need to put in the SELECT clause of the overall query.

You can eliminate one CTE by using the Tabibito-san (Traveler) method:
with Groups as (
select t.*
, row_number() over (order by day_c)
- row_number() over (partition by str_c
, comm_c
, src_c
order by day_c) GroupNum
from t
)
select str_c
, comm_c
, src_c
, min(day_c) GROUP_START
, max(day_c) GROUP_END
from Groups
group by str_c
, comm_c
, src_c
, GroupNum

Create multiple incrementing columns using with recursive in postgresql

I'm trying to create a table with the following columns:
I want to use a with recursive table to do this. The following code however is giving the following error:
'ERROR: column "b" does not exist'
WITH recursive numbers AS
(
SELECT 1,2,4 AS a, b, c
UNION ALL
SELECT a+1, b+1, c+1
FROM Numbers
WHERE a + 1 <= 10
)
SELECT * FROM numbers;
I'm stuck because when I just include one column this works perfectly. Why is there an error for multiple columns?

This appears to be a simple syntax issue: You are aliasing the columns incorrectly. (SELECT 1,2,4 AS a, b, c) is incorrect. Your attempt has 5 columns: 1,2,a,b,c
Break it down to just: Select 1,2,4 as a,b,c and you see the error but Select 1 a,2 b,4 c works fine.
b is unknown in the base select because it is being interpreted as a field name; yet no table exists having that field. Additionally the union would fail as you have 5 fields in the base and 3 in the recursive union.
DEMO: http://rextester.com/IUWJ67486
One can define the columns outside the select making it easier to manage or change names.
WITH recursive numbers (a,b,c) AS
(
SELECT 1,2,4
UNION ALL
SELECT a+1, b+1, c+1
FROM Numbers
WHERE a + 1 <= 10
)
SELECT * FROM numbers;
or this approach which aliases the fields internally so the 1st select column's names would be used. (a,b,c) vs somereallylongalias... in union query. It should be noted that not only the name of the column originates from the 1st query in the unioned sets; but also the datatype for the column; which, must match between the two queries.
WITH recursive numbers AS
(
SELECT 1 as a ,2 as b,4 as c
UNION ALL
SELECT a+1 someReallyLongAlias
, b+1 someReallyLongAliasAgain
, c+1 someReallyLongAliasYetAgain
FROM Numbers
WHERE a<5
)
SELECT * FROM numbers;
Lastly, If you truly want to stop at 5 then the where clause should be WHERE a < 5. The image depicts this whereas the query does not; so not sure what your end game is here.

Managing overflows in LISTAGG on Amazon Redshift

Using the example from this post: https://blogs.oracle.com/datawarehousing/entry/managing_overflows_in_listagg
The following statement:
SELECT
deptno,
LISTAGG(ename, ';') WITHIN GROUP (ORDER BY empno) AS namelist
FROM emp
GROUP BY deptno;
will generate the following output:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING;MILLER
20 SMITH;JONES;SCOTT;ADAMS;FORD
30 ALLEN;WARD;MARTIN;BLAKE;TURNER;JAMES
Let’s assume that the above statement does not run and that we have a limit of 15 characters that can be returned by each row in our LISTAGG function. This is in actuality 65535 on Amazon Redshift.
We would want the following to be returned in this case:
DEPTNO NAMELIST
---------- ----------------------------------------
10 CLARK;KING
10 MILLER
20 SMITH;JONES
20 SCOTT;ADAMS
20 FORD
30 ALLEN;WARD
30 MARTIN;BLAKE
30 TURNER;JAMES
What would be the best way to recreate this result in Amazon Redshift to avoid any data loss and taking speed into consideration?

It's possible to achieve this with 2 subquery:
First:
SELECT id, field,
sum(length(field) + 1) over
(partition by id order by RANDOM() rows unbounded preceding) as total_length_now
from my_schema.my_table)
Initially we want to calculate how many chars we have for each id in our table. We can use a window function to calculate it incrementally for each row. In the 'order by' statement you can use any unique field that you have. If you don't have one, you can simply use random or an hash function, but is mandatory that the field is unique, if not, the function will not work as we want.
The '+1' in the length represent the semicolon that we will use in the listagg function.
Second:
SELECT id, field, total_length_now / 65535 as sub_id
FROM (sub_query_1)
Now we create a sub_id based on the length that we calculated before. If the total_length_now exceed the limit size (in this case 65535) the division's rest will return a new sub_id.
Last Step
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (sub_query_2)
GROUP BY id, sub_id
ORDER BY id, sub_id
Now we can simply call the listagg function grouping by id and sub_id, since each group cannot exceed the size limit.
Complete query
SELECT id, sub_id, listagg(field, ';') as namelist
FROM (
SELECT id, field, total_length_now / 65535 as sub_id
FROM (SELECT id,
field,
sum(length(field) + 1) over
(partition by id order by field rows unbounded preceding) as total_length_now
from support.test))
GROUP BY id, sub_id
order by id, sub_id
Example with your data (with size limit = 10)
First and second query output:
id, field, total_length_now, sub_id
10,KING,5,0
10,CLARK,11,1
10,MILLER,18,1
20,ADAMS,6,0
20,SMITH,12,1
20,JONES,18,1
20,FORD,23,2
20,SCOTT,29,2
30,JAMES,6,0
30,BLAKE,12,1
30,WARD,17,1
30,MARTIN,24,2
30,TURNER,31,3
30,ALLEN,37,3
Final query output:
id,sub_id,namelist
10,0,KING
10,1,CLARK;MILLER
20,0,ADAMS
20,1,SMITH;JONES
20,2,FORD;SCOTT
30,0,JAMES
30,1,BLAKE;WARD
30,2,MARTIN
30,3,TURNER;ALLEN

It is possible to create a partial list, and then the rest of values as separate rows in one go, but if the number of rows is unconstrained you really need a loop statement to then convert that into a list, and the rows for remaining and so on.
So this is really a task for Apache Spark (or any other map-reduce technology).

How to calculate median in AWS Redshift?

Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.
You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.
http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html
http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html

And as of 2014-10-17, Redshift supports the MEDIAN window function:
# select min(median) from (select median(num) over () from temp);
min
-----
4.0

Try the NTILE function.
You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.
create table temp (num smallint);
insert into temp values (1),(5),(10),(2),(4);
select num, ntile(2) over(order by num desc) from temp ;
num | ntile
-----+-------
10 | 1
5 | 1
4 | 1
2 | 2
1 | 2
select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1;
median
--------
4

I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.
They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.
# select count(num), min(median) as median
from
(select num, percentile_cont (0.5) within group (order by num) over () as median from temp);
count | median
-------+--------
5 | 4.0
(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)
In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.

I typically use the NTILE function to split the data into two groups if I’m looking for an answer that’s close enough. However, if I want the exact median (e.g. the midpoint of an even set of rows), I use a technique suggested on the AWS Redshift Discussion Forum.
This technique orders the rows in both ascending and descending order, then if there is an odd number of rows, it returns the average of the middle row (that is, where row_num_asc = row_num_desc), which is simply the middle row itself.
CREATE TABLE temp (num SMALLINT);
INSERT INTO temp VALUES (1),(5),(10),(2),(4);
SELECT
AVG(num) AS median
FROM
(SELECT
num,
SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,
SUM(1) OVER (ORDER BY num DESC) AS row_num_desc
FROM
temp) AS ordered
WHERE
row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);
median
--------
4
If there is an even number of rows, it returns the average of the two middle rows.
INSERT INTO temp VALUES (9);
SELECT
AVG(num) AS median
FROM
(SELECT
num,
SUM(1) OVER (ORDER BY num ASC) AS row_num_asc,
SUM(1) OVER (ORDER BY num DESC) AS row_num_desc
FROM
temp) AS ordered
WHERE
row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1);
median
--------
4.5

Firebird get the list with all available id

In a table I have records with id's 2,4,5,8. How can I receive a list with values 1,3,6,7. I have tried in this way
SELECT t1.id + 1
FROM table t1
WHERE NOT EXISTS (
SELECT *
FROM table t2
WHERE t2.id = t1.id + 1
)
but it's not working correctly. It doesn't bring all available positions.
Is it possible without another table?

You can get all the missing ID's from a recursive CTE, like this:
with recursive numbers as (
select 1 number
from rdb$database
union all
select number+1
from rdb$database
join numbers on numbers.number < 1024
)
select n.number
from numbers n
where not exists (select 1
from table t
where t.id = n.number)
the number < 1024 condition in my example limit the query to the max 1024 recursion depth. After that, the query will end with an error. If you need more than 1024 consecutive ID's you have either run the query multiple times adjusting the interval of numbers generated or think in a different query that produces consecutive numbers without reaching that level of recursion, which is not too difficult to write.