I have a query, which returns a simple list of numbers:
SELECT unnest(c) FROM t ORDER BY f LIMIT 10;
And it goes like
1
1
3
4
2
3
5
1
5
6
3
2
I want to keep the result unique, but also preserve order:
1
3
2
4
5
6
select distinct(id) from (select ...) as c;
does not work, beacuse it uses HashAggregate, which breaks order (and processes all rows to return just 10?). I tried GROUP BY, it also uses HashAggregate the whole table(?) and then sort and return 10 required rows.
Is it possible to do it effectively on DB size? Or should I just read rows from my first query in my application and do the stream filtering?
with ordinality is your friend to preserve the order.
select val
from unnest('{1,1,3,4,2,3,5,1,5,6,3,2}'::int[]) with ordinality t(val, ord)
group by val
order by min(ord); -- the first time that this item appeared
val
1
3
4
2
5
6
Or it may make sense to define this function:
create function arr_unique(arr anyarray)
returns anyarray language sql immutable as
$$
select array_agg(val order by ord)
from
(
select val, min(ord) ord
from unnest(arr) with ordinality t(val, ord)
group by val
) t;
$$;
select elem
from (
select
elem, elem_no, row_no, row_number() over (partition by elem order by row_no) as occurence_no
from (
select elem, elem_no, row_number() over () as row_no from t, unnest(c) WITH ORDINALITY a(elem, elem_no)
) A
) B
where occurence_no = 1
order by row_no
While trying to map some data to a table, I wanted to obtain the ID of a table and its modulo respect the total rows in the same table. For example, given this table:
id
--
1
3
10
12
I would like this result:
id | mod
---+----
1 | 1 <- 1 mod 4
3 | 3 <- 3 mod 4
10 | 2 <- 10 mod 4
12 | 0 <- 12 mod 4
Is there an easy way to achieve this dynamically (as in, not counting the rows on before hand or doing it in an atomic way)?
So far I've tried something like this:
SELECT t1.id, t1.id % COUNT(t1.id) mod FROM tbl t1, tbl t2 GROUP BY t1.id;
This works but you must have the GROUP BY and tbl t2 as otherwise it returns 0 for the mod column which makes sense because I think it works by multiplying the table by itself so each ID gets a full set of the table. I guess for small enough tables this is ok but I can see how this becomes problematic for larger tables.
Edit: Found another hack-ish way:
WITH total AS (
SELECT COUNT(*) cnt FROM tbl
)
SELECT t1.id, t1.id % t2.cnt mod FROM tbl t1, total t2
It similar to the previous query but it "collapses" the multiplication to a single row with the previous count.
You can use COUNT() window function:
SELECT id,
id % COUNT(*) OVER () mod
FROM tbl;
I'm sure that the optimizer is smart enough to calculate the result of the window function only once.
See the demo.
I'm stuck with a (Postgres 9.4.6) query that a) is using too much memory (most likely due to array_agg()) and also does not return exactly what I need, making post-processing necessary. Any input (especially regarding memory consumption) is highly appreciated.
Explanation:
the table token_groups holds all words used in tweets I've parsed with their respective occurrence frequency in a hstore, with one row per 10 minutes (for the last 7 days, so 7*24*6 rows in total). These rows are inserted in order of tweeted_at, so I can simply order by id. I'm using row_numberto identify when a word occurred.
# \d token_groups
Table "public.token_groups"
Column | Type | Modifiers
------------+-----------------------------+-----------------------------------------------------------
id | integer | not null default nextval('token_groups_id_seq'::regclass)
tweeted_at | timestamp without time zone | not null
tokens | hstore | not null default ''::hstore
Indexes:
"token_groups_pkey" PRIMARY KEY, btree (id)
"index_token_groups_on_tweeted_at" btree (tweeted_at)
What I'd ideally want is a list of words with each the relative distances of their row numbers. So if e.g. the word 'hello' appears in row 5 once, in row 8 twice and in row 20 once, I'd want a column with the word, and an array column returning {5,3,0,12}. (meaning: first occurrence in fifth row, next occurrence 3 rows later, next occurrence 0 rows later, next 12 rows later). If anyone wonders why: 'relevant' words occur in clusters, so (simplified) the higher the standard deviation of timely distances, the more likely a word is a keyword. See more here: http://bioinfo2.ugr.es/Publicaciones/PRE09.pdf
For now, I return an array with positions and an array with frequencies, and use this info to calculate the distances in ruby.
Currently the primary problem is a high memory spike, which seems to be caused by array_agg(). As I'm being told by the (very helpful) heroku staff that some of my connections use 500-700MB with very little shared memory, causing out of memory errors (I'm running Standard-0, which gives me 1GB total for all connections), I need to find an optimization.
The total number of hstore entries is ~100k, which then is aggregated (after skipping words with very low frequency):
SELECT COUNT(*)
FROM (SELECT row_number() over(ORDER BY id ASC) AS position,
(each(tokens)).key, (each(tokens)).value::integer
FROM token_groups) subquery;
count
--------
106632
Here is the query causing the memory load:
SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
SELECT row_number() over(ORDER BY id ASC) AS pos,
(each(tokens)).key,
(each(tokens)).value::integer
FROM token_groups
) subquery
GROUP BY key
HAVING SUM(value) > 10;
The output is:
key | positions | frequencies
-------------+---------------------------------------------------------+-------------------------------
hello | {172,185,188,210,349,427,434,467,479} | {1,2,1,1,2,1,2,1,4}
world | {166,218,265,343,415,431,436,493} | {1,1,2,1,2,1,2,1}
some | {35,65,101,180,193,198,223,227,420,424,427,428,439,444} | {1,1,1,1,1,1,1,2,1,1,1,1,1,1}
other | {77,111,233,416,421,494} | {1,1,4,1,2,2}
word | {170,179,182,184,185,186,187,188,189,190,196} | {3,1,1,2,1,1,1,2,5,3,1}
(...)
Here's what explain says:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=12789.00..12792.50 rows=200 width=44) (actual time=309.692..343.064 rows=2341 loops=1)
Output: ((each(token_groups.tokens)).key), array_agg((row_number() OVER (?))), array_agg((((each(token_groups.tokens)).value)::integer))
Group Key: (each(token_groups.tokens)).key
Filter: (sum((((each(token_groups.tokens)).value)::integer)) > 10)
Rows Removed by Filter: 33986
Buffers: shared hit=2176
-> WindowAgg (cost=177.66..2709.00 rows=504000 width=384) (actual time=0.947..108.157 rows=106632 loops=1)
Output: row_number() OVER (?), (each(token_groups.tokens)).key, ((each(token_groups.tokens)).value)::integer, token_groups.id
Buffers: shared hit=2176
-> Sort (cost=177.66..178.92 rows=504 width=384) (actual time=0.910..1.119 rows=504 loops=1)
Output: token_groups.id, token_groups.tokens
Sort Key: token_groups.id
Sort Method: quicksort Memory: 305kB
Buffers: shared hit=150
-> Seq Scan on public.token_groups (cost=0.00..155.04 rows=504 width=384) (actual time=0.013..0.505 rows=504 loops=1)
Output: token_groups.id, token_groups.tokens
Buffers: shared hit=150
Planning time: 0.229 ms
Execution time: 570.534 ms
PS: if anyone wonders: every 10 minutes I append new data to the token_groupstable and remove outdated data. Which is easy when storing data one row per 10 minutes, I still have to come up with a better data structure that e.g. uses one row per word. But that does not seem to be the main issue, I think it's the array aggregation.
Your presented query can be simpler, evaluating each() only once per row:
SELECT key, array_agg(pos) AS positions, array_agg(value) AS frequencies
FROM (
SELECT t.key, pos, t.value::int
FROM (SELECT row_number() OVER (ORDER BY id) AS pos, * FROM token_groups) tg
, each(g.tokens) t -- implicit LATERAL join
ORDER BY t.key, pos
) sub
GROUP BY key
HAVING sum(value) > 10;
Also preserving correct order of elements.
What I'd ideally want is a list of words with each the relative distances of their row numbers.
This would do it:
SELECT key, array_agg(step) AS occurrences
FROM (
SELECT key, CASE WHEN g = 1 THEN pos - last_pos ELSE 0 END AS step
FROM (
SELECT key, value::int, pos
, lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos) AS last_pos
FROM (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) tg
, each(g.tokens) t
) t1
, generate_series(1, t1.value) g
ORDER BY key, pos, g
) sub
GROUP BY key;
HAVING count(*) > 10;
SQL Fiddle.
Interpreting each hstore key as a word and the respective value as number of occurrences in the row (= for the last 10 minutes), I use two cascading LATERAL joins: 1st step to decompose the hstore value, 2nd step to multiply rows according to value. (If your value (frequency) is mostly just 1, you can simplify.) About LATERAL:
What is the difference between LATERAL and a subquery in PostgreSQL?
Then I ORDER BY key, pos, g in the subquery before aggregating in the outer SELECT. This clause seems to be redundant, and in fact, I see the same result without it in my tests. That's a collateral benefit from the window definition of lag() in the inner query, which is carried over to the next step unless any other step triggers re-ordering. However, now we depend on an implementation detail that's not guaranteed to work.
Ordering the whole query once should be substantially faster (and easier on the required sort memory) than per-aggregate sorting. This is not strictly according to the SQL standard either, but the simple case is documented for Postgres:
Alternatively, supplying the input values from a sorted subquery will usually work. For example:
SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;
But this syntax is not allowed in the SQL standard, and is not portable to other database systems.
Strictly speaking, we only need:
ORDER BY pos, g
You could experiment with that. Related:
PostgreSQL unnest() with element number
Possible alternative:
SELECT key
, ('{' || string_agg(step || repeat(',0', value - 1), ',') || '}')::int[] AS occurrences
FROM (
SELECT key, pos, value::int
,(pos - lag(pos, 1, 0) OVER (PARTITION BY key ORDER BY pos))::text AS step
FROM (SELECT row_number() OVER (ORDER BY id)::int AS pos, * FROM token_groups) g
, each(g.tokens) t
ORDER BY key, pos
) t1
GROUP BY key;
-- HAVING sum(value) > 10;
Might be cheaper to use text concatenation instead of generate_series().
I have a table segnature describing an item with a varchar field deno and a numeric field ord. A foreign key fk_collection tells which collection the row is part of.
I want to update field ord so that it contains the ordinal of that row per each collection, sorted by field deno.
E.g. if I have something like
[deno] ord [fk_collection]
abc 10
aab 10
bcd 10
zxc 20
vbn 20
Then I want a result like
[deno] ord [fk_collection]
abc 1 10
aab 0 10
bcd 2 10
zxc 1 20
vbn 0 20
I tried with something like
update segnature s1 set ord = (select count(*)
from segnature s2
where s1.fk_collection=s2.fk_collection and s2.deno<s1.deno
)
but query is really slow: 150 collections per about 30000 items are updated in 10 minutes about.
Any suggestion to speed up the process?
Thank you!
You can use a window function to generate the "ordinal" number:
with numbered as (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
)
update segnature
set ord = n.rn
from numbered n
where n.id = segnature.ctid;
This uses the internal column ctid to uniquely identify each rows. The ctid comparison is quite slow, so if you have a real primary (or unique) key in that table, use that column instead.
Alternatively without the common table expression:
update segnature
set ord = n.rn
from (
select deno, fk_collection,
row_number() over (partition by fk_collection order by deno) as rn,
ctid as id
from segnature
) as n
where n.id = segnature.ctid;
SQLFiddle example: http://sqlfiddle.com/#!15/e997f/1
Given the following data:
sequence | amount
1 100000
1 20000
2 10000
2 10000
I'd like to write a sql query that gives me the sum of the current sequence, plus the sum of the previous sequence. Like so:
sequence | current | previous
1 120000 0
2 20000 120000
I know the solution likely involves windowing functions but I'm not too sure how to implement it without subqueries.
SQL Fiddle
select
seq,
amount,
lag(amount::int, 1, 0) over(order by seq) as previous
from (
select seq, sum(amount) as amount
from sa
group by seq
) s
order by seq
If your sequence is "sequencial" without holes you can simply do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount) from mytable t2 WHERE t2.sequence = t1.sequence - 1)
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence
Otherwise, instead of t2.sequence = t1.sequence - 1 you could do:
SELECT t1.sequence,
SUM(t1.amount),
(SELECT SUM(t2.amount)
from mytable t2
WHERE t2.sequence = (SELECT MAX(t3.sequence)
FROM mytable t3
WHERE t3.sequence < t1.sequence))
FROM mytable t1
GROUP BY t1.sequence
ORDER BY t1.sequence;
You can see both approaches in this fiddle