PipelineDB, get counts for top K items - pipelinedb

How to calculate frequencies of top K values in the stream?
Let's say we have a stream
CREATE STREAM stream (
value number
);
And we inserted ten rows
INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (1)
INSERT INTO stream (value) VALUES (2)
INSERT INTO stream (value) VALUES (2)
INSERT INTO stream (value) VALUES (3)
INSERT INTO stream (value) VALUES (4)
INSERT INTO stream (value) VALUES (5)
INSERT INTO stream (value) VALUES (6)
INSERT INTO stream (value) VALUES (7)
How can I get back the top 2 items and their frequencies?
value | frequency
-----------------
1 | 0.3
2 | 0.2
I suppose it should somehow use both Top K and the Count-min Sketch together?

You can use fss_agg for that:
CREATE CONTINUOUS VIEW v AS
SELECT fss_agg(x, 10) AS top_10_x FROM some_stream
This will keep track of the top 10 most frequently occurring values of x. The weight given to each value can also be explicitly given:
CREATE CONTINUOUS VIEW v AS
SELECT fss_agg_weighted(x, 10, y) AS top_10_x FROM some_stream
The first version implicitly uses a weight of 1.
There are various functions you can use to read the top-K values and their associated frequencies. For example, the following will return tuples of the form: (value, frequency):
SELECT fss_topk(top_10_x) FROM v

Related

Combine JSONB array of values by consecutive pairs

In postgresql, I have a simple one JSONB column data store:
data
----------------------------
{"foo": [1,2,3,4]}
{"foo": [10,20,30,40,50,60]}
...
I need to convert consequent pairs of values into data points, essentially calling the array variant of ST_MakeLine like this: ST_MakeLine(ARRAY(ST_MakePoint(10,20), ST_MakePoint(30,40), ST_MakePoint(50,60))) for each row of the source data.
Needed result (note that the x,y order of each point might need to be reversed):
data geometry (after decoding)
---------------------------- --------------------------
{"foo": [1,2,3,4]} LINE (1 2, 3 4)
{"foo": [10,20,30,40,50,60]} LINE (10 20, 30 40, 50 60)
...
Partial solution
I can already iterate over individual array values, but it is the pairing that is giving me trouble. Also, I am not certain if I need to introduce any ordering into the query to preserve the original ordering of the array elements.
SELECT ARRAY(
SELECT elem::int
FROM jsonb_array_elements(data -> 'foo') elem
) arr FROM mytable;
You can achieve this by using window functions lead or lag, then picking only every second row:
SELECT (
SELECT array_agg((a, b) ORDER BY o)
FROM (
SELECT elem::int AS a, lead(elem::int) OVER (ORDER BY o) AS b, o
FROM jsonb_array_elements(data -> 'foo') WITH ORDINALITY els(elem, o)
) AS pairs
WHERE o % 2 = 1
) AS arr
FROM example;
(online demo)
And yes, I would recommend to specify the ordering explicitly, making use of WITH ORDINALITY.

PostgreSQL10 - is it possible to do PARTITION BY LIST (col1, col2, .., colN)?

I am looking at the PostgreSQL official documentation page on Table Partitioning for my version of postgres.
I would like to create table partitions over three columns, and I wish to use declarative partition with BY LIST method to do that.
However, I cannot seem to find a good example on how to deal with more columns, and BY LIST specifically.
In the aforementioned docs I only read:
You may decide to use multiple columns in the partition key for range
partitioning, if desired. (...) For example, consider a table range
partitioned using columns lastname and firstname (in that order) as
the partition key.
It seems that declarative partition on multiple columns is only for BY RANGE or is that not right?
However, if it is not, I found an answer on SO that tells me how to deal with BY LIST and one column. But in my case I have three columns.
My idea would be to do something like the following (I am pretty sure it's wrong):
CREATE TABLE my_partitioned_table (
col1 type CONSTRAINT col1_constraint CHECK (col1 = 1 or col1 = 0),
col2 type CONSTRAINT col2_constraint CHECK (col2 = 'A' or col2 = 'B'),
col3 type,
col4 type) PARTITION BY LIST (col1, col2);
CREATE TABLE part_1a PARTITION OF my_partitioned_table
FOR VALUES IN (1, 'A');
CREATE TABLE part_1b PARTITION OF my_partitioned_tabel
FOR VALUES IN (1, 'B');
...
I would need a correct implemenation as the combination of possible partitions in my case is quite a lot.
That is true, you cannot use list partitioning with more than one partitioning key. You also cannot bent range partitioning to do what you want.
But you could use a composite type to get what you want:
CREATE TYPE part_type AS (a integer, b text);
CREATE TABLE partme (p part_type, val text) PARTITION BY LIST (p);
CREATE TABLE partme_1_B PARTITION OF partme FOR VALUES IN (ROW(1, 'B'));
INSERT INTO partme VALUES (ROW(1, 'B'), 'x');
INSERT INTO partme VALUES (ROW(1, 'C'), 'x');
ERROR: no partition of relation "partme" found for row
DETAIL: Partition key of the failing row contains (p) = ((1,C)).
SELECT (p).a, (p).b, val FROM partme;
a | b | val
---+---+-----
1 | B | x
(1 row)
But perhaps the best way to go is to use subpartitioning: partition the original table by the first column and the partitions by the second column.

Is the row order guaranteed when using a table value constructor? [duplicate]

When using a Table Value Constructor (http://msdn.microsoft.com/en-us/library/dd776382(v=sql.100).aspx) to insert multiple rows, is the order of any identity column populated guaranteed to match the rows in the TVC?
E.g.
CREATE TABLE A (a int identity(1, 1), b int)
INSERT INTO A(b) VALUES (1), (2)
Are the values of a guaranteed by the engine to be assigned in the same order as b, i.e. in this case so they match a=1, b=1 and a=2, b=2.
Piggybacking on my comment above, and knowing that the behavior of an insert / select+order by will guarantee generation of identity order (#4: from this blog)
You can use the table value constructor in the following fashion to accomplish your goal (not sure if this satisfies your other constraints) assuming you wanted your identity generation to be based on category id.
insert into thetable(CategoryId, CategoryName)
select *
from
(values
(101, 'Bikes'),
(103, 'Clothes'),
(102, 'Accessories')
) AS Category(CategoryID, CategoryName)
order by CategoryId
It depends as long as your inserting the records in one shot . For example after inserting if you delete the record where a=2 and then again re insert the value b=2 ,then identity column's value will be the max(a)+1
To demonstrate
DECLARE #Sample TABLE
(a int identity(1, 1), b int)
Insert into #Sample values (1),(2)
a b
1 1
2 2
Delete from #Sample where a=2
Insert into #Sample values (2)
Select * from #Sample
a b
1 1
3 2

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.

Pixel values of raster records to be inserted in the table as columns

I have a table with following columns:
(ID, row_num, col_num, pix_centroid, pix_val1).
I have more than 1000 records. I am inserting my data using:
insert into pixelbased (row_num, col_num, pix_centroid, pix_val)
select
(ST_PixelAsPolygons(rast, 1)).x as X,
(ST_PixelAsPolygons(rast, 1)).y as Y,
(ST_Centroid((ST_PixelAsPolygons(rast, 1)).geom)) as geom,
(ST_PixelAsPolygons(rast, 1)).val as pix_val1
from mytable
where rid=1`
Now I am trying to insert all the other records as a column and _pix_val1_ column is important for me. All the other columns will remain the same. In the other word, I want the final table to have these columns:
(ID, row_num, col_num, pix_centroid, pix_val1, pix_val2, pix_val3, ....)
Is there a way to do it?
I would want to store this data as a bitmap in a bytea if possible. Here's how to take a series of byte values and turn it into a bytea:
WITH bytes(b) AS (SELECT x % 256 FROM generate_series(1,53000) x)
SELECT ('\x'||string_agg(lpad(to_hex(b),2,'0'),''))::bytea FROM bytes;
You can access fields or ranges of the byte array using the substr function. This bytea is organized as a linear pixel array, but you may find it more useful to organize it into a more traditional bitmap format. Also, if your pixels are more than one byte you may need to cope with big-endian vs little-endian. You could do that in SQL, but it's likely to be much easier in a procedural language like PL/Perl.
Failing that, a multidimensional array would be a somewhat reasonable choice.
Using a generate_series statement as a substitute for your pix_val field for convenient testing, this query produces a two-dimensional array of integers using two aggregation passes:
SELECT ('{'||string_agg(subarray, ',')||'}')::integer[] AS arr
FROM (
SELECT array_agg(x order by x)::text
FROM generate_series(1,53000) x
GROUP BY width_bucket(x, 1, 53001, 100)
) a(subarray);
The unfortunate use of the string literal form of the two dimensional array is made necessary by the fact that array_agg cannot aggregate arrays. In my view this is a real wart in PostgreSQL; in general its multidimensional arrays are odd to work with and inconsistent with how most applications and languages implement arrays.
You can get fields out of the array by indexing it. Example:
regress=> SELECT ('{'||string_agg(subarray, ',')||'}')::integer[] AS arr INTO test FROM (SELECT array_agg(x order by x)::text from generate_series(1,53000) x GROUP BY width_bucket(x, 1, 53001, 100)) a(subarray);
regress=> \d test
Table "public.test"
Column | Type | Modifiers
--------+-----------+-----------
arr | integer[] |
test contains a single array with two dimensions:
regress=> \x
regress=> select array_dims(test.arr), array_ndims(test.arr), array_length(test.arr,1), array_length(test.arr,2) FROM test;
-[ RECORD 1 ]+---------------
array_dims | [1:100][1:530]
array_ndims | 2
array_length | 100
array_length | 530
I can get elements with two-level indexing:
regress=> SELECT test.arr[4][4] FROM test;
arr
------
1594
(1 row)
or a "column" with slicing:
regress=> SELECT test.arr[4:4][1:530] FROM test;
Oddly, this is still a two-dimensional array, the top dimension is just one element deep. You can flatten it (inefficiently) with unnest and array_agg if you need to.
Two-dimensional arrays in PostgreSQL are somewhat weird, as you can see, but so is what you're trying to do.