Find clusters of values using Postgresql - postgresql
Consider the following example table:
CREATE TABLE rndtbl AS
SELECT
generate_series(1, 10) AS id,
random() AS val;
and I want to find for each id a cluster_id such that the clusters are far away from each other at least 0.1. How would I calculate such a cluster assignment?
A specific example would be:
select * from rndtbl ;
id | val
----+-------------------
1 | 0.485714662820101
2 | 0.185201027430594
3 | 0.368477711919695
4 | 0.687312887981534
5 | 0.978742253035307
6 | 0.961830694694072
7 | 0.10397826647386
8 | 0.644958863966167
9 | 0.912827260326594
10 | 0.196085536852479
(10 rows)
The result would be: ids (2,7,10) in a cluster and (5,6,9) in another cluster and (4,8) in another, and (1) and (3) as singleton clusters.
From
SELECT * FROM rndtbl ;
┌────┬────────────────────┐
│ id │ val │
├────┼────────────────────┤
│ 1 │ 0.153776332736015 │
│ 2 │ 0.572575284633785 │
│ 3 │ 0.998213059268892 │
│ 4 │ 0.654628816060722 │
│ 5 │ 0.692200613208115 │
│ 6 │ 0.572836415842175 │
│ 7 │ 0.0788379465229809 │
│ 8 │ 0.390280921943486 │
│ 9 │ 0.611408909317106 │
│ 10 │ 0.555164183024317 │
└────┴────────────────────┘
(10 rows)
Use the LAG window function to know whether the current row is in a new cluster or not:
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl ;
┌────┬────────────────────┬─────────────┐
│ id │ val │ new_cluster │
├────┼────────────────────┼─────────────┤
│ 7 │ 0.0788379465229809 │ (null) │
│ 1 │ 0.153776332736015 │ f │
│ 8 │ 0.390280921943486 │ t │
│ 10 │ 0.555164183024317 │ t │
│ 2 │ 0.572575284633785 │ f │
│ 6 │ 0.572836415842175 │ f │
│ 9 │ 0.611408909317106 │ f │
│ 4 │ 0.654628816060722 │ f │
│ 5 │ 0.692200613208115 │ f │
│ 3 │ 0.998213059268892 │ t │
└────┴────────────────────┴─────────────┘
(10 rows)
Finally you can SUM the number of true (still ordering by val) to get the cluster of the row (counting from 0):
SELECT *, SUM(COALESCE(new_cluster::int, 0)) OVER (ORDER BY val) AS nb_cluster
FROM (
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl
) t
;
┌────┬────────────────────┬─────────────┬────────────┐
│ id │ val │ new_cluster │ nb_cluster │
├────┼────────────────────┼─────────────┼────────────┤
│ 7 │ 0.0788379465229809 │ (null) │ 0 │
│ 1 │ 0.153776332736015 │ f │ 0 │
│ 8 │ 0.390280921943486 │ t │ 1 │
│ 10 │ 0.555164183024317 │ t │ 2 │
│ 2 │ 0.572575284633785 │ f │ 2 │
│ 6 │ 0.572836415842175 │ f │ 2 │
│ 9 │ 0.611408909317106 │ f │ 2 │
│ 4 │ 0.654628816060722 │ f │ 2 │
│ 5 │ 0.692200613208115 │ f │ 2 │
│ 3 │ 0.998213059268892 │ t │ 3 │
└────┴────────────────────┴─────────────┴────────────┘
(10 rows)
Related
Is it possible in Polars to "reset" cumsum() at a certain condition?
I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on. a | b ------------- False | 1 False | 2 True | 3 False | 4 Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number". >>> df.select(pl.col("a").cumsum()) shape: (4, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 0 │ ├╌╌╌╌╌┤ │ 0 │ ├╌╌╌╌╌┤ │ 1 │ ├╌╌╌╌╌┤ │ 1 │ └─────┘ And use that with .over() >>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum())) shape: (4, 1) ┌─────┐ │ b │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ ├╌╌╌╌╌┤ │ 3 │ ├╌╌╌╌╌┤ │ 3 │ ├╌╌╌╌╌┤ │ 7 │ └─────┘ You can .shift().backward_fill() to include the True >>> df.select(pl.col("b").cumsum().over( ... pl.col("a").cumsum().shift().backward_fill())) shape: (4, 1) ┌─────┐ │ b │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ ├╌╌╌╌╌┤ │ 3 │ ├╌╌╌╌╌┤ │ 6 │ ├╌╌╌╌╌┤ │ 4 │ └─────┘
Is there any way to return a result set (of different tables) from a stored procedure / function
I want to fetch data from more than 1 table on a single go. Is there any way to return a result set (of different tables) from a stored procedure / function?
The dynamic refcursor can be used: CREATE OR REPLACE FUNCTION public.foo(tabname character varying) RETURNS refcursor LANGUAGE plpgsql AS $function$ declare r refcursor; begin open r for execute format('select * from %I', tabname); return r; end; $function$ (2022-12-09 06:41:42) postgres=# begin; BEGIN (2022-12-09 06:42:12) postgres=# select foo('pg_class'); ┌────────────────────┐ │ foo │ ╞════════════════════╡ │ <unnamed portal 1> │ └────────────────────┘ (1 row) (2022-12-09 06:42:35) postgres=# fetch 10 from "<unnamed portal 1>"; ┌──────┬─────────────────────┬──────────────┬─────────┬───────────┬──────────┬───────┬─────────────┬───────────────┬──────────┬───────────┬───── │ oid │ relname │ relnamespace │ reltype │ reloftype │ relowner │ relam │ relfilenode │ reltablespace │ relpages │ reltuples │ rela ╞══════╪═════════════════════╪══════════════╪═════════╪═══════════╪══════════╪═══════╪═════════════╪═══════════════╪══════════╪═══════════╪═════ │ 2619 │ pg_statistic │ 11 │ 10029 │ 0 │ 10 │ 2 │ 2619 │ 0 │ 19 │ 407 │ │ 1247 │ pg_type │ 11 │ 71 │ 0 │ 10 │ 2 │ 0 │ 0 │ 15 │ 611 │ │ 2836 │ pg_toast_1255 │ 99 │ 0 │ 0 │ 10 │ 2 │ 0 │ 0 │ 1 │ 3 │ │ 2837 │ pg_toast_1255_index │ 99 │ 0 │ 0 │ 10 │ 403 │ 0 │ 0 │ 1 │ 0 │ │ 4171 │ pg_toast_1247 │ 99 │ 0 │ 0 │ 10 │ 2 │ 0 │ 0 │ 0 │ 0 │ │ 4172 │ pg_toast_1247_index │ 99 │ 0 │ 0 │ 10 │ 403 │ 0 │ 0 │ 1 │ 0 │ │ 2830 │ pg_toast_2604 │ 99 │ 0 │ 0 │ 10 │ 2 │ 2830 │ 0 │ 0 │ 0 │ │ 2831 │ pg_toast_2604_index │ 99 │ 0 │ 0 │ 10 │ 403 │ 2831 │ 0 │ 1 │ 0 │ │ 2832 │ pg_toast_2606 │ 99 │ 0 │ 0 │ 10 │ 2 │ 2832 │ 0 │ 0 │ 0 │ │ 2833 │ pg_toast_2606_index │ 99 │ 0 │ 0 │ 10 │ 403 │ 2833 │ 0 │ 1 │ 0 │ └──────┴─────────────────────┴──────────────┴─────────┴───────────┴──────────┴───────┴─────────────┴───────────────┴──────────┴───────────┴───── (10 rows) (2022-12-09 06:43:31) postgres=# close "<unnamed portal 1>"; CLOSE CURSOR (2022-12-09 06:43:54) postgres=# select foo('pg_proc'); ┌────────────────────┐ │ foo │ ╞════════════════════╡ │ <unnamed portal 2> │ └────────────────────┘ (1 row) (2022-12-09 06:44:03) postgres=# fetch 3 from "<unnamed portal 2>"; ┌──────┬─────────┬──────────────┬──────────┬─────────┬─────────┬─────────┬─────────────┬────────────┬─────────┬───────────┬──────────────┬────── │ oid │ proname │ pronamespace │ proowner │ prolang │ procost │ prorows │ provariadic │ prosupport │ prokind │ prosecdef │ proleakproof │ prois ╞══════╪═════════╪══════════════╪══════════╪═════════╪═════════╪═════════╪═════════════╪════════════╪═════════╪═══════════╪══════════════╪══════ │ 1242 │ boolin │ 11 │ 10 │ 12 │ 1 │ 0 │ 0 │ - │ f │ f │ f │ t │ 1243 │ boolout │ 11 │ 10 │ 12 │ 1 │ 0 │ 0 │ - │ f │ f │ f │ t │ 1244 │ byteain │ 11 │ 10 │ 12 │ 1 │ 0 │ 0 │ - │ f │ f │ f │ t └──────┴─────────┴──────────────┴──────────┴─────────┴─────────┴─────────┴─────────────┴────────────┴─────────┴───────────┴──────────────┴────── (3 rows) (2022-12-09 06:44:16) postgres=# close "<unnamed portal 2>"; CLOSE CURSOR (2022-12-09 06:44:21) postgres=# commit; COMMIT for more cursors: CREATE OR REPLACE FUNCTION public.foo2(tabname1 character varying, tabname2 character varying) RETURNS SETOF refcursor LANGUAGE plpgsql AS $function$ declare r refcursor; begin open r for execute format('select * from %I', tabname1); return next r; r := NULL; -- re initialize refcursor open r for execute format('select * from %I', tabname2); return next r; end; $function$ (2022-12-09 06:58:33) postgres=# begin; BEGIN (2022-12-09 06:58:36) postgres=# select * from foo2('pg_class', 'pg_proc'); ┌────────────────────┐ │ foo2 │ ╞════════════════════╡ │ <unnamed portal 5> │ │ <unnamed portal 6> │ └────────────────────┘ (2 rows) (2022-12-09 06:58:39) postgres=# fetch 2 from "<unnamed portal 5>"; ┌──────┬──────────────┬──────────────┬─────────┬───────────┬──────────┬───────┬─────────────┬───────────────┬──────────┬───────────┬──────────── │ oid │ relname │ relnamespace │ reltype │ reloftype │ relowner │ relam │ relfilenode │ reltablespace │ relpages │ reltuples │ relallvisib ╞══════╪══════════════╪══════════════╪═════════╪═══════════╪══════════╪═══════╪═════════════╪═══════════════╪══════════╪═══════════╪════════════ │ 2619 │ pg_statistic │ 11 │ 10029 │ 0 │ 10 │ 2 │ 2619 │ 0 │ 19 │ 407 │ │ 1247 │ pg_type │ 11 │ 71 │ 0 │ 10 │ 2 │ 0 │ 0 │ 15 │ 611 │ └──────┴──────────────┴──────────────┴─────────┴───────────┴──────────┴───────┴─────────────┴───────────────┴──────────┴───────────┴──────────── (2 rows) (2022-12-09 06:58:47) postgres=# fetch 2 from "<unnamed portal 6>"; ┌──────┬─────────┬──────────────┬──────────┬─────────┬─────────┬─────────┬─────────────┬────────────┬─────────┬───────────┬──────────────┬────── │ oid │ proname │ pronamespace │ proowner │ prolang │ procost │ prorows │ provariadic │ prosupport │ prokind │ prosecdef │ proleakproof │ prois ╞══════╪═════════╪══════════════╪══════════╪═════════╪═════════╪═════════╪═════════════╪════════════╪═════════╪═══════════╪══════════════╪══════ │ 1242 │ boolin │ 11 │ 10 │ 12 │ 1 │ 0 │ 0 │ - │ f │ f │ f │ t │ 1243 │ boolout │ 11 │ 10 │ 12 │ 1 │ 0 │ 0 │ - │ f │ f │ f │ t └──────┴─────────┴──────────────┴──────────┴─────────┴─────────┴─────────┴─────────────┴────────────┴─────────┴───────────┴──────────────┴────── (2 rows) (2022-12-09 06:58:53) postgres=# commit; COMMIT Cursor is an pointer to open (in execution) query. refcursor value is some handle of some cursor in string format. You can specify own name or you PLpgSQL generates unique name, when the refcursor variable is NULL (when it is used inside OPEN statement).
Filter by all parts of a LTREE-Field
Let's say I have a Table people with the following columns: name/string, mothers_hierachy/ltree "josef", "maria.jenny.lisa" How do I find all mothers of Josef in the people Table? I'm searching for such a expression like this one: (That actually works) SELECT * FROM people where name IN ( SELECT mothers_hierachy from people where name = "josef" )
You can cast the names to ltree and then use index() to see if they are contained: # select * from people; ┌───────┬───────────────────────┐ │ name │ mothers_hierarchy │ ├───────┼───────────────────────┤ │ josef │ maria.jenny.lisa │ │ maria │ maria │ │ jenny │ maria.jenny │ │ lisa │ maria.jenny.lisa │ │ kate │ maria.jenny.lisa.kate │ └───────┴───────────────────────┘ (5 rows) # select * from people j join people m on index(j.mothers_hierarchy, m.name::ltree) >= 0 where j.name = 'josef'; ┌───────┬───────────────────┬───────┬───────────────────┐ │ name │ mothers_hierarchy │ name │ mothers_hierarchy │ ├───────┼───────────────────┼───────┼───────────────────┤ │ josef │ maria.jenny.lisa │ maria │ maria │ │ josef │ maria.jenny.lisa │ jenny │ maria.jenny │ │ josef │ maria.jenny.lisa │ lisa │ maria.jenny.lisa │ └───────┴───────────────────┴───────┴───────────────────┘ (3 rows)
Counting the same positional bits in postgresql bitmasks
I am trying to count each same position bit of multiple bitmasks in postgresql, here is an example of the problem: Suppose i have three bitmasks (in binary) like: 011011011100110 100011010100101 110110101010101 Now what I want to do is to get the total count of bits in each separate column, considering the above masks as three rows and multiple columns. e.g The first column have count 2, the second one have count 2, the third one have count of 1 and so on... In actual i have total of 30 bits in each bitmasks in my database. I want to do it in PostgreSQL. I am open for further explanation of the problem if needed.
You could do it by using the get_bit functoin and a couple of joins: SELECT sum(bit) FILTER (WHERE i = 0) AS count_0, sum(bit) FILTER (WHERE i = 1) AS count_1, ... sum(bit) FILTER (WHERE i = 29) AS count_29 FROM bits CROSS JOIN generate_series(0, 29) AS i CROSS JOIN LATERAL get_bit(b, i) AS bit; The column with the bit string is b in my example.
You could use the bitwise and & operator and bigint arithmetic so long as your bitstrings contain 63 bits or fewer: # create table bmasks (mask bit(15)); CREATE TABLE # insert into bmasks values ('011011011100110'), ('100011010100101'), ('110110101010101'); INSERT 0 3 # with masks as ( select (2 ^ x)::bigint::bit(15) as mask, x as posn from generate_series(0, 14) as gs(x) ) select m.posn, m.mask, sum((b.mask & m.mask > 0::bit(15))::int) as set_bits from masks m cross join bmasks b group by m.posn, m.mask; ┌──────┬─────────────────┬──────────┐ │ posn │ mask │ set_bits │ ├──────┼─────────────────┼──────────┤ │ 0 │ 000000000000001 │ 2 │ │ 1 │ 000000000000010 │ 1 │ │ 2 │ 000000000000100 │ 3 │ │ 3 │ 000000000001000 │ 0 │ │ 4 │ 000000000010000 │ 1 │ │ 5 │ 000000000100000 │ 2 │ │ 6 │ 000000001000000 │ 2 │ │ 7 │ 000000010000000 │ 2 │ │ 8 │ 000000100000000 │ 1 │ │ 9 │ 000001000000000 │ 2 │ │ 10 │ 000010000000000 │ 3 │ │ 11 │ 000100000000000 │ 1 │ │ 12 │ 001000000000000 │ 1 │ │ 13 │ 010000000000000 │ 2 │ │ 14 │ 100000000000000 │ 2 │ └──────┴─────────────────┴──────────┘ (15 rows)
Create lag / lead time series with by groups in Julia?
I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"], var2=[0,1,2,3,0,1,2,3]) 8×2 DataFrame │ Row │ var1 │ var2 │ │ │ String │ Int64 │ ├─────┼────────┼───────┤ │ 1 │ a │ 0 │ │ 2 │ a │ 1 │ │ 3 │ a │ 2 │ │ 4 │ a │ 3 │ │ 5 │ b │ 0 │ │ 6 │ b │ 1 │ │ 7 │ b │ 2 │ │ 8 │ b │ 3 │ And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value. I have tried the following code which produces the following error. julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}}) Closest candidates are: merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245 merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233 merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249 ... Stacktrace: [1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58 [2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined] [3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined] [4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined] [5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined] [6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40 [7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31 [8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854 [9] top-level scope at none:0 Appreciate any help with this approach or alternate approaches. Thanks.
EDIT Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API: Under DataFrames.jl 0.22.2 the correct syntax is: julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2) 8×2 DataFrame Row │ var1 var2_l2 │ String Int64? ─────┼───────────────── 1 │ a missing 2 │ a missing 3 │ a 0 4 │ a 1 5 │ b missing 6 │ b missing 7 │ b 0 8 │ b 1 Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence). Original answer: You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax: julia> using DataFrames julia> import ShiftedArrays: lag julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"], var2=[0,1,2,3,0,1,2,3]); julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2))) 8×2 DataFrame │ Row │ var1 │ var2_l2 │ │ │ String │ Int64⍰ │ ├─────┼────────┼─────────┤ │ 1 │ a │ missing │ │ 2 │ a │ missing │ │ 3 │ a │ 0 │ │ 4 │ a │ 1 │ │ 5 │ b │ missing │ │ 6 │ b │ missing │ │ 7 │ b │ 0 │ │ 8 │ b │ 1 │ Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values: julia> l2(x) = lag(x, 2, default = -1000) l2 (generic function with 1 method) julia> by(df1, :var1, var2_l2 = :var2 => l2) 8×2 DataFrame │ Row │ var1 │ var2_l2 │ │ │ String │ Int64 │ ├─────┼────────┼─────────┤ │ 1 │ a │ -1000 │ │ 2 │ a │ -1000 │ │ 3 │ a │ 0 │ │ 4 │ a │ 1 │ │ 5 │ b │ -1000 │ │ 6 │ b │ -1000 │ │ 7 │ b │ 0 │ │ 8 │ b │ 1 │