Aerospike - Query on Map Keys - nosql

I have a question about Aerospike DB.
I have set of students, and each student (record key is StudentId), has a map (bin) of <CourseId, Grade>.
I'm trying to create some queries, and I'm not sure what is the correct way to do it.
I have variable containing a List of <String> courseIds.
The queries that I want to create are:
For each student, get all the courseIds that exists in the map and in the list.
For each student, get all the courseIds that exist only in their map, and not in the list.
What is the best approach here? Should I use UDF?
Thanks.

This is the kind of thing a record UDF is good for - extending functionality that doesn't yet exist in predicate filtering. The record UDF can take the bin name as the first argument, your list variable as its second argument, and an optional third argument for deciding whether this is an 'IN' or 'NOT IN', then iterate through it against the map of course IDs.
You can apply this record UDF to every record matched by a scan or query running against the set containing the students.
test.lua
function list_compare(rec, bin, l, not_in_l)
if rec[bin] then
local b = rec[bin]
if (tostring(getmetatable(rec[bin])) == tostring(getmetatable(list()))) then
iter = list.iterator
elseif (tostring(getmetatable(rec[bin])) == tostring(getmetatable(map()))) then
iter = map.values
else
return nil
end
local s = {}
local l_keys = {}
if (not_in_l ~= nil) then
for v in list.iterator(l) do
l_keys[v] = 1
end
end
for i in list.iterator(l) do
for v in iter(b) do
if (not_in_l == nil) then
if (i == v) then
s[v] = 1
end
else
if (i ~= v and not l_keys[v]) then
s[v] = 1
end
end
end
end
local keys = {}
for k,v in pairs(s) do
table.insert(keys, k)
end
table.sort(keys)
return list(keys)
end
end
In AQL:
$ aql
Aerospike Query Client
Version 3.15.1.2
C Client Version 4.3.0
Copyright 2012-2017 Aerospike. All rights reserved.
aql> register module './test.lua'
OK, 1 module added.
aql> insert into test.demo (PK,i,s,m,l) values ('88',6,'six',MAP('{"a":2, "b":4, "c":8, "d":16}'),LIST('[2, 4, 8, 16, 32, 128, 256]'))
OK, 1 record affected.
aql> select * from test.demo where PK='88'
+---+-------+--------------------------------------+-------------------------------------+
| i | s | m | l |
+---+-------+--------------------------------------+-------------------------------------+
| 6 | "six" | MAP('{"a":2, "b":4, "c":8, "d":16}') | LIST('[2, 4, 8, 16, 32, 128, 256]') |
+---+-------+--------------------------------------+-------------------------------------+
1 row in set (0.001 secs)
aql> execute test.list_compare("l", LIST('[1,2,3,4]')) on test.demo where PK='88'
+----------------+
| list_compare |
+----------------+
| LIST('[2, 4]') |
+----------------+
1 row in set (0.002 secs)
aql> execute test.list_compare("l", LIST('[1,2,3,4]'),1) on test.demo where PK='88'
+-------------------------------+
| list_compare |
+-------------------------------+
| LIST('[8, 16, 32, 128, 256]') |
+-------------------------------+
1 row in set (0.001 secs)
aql> execute test.list_compare("m", LIST('[1,2,3,4]')) on test.demo where PK='88'
+----------------+
| list_compare |
+----------------+
| LIST('[2, 4]') |
+----------------+
1 row in set (0.001 secs)
aql> execute test.list_compare("m", LIST('[1,2,3,4]'), 1) on test.demo where PK='88'
+-----------------+
| list_compare |
+-----------------+
| LIST('[8, 16]') |
+-----------------+
1 row in set (0.000 secs)

Related

How can I filter on the results of a set function

I have a function that returns a SETOF, and I want to filter on the set column.
Here's a minimal reproducible example of what I'm trying to do:
=> \d test1
Table "public.test1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
a | jsonb | | |
b | integer | | |
=> SELECT * FROM test1;
a | b
--------------------------+----
{"X": 1, "Y": 2} | 17
{"X": 4, "Y": 8, "Z": 3} | 22
(2 rows)
=> \ef test1function
CREATE OR REPLACE FUNCTION public.test1function(_item test1)
RETURNS SETOF text
LANGUAGE sql
AS $function$
SELECT jsonb_object_keys(_item.a)
$function$
With this setup, I can do queries like so:
=> SELECT test1.b, test1.test1function FROM test1;
b | test1function
----+---------------
17 | X
17 | Y
22 | X
22 | Y
22 | Z
(5 rows)
However, if I try to filter on the test1function field, I don't seem to be able to:
=> SELECT test1.b, test1.test1function FROM test1 HAVING test1function = "Z";
ERROR: column "test1function" does not exist
=> SELECT test1.b, test1.test1function FROM test1 HAVING test1.test1function = "Z";
ERROR: set-returning functions are not allowed in HAVING
Note: I am aware that, for this actual example, I could just write something like
SELECT b, 'Z' AS test1function FROM test1 WHERE a -> 'Z' IS NOT NULL;
b | test1function
----+---------------
22 | Z
(1 row)
As it happens, though, my actual analogue of test1function is more complicated than just a call to json_object_keys.
Is it just impossible to filter on the results of something returning SETOF at all?
EDIT: I'm also aware that I can do something like
=> SELECT * FROM (SELECT test1.b, test1.test1function FROM test1) q WHERE q.test1function = 'X';
b | test1function
----+---------------
17 | X
22 | X
(2 rows)
But that's awful... do I really have to do a subquery just to give this field a name I can reference?
You need to put set returning functions into the FROM clause. If you have to pass a column from a table, you need a lateral join. Then you can reference the columns of the function in the WHERE clause:
select *
from test1 t
cross join lateral test1function(t) as x(item)
where x.item = 'Z';

pyspark: how to modify column value based on other columns for the same Id

I have a pyspark dataframe with 5 columns: Id, a value X, lower & upper bounds of X and the update date (this dataframe is ordered by "Id, Update"). I read it from a hive table:
(spark.sql(Select *from table1 ordered by Update))
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-15|2019-01-25|2019-01-05|
| 1|2019-01-12|2019-01-15|2019-01-25|2019-01-07|
| 1|2019-01-15|2019-01-15|2019-01-25|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
When "X" is lower than "LB" or greater than "UB", "LB" & "UB" will be re-computed according to X and for all the following rows having the same Id.
if(X<LB | X>UB) LB = X-5 (in days)
UB = X+5 (in days)
The result should be like that:
+---+----------+----------+----------+----------+
| Id| X| LB| UB| Update|
+---+----------+----------+----------+----------+
| 1|2019-01-20|2019-01-15|2019-01-25|2019-01-02|
| 1|2019-01-17|2019-01-15|2019-01-25|2019-01-03|
| 1|2019-01-10|2019-01-05|2019-01-15|2019-01-05|
| 1|2019-01-12|2019-01-05|2019-01-15|2019-01-07|
| 1|2019-01-15|2019-01-05|2019-01-15|2019-01-08|
| 2|2018-12-12|2018-12-07|2018-12-17|2018-11-17|
| 2|2018-12-15|2018-12-07|2018-12-17|2018-11-18|
The third, forth & fifth rows are changed.
How can achieve this?
Try Case statement within Select Expression-
df.selectExpr("Id AS Id",
"X AS X",
"CASE WHEN X<LB OR X>UB THEN date_sub(X,5) ELSE LB END AS LB",
"CASE WHEN X<LB OR X>UB THEN date_add(X,5) ELSE UB END AS UB",
"Update AS Update").show()

Aggregate all combinations of rows taken k at a time

I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.

Update intermediate result

EDIT
As requested a little background of what I want to achieve. I have a table that I want to query but I don't want to change the table itself. Next the result of the SELECT query (what I called the 'intermediate table') needs to be cleaned a bit. For example certain cells of certain rows need to be swapped and some strings need to be trimmed. Of course this could all be done as postprocessing in, e.g., Python, but I was hoping to do all of this with one query statement.
Being new to Postgresql I want to update the intermediate table that results from a SELECT statement. So I basically want to edit the resulting table from a SELECT statement in one query. I'd like to prevent having to store the intermediate result.
I've tried the following 'with clause':
with result as (
select
a
from
b
)
update result as r
set
a = 'd'
...but that results in ERROR: relation "result" does not exist, while the following does work:
with result as (
select
a
from
b
)
select
*
from
result
As I said, I'm new to Postgresql so it is entirely possible that I'm using the wrong approach.
Depending on the complexity of the transformations you want to perform, you might be able to munge it into the SELECT, which would let you get away with a single query:
WITH foo AS (SELECT lower(name), freq, cumfreq, rank, vec FROM names WHERE name LIKE 'G%')
SELECT ... FROM foo WHERE ...
Or, for more or less unlimited manipulation options, you could create a temp table that will disappear at the end of the current transaction. That doesn't get the job done in a single query, but it does get it all done on the SQL server, which might still be worthwhile.
db=# BEGIN;
BEGIN
db=# CREATE TEMP TABLE foo ON COMMIT DROP AS SELECT * FROM names WHERE name LIKE 'G%';
SELECT 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
----------+-------+---------+------+-----------------------
GREEN | 0.183 | 11.403 | 35 | 'KRN':1 'green':1
GONZALEZ | 0.166 | 11.915 | 38 | 'KNSL':1 'gonzalez':1
GRAY | 0.106 | 15.921 | 69 | 'KR':1 'gray':1
GONZALES | 0.087 | 18.318 | 94 | 'KNSL':1 'gonzales':1
GRIFFIN | 0.084 | 18.659 | 98 | 'KRFN':1 'griffin':1
(5 rows)
db=# UPDATE foo SET name = lower(name);
UPDATE 4677
db=# SELECT * FROM foo LIMIT 5;
name | freq | cumfreq | rank | vec
--------+-------+---------+-------+---------------------
grube | 0.002 | 67.691 | 7333 | 'KRP':1 'grube':1
gasper | 0.001 | 69.999 | 9027 | 'KSPR':1 'gasper':1
gori | 0.000 | 81.360 | 28946 | 'KR':1 'gori':1
goeltz | 0.000 | 85.471 | 47269 | 'KLTS':1 'goeltz':1
gani | 0.000 | 86.202 | 51743 | 'KN':1 'gani':1
(5 rows)
db=# COMMIT;
COMMIT
db=# SELECT * FROM foo;
ERROR: relation "foo" does not exist

How to traverse a hierarchical tree-structure structure backwards using recursive queries

I'm using PostgreSQL 9.1 to query hierarchical tree-structured data, consisting of edges (or elements) with connections to nodes. The data are actually for stream networks, but I've abstracted the problem to simple data types. Consider the example tree table. Each edge has length and area attributes, which are used to determine some useful metrics from the network.
CREATE TEMP TABLE tree (
edge text PRIMARY KEY,
from_node integer UNIQUE NOT NULL, -- can also act as PK
to_node integer REFERENCES tree (from_node),
mode character varying(5), -- redundant, but illustrative
length numeric NOT NULL,
area numeric NOT NULL,
fwd_path text[], -- optional ordered sequence, useful for debugging
fwd_search_depth integer,
fwd_length numeric,
rev_path text[], -- optional unordered set, useful for debugging
rev_search_depth integer,
rev_length numeric,
rev_area numeric
);
CREATE INDEX ON tree (to_node);
INSERT INTO tree(edge, from_node, to_node, mode, length, area) VALUES
('A', 1, 4, 'start', 1.1, 0.9),
('B', 2, 4, 'start', 1.2, 1.3),
('C', 3, 5, 'start', 1.8, 2.4),
('D', 4, 5, NULL, 1.2, 1.3),
('E', 5, NULL, 'end', 1.1, 0.9);
Which can be illustrated below, where the edges represented by A-E are connected with nodes 1-5. The NULL to_node (Ø) represents the end node. The from_node is always unique, so it can act as PK. If this network flows like a drainage basin, it would be from top to bottom, where the starting tributary edges are A, B, C and the ending outflow edge is E.
The documentation for WITH provide a nice example of how to use search graphs in recursive queries. So, to get the "forwards" information, the query starts at the end, and works backwards:
WITH RECURSIVE search_graph AS (
-- Begin at ending nodes
SELECT E.from_node, 1 AS search_depth, E.length
, ARRAY[E.edge] AS path -- optional
FROM tree E WHERE E.to_node IS NULL
UNION ALL
-- Accumulate each edge, working backwards (upstream)
SELECT o.from_node, sg.search_depth + 1, sg.length + o.length
, o.edge|| sg.path -- optional
FROM tree o, search_graph sg
WHERE o.to_node = sg.from_node
)
UPDATE tree SET
fwd_path = sg.path,
fwd_search_depth = sg.search_depth,
fwd_length = sg.length
FROM search_graph AS sg WHERE sg.from_node = tree.from_node;
SELECT edge, from_node, to_node, fwd_path, fwd_search_depth, fwd_length
FROM tree ORDER BY edge;
edge | from_node | to_node | fwd_path | fwd_search_depth | fwd_length
------+-----------+---------+----------+------------------+------------
A | 1 | 4 | {A,D,E} | 3 | 3.4
B | 2 | 4 | {B,D,E} | 3 | 3.5
C | 3 | 5 | {C,E} | 2 | 2.9
D | 4 | 5 | {D,E} | 2 | 2.3
E | 5 | | {E} | 1 | 1.1
The above makes sense, and scales well for large networks. For example, I can see edge B is 3 edges from the end, and the forward path is {B,D,E} with a total length of 3.5 from the tip to the end.
However, I cannot figure out a good way to build a reverse query. That is, from each edge, what are the accumulated "upstream" edges, lengths and areas. Using WITH RECURSIVE, the best I have is:
WITH RECURSIVE search_graph AS (
-- Begin at starting nodes
SELECT S.from_node, S.to_node, 1 AS search_depth, S.length, S.area
, ARRAY[S.edge] AS path -- optional
FROM tree S WHERE from_node IN (
-- Starting nodes have a from_node without any to_node
SELECT from_node FROM tree EXCEPT SELECT to_node FROM tree)
UNION ALL
-- Accumulate edges, working forwards
SELECT c.from_node, c.to_node, sg.search_depth + 1, sg.length + c.length, sg.area + c.area
, c.edge || sg.path -- optional
FROM tree c, search_graph sg
WHERE c.from_node = sg.to_node
)
UPDATE tree SET
rev_path = sg.path,
rev_search_depth = sg.search_depth,
rev_length = sg.length,
rev_area = sg.area
FROM search_graph AS sg WHERE sg.from_node = tree.from_node;
SELECT edge, from_node, to_node, rev_path, rev_search_depth, rev_length, rev_area
FROM tree ORDER BY edge;
edge | from_node | to_node | rev_path | rev_search_depth | rev_length | rev_area
------+-----------+---------+----------+------------------+------------+----------
A | 1 | 4 | {A} | 1 | 1.1 | 0.9
B | 2 | 4 | {B} | 1 | 1.2 | 1.3
C | 3 | 5 | {C} | 1 | 1.8 | 2.4
D | 4 | 5 | {D,A} | 2 | 2.3 | 2.2
E | 5 | | {E,C} | 2 | 2.9 | 3.3
I would like to build aggregates into the second term of the recursive query, since each downstream edge connects to 1 or many upstream edges, but aggregates are not allowed with recursive queries. Also, I'm aware that the join is sloppy, since the with recursive result has multiple join conditions for edge.
The expected result for the reverse / backwards query is:
edge | from_node | to_node | rev_path | rev_search_depth | rev_length | rev_area
------+-----------+---------+-------------+------------------+------------+----------
A | 1 | 4 | {A} | 1 | 1.1 | 0.9
B | 2 | 4 | {B} | 1 | 1.2 | 1.3
C | 3 | 5 | {C} | 1 | 1.8 | 2.4
D | 4 | 5 | {A,B,D} | 3 | 3.5 | 3.5
E | 5 | | {A,B,C,D,E} | 5 | 6.4 | 6.8
How can I write this update query?
Note that I'm ultimately more concerned about accumulating accurate length and area totals, and that the path attributes are for debugging. In my real-world case, forwards paths are up to a couple hundred, and I expect reverse paths in the tens of thousands for large and complex catchments.
UPDATE 2:
I rewrote the original recursive query so that all accumulation/aggregation is done outside the recursive part. It should perform better than the previous version of this answer.
This is very much alike the answer from #a_horse_with_no_name for a similar question.
WITH
RECURSIVE search_graph(edge, from_node, to_node, length, area, start_node) AS
(
SELECT edge, from_node, to_node, length, area, from_node AS "start_node"
FROM tree
UNION ALL
SELECT o.edge, o.from_node, o.to_node, o.length, o.area, p.start_node
FROM tree o
JOIN search_graph p ON p.from_node = o.to_node
)
SELECT array_agg(edge) AS "edges"
-- ,array_agg(from_node) AS "nodes"
,count(edge) AS "edge_count"
,sum(length) AS "length_sum"
,sum(area) AS "area_sum"
FROM search_graph
GROUP BY start_node
ORDER BY start_node
;
Results are as expected:
start_node | edges | edge_count | length_sum | area_sum
------------+-------------+------------+------------+------------
1 | {A} | 1 | 1.1 | 0.9
2 | {B} | 1 | 1.2 | 1.3
3 | {C} | 1 | 1.8 | 2.4
4 | {D,B,A} | 3 | 3.5 | 3.5
5 | {E,D,C,B,A} | 5 | 6.4 | 6.8