Grouping and processing groups inside plpgsql functions - postgresql

I need to perform a sophisticated group processing, like here. I get some rows from a complex query, the row set looks like this:
key val
-------
foo 1
foo 2
foo 3
bar 10
bar 15
baz 22
baz 44
...
And here is a pseudocode I want to implement in plpgsql:
result = new array()
group = new array()
current_key = null
for (record in (select * from superComplexQuery())) {
if (current_key == null) {
current_key = record.key
}
if (current_key != record.key) {
result.add(processRows(group))
group.clear()
current_user = record.key
}
group.add(record)
}
if (group.size() > 0) {
result.add(processRows(group))
}
return result
I.e., we must process 3 "foo" rows, then 2 "bar" rows, then 2 "baz rows" etc. And result of each processRows is added to resulting collection.
Maybe I should use another approach, but I don't know what it must be.
EDIT: processRows should output a record. Thus, the output of the whole procedure will be a set of rows, where each row is a result of processRows(group). One example of such calculation is given in first sentence of this question: Selecting positive aggregate value and ignoring negative in Postgres SQL , i.e. the calculation involves some iteration and aggregation with some complex rules.

The right approach was to use User-Defined Aggregates
I.e. I successfully implemented my own aggregate function and the code looks like
select my_complex_aggregation((input_field_1, input_field_2, input_field_3))
from my_table
group by key

Related

Postgres dynamic filter conditions

I want to dynamically filter through data based on condition, which is stored in specific column. This condition can change for every row.
For example I have a table my_table with couple of columns, one of them is called foo, where there are couple of filter conditions such as AND bar > 1 or in the next row AND bar > 2 or in the next row AND bar = 33.
I have a query which looks like:
SELECT something from somewhere
LEFT JOIN otherthing on some_condition
WHERE first_condition AND second_condition AND
here_i_want_dynamically_load_condition_from_my_table.foo
What is the correct way to do it? I have read some articles about dynamic queries, but I am not able to find a correct way.
This is impossible in pure SQL: at query time, the planner has to know your exact logic. Now, you can hide it away in a function (in pseudo-sql):
CREATE FUNCTION do_I_filter_or_not(some_id) RETURNS boolean AS '
BEGIN
value = select some_value from table where id = some_id
condition_type = SELECT ... --Query a condition type for this row
condition_value = SELECT ... --Query a condition value for this row
if condition_type = 'equals' and condition_value = value
return true
if condition_type = 'greater_than' and condition_value < value
return true
if condition_type = 'lower_than' and condition_value > value
return true
return false;
END;
'
LANGUAGE 'plpgsql';
And query it like this:
SELECT something
FROM somewhere
LEFT JOIN otherthing on some_condition
WHERE first_condition
AND second_condition
AND do_I_filter_or_not(somewhere.id)
Now the performance will be bad: you have to invoke that function potentially on every row in the query; triggering lots of subqueries.
Thinking about it, if you just want <, >, =, and you have a table (filter_criteria) describing for each id what the criteria is you can do it:
CREATE TABLE filter_criteria AS(
some_id integer,
equals_threshold integer,
greater_than_threshold integer,
lower_than_threshold integer
-- plus a check that two thresholds must be null, and one not null
)
INSERT INTO filter_criteria (1, null, 5, null); -- for > 5
And query like this:
SELECT something
FROM somewhere
LEFT JOIN otherthing on some_condition
LEFT JOIN filter_criteria USING (some_id)
WHERE first_condition
AND second_condition
AND COALESCE(bar = equals_threshold, true)
AND COALESCE(bar > greater_than_threshold, true)
AND COALESCE(bar < lower_than_threshold, true)
The COALESCEs are here to default to not filtering (AND true) if the threshold is missing (bar = equals_threshold will yield null instead of a boolean).
The planner has to know your exact logic at query time: now you're just doing 3 passes of filtering, with a =, <, > check each time. That'd still be more performant than idea #1 with all the subquerying.

Unable to figure out filter in slickdb

Using scala with slickdb. I have table called persons. And I am filtering out persons by name as below
table.Persons.filter({ row => {
println("inside filter")
req.personName.map(name => row.personName === name).getOrElse(true:Rep[Boolean])
})
The table contains 3 rows. But still println() is executed only once. How is this filter working?
First of all when you write something like
personTable.filter(p => { .... })
It evaluates it self as a Query which can generate the SQL Query when needed for actual DB querying. The generated SQL will be something like,
SELECT ...
FROM persons
WHERE ...
Now this SQL query is submitted to the DB for execution.
So, you code inside { ... } gets evaluated to generate the Query itself. And it has no relation to how many rows do you have in your DB table.
So, the println in your example will run just once even if your DB table has 0 rows, 1 row or a million rows.

MDX: Define dimension sub set and show the total

Since in MDX you can specify the member [all] to add the aggregation between all the members of the dimension, if I want to show the totals of a certain measure I can build a query like
SELECT {
[MyGroup].[MyDimension].[MyDimension].members,
[MyGroup].[MyDimension].[all]
} *
[Measures].[Quantity] on 0
FROM [MyDatabase]
Now I want to filter MyDimension for a bunch of values and show the total of the selected values, but of course if I generate the query
SELECT {
[MyGroup].[MyDimension].[MyDimension].&[MyValue1],
[MyGroup].[MyDimension].[MyDimension].&[MyValue2],
[MyGroup].[MyDimension].[all]
} *
[Measures].[Quantity] on 0
FROM [MyDatabase]
it shows the Quantity for MyValue1, MyValue2 and the total of all MyDimension members, not just the ones I selected.
I investigated a bit and came up to a solution that include the generation of a sub query to filter my values
SELECT {
[MyGroup].[MyDimension].[MyDimension].members, [MyGroup].[MyDimension].[all]
} * [Measures].[Quantity] ON 0
FROM (
SELECT {
[MyGroup].[MyDimension].[MyDimension].&[MyValue1],
[MyGroup].[MyDimension].[MyDimension].&[MyValue2]
} ON 0
FROM [MyDatabase]
)
Assuming this works, is there a simplier or more straight forward approach to achieve this?
I tried to use the SET statement to define my custom tuple sets but then I couldn't manage to show the total.
Keep in mind that in my example I kept things as easy as possible, but in real cases I could have multiple dimension on both rows and columns as well as multiple calculated measures defined with MEMBER statement.
Thanks!
What you have done is standard - it is the simple way!
One thing to bear in mind when using a sub-select is that it is not a full filter, in that the original All is still available. I think this is in connection with the query processing of the clauses in mdx - here is an example of what I mean:
WITH
MEMBER [Product].[Product Categories].[All].[All of the Products] AS
[Product].[Product Categories].[All]
SELECT
[Measures].[Internet Sales Amount] ON 0
,NON EMPTY
{
[Product].[Product Categories].[All] //X
,[Product].[Product Categories].[All].[All of the Products] //Y
,[Product].[Product Categories].[Category].MEMBERS
} ON 1
FROM
(
SELECT
{
[Product].[Product Categories].[Category].&[4]
,[Product].[Product Categories].[Category].&[1]
} ON 0
FROM [Adventure Works]
);
So line marked X will be the sum of categories 4 and 1 but line Y will sill refer to the whole of Adventure Works:
This behavior is useful although a little confusing when using All members in the WITH clause.

Distinct MondoDB function - How to use some criteria with distinct

I have a situation where I need fetch only distict records which are grater than 0 and all records with value 0.
For Example I have column name called mid then it rows like "0,0,1,1,2,3,5,5,3" then I should fetch only "0,0,1,2,5,3".
In short distinct record plus all mid with value 0
I have used this
def distinctMIdCursor = dataSetCollection.distinct("mid",whereObject)
def distinctMIdList = distinctMIdCursor.asList()
but its fetching result like "0,1,2,5,3"
Actual result "0,1,2,5,3".
Expected result "0,0,1,2,5,3"
How to achieve it. What is better way?
You cannot achieve it with distinct because by doing so you are defying the whole purpose of using distinct. Instead you can write two queries and concat the result.
def nonZeroDistinctList = dataSetCollection.distinct("mid",{mid: {$ne:0}});
// map function to convert object list into mid value list
def allZeroList = dataSetCollection.find({mid:0}).map(function(doc){return doc.mid});
// concating the two lists
def result = nonZeroDistinctList + allZeroList ;

filtering on a range of values in a db column with sqlalchemy orm

I have a postgresql database and in one particular table, with many rows. One column in this table, called data, is a float array, REAL[], and gets filled with an array of ~4500 elements. I want to access this table through some query via SQLAlchemy and the ORM.
How do I select all rows in the table where a subset of this column satisfies some condition, e.g.contains a range of values? Like I want to select all rows where the data contains values >= 10, or values between >=10 and <=20.
Can I do this with a straight session query like
rows = session.query(Table).filter(Table.data.(some conditional)).all()
where my conditional is something like "VALUES >= 10 and VALUES <= 20"?
Or do I need to define some special methods, or setup, when I'm defining my SQLAlchemy table class. For example, I have my table set up as
class Table(Base):
__tablename__ = 'table'
__table_args__ = {'autoload' : True, 'schema' : 'testdb', 'extend_existing':True}
data = deferred(Column(ARRAY(Float)))
def __repr__(self):
return '<Table (pk={0})>'.format(self.pk)
Ideally I'd like to set it up so I can just do simple filtering in my session.query calls. Is this possible? I'm not super familiar with the ORM, so maybe it is?
I've had a look at the ARRAY Comparator sqlalchemy docs but those only seem to work on exact values. My data is precise to 6 sigfigs, and I don't know the exact values ahead of time.
What's the best way to do this? Thanks.
EDIT:
Based on the below comment, here is the code I'm using in attempting to select all rows (out of 1000) that have data (from 1 column) >= 1.0. There should be 537 rows.
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This gives the correct subset number. len(rows) = 537. However, I don't understand the logic of with this operator, where to select data >=1.0 , I use the le operator? Also, along those same lines, there should be 234 rows that have data between the values >=1.0 and <1.0, but this statement fails to give the correct subset..
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
* EDIT 2 *
Here's an example of my database Table with a few rows. pk is an integer, and data is a real[].
db datadb
schema Table
pk data
0 [0.0,0.0,0.5,0.3,1.3,1.9,0.3,0.0,0.0]
1 [0.1,0.0,1.0,0.7,1.1,1.5,1.2,0.3,1.4]
2 [0.0,0.6,0.4,0.3,1.6,1.7,0.4,1.3,0.0]
3 [0.0,0.1,0.2,0.4,1.0,1.1,1.2,0.9,0.0]
4 [0.0,0.0,0.5,0.3,0.2,0.1,0.7,0.3,0.1]
I have 5 rows, 4 of them have data with values >= 1.0, while just 2 have values in the range >= 1.0 and <= 1.2. The query I would do to grab the rows is in the first case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).all()
This should return the 4 rows, at pk=0,1,2,3. This query does what I expect. The second case
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le)).filter(datadb.Table.data.any(1.2,operator=operators.ge)).all()
and should return the 2 rows at pk=1,3. However this query just returns the 4 rows from the first query. For the second query, I also tried
rows = session.query(datadb.Table).filter(datadb.Table.data.any(1.0,operator=operators.le),datadb.Table.data.any(1.2,operator=operators.ge)).all()
which also didn't work.
Please read documentation on ARRAY.Comparator, according to which you should be able to do the following:
rows = (session.query(Table)
.filter(Table.data.any(10, operator=operators.le))
.filter(Table.data.any(20, operator=operators.ge)
).all()
EDIT:
# combined filter does not work,
# but applying one or the other is still useful as it reduces the result set
q = (session.query(MyTable)
.filter(MyTable.data.any(1.0, operator=operators.le))
# .filter(MyTable.data.any(1.2, operator=operators.ge))
)
# filter in memory
items = [_row for _row in q.all()
if any(1.0 <= item <= 1.2 for item in _row.data)]
for item in items:
print(item)