Stored Procedure in postgresql, multiple queries w/ aggregates - postgresql

I'm trying to write a store procedure that can take some input parameters (obviously), run multiple queries against those, taking the output from those and doing calculations, and from those calculations and the original queries, outputting a formatted text string like:
Number of Rows for max(Z) matching condition x and y of total rows matching x (x&y/x*100).
To explain the max(Z) bit, this will be the username field, it won't matter which actual entry is picked, because the where clause will filter the results by user id, is there a saner way to do this?

For starters break the code up into multiple procedures. Don't create one procedure that does all of these things.

Related

get postgres to use an index when querying timestamps in a function

I have a system with a large number of tables that contain historical data. Each table has a ts_from and ts_to column which are of type timestamptz. These represent the time period in which the data for a particular row was valid.
These columns are indexed.
If I want to query all rows that were valid at a particular timestamp, it is trivial to write the ts_from <= #at_timestamp AND ts_to >= #at_timestamp WHERE clause to utilitise the index.
However, I wanted to create a function called Temporal.at which would take the #at_timestamp column and the ts_from / ts_to columns and do this by hiding the complexity of the comparison from the query that uses it. You might think this is trivial, but I would also like to extend the concept to create a function called Temporal.between which would take a #from_timestamp and #to_timestamp and select all rows that were valid between those two periods. That function would not be trivial, as one would have to check where rows partially overlap the period rather than always being fully enclosed by it.
The issue is this: I have written these functions but they do not cause the index to be used. The query performance is woefully slow on the history tables, some of which have hundreds of millions of rows.
The questions therefore are:
a) Is there a way to write these functions so that we can be sure the indexes will be used?
b) Am I going about this completely the wrong way and is there a better way to proceed?
This is complicated if you model ts_from and ts_to as two different timestamp columns. Instead, you should use a range type: tstzrange. Then everything will become simple:
for containment in an interval, use #at_timestamp <# from_to
for interval overlap, use tstzinterval(#from_timestamp, #to_timestamp) && from_to
Both queries can be supported by a GiST index on the range column.

Postgresql - Return column subset from cursor

I have a legacy stored procedure returning a number (row count) a cursor with many columns; I need to retrieve a subset of the selected columns. I can think of three ways of doing it:
Invoke the existing procedure from the outside, and map columns to my own data structures trimming unneeded columns;
Write a new stored procedure, mostly identical to the existing one but returning different columns;
Write a new stored procedure, invoking the old one internally and filtering columns (the referenced entities and thus the number of rows are exactly the same as the existing procedure).
Number 2 is obviously a no-go.
Number 1 is viable. As far as I know, there is little difference in the computing cost between retrieving one or more columns, in that the engine has to read full rows regardless, before filtering unrequired columns; I do have a feeling it would be heavier on the runtime invoking the procedure from the outside, as objects representing unneeded columns would exist on returning from the DB call.
I would be interested in implementing Number 3, but I would prefer to maintain the same return type as the existing function (count + refcursor) for conformity.
I think I could transfer all the rows in the cursor returned by the existing function into a temporary table as described e.g. in this question, and use it as a source for the output cursor but:
I am not sure of how the output cursor would behave with a temporary table created with a drop-on-commit clause (would the results exist reliably after the procedure has terminated? Would the temporary table be dropped as expected?);
I read that temporary tables are expensive to use, and it feels like overkill for what in the end is a filtering of columns on the same rows from a pre-computed result.
Is there a way to query the existing cursor so that it may be used as a source for the output cursor, while filtering columns?

Transpose table with unknown amount of variables

I would like to transpose table a to table b without knowing exactly how many procedures there are. Is there a way to include a loop inside a query?
Thank you in advance!
So far I am just checking what the maximum amount of 'procedures' are, I put all the procedures in an array, and then query all elements from this array. However I would like a query that always works without first defining the maximum amount of procedures.

Data Lake Analytics - Large vertex query

I have a simple query which make a GROUP BY using two fields:
#facturas =
SELECT a.CodFactura,
Convert.ToInt32(a.Fecha.ToString("yyyyMMdd")) AS DateKey,
SUM(a.Consumo) AS Consumo
FROM #table_facturas AS a
GROUP BY a.CodFactura, a.DateKey;
#table_facturas has 4100 rows but query takes several minutes to finish. Seeing the graph explorer I see it uses 2500 vertices because I'm having 2500 CodFactura+DateKey unique rows. I don't know if it normal ADAL behaviour. Is there any way to reduce the vertices number and execute this query faster?
First: I am not sure your query actually will compile. You would need the Convert expression in your GROUP BY or do it in a previous SELECT statement.
Secondly: In order to answer your question, we would need to know how the full query is defined. Where does #table_facturas come from? How was it produced?
Without this information, I can only give some wild speculative guesses:
If #table_facturas is coming from an actual U-SQL Table, your table is over partitioned/fragmented. This could be because:
you inserted a lot of data originally with a distribution on the grouping columns and you either have a predicate that reduces the number of rows per partition and/or you do not have uptodate statistics (run CREATE STATISTICS on the columns).
you did a lot of INSERT statements, each inserting a small number of rows into the table, thus creating a big number of individual files. This will "scale-out" the processing as well. Use ALTER TABLE REBUILD to recompact.
If it is coming from a fileset, you may have too many small files in the input. See if you can merge them into less, larger files.
You can also try to hint a small number of rows in your query that creates #table_facturas if the above does not help by adding OPTION(ROWCOUNT=4000).

Calculate hash for java.sql.ResultSet

I need to know if the results of SQL query has been changed between two queries.
The solution a came up with is to calculate and compare some hash value based on ResultSet content.
What is the preferred way?
There are no such special hashCode method, for ResultSet that is calculated based on all retrieved data. Definetly you can not use default hashCode method.
To be 100% sure that you will take into account all the changes in the data,
you have to retrieve all columns from all the rows from ResultSet one by one and calculate hash code for them with any possible way. (Put everything into single String and get it's hashCode).
But it's very time consumption operation. I would propose you to execute extra query that calculate hash sum by itself. For example it can return count of rows and sum of all columns/rows... or smth like that..