Postgres "first" aggregation function - postgresql

I am aggregation a table using file ID field. Each file has a name which matched exactly one (his) file id.
select file_key, min(fullfilepath)
from table
group by file_key
Because I know the structure of the table, I know that I need ANY fullfilepath. The min and the max are ok, but it requires a lot of time.
I came across this aggregation function which returns the first value. Unfortunately, this function takes a long time, because it scans the whole table. For example, this is very slow:
select first(file_id) from table;
What is the fastest way to do that? With or without aggregation function.

There is no way to make your first query with the GROUP BY clause faster, because it has to scan the whole table to find all groups.
Your second query can be made faster:
SELECT (
SELECT file_id FROM "table"
WHERE file_id IS NOT NULL
LIMIT 1
);
There is no way to optimize the query as you wrote it, because the aggregate function is a black box to PostgreSQL.

I doubt that this will help performance but it may be useful if anyone actually wants a first agregate.
-- coaslesce isn't a function so make an equivalent function.
create function coalesce_("anyelement","anyelement") returns "anyelement"
language sql as $$ select coalesce( $1,$2 ) $$;
create aggregate first("anyelement") (sfunc=coalesce_, stype="anyelement");

select
distinct on (file_key)
file_key, fullfilepath
from table
order by file_key
That will return one record for each file_key

Related

Pivot function without manually typing values in `for in`?

Documentation provides an example of using the pivot() function.
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN ('prop', 'rudder', 'wing')
);
I would like to use pivot() without having to manually specify each value of partname. I want all parts. I tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname);
That gave an error. Then tried:
SELECT *
FROM (SELECT partname, price FROM part) PIVOT (
AVG(price) FOR partname IN (select distinct partname from part)
);
That also threw an error.
How can I tell Redshift to include all values of partname in the pivot?
I don't think this can be done in a simple single query. This would mean that the query compiler would need to work without knowing how many output columns will be produced. I don't think it can do that.
You can do this in multiple queries - use a query to create the list of partnames and then use this to "generate" a second query that populates the IN list. So something needs issue these queries and generated the second. This can be some code external to Redshift (lots of options) or a stored procedure in Redshift. This code, no matter where it exists, should understand that Redshift has a max number of columns limit - 1,600.
The Redshift docs are fairly good on the topic of dynamic SQL for stored procedures. The EXECUTE statement will be used to fire off the second query in a stored procedure. See: https://docs.aws.amazon.com/redshift/latest/dg/c_PLpgSQL-statements.html

Why does this SQL unnest query result in 2 rows rather than 4?

Relatively new SQL user question....
If my postgresql query looks like this:
select
to_timestamp((unnest(enrolled_ranges) ->> 'start_time')::float) as start_time
, to_timestamp((unnest(enrolled_ranges) ->> 'end_time')::float) as end_time
from student_inclusions
where student_id = '123456'
And the initial enrolled_ranges json data is this:
{"{\"start_time\":1536652800.00007,\"end_time\":1563981839.966626}","{\"start_time\":1563982078.624668,\"end_time\":1563989693.830777}"}
Why does sql do this
instead of this
The first answer is what I want, I just don't understand how sql knows from the query to associate the matching start and end times. Do you have any insight?
The documentation on set-returning functions describes the behavior you observed:
For each row from the underlying query, there is an output row using the first result from each set-returning function, then an output row using the second result, and so on.
See also What is the expected behaviour for multiple set-returning functions in SELECT clause?

How to hash a query result with sha256 in PostgreSQL?

I want to somehow hash the result of a query in PostgreSQL. I have a query like
SELECT output FROM result;
And it returns a column composed only of integers. So I somehow want to hash the result of this query. Concatenate the values and hash, or somehow hash the query output directly. Simply I need a way to put it inside SELECT sha256(...). So please note that I do not want to get hash of every column entry, but one hash that somehow corresponds to the query output. Any ideas?
PostgreSQL doesn't come with a built-in streaming hash function exposed to the user, so the easiest way is to build the string in memory and then hash it. Of course this won't work with giant result sets. You can use digest from the pg_crypto extension. You also need to order your rows, or else you might get different results on the same data from one execution to the next if you get the rows in different orders.
select digest(string_agg(output::text,' ' order by output),'sha256')
from result;
Replace 1234 with your column name and add [from table_name] to this query:
select encode(digest(1234::text, 'sha256'), 'hex')
Or for multiple rows use this:
select encode(
digest(
(select array_agg(q1)::text[] from (select row(R.*)::text as q1 from (SELECT output FROM result)R)alias)::text
, 'sha256')
, 'hex')

Store the whole query result in variable using postgresql Stored procedure

I'm trying to get the whole result of a query into a variable, so I can loop through it and make inserts.
I don't know if it's possible.
I'm new to postgre and procedures, any help will be very welcome.
Something like:
declare result (I don't know what kind of data type I should use to get a query);
select into result label, number, desc from data
Thanks in advance!
I think you have to read PostgreSQL documentation about cursors.
But if you want just insert data from one table to another, you can do:
insert into data2 (label, number, desc)
select label, number, desc
from data
if you want to "save" data from query, you also can use temporary table, which you can create by usual create table or create table as:
create temporary table temp_data as
(
select label, number, desc
from data
)
see documentation

Calling a function on distinct values from table

I've got a SQL Server 2005 database. I need to get distinct values in addition to calling a function on those distinct values. I'm not sure how the distinct works when there is a function call involved. For example, I have this query:
SELECT DISTINCT a, b, c, fcn_DoSomething(a, b, c) AS z FROM users
I'm guessing that the function (fcn_DoSomething) is being called for all of the values in the table, not the distinct values. Am I correct? If so, how can I write the query to call the function only on distinct values of a,b,c? I know one option is to use a temporary table, but if anyone has better ideas that would be great.
Thanks
This got me curious, so I did a bit of basic testing. I created a small table with some distinct and some repeating values, a function that just does string concatenation, and then looked at the execution plans for:
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select distinct cola, colb, dbo.sillyfunc(cola, colb)
from distincttest
--Clear the cache
Go
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
select cola, colb, dbo.sillyfunc(cola, colb)
from (select distinct cola, colb from distincttest) as t
In this case, the execution plans showed clearly that the first one ran the concatenation function for every single row, but the second did the sort for distinct values first, then ran the function. But for a small number of rows, they had the same execution time, and when run together they showed each one using 50% of the total query resources.
So, I added a few hundred thousand repeating rows. and tried again. This changed the query plan so it was doing a hash match to get distinctness rather than the former sort, and now the second version which forced it to select for distinctness first executed more than ten times faster.
Finally, I thought there was a chance that this might just be because SQL Server had my sillyfunc marked as nondeterministic (select OBJECTPROPERTYEX(object_id('dbo.sillyfunc'), 'isdeterministic') returned 0), so I switched to patindex which was a builtin function and considered deterministic. This gave me the same results with the function being called for every row in the first version and just for the few distinct ones in the second version.
So, its possible that further testing would find situations that would coax the optimizer to do something more sophisticated, but it appears that if you want to apply the distinct before the function is called then you need to use something like a subquery, CTE, or temp table to limit what the function has access to.
This would ensure that the function only got called on distinct values.
select *, fcn_DoSomething(a, b, c)
from
(select distinct a,b,c FROM users) v
However, I believe that the function call will be optimised, so it may not make a difference. Give it a try.