How to hash a query result with sha256 in PostgreSQL? - postgresql

I want to somehow hash the result of a query in PostgreSQL. I have a query like
SELECT output FROM result;
And it returns a column composed only of integers. So I somehow want to hash the result of this query. Concatenate the values and hash, or somehow hash the query output directly. Simply I need a way to put it inside SELECT sha256(...). So please note that I do not want to get hash of every column entry, but one hash that somehow corresponds to the query output. Any ideas?

PostgreSQL doesn't come with a built-in streaming hash function exposed to the user, so the easiest way is to build the string in memory and then hash it. Of course this won't work with giant result sets. You can use digest from the pg_crypto extension. You also need to order your rows, or else you might get different results on the same data from one execution to the next if you get the rows in different orders.
select digest(string_agg(output::text,' ' order by output),'sha256')
from result;

Replace 1234 with your column name and add [from table_name] to this query:
select encode(digest(1234::text, 'sha256'), 'hex')
Or for multiple rows use this:
select encode(
digest(
(select array_agg(q1)::text[] from (select row(R.*)::text as q1 from (SELECT output FROM result)R)alias)::text
, 'sha256')
, 'hex')

Related

Searching DB table to determine unseen file names from list

I am processing flat files from disk and need to ensure that I never process the same file twice. The filename of every processed file is stored in a postgresql DB and at the next iteration I need to determine the unseen files on disk and process them, ie. I need to determine the set difference of the filenames on disk and the filenames in the DB.
Currently my approach is to create a CTE from the filenames on disk and join that to the table of seen filenames. The list of files on disk is large and constantly changing, and processing is slowing down.
This is the current query:
WITH input(filename) AS (VALUES ${filenames.joinToString { "(?)" }})
SELECT input.filename FROM input
LEFT JOIN my_table pm ON input.filename ILIKE pm.filename
WHERE pm.filename IS NULL
${filenames.joinToString { "(?)" }} expands to something like (?), (?), (?), depending on the number of filename parameters.
What can I do to speed up this process?
One thing that I have to do is add an index on the filename column. What kind of index is the correct choice?
Since you're using ILIKE, I wouldn't put an index on pm.filename, but on LOWER(pm.filename). This should allow you to remove ILIKE in favour of the more performant LIKE. This also means you can just use a simple B-tree index, as it works fine with LIKE. LIKE is useful if you use wildcards, but if you don't, just use normal =-equality.
Finally, there is a good chance that the query optimiser already does a lot with the query, but I suggest you look at the EXPLAIN (ANALYSE) output of this query. I have some suggestions for improvement, but no idea on whether they will help or they will all be boiled down to the same query plan. That's completely up to you!
This takes the result of the first query first list and removes any matches from the result of the second query. The downside is that the returned filenames are lowercased.
SELECT LOWER(filename)
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
EXCEPT ALL (SELECT LOWER(filename) FROM my_table pm)
This query doesn't have this drawback, it just returns all filenames that do not have a match in my_table.
SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE NOT EXISTS (
SELECT
FROM my_table pm
WHERE LOWER(pm.filename) = LOWER(input.filename)
)
The last query is probably equivalent to this one, but I'll add it for completeness.
SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE LOWER(filename) NOT IN (
SELECT LOWER(pm.filename)
FROM my_table pm
)

Postgres "first" aggregation function

I am aggregation a table using file ID field. Each file has a name which matched exactly one (his) file id.
select file_key, min(fullfilepath)
from table
group by file_key
Because I know the structure of the table, I know that I need ANY fullfilepath. The min and the max are ok, but it requires a lot of time.
I came across this aggregation function which returns the first value. Unfortunately, this function takes a long time, because it scans the whole table. For example, this is very slow:
select first(file_id) from table;
What is the fastest way to do that? With or without aggregation function.
There is no way to make your first query with the GROUP BY clause faster, because it has to scan the whole table to find all groups.
Your second query can be made faster:
SELECT (
SELECT file_id FROM "table"
WHERE file_id IS NOT NULL
LIMIT 1
);
There is no way to optimize the query as you wrote it, because the aggregate function is a black box to PostgreSQL.
I doubt that this will help performance but it may be useful if anyone actually wants a first agregate.
-- coaslesce isn't a function so make an equivalent function.
create function coalesce_("anyelement","anyelement") returns "anyelement"
language sql as $$ select coalesce( $1,$2 ) $$;
create aggregate first("anyelement") (sfunc=coalesce_, stype="anyelement");
select
distinct on (file_key)
file_key, fullfilepath
from table
order by file_key
That will return one record for each file_key

Why array_agg() is returning empty array in postgresql?

I have an integer type column named as start. I want to make an array by the values of this column. It seemed to be very easy and I used array_agg(), but it is giving empty array as output. Following is my column data
start
1
2
11
5
.
.
. (and so on)
And following is my query used to make the array:
select array_agg(start) as start_array from table1;
Why is it giving empty array?
It's not
There is no way that this can return empty unless there are no rows. Perhaps a JOIN or a WHERE clause is wrong and you have 0-rows?
Also as a micro-optimization if your query is this simple,
select array_agg(start) as start_array from table1;
Then it's probably better written with the ARRAY() constructor...
SELECT ARRAY(SELECT start FROM table1) AS start_array;

PostgreSQL use function result in ORDER BY

Is there a way to use the results of a function call in the order by clause?
My current attempt (I've also tried some slight variations).
SELECT it.item_type_id, it.asset_tag, split_part(it.asset_tag, 'ASSET', 2)::INT as tag_num
FROM serials.item_types it
WHERE it.asset_tag LIKE 'ASSET%'
ORDER BY split_part(it.asset_tag, 'ASSET', 2)::INT;
While my general assumption is that this can't be done, I wanted to know if there was a way to accomplish this that I wasn't thinking of.
EDIT: The query above gives the following error [22P02] ERROR: invalid input syntax for integer: "******"
Your query is generally OK, the problem is that for some row the result of split_part(it.asset_tag, 'ASSET', 2) is the string ******. And that string cannot be cast to an integer.
You may want to remove the order by and the cast in the select list and add a where split_part(it.asset_tag, 'ASSET', 2) = '******', for instance, to narrow down that data issue.
Once that is resolved, having such a function in the order by list is perfectly fine. The quoted section of the documentation in the comments on the question is referring to applying an order by clause to the results of UNION, INTERSECTION, etc. queries. In other words, the order by found in this query:
(select column1 as result_column1 from table1
union
select column2 from table 2)
order by result_column1
can only refer to the accumulated result columns, not to expressions on individual rows.

Number of rows returned in a sqlite statement

Is there any easy way to get the number of rows returned by a sqlite statement? I don't want to have to go through the process of doing a COUNT() first. Thanks.
On each call to sqlite_step, increment a variable by 1.
If you want the row count in advance, then there's no easy way.
To count all entries in a table, you can use the following SQL statement:
SELECT COUNT(*) FROM "mytable" where something=42;
Or just the following to get all entries:
SELECT COUNT(*) FROM "mytable";
In case you have already done the query, and just want the number of entries returned you can use sqlite3_data_count() and sqlite3_column_count() depending on what you want to count.