Searching DB table to determine unseen file names from list - postgresql

I am processing flat files from disk and need to ensure that I never process the same file twice. The filename of every processed file is stored in a postgresql DB and at the next iteration I need to determine the unseen files on disk and process them, ie. I need to determine the set difference of the filenames on disk and the filenames in the DB.
Currently my approach is to create a CTE from the filenames on disk and join that to the table of seen filenames. The list of files on disk is large and constantly changing, and processing is slowing down.
This is the current query:
WITH input(filename) AS (VALUES ${filenames.joinToString { "(?)" }})
SELECT input.filename FROM input
LEFT JOIN my_table pm ON input.filename ILIKE pm.filename
WHERE pm.filename IS NULL
${filenames.joinToString { "(?)" }} expands to something like (?), (?), (?), depending on the number of filename parameters.
What can I do to speed up this process?
One thing that I have to do is add an index on the filename column. What kind of index is the correct choice?

Since you're using ILIKE, I wouldn't put an index on pm.filename, but on LOWER(pm.filename). This should allow you to remove ILIKE in favour of the more performant LIKE. This also means you can just use a simple B-tree index, as it works fine with LIKE. LIKE is useful if you use wildcards, but if you don't, just use normal =-equality.
Finally, there is a good chance that the query optimiser already does a lot with the query, but I suggest you look at the EXPLAIN (ANALYSE) output of this query. I have some suggestions for improvement, but no idea on whether they will help or they will all be boiled down to the same query plan. That's completely up to you!
This takes the result of the first query first list and removes any matches from the result of the second query. The downside is that the returned filenames are lowercased.
SELECT LOWER(filename)
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
EXCEPT ALL (SELECT LOWER(filename) FROM my_table pm)
This query doesn't have this drawback, it just returns all filenames that do not have a match in my_table.
SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE NOT EXISTS (
SELECT
FROM my_table pm
WHERE LOWER(pm.filename) = LOWER(input.filename)
)
The last query is probably equivalent to this one, but I'll add it for completeness.
SELECT filename
FROM (VALUES ${filenames.joinToString { "(?)" }}) AS input(filename)
WHERE LOWER(filename) NOT IN (
SELECT LOWER(pm.filename)
FROM my_table pm
)

Related

How to hash a query result with sha256 in PostgreSQL?

I want to somehow hash the result of a query in PostgreSQL. I have a query like
SELECT output FROM result;
And it returns a column composed only of integers. So I somehow want to hash the result of this query. Concatenate the values and hash, or somehow hash the query output directly. Simply I need a way to put it inside SELECT sha256(...). So please note that I do not want to get hash of every column entry, but one hash that somehow corresponds to the query output. Any ideas?
PostgreSQL doesn't come with a built-in streaming hash function exposed to the user, so the easiest way is to build the string in memory and then hash it. Of course this won't work with giant result sets. You can use digest from the pg_crypto extension. You also need to order your rows, or else you might get different results on the same data from one execution to the next if you get the rows in different orders.
select digest(string_agg(output::text,' ' order by output),'sha256')
from result;
Replace 1234 with your column name and add [from table_name] to this query:
select encode(digest(1234::text, 'sha256'), 'hex')
Or for multiple rows use this:
select encode(
digest(
(select array_agg(q1)::text[] from (select row(R.*)::text as q1 from (SELECT output FROM result)R)alias)::text
, 'sha256')
, 'hex')

Postgres "first" aggregation function

I am aggregation a table using file ID field. Each file has a name which matched exactly one (his) file id.
select file_key, min(fullfilepath)
from table
group by file_key
Because I know the structure of the table, I know that I need ANY fullfilepath. The min and the max are ok, but it requires a lot of time.
I came across this aggregation function which returns the first value. Unfortunately, this function takes a long time, because it scans the whole table. For example, this is very slow:
select first(file_id) from table;
What is the fastest way to do that? With or without aggregation function.
There is no way to make your first query with the GROUP BY clause faster, because it has to scan the whole table to find all groups.
Your second query can be made faster:
SELECT (
SELECT file_id FROM "table"
WHERE file_id IS NOT NULL
LIMIT 1
);
There is no way to optimize the query as you wrote it, because the aggregate function is a black box to PostgreSQL.
I doubt that this will help performance but it may be useful if anyone actually wants a first agregate.
-- coaslesce isn't a function so make an equivalent function.
create function coalesce_("anyelement","anyelement") returns "anyelement"
language sql as $$ select coalesce( $1,$2 ) $$;
create aggregate first("anyelement") (sfunc=coalesce_, stype="anyelement");
select
distinct on (file_key)
file_key, fullfilepath
from table
order by file_key
That will return one record for each file_key

Filter data returned by a function

I have a very simple query, which looks like so:
SELECT (ST_DumpPoints(geom)).path as path FROM layer
It uses some library function which returns integer[] type. Data in the resulting table looks like:
{1}
{1,2}
{1}
I tried a lot of ways to filter this data - say, to get all rows which contain {2}. But I failed. The only query that works is this one:
SELECT path FROM (
SELECT (ST_DumpPoints(geom)).path as path FROM layer_60_
) t WHERE path #> '{2}'::integer[]
But using a subquery in this case looks like overkill. Is there any possibility to filter it right away without a subquery? My own attempts ended in failure:
SELECT (ST_DumpPoints(geom)).path as path FROM layer_60_ WHERE path ... # indeed, wrong
SELECT (ST_DumpPoints(geom)).path as path FROM layer_60_ HAVING path ... # wrong
SELECT * FROM layer_60_ WHERE (ST_DumpPoints(geom)).path ... # does not work
SELECT (ST_DumpPoints(geom)).path as path FROM layer_60_ GROUP BY path HAVING path ... # wrong!!!
So, what is wrong with that and is there any simpler solution, than a subquery?
Using a sub-query should not be of any concern to you. The planner will fold sub-queries into a single plan anyway. If a query is easier to construct, understand and maintain using a sub-query, then - by all means - go ahead. There is no performance penalty (unless you use optimization barriers).
In your case, the function ST_DumpPoints() is a set-returning function, so you should really use it as a row source and then it becomes rather easy to do what you want without a sub-query:
SELECT p.path
FROM layer_60_, ST_DumpPoints(geom) p(path, geom)
WHERE p.path #> '{2}'::integer[];
The function is evaluated once for each of the rows from the preceding table and then the rows are joined as if there were a join condition specified. After that you can refer to the columns of the function output.

Questionable performance using IF EXISTS with inner existence checks

This is in a stored procedure..This if statement, then I do a little work. The #AsOfDate is a passed in variable of date datatype. The question I have is Why do I get better performance by removing the inner-most exists, but ONLY when the entire statement is in an IF EXISTS?
The two tables:
dbo.TXXX_InventoryDetail -- 1.3 billion records..stats up to date
dbo.TXXX_InventoryFull -- 9.8 million records..stats up to date
Statement:
if exists (select 1
from dbo.TXXX_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryFull'. Scan count 9240262, logical reads 29548864
Table 'T001_InventoryDetail'. Scan count 1, logical reads 17259
If I remove the second where exists and do a join:
if exists (select 1
from dbo.TXXX_InventoryDetail o,
dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut)
and o.C001_AsOfDate=#AsOfDate)
io output:
Table 'TXXX_InventoryDetail'. Scan count 0, logical reads 333952
Table 'TXXX_InventoryFull'. Scan count 1, logical reads 630
Now..the reason I think it is the if exists is that if I remove it and do a select count(*) like this:
select COUNT(*)
from dbo.T001_InventoryDetail o
where exists (select 1
from dbo.TXXX_InventoryFull i
where i.C001_AsOfDate= o.C001_AsOfDate
and i.C001_ProductID=o.C001_ProductID
and i.C001_StoreNumber=o.C001_StoreNumber
and i.C001_AsOfDate=#AsOfDate
and (i.C001_LastModelDate!=o.C001_LastModelDate
or o.C001_InventoryQty!=o.C001_InventoryQty
or i.C001_OnOrderQty!=o.C001_OnOrderQty
or i.C001_TBOQty!=o.C001_TBOQty
or i.C001_ModelQty!=o.C001_ModelQty
or i.C001_TBOAdjustQty!=o.C001_TBOAdjustQty
or i.C001_ReturnQtyPending!=o.C001_ReturnQtyPending
or i.C001_ReturnQtyInProcess!=o.C001_ReturnQtyInProcess
or i.C001_ReturnQtyDueOut!=o.C001_ReturnQtyDueOut))
and o.C001_AsOfDate=#AsOfDate
TXXX_InventoryFull'. Scan count 41, logical reads 692
T001_InventoryDetail'. Scan count 65, logical reads 17477
Worktable'. Scan count 0, logical reads 0
It is generally said that one should avoid doing coordinated subqueries in the predicate, as these tend to force nested loop joins. When querying large datasets, especially where one's trying to discover a difference between the sets, it's important to allow the query optimizer to choose dynamically between hash, merge and nested loop algorhithms, which may not be possible if the query is structured using a coordinated subquery. Better to create these as derived tables in the FROM clause.
I have found similar issues using the EXISTS statement on a SQL 08 R2 server, where the exact same statement runs fine on SQL 08 and SQL 05.
I found that changing something like
WHILE EXISTS(SELECT * FROM X)
Would be super slow, but:
WHILE ISNULL((SELECT TOP 1 ID FROM X), 0) <> 0
Runs perfectly fast again.
To me, it seems like an R2 issue...
I would guess that the plan you get is quite different when you use the join. Perhaps the imbalance in the number of rows (very large outer table, smaller inner table) is giving the optimizer fits, but it can probably eliminate rows much easier with the join (you'll probably see additional loop operators with the worse query). Tough to really guess without seeing the plans or being able to reproduce, but you should always aim at eliminating the most rows as early in the plan as possible. Pulling back millions of rows through several operators / subqueries only to eliminate most of them later in the plan is almost certainly going to yield worse performance.

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000