postgres alternative to bulk collect limit from oracle pl/sql - postgresql

I've a procedure in Oracle PL/SQL which fetches transactional data based on certain condition, then performs some logical calculations. I used cursor to store the SQL and then I used FETCH (cursor) BULK COLLECT INTO (table type variable) LIMIT 10000, iterated over this table variable to perform calculation and ultimately storing the value in a DB table. Once 10000 rows have been processed, query will be executed to fetch next set of records,
This helped me limiting number of times SQL is executed via cursor and limiting the number of records loaded into memory.
I am trying to migrate this code to plpgsql. How can I achieve this functionality in plpgsql?

You cannot achieve this functionality in PostgreSQL.

I wrote an extension https://github.com/okbob/dbms_sql . It can be used for reduce of necessary work related to migration from Oracle to Postgres.
But you don't need this feature in Postgres. Although PL/pgSQL is similar to PL/SQL, the architecture is very different - and bulk collect operations are not necessary.

Related

Spark data pipeline initial load impact on production DB

I want to write a Spark pipeline to perform aggregation on my production DB data and then write data back to the DB. My goal of writing the pipeline is to perform aggregation and not impact production DB while it runs, meaning I don't want users experiencing lag nor DB having heavy IOPS while the aggregation is performed. For example, an equivalent aggregation query just run as SQL would take a long time and also use up the RDS IOPS, which results in users not being able to get data - trying to avoid this. A few questions:
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)? For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
When writing data back to DB, does that incur load on DB as well?
I'm using a PostgreSQL database in case this matters.
How is data loaded into Spark (AWS Glue) in general? Is there query load on prod DB?
By default there will be a single partition in Glue to which the whole table is read into.But you can do parallel reads using this and make sure to chose a column that will not affect the DB performance.
Is there a difference in using a custom SQL query vs custom Spark code to filter items initially (initial loading of data, e.g. load 30 days sales data)?
Yes when you pass a query instead of table you will be only reading the result of it from the DB and reducing the large n/w and IO transfer. This means you are delegating it to DB engine to calculate the result.Refer to this on how you can do it.
For example, does using custom SQL query end up performing a query on the prod DB, resulting in large load on prod DB?
Yes depending on the table size and query complexity this might affect DB performance and if you have a read replica then you can simply use that.
When writing data back to DB, does that incur load on DB as well?
Yes it depends on how you are writing the result back to DB. Few partitions is always good i.e, not too many and not too less.

Is there any size limitation of SQL result?

I am using Azure PostgreSQL, I have a lot of files saved as byeta datatype in a table. In my project, I will execute some SQL query to get these files.
Sometimes a query will involve multiple files so the result data size of SQL query will be large. My questions: is there has some data size limit of SQL result for one SQL query ? Should I do some limit here? Any suggestion is appreciated.
There is no limit for the size of a result set in PostgreSQL.
However, many clients cache the whole result set in memory, which can easily lead to an out-of-memory condition on the client side.
There are ways around that:
Use cursors and fetch the result row by row or in batches. That should work with any client API.
With the C API (libpq), you could activate single-row mode.
With JDBC, you could set the fetch size.
Note that this means that you could get a runtime error from the database server in the middle of processing a result set.

To what degree does PostgreSQL support parallel DDL?

Looking here, it is clear that Oracle supports execution of DDL commands in parallel with scenarios clearly listed. I was wondering whether Postgres does indeed offer such functionality? I can find a lot of material on "parallel queries" for PostgreSQL but not so much when DDL is involved.
For example, can I execute multiple 'CREATE TABLE...AS SELECT' in parallel? And if not, how can I achieve such functionality? What happens if I have a temporary table (CREATE TEMP TABLE)? Do I need to configure something for locks?
From here:
Even when it is in general possible for parallel query plans to be generated, the planner will not generate them for a given query if any of the following are true:
The query writes any data or locks any database rows. If a query contains a data-modifying operation either at the top level or within
a CTE, no parallel plans for that query will be generated.
(emphasis mine).
Which seems to suggest that Postgres will not "parallelize" any query that modifies the database structure, under any circumstances.
Running multiple queries simultaneously in Postgres requires one connection per running query.
Those are generic DDL statements, they are index operations and partition operations that can be parallelized.
If you check the Notes section of the CREATE INDEX statement, you'll see that parallel index building is supported :
PostgreSQL can build indexes while leveraging multiple CPUs in order to process the table rows faster. This feature is known as parallel index build. For index methods that support building indexes in parallel (currently, only B-tree), maintenance_work_mem specifies the maximum amount of memory that can be used by each index build operation as a whole, regardless of how many worker processes were started. Generally, a cost model automatically determines how many worker processes should be requested, if any.
Update
I suspect the real question is about CREATE TABLE ... AS though.
This is essentially a CREATE TABLE followed by an INSERT .. SELECT. The CREATE TABLE part can't be parallelized and doesn't have to - it's essentially a metadata operation. The SELECT on the other hand, could be parallelized easily. INSERT is a bit harder, but it's a matter of implementation.
As a_horse_with_no_name explains in a comment to this question, parallelization for CREATE TABLE AS was added in PostgreSQL 11 :
Improvements to parallelism, including:
CREATE INDEX can now use parallel processing while building a B-tree index
Parallelization is now possible in CREATE TABLE ... AS, CREATE MATERIALIZED VIEW, and certain queries using UNION
Parallelized hash joins and parallelized sequential scans now perform better

Need to join oracle and sql server tables in oledb source without using linked server

My ssis package has an oledb source which joins oracle and sql server to get source data and loads it into sql server oledb destination. Earlier we were using linked server for this purpose but we cannot use linked server anymore.
So I am taking the data from sql server and want to return it to the in clause of the oracle query which i am keeping as sql command oledb source.
I tried parsing an object type variable from sql server and putting it into the in clause of oracle query in oledb source but i get error that oracle cannot have more than 1000 literals in the in statement. So basically I think I have to do something like this:
select * from oracle.db where id in (select id from sqlserver.db).
Since I cannot use linked server so i was thinking if I could have a temp table which can be used throughout the package.
I tried out another way of using merge join in ssis. but my source data set is really large and the merge join is returning fewer rows than expecetd. I am badly stuck at this point. I have tried a number if things nothung seems to be working.
Can someone please help. Any help will be greatly appreciated.
A couple of options to try.
Lookup:
My first instinct was a Lookup Task, but that might not be a great solution depending on the size of your data sets, since all of the records from both tables have to pulled over the wire and stored in memory on the SSIS server. But if you were able to pull off a Merge Join, then a Lookup should also work, but it might be slow.
Set an OLE DB Source to pull the Oracle data, without the WHERE clause.
Set a Lookup to pull the id column from your SQL Server table.
On the General tab of the Lookup, under Specify how to handle rows with no matching entries, select Redirect rows to no-match output.
The output of the Lookup will just be the Oracle rows that found a matching row in your SQL Server query.
Working Table on the Oracle server
If you have the option of creating a table in the Oracle database, you could create a Data Flow Task to pipe the results of your SQL Server query into a working table on the Oracle box. Then, in a subsequent Data Flow, just construct your Oracle query to use that working table as a filter.
Probably follow that up with an Execute SQL Task to truncate that working table.
Although this requires write access to Oracle, it has the advantage of off-loading the heavy lifting of the query to the database machine, and only pulling the rows you care about over the wire.

HSQL DB: is it possible to simulate Oracle IN clause item limit?

is there some HSQL DB property which would say how much items can be in the list used in IN clause? Oracle limits it to 1000 items, when I have more elements, I split the list by 1000 and execute more queries, but I'd need the HSQL database to simulate this scenario (I am writing an automated test and I'd like it to fail when someone removes this list splitting mechanism in the future)
No such limit can be set in HSQLDB. You should be able to check for the limit with a stored procedure in Oracle and in HSQLDB, so it is not affected by others modifying the application code.