Architecture to be able to have a lot of SQL calls on the same tables for a workflow execution - postgresql

We have a project where we let users execute workflows based on a selection of steps.
Basically each step is linked to an execution and an execution can be linked to one or multiple executionData (the data created or updated during that execution for that step, a blob in postgres).
Today, we execute this through a queuing mechanism where executions are created in queues and workers do the executions and create the next job in the queue.
But this architecture and our implementation make our postgres database slow as when multiple jobs are scheduled at the same time:
We are basically always creating and reading from the execution table (we create the execution to be scheduled, we read the execution when starting the job, we update the status when the job is finished)
We are basically always creating and reading from the executionData table (we add and update executionData during executions)
We have the following issues:
Our executionData table is growing very fast and it's almost impossible to remove rows as there are constantly locks on the table => what could we do to avoid that ? Postgres a good usage for that kind of data ?
Our execution table is growing as well very fast and it impacts the overall execution as to be able to execute we need to create, read & update execution. Delete of rows is as well almost impossible ... => what could we do to improve this ? Usage of historical table ? Suggestions ?
We need to perform statistics on the total executions executed & data saved, this is as well requested on the above table which slows down the process
We use RDS on AWS for our Postgres database.
Thanks for your insights!

Try going for a faster database architecture. Your use-case seems well optimized for a DynamoDB architecture for your executions. You can get O(1) performance, and the blob-storage can fit right into the record as long as you can keep it under 256K.

Related

Optimize trigger to write data on data warehouse

We are using using trigger to store the data on warehouse. Whenever some process is executed a trigger fires and store some information on the data warehouse. When number of transactions increase it affects the processing time.
What would be the best way to do this activity ?
I was thinking about Foreign Data Wrapper or AWS Read replica. Is there any other way to do such activity would be appreciated as well. Or I might not have to use trigger at all ?
Here are quick tips
Reduce latency between database server
Target database table should have less index, To Improve DML Performance
Logical replication may solve syncing data to warehouse
Option 3 is an architectural change, though you don't need to write triggers on each table to sync data

Bidirectional Replication Design: best way to script and execute unmatched row on Source DB to multiple subscriber DBs, sequentially or concurrently?

Thank you for help or suggestion offered.
I am trying to build my own multi-master replication on Postgresql 10 in Windows, for a situation which cannot use any of the current 3rd party tools for PG multimaster replication, which can also involve another DB platform in a subscriber group (Sybase ADS). I have the following logic to create bidirectional replication, partially inspired by Bucardo's logic, between 1 publisher and 2 subscribers:
When INSERT, UPDATE, or DELETE is made on Source table, Source table Trigger adds row to created meta table on Source DB that will act as a replication transaction to be performed on the 2 subscriber DBs which subcribe to it.
A NOTIFY signal will be sent to a service, or script written in Python or some scripting language will monitor for changes in the metatable or trigger execution and be able to do a table compare or script the statement to run on each subscriber database.
***I believe that triggers on the subscribers will need to be paused to keep them from pushing their received statements to their subscribers, i.e. if node A and node B both subscribe to each other's table A, then an update to node A's table A should replicate to node B's table A without then replicating back to table A in a bidirectional "ping-pong storm".
There will be a final compare between tables and the transaction will be closed. Re-enable triggers on subscribers if they were paused/disabled when pushing transactions from step 2 addendum.
This will hopefully be able to be done bidirectionally, in order of timestamp, in FIFO order, unless I can figure out a to create child processes to run the synchronizations concurrently.
For this, I am trying to figure out the best way to setup the service logic---essentially Step 2 above, which has apparently been done using a daemon in Linux, but I have to work in Windows, making it run as, or resembling, a service/agent---or come up with a reasonably easy and efficient design to send the source DBs statements to the subscribers DBs.
Does anyone see that this plan is faulty or may not work?
Disclaimer: I don't know anything about Postgresql but have done plenty of custom replication.
The main problem with bidirectional replication is merge issues.
If the same key is used in both systems with different attributes, which one gets to push their change? If you nominate a master it's easier. Then the slave just gets overwritten every time.
How much latency can you handle? It's much easier to take the 'notify' part out and just have a five minute windows task scheduler job that inspects log tables and pushes data around.
In other words, this kind of pattern:
Change occurs in a table. A database trigger on that table notes the change and writes the PK of the table to a change log table. A ReplicationBatch column in the log table is set to NULL by default
A windows scheduled task inspects all change log tables to find all changes that happened since the last run and 'reserves' these records by setting their replication state to a replication batch number
i.e. you run a UPDATE LogTable Set ReplicationBatch=BatchNumber WHERE ReplicationState IS NULL
All records that have been marked are replicated
you run a SELECT * FROM LogTable WHERE ReplicationState=RepID to get the records to be processed
When complete, the reserved records are marked as complete so the next time around only subsequent changes are replicated. This completion flag might be in the log table or it might be in a ReplicaionBatch number table
The main point is that you need to reserve records for replication, so that as you are replicating them out, additional log records can be added in from the source without messing up the batch
Then periodically you clear out the log tables.

DB2 LUW Parallel Jobs Execution

I have been working in DB2 LUW database, i want to submit procedures as a parallel job. Meaning I have a procesure which will do some DDL, DML statements to one table. This table is having huge data, the same procedure need to run for few more tables run in parallel.
I submit the job using DBMS_JOB.SUBMIT statement and executed the job using DBMS_JOB.RUN statement. I have job handler procedure which helps to do this in parallel.
But each job is executing in sequentially (meaning the first job got completed then the second jobs started, after 2nd job completed 3rd job getting started.
**My First Question **
how to run DBMS_JOB in parallel ?
And second issue I'm facing is the cutrent session is still waiting to get complete all the jobs. I can't use that particular session, once all the job got completed than i can have access to use that same session.
**My Second Question **
*how to make the session accessible, instead of waiting for all jobs completed *
Please help me sir/madam.
DBMS_JOB is an interface to the Administrative Taks Scheduler (ATS) of Db2-LUW for the sake of some compatibility with Oracle RDBMS. However, you can also use the ATS directly independently of DBMS_JOB, via ADMIN_TASK_ADD and related procedures.
My experience is that db2acd (the process that implements autonomic actions including the ATS) is unreliable especially when ulimits are misconfigured, and it silently won't run jobs in some circumstances. It also has a 5 minute wakeup to check for new jobs which can frustrate, and it requires an already activated database which is inconvenient for some use cases.
I would not recommend usage of the Db2 ATS for application layer functionality. Full function enterprise schedulers exist for good reasons.
For parallel invocations, I would use an enterprise scheduling tool if available, or failing-that use the scheduler supplied by the operating system either on the Db2-server or at worst on the client-side, taking care in both cases that each stored-procedure-invocation is its own scheduled-job with its own Db2-connection.
By using a Db2-connection per stored-procedure invocation, and concurrently scheduling them, they run in parallel as long as their actions don't cause mutual contention.
Apart from the above, I believe the ATS will start jobs in parallel provided that the job-defintions are correct.
Examine the contents of both ADMIN_TASK_LIST and ADMIN_TASK_STATUS administrative views, and corroborate with db2diag entries (diaglevel 4 may give more detail, even if you must use it only temporarily).
Calls to SQL PL (or PL/SQL) stored procedures are synchronous relative to the caller, which means that the Db2-connection is blocked until the stored procedure returns. You cannot "make the session accessible" if it is waiting for a stored procedure to complete, but you can open a new connection.
Different options exist for stored procedures that are written in C, or C++, or Java or C++/CLR. They have more freedom. Other options exist for messaging/broker based solutions. uch depends on available skillsets, toolsets, and experience. But in general it's wiser to keep it simple.

Is it possible to delete a single execution plan from cache on Azure SQL DB?

Conclusion
You can not. Microsoft explicitly states: "you cannot manually remove an execution plan from the cache" in this article called 'Understanding the Procedure Cache on SQL Azure'.
Original Question
On SQL Server a single execution plan can be deleted from the cache using [DBCC FREEPROCCACHE(plan_handle varbinary(64))][1]. There is different [documentation about DBCC FREEPROCCACHE on SQL Azure][2]. It seems that it removes all cached execution plans from all compute nodes and or control nodes (whatever those nodes might be, I don't know). I do not understand why SQL on Azure of the Server version would differ in this aspect.
However, I am looking for a way to delete a single execution plan from the cache on Azure. Is there any way to delete a single execution plan on Azure? Maybe using a query instead of a DBCC?
There is no way to remove single execution plan from cache.
If your execution plan is related to only few tables/one table and if you are ok with removal of cache for those tables as well, then you can alter the table ,add a non null column and remove the column.This will force flush the cache ..
Changing schema of the tables causes cache flush(not single plan, all plans) for those tables involved
I do not understand why SQL on Azure of the Server version would differ in this aspect.
This has to do with database as a offering, you are offered a database(this may be in some server with multiple databases) and some dbcc commands affect whole instance,so they kind of banned all DBCC commands.There is a new offering called Managed instance(which is same as on premises server,but with high availabilty of Azure database), you may want to check that as well

Perform multiple tasks to database one at a time

I´m using Sqlite.Swift and I want to perform three different tasks to add data to my database. Each task will get data from an external source.
So what I want to do is:
Get data for the first task
Add it to the first table
When this is done, go on to the next task
Add it to the second table
When this is done, go on to the last task
Add it to the last table
Right now I only have it like this:
dataService.getPlaces()
dataService.getTaxes()
dataService.getPersons()
But the issue is that there is over 2000 places, 100 taxes and 2000 persons so each task takes some time to complete and the database get locked when these try to run at the same time.
Anyone have any idea how to do this tasks one at a time?
Use NSOperationQueue, there is an excellent video online from last year's WWDC.
SQLite, whatever the Swift library you use, does not support concurrent writes: you won't be able to write places, taxes, persons, in parallel.
This is the case even when you open multiple connections, as I guess you did because you got locking errors.
What you can do is: first load data from external sources in memory. This can be done in parallel. When all data has been loaded, you can write them to the database in a single transaction (SQLite performs much better when you group writes in a transaction).