How to defer a slow database operation with SQLAlchemy / Python - postgresql

I've got a REST interface to a database, implemented in Python FastAPI + SQLAlchemy. I want to run some (relatively) costly operations on my data, such as calculating cryptographic hashes or signatures. My hope is to be able to:
Perform the insert, so that all referential integrity is checked by the database.
Return the REST response to the front-end (primary key of insert).
Asynchronously calculate the costly hash/signature/whatever, and update into the DB (same table).
I'd prefer to have a cross-database solution, but it seems running a trigger on insert/update might be the way to go. If so, my target in production is PostgreSQL.
Does anyone have a suggested approach they'd favour?

I would try to separate the concerns.
Build a fast REST service that does items 1 and 2 (concern: consistent data persistence);
Create a scheduled job that does item 3 in the background using pgAgent or s/th similar (concern: data "decoration") or use a notify/listen setup to invoke a separate process.
Running a trigger would do the actions synchronously. As far as I understand this is not your intention.

Related

Architecture to be able to have a lot of SQL calls on the same tables for a workflow execution

We have a project where we let users execute workflows based on a selection of steps.
Basically each step is linked to an execution and an execution can be linked to one or multiple executionData (the data created or updated during that execution for that step, a blob in postgres).
Today, we execute this through a queuing mechanism where executions are created in queues and workers do the executions and create the next job in the queue.
But this architecture and our implementation make our postgres database slow as when multiple jobs are scheduled at the same time:
We are basically always creating and reading from the execution table (we create the execution to be scheduled, we read the execution when starting the job, we update the status when the job is finished)
We are basically always creating and reading from the executionData table (we add and update executionData during executions)
We have the following issues:
Our executionData table is growing very fast and it's almost impossible to remove rows as there are constantly locks on the table => what could we do to avoid that ? Postgres a good usage for that kind of data ?
Our execution table is growing as well very fast and it impacts the overall execution as to be able to execute we need to create, read & update execution. Delete of rows is as well almost impossible ... => what could we do to improve this ? Usage of historical table ? Suggestions ?
We need to perform statistics on the total executions executed & data saved, this is as well requested on the above table which slows down the process
We use RDS on AWS for our Postgres database.
Thanks for your insights!
Try going for a faster database architecture. Your use-case seems well optimized for a DynamoDB architecture for your executions. You can get O(1) performance, and the blob-storage can fit right into the record as long as you can keep it under 256K.

Azure Database, EF, Time out issues

I have taken over an existing MVC website which uses entity framework and hangfire and is hosted on Azure and uses Azure database. Every so often the website times out.
I'm new to Azure portal, entity framework and hangfire.
If I increase the DTU's it clears the timeout issues?
I'm looking for ways of how to diagnose why the website times out. I have added error logging using elmah and checked hangfire but this doesn't give me any further information.
Is there anything in azure portal that can help?
If it "times out" and if "increasing DTU resolves timeouts" and these observations are true (I think it's on you to really convince yourself this is absolutely true, don't make this assumption lightly) then the usual and obvious candidate is "a slow sql query". Entity Framework is often used with linq to create sql queries without writing sql. These queries are often fine for very simple tasks, such as someData.Where(x=>x.Id == 1).First(), however, if linq is used to join tables, or create complex associations, the generated sql can become monstrously bad, from a performance perspective. You can add logging to write out the sql generated by linq, or you can try to trace the database to see what sql is running on it. If tracing is out of the question, there are still meta queries you can use to view things like cached query plans and SQL Server can give you estimated costs and cached execution counts.
You can still hang yourself without using linq. You can still use stored procedures with EF. Way too many developers are naive about SQL performance still; you need to comb over your back end and learn the schema, the stored procedures; inspect the sql contents of everything. Check for any database triggers (easy to miss). Red flags are subqueries, too many joining, too many results from a query, lots of string manipulation in a query, joining tables on strings, or XML/JSON-based SQL work.
Be aware that "slow sql queries" will become slower when load is high. And when slow sql queries build up, they only take more time to resolve. This can also cause debilitating table locking, depending on the nature of the query.
But queries can be performant and still cause locking. ie One table is being written to often and it's blocking other writes or reads from that table. This is a little harder to diagnose, but you can figure it out by carefully inspecting logs of database calls and how long they take to execute. There are also sql queries you can run on the database to diagnose long-running queries, or what tables are locked at a given point in time.
Finally, check for any back end webjobs for your application. If timeouts occur at reoccurring days or times, then somebody's batch SQL could be blocking your production database from being read.
But this is all speculation. I think you need to do more research to determine what is actually causing the site to become unresponsive. If you can log response times for common queries, you can rule out SQL-based latency as being the culprit or not and work from there. There's nothing inherently "amiss" about any of the technologies you specified.
If queries are perfomant but still causing issues, a long term solution is to add something like a message queue and batch your sql work intelligently, or just make the database work asynchronous and not block the UI.
You should correlate any logged timeouts with azure's monitoring. Azure can give you CPU/RAM/page visits and such on the dashboard.
SQL Azure is a bit of a different beast. It doesn't have the on-demand performance of a dedicated DB unless you're prepared to throw serious $$ at it. And even then ...
EF, when written for well can perform quite well. When written poorly it can be a dog, and those problems are compounded on a platform like SQL Azure.
The first thing is to check that your EF contexts are set up to use an execution strategy suited to Azure: https://learn.microsoft.com/en-us/ef/ef6/fundamentals/connection-resiliency/retry-logic
The next thing would be to see what kinds of SQL tracing you can run on Azure. Tracing is essential to see what EF is doing behind the scenes. I'm not familiar with tools available for Azure, in my case my Azure experience was running SQL Server on VMs because SQL Azure was too immature, not HIPAA compliant at the time, and expensive for the DTU estimates we were able to get. Worst case, can you restore an database backup into an SQL Server instance and point a copy of your application environment temporarily at that to run through common usage scenarios? Using an SQL Trace you can pick up on exactly when and how often EF is executing queries, and what queries it is executing.
Things to look at:
How many queries are running? If you are loading a set of records and expect one query, are there a whole heap of queries getting sent? This would indicate lazy-load calls being triggered.
What queries are being run? Is it selecting a lot more fields than are being displayed? This would be potentially a case where entire entities are being loaded where a .Select() could be used to reduce the amount of data. Perhaps even the case where entire sets of entities are being loaded that aren't relevant to what is displayed/done, such as cases where someone is using .ToList() prior to just doing a .Count() or .Any() or doing a .FirstOrDefault() just to do a != null check.
Is the database properly indexed? Copy some of the heavier queries into SQL Manager and execute them with an execution plan. Are there indexing suggestions?
The common sins of developing with EF and other ORMs boil down to "pulling too much, too often." It's surprising how many clients I've worked with have development teams that have not used a profiler to inspect their ORM use efficiency. (and I'm talking 0% so far.)

Challenges in migrating from IBM DB2 to Netezza

Due to added advantage of high performance and reduction in turnaround time, I am trying to migrate all the data from IBM DB2 to Netezza in my organization.
But what I realized is there is no concept of primary key in Netezza? If true, I can try and take care of these issue by using duplicate removal stage in Datastage.
Also, could you guys please assist me understanding if there are any more constraints that I should consider or challenges I could face for DB2 to Netezza migration?
Netezza does allow you to specify Primary Key and Foreign Key restraints, but they are not enforced. Which is to say that they are purely informational (for bot the user and the optimizer). A well-formed upsert process in ETL is a good way to manage for this.
On the topic of other issues you may face, here are a few thoughts:
Surrogate Keys
Be sure that you generate your surrogate keys either with Netezza's SEQUENCE object, or with a surrogate key generator in your ETL tool. Avoid using ROW_NUMBER for this process as it will most often prevent you from exploiting the parallel nature of the system when used in this way.
Stored Procedures
Stored procedures should avoid row-by-row/cusor-based processing when possible, as this is another case where you may prevent yourself from exploiting the parallel nature of the system.
SQL Extension Functions
If you find that you rely on functions that exists in DB2 that you don't find natively in Netezza, be sure to check what is available in the SQL Extensions Toolkit, which is included with Netezza, but not automatically installed/configured.
MERGE
If you rely on MERGE in your current environment, be aware that you must be on v7.2.1 to use MERGE in Netezza. Otherwise you will have to break it out into an INSERT/UPDATE operation.
Once you load the data in Netezza, one method we have utilized is to create a View to access the data and only expose the view. The view would have the logic inside to remove the duplicates.
Good luck!
Delan

Upsert in Amazon RedShift without Function or Stored Procedures

As there is no support for user defined functions or stored procedures in RedShift, how can i achieve UPSERT mechanism in RedShift which is using ParAccel, a PostgreSQL 8.0.2 fork.
Currently, i'm trying to achieve UPSERT mechanism using IF...THEN...ELSE... statement
e.g:-
IF NOT EXISTS(SELECT...WHERE(SELECT..))
THEN INSERT INTO tblABC() SELECT... FROM tblXYZ
ELSE UPDATE tblABC SET.,.,.,. FROM tblXYZ WHERE...
which is giving me error. As i'm writing this code independently without including it in function or SP's.
So, is there any solution to achieve UPSERT.
Thanks
You should probably read this article on upsert by depesz. You can't rely on SERIALIABLE for this since, AFAIK, ParAccel doesn't support full serializability support like in Pg 9.1+. As outlined in that post, you can't really do what you want purely in the DB anyway.
The short version is that even on current PostgreSQL versions that support writable CTEs it's still hard. On an 8.0 based ParAccel, you're pretty much out of luck.
I'd do a staged merge. COPY the new data to a temporary table on the server, LOCK the destination table, then do an UPDATE ... FROM followed by an INSERT INTO ... SELECT. Doing the data uploads in big chunks and locking the table for the upserts is reasonably in keeping with how Redshift is used anyway.
Another approach is to externally co-ordinate the upserts via something local to your application cluster. Have all your tools communicate via an external tool where they take an "insert-intent lock" before doing an insert. You want a distributed locking tool appropriate to your system. If everything's running inside one application server, it might be as simple as a synchronized singleton object.

MongoDB stored procedures NOT in javascript

I have some collection and I want to perform action on every insert into that collection. The problem is that the code, that will do this actions is in Java. In Oracle it was possible to wrap Java or even C code into PL/SQL procedure, and then use this procedure in trigger. In CouchDB we could write a view. What would be the closest analog for MongoDB?
The best possibility I can think of is to wrap my code into REST server, and then interact with it using stored javascript.
I've already seen this question, but due to dependency on java libs, I can't use just javascript in my workflow, neither I don't want to run a new heavy service along with mongodb if there is some other way to do this.
There are a number of things to say about your request:
I have some collection and I want to perform action on every insert into that collection.
1) What you're asking for here is not really a "stored procedure", but really is a "database trigger". MongoDB does not provide any sort of "database trigger" functionality.
This is consistent with the general design goals of MongoDB, which is to provide a very fast, scalable data store without the heavy weight of traditional DBMS systems. See this presentation for more details about the design goals of MongoDB: http://www.10gen.com/presentations/mongosf2011/whymongodb
2) If there is some data processing that you'd like to perform on every insert, you'll need to do it on the client side of the MongoDB connection. This will necessarily involve writing some code in your application.
3) I'd suggest that you avoid running JavaScript within the mongod server if at all possible. The JavaScript is interpreted on the server side, so the speed of your queries will be affected. In addition, all JavaScript run in the mongod server is single-threaded, so there is no concurrency of any JavaScript execution.
I wish I had a better answer for you.