How to make SQLAlchemy custom DDL be emitted after object inserted? - postgresql

I have a PostgreSQL Materialized View that calculates some data about Manufacturers.
I created a SQLAlchemy custom DDL command to refresh the view:
from sqlalchemy.schema import DDLElement
from sqlalchemy.ext import compiler
class RefreshMaterializedView(DDLElement):
'''Target expected to be a view name string'''
def __init__(self, concurrently):
self.concurrently = concurrently
#compiler.compiles(RefreshMaterializedView)
def compile(element, compiler, **kw):
if element.concurrently:
return "REFRESH MATERIALIZED VIEW CONCURRENTLY %s" % (element.target)
return "REFRESH MATERIALIZED VIEW %s" % (element.target)
class ManufacturerMaterializedView(db.Model):
#classmethod
def refresh(cls, concurrently=True, bind=db.session):
RefreshMaterializedView(concurrently).execute(
target=cls.__table__.fullname, bind=bind)
Here's how I currently use it in my code:
db.session.add(new_manufacturer_instance)
ManufacturerMaterializedViewClass.refresh() # bound to the same session
db.session.flush()
# some other stuff
db.session.add(another_manufacturer_instance) # still in the same PostgreSQL transaction
ManufacturerMaterializedViewClass.refresh()
db.session.commit()
Desired behavior:
The materialized view is refreshed after the new_manufacturer_instance is created.
I can repeatedly insert new manufacturers and call ManufacturerMaterializedViewClass.refresh() multiple times within the same session, but the refresh will only be emitted once, at the end of the session after all the INSERT/UPDATE/DELETE statements for all objects have been emitted. Other object types affect the output of this materialized view, so this refresh statement needs to be emitted after those objects are modified.
Here's what's currently happening when I view the Flask-Sqlalchemy query log using SQLALCHEMY_ECHO = True:
$ python manage.py shell
>>> ManufacturerFactory() # creates a new manufacturer instance and adds it to the session
<Manufacturer #None:'est0'>
>>> ManufacturerMV.refresh()
2015-11-29 13:33:44,811 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2015-11-29 13:33:44,812 INFO sqlalchemy.engine.base.Engine REFRESH MATERIALIZED VIEW CONCURRENTLY manufacturer_mv
2015-11-29 13:33:44,812 INFO sqlalchemy.engine.base.Engine {}
>>> db.session.flush()
2015-11-29 13:34:13,745 INFO sqlalchemy.engine.base.Engine INSERT INTO manufacturer (name, website, logo, time_updated) VALUES (%(name)s, %(website)s, %(logo)s, %(time_updated)s) RETURNING manufacturer.id
2015-11-29 13:34:13,745 INFO sqlalchemy.engine.base.Engine {'logo': '/static/images/16-rc_gear_essential.jpg', 'website': 'http://hermann.com/', 'time_updated': None, 'name': 'est0'}
>>>> db.session.commit()
2015-11-29 13:42:58,160 INFO sqlalchemy.engine.base.Engine COMMIT
As you can see, calling refresh() immediately issues SQL to the DB, even before a session.flush(), pre-empting any additional insert/update statements. However, the DDL isn't actually executed by PostgreSQL until session.commit() closes the transaction.
How should I modify my DDL/classmethod to achieve my desired behavior?
I looked at the ORM events but wasn't quite sure how to leverage them for my use case. I don't want a refresh called on every session.commit() emitted by my application. The refreshing this particular view is a fairly expensive operation, so it should only happen when I actually called refresh() within the current transaction/session.

Related

JDBC batch for multiple prepared statements

Is it possible to batch together commits from multiple JDBC prepared statements?
In my app the user will insert one or more records along with records in related tables. For example, we'll need to update a record in the "contacts" table, delete related records in the "tags" table, and then insert a fresh set of tags.
UPDATE contacts SET name=? WHERE contact_id=?;
DELETE FROM tags WHERE contact_id=?;
INSERT INTO tags (contact_id,tag) values (?,?);
// insert more tags as needed here...
These statements need to be part of a single transaction, and I want to do them in a single round trip to the server.
To send them in a single round-trip, there are two choices: for each command create a Statement and then call .addBatch(), or for each command create a PreparedStatement, and then call .setString(), .setInt() etc. for parameter values, then call .addBatch().
The problem with the first choice is that sending a full SQL string in the .addBatch() call is inefficient and you don't get the benefit of sanitized parameter inputs.
The problem with the second choice is that it may not preserve the order of the SQL statements. For example,
Connection con = ...;
PreparedStatement updateState = con.prepareStatement("UPDATE contacts SET name=? WHERE contact_id=?;");
PreparedStatement deleteState = con.prepareStatement("DELETE FROM contacts WHERE contact_id=?;");
PreparedStatement insertState = con.prepareStatement("INSERT INTO tags (contact_id,tag) values (?,?);");
updateState.setString(1, "Bob");
updateState.setInt(1, 123);
updateState.addBatch();
deleteState.setInt(1, 123);
deleteState.addBatch();
... etc ...
... now add more parameters to updateState, and addBatch()...
... repeat ...
con.commit();
In the code above, are there any guarantees that all of the statements will execute in the order we called .addBatch(), even across different prepared statements? Ordering is obviously important; we need to delete tags before we insert new ones.
I haven't seen any documentation that says that ordering of statements will be preserved for a given connection.
I'm using Postgres and the default Postgres JDBC driver, if that matters.
The batch is per statement object, so a batch is executed per executeBatch() call on a Statement or PreparedStatement object. In other words, this only executes the statements (or value sets) associated with the batch of that statement object. It is not possible to 'order' execution across multiple statement objects. Within an individual batch, the order is preserved.
If you need statements executed in a specific order, then you need to explicitly execute them in that order. This either means individual calls to execute() per value set, or using a single Statement object and generating the statements in the fly. Due to the potential of SQL injection, this last approach is not recommended.

Slick insert into H2, but no data inserted

I'm sure I am missing something really stupidly obvious here - I have a unit test for a very simple Slick 3.2 setup. The DAO has basic retrieve and insert methods as follows:
override def questions: Future[Seq[Tables.QuestionRow]] =
db.run(Question.result)
override def createQuestion(title: String, body: String, authorUuid: UUID): Future[Long] =
db.run(Question returning Question.map(_.id) += QuestionRow(0l, UUID.randomUUID().toString, title, body, authorUuid.toString))
And I have some unit tests - for the tests im using in memory H2 and have a setup script (passed to the jdbcurl) to initialise two basic rows in the table.
The unit tests for retriving works fine, and they fetch the two rows inserted by the init script, and I have just added a simple unit test to create a row and then retrieve them all - assuming it will fetch the three rows, but no matter what I do, it only ever retrieves the initial two:
it should "create a new question" in {
whenReady(questionDao.createQuestion("Question three", "some body", UUID.randomUUID)) { s =>
whenReady(questionDao.questions(s)) { q =>
println(s)
println(q.map(_.title))
assert(true)
}
}
}
The output shows that the original s (the returning ID from the autoinc) is 3, as I would expect (I have also tried the insert not doing the returning step and just letting it return the number of rows inserted, which returns 1, as expecteD), but looking at the values returned in q, its only ever the first two rows inserted by the init script.
What am I missing?
My assumptions are that your JDBC url is something like jdbc:h2:mem:test;INIT=RUNSCRIPT FROM 'init.sql' and no connection pooling is used.
There are two scenarios:
the connection is performed with keepAliveConnection = true (or by appending DB_CLOSE_DELAY=-1 to the JDBC url) and the init.sql is something like:
DROP TABLE IF EXISTS QUESTION;
CREATE TABLE QUESTION(...);
INSERT INTO QUESTION VALUES(null, ...);
INSERT INTO QUESTION VALUES(null, ...);
the connection is performed with keepAliveConnection = false (default) (without appending DB_CLOSE_DELAY=-1 to the JDBC url) and the init.sql is something like:
CREATE TABLE QUESTION(...);
INSERT INTO QUESTION VALUES(null, ...);
INSERT INTO QUESTION VALUES(null, ...);
The call to questionDao.createQuestion will open a new connection to your H2 database and will trigger the initialization script (init.sql).
In both scenarios, right after this call, the database contains a QUESTION table with 2 rows.
In scenario (2) after this call the connection is closed and according to H2 documentation:
By default, closing the last connection to a database closes the database. For an in-memory database, this means the content is lost. To keep the database open, add ;DB_CLOSE_DELAY=-1 to the database URL. To keep the content of an in-memory database as long as the virtual machine is alive, use jdbc:h2:mem:test;DB_CLOSE_DELAY=-1.
The call to questionDao.questions will then open a new connection to your H2 database and will trigger again the initialization script (init.sql).
In scenario (1) the first connection is kept alive (and also the database content) but the new connection will re-execute the initialization script (init.sql) erasing the database content.
Given that (in both scenarios) questionDao.createQuestion returns 3, as expected, but then the content is lost and so the subsequent call to questionDao.questions will use a freshly initialized database.

How to make SQLAlchemy issue additional SQL after flushing the current session?

I have some SQL that I'd like SQLAlchemy to issue after it flushes the current session.
So I'm trying to write a Python function that will do the following "at the end of this specific SQLAlchemy session, 1) flush the session, 2) then send this extra SQL to the database as well, 3) then finally commit the session", but only if I call it within that particular session.
I don't want it on all sessions globally, so if I didn't call the function within this session, then don't execute the SQL.
I know SQLAlchemy has a built-in events system, and I played around with it, but I can't figure out how to register an event listener for only the current session, and not all sessions globally. I read the docs, but I'm still not getting it.
I am aware of database triggers, but they won't work for this particular scenario.
I'm using Flask-SQLAlchemy, which uses scoped sessions.
Not sure why it does not work for you. Sample code below runs as expected:
class Stuff(Base):
__tablename__ = 'stuff'
id = Column(Integer, primary_key=True)
name = Column(String)
Base.metadata.create_all(engine)
session = Session()
from sqlalchemy import event
#event.listens_for(session, 'after_flush')
def _handle_event(session, context):
print('>> --- after_flush started ---')
rows = session.execute("SELECT 1 AS XXX").fetchall()
print(rows)
print('>> --- after_flush finished ---')
# create test data
s1 = Stuff(name='uno')
session.add(s1)
print('--- before calling commit ---')
session.commit()
print('--- after calling commit ---')

How can I ensure that a materialized view is always up to date?

I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right? I'm surprised to not find much discussion of this on the web.
How should I go about doing this?
I think the top half of the answer here is what I'm looking for: https://stackoverflow.com/a/23963969/168143
Are there any dangers to this? If updating the view fails, will the transaction on the invoking update, insert, etc. be rolled back? (this is what I want... I think)
I'll need to invoke REFRESH MATERIALIZED VIEW on each change to the tables involved, right?
Yes, PostgreSQL by itself will never call it automatically, you need to do it some way.
How should I go about doing this?
Many ways to achieve this. Before giving some examples, keep in mind that REFRESH MATERIALIZED VIEW command does block the view in AccessExclusive mode, so while it is working, you can't even do SELECT on the table.
Although, if you are in version 9.4 or newer, you can give it the CONCURRENTLY option:
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
This will acquire an ExclusiveLock, and will not block SELECT queries, but may have a bigger overhead (depends on the amount of data changed, if few rows have changed, then it might be faster). Although you still can't run two REFRESH commands concurrently.
Refresh manually
It is an option to consider. Specially in cases of data loading or batch updates (e.g. a system that only loads tons of information/data after long periods of time) it is common to have operations at end to modify or process the data, so you can simple include a REFRESH operation in the end of it.
Scheduling the REFRESH operation
The first and widely used option is to use some scheduling system to invoke the refresh, for instance, you could configure the like in a cron job:
*/30 * * * * psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv"
And then your materialized view will be refreshed at each 30 minutes.
Considerations
This option is really good, specially with CONCURRENTLY option, but only if you can accept the data not being 100% up to date all the time. Keep in mind, that even with or without CONCURRENTLY, the REFRESH command does need to run the entire query, so you have to take the time needed to run the inner query before considering the time to schedule the REFRESH.
Refreshing with a trigger
Another option is to call the REFRESH MATERIALIZED VIEW in a trigger function, like this:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
REFRESH MATERIALIZED VIEW CONCURRENTLY my_mv;
RETURN NULL;
END;
$$;
Then, in any table that involves changes on the view, you do:
CREATE TRIGGER tg_refresh_my_mv AFTER INSERT OR UPDATE OR DELETE
ON table_name
FOR EACH STATEMENT EXECUTE PROCEDURE tg_refresh_my_mv();
Considerations
It has some critical pitfalls for performance and concurrency:
Any INSERT/UPDATE/DELETE operation will have to execute the query (which is possible slow if you are considering MV);
Even with CONCURRENTLY, one REFRESH still blocks another one, so any INSERT/UPDATE/DELETE on the involved tables will be serialized.
The only situation I can think that as a good idea is if the changes are really rare.
Refresh using LISTEN/NOTIFY
The problem with the previous option is that it is synchronous and impose a big overhead at each operation. To ameliorate that, you can use a trigger like before, but that only calls a NOTIFY operation:
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, 'my_mv';
RETURN NULL;
END;
$$;
So then you can build an application that keep connected and uses LISTEN operation to identify the need to call REFRESH. One nice project that you can use to test this is pgsidekick, with this project you can use shell script to do LISTEN, so you can schedule the REFRESH as:
pglisten --listen=refresh_mv --print0 | xargs -0 -n1 -I? psql -d your_database -c "REFRESH MATERIALIZED VIEW CONCURRENTLY ?;"
Or use pglater (also inside pgsidekick) to make sure you don't call REFRESH very often. For example, you can use the following trigger to make it REFRESH, but within 1 minute (60 seconds):
CREATE OR REPLACE FUNCTION tg_refresh_my_mv()
RETURNS trigger LANGUAGE plpgsql AS $$
BEGIN
NOTIFY refresh_mv, '60 REFRESH MATERIALIZED VIEW CONCURRENLTY my_mv';
RETURN NULL;
END;
$$;
So it will not call REFRESH in less the 60 seconds apart, and also if you NOTIFY many times in less than 60 seconds, the REFRESH will be triggered only once.
Considerations
As the cron option, this one also is good only if you can bare with a little stale data, but this has the advantage that the REFRESH is called only when really needed, so you have less overhead, and also the data is updated more closer to when needed.
OBS: I haven't really tried the codes and examples yet, so if someone finds a mistake, typo or tries it and works (or not), please let me know.
Now there is a PostgreSQL extension to keep materialized views updated: pg_ivm.
It only computes and applies the incremental changes, rather than recomputing the contents fully as REFRESH MATERIALIZED VIEW does. It has 2 approaches, IMMEDIATE and DEFERRED:
For IMMEDIATE, the views are updated in the same transaction that its base table is modified.
For DEFERRED, the views are updated after the transaction is committed.
Version 1.0 has been released on 2022-04-28.
Let me point out three things on the previous answer by MatheusOl - the pglater technology.
As the last element of long_options array it should include "{0, 0, 0, 0}" element as pointed at https://linux.die.net/man/3/getopt_long by the phrase "The last element of the array has to be filled with zeros." So, it should read -
static struct option long_options[] = {
//......
{"help", no_argument, NULL, '?'},
{0, 0, 0, 0}
};
On the malloc/free thing -- one free(for char listen = malloc(...);) is missing. Anyhow, malloc caused pglater process to crash on CentOS (but not on Ubuntu - I don't know why). So, I recommend using char array and assign the array name to the char pointer(to both char and char**). You many need to force type conversion while you do that(pointer assignment).
char block4[100];
...
password_prompt = block4;
...
char block1[500];
const char **keywords = (const char **)&block1;
...
char block3[300];
char *listen = block3;
sprintf(listen, "listen %s", id);
PQfreemem(id);
res = PQexec(db, listen);
Use below table to calculate timeout where md is mature_duration which is the time difference between the latest refresh(lr) time point and current time.
when md >= callback_delay(cd) ==> timeout: 0
when md + PING_INTERVAL >= cd ==> timeout: cd-md[=cd-(now-lr)]
when md + PING_INTERVAL < cd ==> timeout: PI
To implement this algorithm(3rd point), you should init 'lr' as follows -
res = PQexec(db, command);
latest_refresh = time(0);
if (PQresultStatus(res) == PGRES_COMMAND_OK) {

Row-Level Update Lock using System.Transactions

I have a MSSQL procedure with the following code in it:
SELECT Id, Role, JurisdictionType, JurisdictionKey
FROM
dbo.SecurityAssignment WITH(UPDLOCK, ROWLOCK)
WHERE Id = #UserIdentity
I'm trying to move that same behavior into a component that uses OleDb connections, commands, and transactions to achieve the same result. (It's a security component that uses the SecurityAssignment table shown above. I want it to work whether that table is in MSSQL, Oracle, or Db2)
Given the above SQL, if I run a test using the following code
Thread backgroundThread = new Thread(
delegate()
{
using (var transactionScope = new TrasnsactionScope())
{
Subject.GetAssignmentsHavingUser(userIdentity);
Thread.Sleep(5000);
backgroundWork();
transactionScope.Complete();
}
});
backgroundThread.Start();
Thread.Sleep(3000);
var foregroundResults = Subject.GetAssignmentsHavingUser(userIdentity);
Where
Subject.GetAssignmentsHavingUser
runs the sql above and returns a collection of results and backgroundWork is an Action that updates rows in the table, like this:
delegate
{
Subject.UpdateAssignment(newAssignment(user1, role1));
}
Then the foregroundResults returned by the test should reflect the changes made in the backgroundWork action.
That is, I retrieve a list of SecurityAssignment table rows that have UPDLOCK, ROWLOCK applied by the SQL, and subsequent queries against those rows don't return until that update lock is released - thus the foregroundResult in the test includes the updates made in the backgroundThread.
This all works fine.
Now, I want to do the same with database-agnostic SQL, using OleDb transactions and isolation levels to achieve the same result. And I can't for the life of me, figure out how to do it. Is it even possible, or does this row-level locking only apply at the db level?