Trying to use Knex onConflict times out my Cloud Function - postgresql

I am trying to insert geoJSON data into a PostGIS instance on a regular schedule and there is usually duplicate data each time it runs. I am looping through this geoJSON data and trying to use Knex.js onConflict modifier to ignore when a duplicate key field is found but, it times out my cloud function.
async function insertFeatures() {
try {
const results = await getGeoJSON();
pool = pool || (await createPool());
const st = knexPostgis(pool);
for (const feature of results.features) {
const { geometry, properties } = feature;
const { region, date, type, name, url } = properties;
const point = st.geomFromGeoJSON(geometry);
await pool('observations').insert({
region: region,
url: url,
date: date,
name: name,
type: type,
geom: point,
})
.onConflict('url')
.ignore()
}
} catch (error) {
console.log(error)
return res.status(500).json({
message: error + "Poop"
});
}
}

The timeout error could be caused by a variety of reasons,either it could be transaction batch size your function is processing or connection pool size or database server limitations.Here in your cloud function, check whether when setting up the pool, knex allows us to optionally register afterCreate callback, if this callback is added it is getting positive that you make the call to the done callback that is passed as the last parameter to your registered callback or else no connection will be acquired leading to timeout.
Also one way to see what knex is doing internally is to set DEBUG=knex:* environment variable, before running the code so that knex outputs information about queries, transactions and pool connections while code executes.It is advised that you set batch sizes, connection pool size and connection limits from the database server to match the workload that you are pushing to the server, this ensures the basic timeout issues caused.
Also check for similar examples here:
Knex timeout error acquiring connection
When trying to mass insert timeout occurs for knexjs error
Having timeout error after upgrading knex
Knex timeout acquiring a connection

Related

Sequelize transaction retry doens't work as expected

I don't understand how transaction retry works in sequelize.
I am using managed transaction, though I also tried with unmanaged with same outcome
await sequelize.transaction({ isolationLevel: Sequelize.Transaction.ISOLATION_LEVELS.REPEATABLE_READ}, async (t) => {
user = await User.findOne({
where: { id: authenticatedUser.id },
transaction: t,
lock: t.LOCK.UPDATE,
});
user.activationCodeCreatedAt = new Date();
user.activationCode = activationCode;
await user.save({transaction: t});
});
Now if I run this when the row is already locked, I am getting
DatabaseError [SequelizeDatabaseError]: could not serialize access due to concurrent update
which is normal. This is my retry configuration:
retry: {
match: [
/concurrent update/,
],
max: 5
}
I want at this point sequelize to retry this transaction. But instead I see that right after SELECT... FOR UPDATE it's calling again SELECT... FOR UPDATE. This is causing another error
DatabaseError [SequelizeDatabaseError]: current transaction is aborted, commands ignored until end of transaction block
How to use sequelizes internal retry mechanism to retry the whole transaction?
Manual retry workaround function
Since Sequelize devs simply aren't interested in patching this for some reason after many years, here's my workaround:
async function transactionWithRetry(sequelize, transactionArgs, cb) {
let done = false
while (!done) {
try {
await sequelize.transaction(transactionArgs, cb)
done = true
} catch (e) {
if (
sequelize.options.dialect === 'postgres' &&
e instanceof Sequelize.DatabaseError &&
e.original.code === '40001'
) {
await sequelize.query(`ROLLBACK`)
} else {
// Error that we don't know how to handle.
throw e;
}
}
}
}
Sample usage:
const { Transaction } = require('sequelize');
await transactionWithRetry(sequelize,
{ isolationLevel: Transaction.ISOLATION_LEVELS.SERIALIZABLE },
async t => {
const rows = await sequelize.models.MyInt.findAll({ transaction: t })
await sequelize.models.MyInt.update({ i: newI }, { where: {}, transaction: t })
}
)
The error code 40001 is documented at: https://www.postgresql.org/docs/13/errcodes-appendix.html and it's the only one I've managed to observe so far on Serialization failures: What are the conditions for encountering a serialization failure? Let me know if you find any others that should be auto looped and I'll patch them in.
Here's a full runnable test for it which seems to indicate that it is working fine: https://github.com/cirosantilli/cirosantilli.github.io/blob/dbb2ec61bdee17d42fe7e915823df37c4af2da25/sequelize/parallel_select_and_update.js
Tested on:
"pg": "8.5.1",
"pg-hstore": "2.3.3",
"sequelize": "6.5.1",
PostgreSQL 13.5, Ubuntu 21.10.
Infinite list of related requests
https://github.com/sequelize/sequelize/issues/1478 from 2014. Original issue was MySQL but thread diverged everywhere.
https://github.com/sequelize/sequelize/issues/8294 from 2017. Also asked on Stack Overflow, but got Tumbleweed badge and the question appears to have been auto deleted, can't find it on search. Mentions MySQL. Is a bit of a mess, as it also includes connection errors, which are not clear retries such as PostgreSQL serialization failures.
https://github.com/sequelize/sequelize/issues/12608 mentions Postgres
https://github.com/sequelize/sequelize/issues/13380 by the OP of this question
Meaning of current transaction is aborted, commands ignored until end of transaction block
The error is pretty explicit, but just to clarify to other PostgreSQL newbies: in PostgreSQL, when you get a failure in the middle of a transaction, Postgres just auto-errors any following queries until a ROLLBACK or COMMIT happens and ends the transaction.
The DB client code is then supposed to notice that just re-run the transaction.
These errors are therefore benign, and ideally Sequelize should not raise on them. Those errors are actually expected when using ISOLATION LEVEL SERIALIZABLE and ISOLATION LEVEL REPEATABLE READ, and prevent concurrent errors from happening.
But unfortunately sequelize does raise them just like any other errors, so it is inevitable for our workaround to have a while/try/catch loop.

pg-promise and Row Level Security

I am looking at implementing Row Level security with our node express + pg-promise + postgres service.
We've tried a few approaches with no success:
create a getDb(tenantId) wrapper which calls the SET app.current_tenant = '${tenantId}';` sql statement before returning the db object
getDb(tenantId) wrapper that gets a new db object every time - this works for a few requests but eventually causes too many db connections and errors out (which is understandable as it is not using pg-promise's connection pool management)
getDb(tenantId) wrapper that uses a name value (map) to store a list of db connections per tenant. This works for a short while but eventually results in too many db connections).
utilising the initOptions > connect event - have not found a way to get hold of the current request object (to then set the tenant_id)
Can someone (hopefully vitaly-t :)) please suggest the best strategy for injecting the current tenant before all sql queries are run inside a connection.
Thank you very much
here is an abbreviated code example:
const promise = require('bluebird');
const initOptions = {
promiseLib: promise,
connect: async (client, dc, useCount) => {
try {
// "hook" into the db connect event - and set the tenantId so all future sql queries in this connection
// have an implied WHERE tenant_id = app.current_setting('app.current_tenant')::UUID (aka PostGres Row Level Security)
const tenantId = client.$ctx?.cn?.tenantId || client.$ctx?.cnOptions?.tenantId;
if (tenantId) {
await client.query(`SET app.current_tenant = '${tenantId}';`);
}
} catch (ex) {
log.error('error in db.js initOptions', {ex});
}
}
};
const pgp = require('pg-promise')(initOptions);
const options = tenantIdOptional => {
return {
user: process.env.POSTGRES_USER,
host: process.env.POSTGRES_HOST,
database: process.env.POSTGRES_DATABASE,
password: process.env.POSTGRES_PASSWORD,
port: process.env.POSTGRES_PORT,
max: 100,
tenantId: tenantIdOptional
};
};
const db = pgp(options());
const getDb = tenantId => {
// how to inject tenantId into the db object
// 1. this was getting an error "WARNING: Creating a duplicate database object for the same connection and Error: write EPIPE"
// const tmpDb = pgp(options(tenantId));
// return tmpDb;
// 2. this was running the set app.current_tenant BEFORE the database connection was established
// const setTenantId = async () => {
// await db.query(`SET app.current_tenant = '${tenantId}';`);
// };
// setTenantId();
// return db;
// 3. this is bypassing the connection pool management - and is not working
// db.connect(options(tenantId));
// return db;
return db;
};
// Exporting the global database object for shared use:
const exportFunctions = {
getDb,
db // have to also export db for the legacy non-Row level security areas of the service
};
module.exports = exportFunctions;
SET operation is connection-bound, i.e. the operation only has effect while the current connection session lasts. For fresh connections spawned by the pool, you need to re-apply the settings.
The standard way of controlling current connection session is via tasks:
await db.task('my-task', async t => {
await t.none('SET app.current_tenant = ${tenantId}', {tenantId});
// ... run all session-related queries here
});
Or you can use method tx instead, if a transaction is needed.
But if you have tenantId known globally, and you want it automatically propagated through all connections, then you can use event connect instead:
const initOptions = {
connect(client) {
client.query('SET app.current_tenant = $1', [tenantId]);
}
};
The latter is kind of an after-thought work-around, but it does work reliably, has best performance, and avoids creating the extra tasks.
have not found a way to get hold of the current request object (to then set the tenant_id)
This should be very straightforward for any HTTP library out there, but is outside of scope here.

Initial Request Slow on Lamda Due to DB connection

When my lambda function is activated it connects to my MongoDB Atlas instance, significantly slowing down the response by 1000-2000ms
I can cache the DB connection, but the cache only lasts if requests are made quickly after the last one and would not persist for a request made an hour later.
Do any of the native AWS DB's avoid this problem and allow an instant connection every time? (documentDB, DynamoDB etc)
CODE
let response
import { MongoClient } from 'mongodb'
let cachedDb = null
const uri =
'mongodb+srv://XXXX'
function connectToDatabase(uri) {
if (cachedDb && cachedDb.serverConfig.isConnected()) {
console.log('=> using cached database instance')
return Promise.resolve(cachedDb)
}
const dbName = 'test'
return MongoClient.connect(uri, { useNewUrlParser: true, useUnifiedTopology: true }).then(
client => {
cachedDb = client.db(dbName)
return cachedDb
}
)
}
export async function lambdaHandler() {
try {
const client = await connectToDatabase(uri)
const collection = client.collection('users')
const profile = await collection.findOne({ user: 'myuser' })
response = profile
}
} catch (err) {
console.log(err)
return err
}
return response
}
We have the same issue with mysql connections, the cached variables disappear when the lambda function cold starts.
The only solution I have is to keep the cache alive with function warming.
Just set up a periodic cron job to trigger your function every 5-15 minutes, and rest assured, it will always be idle.
You can check also this one: https://www.npmjs.com/package/lambda-warmer
You are facing cold start. It's not related to DB connection.
In order to keep you lambda function warm you can set up CloudWatch event that will trigger Lambda periodically (normally once per 5 minutes should be enough).
Also if you are using DocumentDB, you must put Lambda into VPC. It requires ENI (Elastic Network Interface) to be provisioned therefore it adds more time to start. So for example if you can avoid using VPC, then it could give you some performance advantages.
More info:
Good article about cold start
AWS Lambda Execution Context

Mongoose how to listen for collection changes

I need to build a mongo updater process to dowload mongodb data to local IoT devices (configuration data, etc.)
My goal is to watch for some mongo collections in a fixed interval (1 minute, for example). If I have changed a collection (deletion, insertion or update) I will download the full collection to my device. The collections will have no more than a few hundred simple records, so it´s gonna not be a lot of data to download.
Is there any mechanism to find out a collection has changed since last pool ? What mongo features should be used in that case ?
To listen for changes to your MongoDB collection, set up a Mongoose Model.watch.
const PersonModel = require('./models/person')
const personEventEmitter = PersonModel.watch()
personEventEmitter.on('change', change => console.log(JSON.stringify(change)))
const person = new PersonModel({name: 'Thabo'})
person.save()
// Triggers console log on change stream
// {_id: '...', operationType: 'insert', ...}
Note: This functionality is only available on a MongoDB Replicaset
See Mongoose Model Docs for more:
If you want to listen for changes to your DB, use Connection.watch.
See Mongoose Connection Docs for more
These functions listen for Change Events from MongoDB Change Streams as of v3.6
I think best solution would be using post update middleware.
You can read more about that here
http://mongoosejs.com/docs/middleware.html
I have the same demand on an embedded that works quite autonomously, and it is always necessary to auto adjust your operating parameters without having to reboot your system.
For this I created a configuration manager class, and in its constructor I coded a "parameter monitor", which checks the database only the parameters that are flagged for it, of course if a new configuration needs to be monitored, I inform the config -manager in another part of the code to reload such an update.
As you can see the process is very simple, and of course can be improved to avoid overloading the config-manager with many updates and also prevent them from overlapping with a very small interval.
Since there are many settings to be read, I open a cursor for a query as soon as the database is connected and opened. As data streaming sends me new data, I create a proxy for it so that it can be manipulated according to the type and internal details of Config-manager. I then check if the property needs to be monitored, if so, I call an inner-function called watch that I created to handle this, and it queries the subproject of the same name to see what default time it takes to check in the database by updates, and thus registers a timeout for that task, and each check recreates the timeout with the updated time or interrupts the update if watch no longer exists.
this.connection.once('open', () => {
let cursor = Config.find({}).cursor();
cursor.on('data', (doc) => {
this.config[doc.parametro] = criarProxy(doc.parametro, doc.valor);
if (doc.watch) {
console.log(sprintf("Preparando para Monitorar %s", doc.parametro));
function watch(configManager, doc) {
console.log("Monitorando parametro: %s", doc.parametro);
if (doc.watch) setTimeout(() => {
Config.findOne({
parametro: doc.parametro
}).then((doc) => {
console.dir(doc);
if (doc) {
if (doc.valor != configManager.config[doc.parametro]) {
console.log("Parametro monitorado: %(parametro)s, foi alterado!", doc);
configManager.config[doc.parametro] = criarProxy(doc.parametro, doc.valor);
} else
console.log("Parametro monitorado %{parametro}s, não foi alterado", doc);
watch(configManager, doc);
} else
console.log("Verifique o parametro: %s")
})
},
doc.watch)
}
watch(this, doc);
}
});
cursor.on('close', () => {
if (process.env.DEBUG_DETAIL > 2) console.log("ConfigManager closed cursor data");
resolv();
});
cursor.on('end', () => {
if (process.env.DEBUG_DETAIL > 2) console.log("ConfigManager end data");
});
As you can see the code can improve a lot, if you want to give suggestions for improvements according to your environment or generics please use the gist: https://gist.github.com/carlosdelfino/929d7918e3d3a6172fdd47a59d25b150

AWS Lambda callback being blocked by open mongodb connection?

I have setup an AWS lambda to do some data saving for me to MongoDB. I'd like to reuse the connection so I dont have to create a new connection every time the lambda is invoked. But if I leave the db connection open, the callback for the Lambda handler doesnt work!
Is there something I'm doing wrong thats creating this behavior? Here is my code:
var MongoClient = require('mongodb').MongoClient
exports.handler = (event, context, callback) => {
MongoClient.connect(process.env.MONGOURL, function (err, database) {
//database.close();
callback(null, "Successful db connection")
});
}
This is caused by not setting context.callbackWaitsForEmptyEventLoop = false. If left at the default true, the callback does not cause Lambda to return the response because your database connection is keeping the event loop from being empty.
http://docs.aws.amazon.com/lambda/latest/dg/nodejs-prog-model-context.html