How to leverage redis and mongo in the following scenario? - mongodb

I am using the change streams to keep the redis cache up to date whenever a change happens in the mongodb. I am updating the cache in two steps whenever the db is changed:
clear cache
Load data from scratch
code:
redisTemplate.delete(BINDATA);
redisTemplate.opsForHash().putAll(BINDATA, binDataItemMap);
The problem is that in prod this setting may be dangerous. As redis is single threaded there is the chance the events will happen in the following way in a multi-container application:
db update happens
data is cleared in redis
request comes in and queries an empty redis cache
data is pushed into redis
Ideally it should happen as so:
db update happens
data is cleared in redis
data is pushed into redis
request comes in and queries an updated redis cache
How can both operations clear and update happen in redis as a block?

Redis supports a kind of transactions. This fits exactly your needs.
All the commands in a transaction are serialized and executed sequentially. It can never happen that a request issued by another client is served in the middle of the execution of a Redis transaction. This guarantees that the commands are executed as a single isolated operation.
A Redis transaction is entered using the MULTI command. At this point the user can issue multiple commands. Instead of executing these commands, Redis will queue them. All the commands are executed once EXEC is called.
Pseudo Code:
> MULTI
OK
> redisTemplate.delete(BINDATA);
QUEUED
> redisTemplate.opsForHash().putAll(BINDATA, binDataItemMap);
QUEUED
> EXEC

Related

How to locally test MongoDB multicollection transactions when standalone mode does not support them

I am just starting out with MongoDB and am using the docker mongo instance for local development and testing.
My code has to update 2 collections in the same transaction so that the data is logically consistent:
using (var session = _client.StartSession())
{
session.StartTransaction();
ec.InsertOne(evt);
sc.InsertMany(snapshot.Selections.Select(ms => new SelectionEntity(snapshot.Id, ms)));
session.CommitTransaction();
}
This is failing with the error:
'Standalone servers do not support transactions
The error is obvious, my standalone docker container does not support transactions. I am confused though as this means it's impossible to test code such as the above unless I have a replica set running. This doesn't appear to be listed as a requirement in the documentation - and it refers to the fact that transactions could be multi-document OR distributed:
For situations that require atomicity of reads and writes to multiple documents (in a single or multiple collections), MongoDB supports multi-document transactions. With distributed transactions, transactions can be used across multiple operations, collections, databases, documents, and shards.
It's not clear to me how to create a multi-document transaction that does not require a replica based server to exist or how to properly test code locally that may not have a mongo replica cluster to work against.
How do people handle this?
For testing puirposes, you could set up a local replica set using docker-compose. There are various blog posts on the topic available, e.g. Create a replica set in MongoDB with docker-compose.
Another option is to use a cluster on MongoDB Atlas. There is a free tier available so you can test this without any extra cost.
In addition, you could change your code so that transactions can be disabled depending on the configuration. This way, you can test the code without transactions locally and enable them on staging or production.

Under which circumstances can documents insert with insert_many not appear in DB

We are using pymongo 3.12 and Python 3.12, MongoDB 4.2. We write results of our task from Celery worker process into MongoDB using pymongo. MongoClient is not instantiated each time, thus, we use connection pooling and reuse the connections. There are multiple instances of Celery workers competing for running a job, so our Celery server has multiple connections to MongoDB.
The problem: sometimes, results of that particular operation are not in MongoDB, however, no error is logged, and our code captures all exceptions, so it looks like no exception ever happens. We use plain insert_many with default parameters, which means our insert is ordered and any failure would trigger an error. Only this particular operation fails, others that read or write data from/to the same or another MongoDB instance work fine. Problem can be reproduced in different systems. We have added maxIdleTimeMS parameter to MongoDB connection in order to close idle connections, but it did not help.
Is there a way to tell programmatically which local port is used by pymongo connection which is going to serve my request?

Sidecar redis instance in k8s job

My use case:
Running a bunch of my own processing in a kubernetes Job
Want to associate a short-lived redis cache for the lifetime of the Job processing
The Job should complete when my processing finishes, and redis should go away
I can't figure out how to accomplish this, though I feel like I can't be the first one to need this.
If I add redis as a second container to the job's spec, then my processing completes but redis keeps on going, and the Job never completes.
I could try and add redis to the container with my processing, running it first as a daemon and then running my own code, but this feels wrong to me.
I've read about sidecar patterns and some other people looking for what I need, but I didn't see any clear solution. I saw what look to me like hacks with shared volumes and livenessProbes.
How is this best accomplished?

Out of box distributed job queue solution

Are there any existing out of the box job queue framework? basic idea is
someone to enqueue a job with job status New
(multiple) workers get a job and work on it, mark the job as Taken. One job can only be running on at most one worker
something will monitor the worker status, if the running jobs exceed predefined timeout, will be re-queued with status New, could be worker health issue
Once a worker completes a task, it marks the task as Completed in the queue.
something keeps cleaning up completed tasks. Or at step #4 when worker completes a task, the worker simply dequeues the task.
From my investigation, things like Kafka (pub/sub) or MQ (push/pull & pub/sub) or cache (Redis, Memcached) are mostly sufficient for this work. However, they all require some sort of development around its core functionality to become a fully functional job queue.
Also looked into relational DB, the ones supports "SELECT * FOR UPDATE SKIP LOCKED" syntax is also a good candidate, this again requires a daemon between the DB and worker, which means extra effort.
Also looked into the cloud solutions, Azure Queue storage, etc. similar assessment.
So my question is, is there any out of the box solution for job queue, that are tailored and dedicated for one thing, job queuing, without much effort to set up?
Thanks
Take a look at Python Celery. https://docs.celeryproject.org/en/stable/getting-started/introduction.html
The default mode uses RabbitMQ as the message broker, but other options are available. Results can be stored in a DB if needed.

spring batch remote partitioning remote step performance

We are using remote partitioning in our POC where we process around 20 million records. To process this records, slave needs some static metadata which is around 5000 rows. Our current POC uses EhCache to load this metadata in slave once time from db and put it in cache so the subseuent calls just get this data from cache for better performance.
Now since we are using remote partitioning, our slave has approx 20 MDP/thread so each message listener calls first to get the metadata from db, so basically 20 threads are hitting db at the same time on each remote machine. We have 2 machine for now but will grow to 4.
My question is , is there any better way to load this metadata only one time like before job starts and be accessible to all remote slave?
Or can we use step listener in remote stap? I dont think so this is a good idea, as it will be executed for each remote step execution but needed expert thoughts on this.
You could set up an EhCache server running as a separate application or use another product for caching instead like Hazelcast. If commercial products are an option for you, Coherence might also work.