I am building an application which will perform 2 phases.
Execute Phase - First phase is very
INSERT intensive (as many inserts
as the hardware can possibly can
execute in a second). This is
essentially a logging trail of work
performed.
Validation Phase - Next
phase will query the logs generated
by phase 1 and compare to an
external source and perform an
UPDATE on the record to store some
statistics. This process is second priority to phase 1.
I'm trying to see if its feasible to do them in parallel and keep write locking to a minimum for the execution phase. I thought one way to do this would be to restrict my Validation phase to only query from older records which are not in the chunk currently being inserted to by the execution phase. Is there something in MongoDB that restricts a find() to only query from chunks that have not been accessed in some configurable amount of time?
You probably want to set up replica set. Insert into the master and fetch from secondaries. In that way, your insert won't be blocked at all.
You can use the mentioned replica set with slaveOk, and update in the master.
You can use a timestamp field or an ObjectId (which already contains a timestamp) for filtering.
Related
We are evaluating Citus data for the large-scale data use cases in our organization. While analyzing, I am trying to see if there is a way to achieve the following with Citus data:
We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly
So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only. This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.
I am not able to find anything from the documentation on how we can achieve this. Is this possible with Citus data?
Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case.
To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.
One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column. So that your application knows which rows should be fetched/updated from which sub-task. In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column. In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard; by assigning them the shard_id you want.
Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant
It basically isolates a tenant (all data with the same customer_id in your case) into a single shard. Maybe it works for you at some point.
If I trigger two updates which are just 1 nanosecond apart. Is it possible that the updates could be done out-of-order? Say if the first update was more complex than the second.
I understand that MongoDB is eventually consistent, but I'm just not clear on whether or not the order of writes is preserved.
NB I am using a legacy system with an old version of MongoDB that doesn't have the newer transaction stuff
In MongoDB, write operations are atomic on document level as every document in a collection is independent & individual on it's own. So when an operation is executing write on a document then the second operation has to wait until first one finishes writing to the document.
From their docs :
a write operation is atomic on the level of a single document, even if
the operation modifies multiple embedded documents within a single
document.
Ref : atomicity-in-mongodb
So when this can be an issue ? - On reads, This is when if your application is so ready heavy. As reads can happen during updates - if your read happened before update finishes then your app will see old data or in other case reading from secondary can also result in inconsistent data.
In general MongoDB is usually hosted as replica-set (A MongoDB is generally a set of atleast 3 servers/shards/nodes) in which writes has to definitely be targeted to primary shard & by default reads as well are targeted to primary shard but if you overwrite read preference to read from Secondary shards to make primary free(maybe in general for app reporting), then you might see few to zero issues.
But why ? In general in background data gets sync'd from Primary to Secondary, if at any case this got delayed or not done by the time your application reads then you'll see the issue but chances can be low. Anyway all of this is until MongoDB version 4.0 - From 4.0 secondary read preference enabled apps will read from WiredTiger snapshot of data.
Ref : replica-set-data-synchronization
When executing an update query which triggers a lot of (e.g. a million) records to be updated. As I understand, the underlying index system needs to re-ingest the doc. So for this kind of "heavy" job, is there a way to control its working load, i.e., update with a fix rate until it's finished?
currently it is not possible to do throttling of update queries.
probably it would help for your use-case to split the updates into parts, by adding specific filters.
for example if you have a timestamp field you could just update each month seperately by adding queries accordingly.
I'm doing two consecutive writes into the MongoDB (no shards, no replicas):
insert data into db
find and modify data inserted in 1.
when performing step 2), is it granted, that the command sees the data insertion from step 1)? What is the minimal WriteConcern I should use in step 1) to ensure this?
As for my use-case, I know, I could merge 1 and 2 into one simple step; however, my real use-case is much more complicated and cannot be solved such easily.
Your use case will work given you are using a write concern of Acknowledged. This is the default write concern in MongoDB 2.2 or later given you are using a recent driver (see here for the minimum driver version required).
http://docs.mongodb.org/manual/release-notes/drivers-write-concern/
I'm using Map Reduce with MongoDB. Simplified scenario: There are users, items and things. Items include any number of things. Each user can rate things. Map reduce is used to calculate the aggregate rating for each user on each item. It's a complex formula using the ratings for each thing in the item and the time of day - it's not something you could ever index on and thus map-reduce is an ideal approach to calculating it.
The question is: having calculated the results using Map Reduce what strategies do people use to maintain these per-user results collections in their NOSQL databases?
1) On demand with automatic deletion: Keep them around for some set period of time and then delete them; regenerate them as necessary when the user makes a new request?
2) On demand never delete: Keep them around indefinitely. When the user makes a request and the collection is past it's use-by date, regenerate it.
3) Scheduled: Regular process running to update all results collections for all users?
4) Other?
The best strategy depends on the nature of your map-reduce job.
If you're using a separate map-reduce call for each individual user, I would go with the first or second strategy. The advantage of the second strategy over the first strategy is that you always have a result ready. So when the user makes a request and the result is outdated, you can still present the old result to the user, while running a new map-reduce in the background to generate a fresh result for the next requests. This has the following advantages:
The user doesn't have to wait for the map-reduce to complete, which is important if the map-reduce may take a while to complete. The exception is of course the very first map-reduce call; at this point there is no old result available.
You're automatically running map-reduce only for the active users, reducing the load on the database.
If you're using a single, application-wide map-reduce call for all users, the third strategy is the best approach. You can easily achieve this by specifying an output collection. The advantages of this approach:
You can easily control the freshness of the result. If you need more up-to-date results, or need to reduce the load on the database, you only have to adjust the schedule.
Your application code isn't responsible for managing the map-reduce calls, which simplifies your application.
If a user can only see his or her own ratings, I'd go with strategy one or two, or include a lastActivity timestamp in user profiles and run an application-wide scheduled map-reduce job on the active subset of the users (strategy 3). If a user can see any other user's ratings, I'd go with strategy 3 as well, as this greatly reduces the complexity of the application.