I would like to have an action triggered every time an item is created or updated on a DynamoDB. I have been going through the doc, but cannot find anything like this. Is it possible?
Thanks.
This is not possible. DynamoDB doesn't let you run any code server-side. The only thing which might count as server-side actions as part of an update are conditional updates, but those can't trigger changes to other items.
The new update supports triggers.
https://aws.amazon.com/blogs/aws/dynamodb-update-triggers-streams-lambda-cross-region-replication-app/
Now you can use DynamoDb Streams.
A stream consists of stream records. Each stream record represents a single data modification in the DynamoDB table to which the stream belongs. Each stream record is assigned a sequence number, reflecting the order in which the record was published to the stream.
Stream records are organized into groups, or shards. Each shard acts as a container for multiple stream records, and contains information required for accessing and iterating through these records. The stream records within a shard are removed automatically after 24 hours.
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Checkout http://zapier.com/help/dynamodb might be what you are looking for.
Related
We are evaluating Citus data for the large-scale data use cases in our organization. While analyzing, I am trying to see if there is a way to achieve the following with Citus data:
We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly
So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only. This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.
I am not able to find anything from the documentation on how we can achieve this. Is this possible with Citus data?
Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case.
To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.
One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column. So that your application knows which rows should be fetched/updated from which sub-task. In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column. In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard; by assigning them the shard_id you want.
Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant
It basically isolates a tenant (all data with the same customer_id in your case) into a single shard. Maybe it works for you at some point.
I am wondering if there is a possibility of the Firestore ServerTimestamp to be exactly the same for 2 or more documents in a given collection, considering that multiple clients will be writing to the collection. I am asking this because, Firestore does not provide an auto-incrementing sequential number to documents created and we have to rely on the ServerTimestamp to assume serial writes. My use-case requires that the documents are numbered or at least have a semblance to a "linear write" model. My app is mobile and web based
(There are other ways to have an incremental number, such as a Firebase Cloud Function using the FieldValue.Increment() method, which I am already doing, but this adds one more level of complexity and latency)
Is it safe to assume that every document created in a given collection will have a unique timestamp and there would be no collision? Does Firestore queue up the writes for a collection or are the writes executed in parallel?
Thanks in advance.
Is it safe to assume that every document created in a given collection will have a unique timestamp and there would be no collision?
No, it's not safe to assume that. But it's also extremely unlikely that there will be a collision, depending on how the writes actually occur. If you need a guaranteed order, add another random piece of data to the document in another field, and use its sort order to break any ties in a deterministic fashion. You will have to decide for yourself if this is worthwhile for your use case.
Does Firestore queue up the writes for a collection or are the writes executed in parallel?
You should consider all writes to be in parallel. No guarantees are made about the order of writes, as that does not scale well at all.
We have large amount of data where we need to update documents based on a status. We will writing in batch of 500 chunks and wondering how many max records we can commit with a single trigger invocation?
Our trigger is a pubsub trigger in firebase cloud functions.
We see there is a limit of 540secs per invocation so would like to know how many max documents we can write in batches?
Update : Adding usecase
I have an event collection(Events) where users can subscribe for each event happening in a country.
Users have an api to see how many events they have subscribed to. They have query flags like is the event Live/Finished/Upcoming.
As I can't save list of user array who subscribed for an event in the event document(assuming subscribers can go beyond the document limit when stored), I maintained a separate sub-collection under users collection.
Ex : users/user-id/subscribedevents
Event document's status (Live/Finished/Upcoming), i'm updating from a cron job which will be running every minute. This is because I can't apply filters with two different fields (startDate & endDate).
When ever an event's status changes, i need to update in subscribedevents subcollection (which is under user's collection).
As I will be updating all the subscribedevents subcollection entries, I want to do it in batches.
Hope the usecase gives some clarity on where it is applied. As firestore is designed for scale, wondering how others are handling this scenario as its very common.
Each transaction or batch of writes can write to a maximum of 500 documents.
And 20 maximum batched writes.
References
Firestore Write Documents
Transaction and batches in firestore
I have a requirement that when a date attribute field is passed, that we would like to trigger two things:
to move the record to be deleted to another table.
to call a function to do other actions.
I understand TTL is only to delete a record when the date field is tripped. Can I hook extra logic to it?
Thanks!
Depending on the requirements there could be quite a few ways to do this.
One way is to execute a script periodically, and run a query to filter documents that have passed certain date value. For each of the documents, perform a document migration to another table and extra actions.
Alternatively is to use MongoDB Change Streams. The trick however, is that delete events from change stream do not return the document itself (because it's already been deleted).
Instead if you were to update a field for documents that have passed certain date value you could listen for the update events. For example, sets a field value to expired:true.
Worth mentioning that if you're going down the route of change streams update events, you could utilise MongoDB Stitch Triggers (relying on change streams). MongoDB Stitch database triggers allow you to automatically execute Stitch functions in response to changes in your MongoDB database.
I suggest write a function and call it via scheduler. That will be the better option to do it.
I have an usecase where a set of records in a collection need to be deleted after a specified interval of time.
For ex: Records older than 10hours be deleted every 10th hour.
We have tried deletion based on id but found it to be slow.
Is there a way to partition the records in a collection and drop a partition as and when required in Mongo
MongoDB does not currently support partitions, there is a JIRA ticket to add this as a feature (SERVER-2097).
One solution is to leverage multiple, time-based collections, cycling collections in a similar way as you would partitions. Typically we would do this when you'd usually only be querying one or few of these time-based collections. If you would often need to read across multiple collections, you could add some wrapper code to simplify that.
There's also TTL Indexes, which leverage a background thread in the mongod server to handle the deletes for you.
Your deletes by _id may have been slow for a number of reasons, and probably warrants more investigation beyond your original question.