Guarantee creation order in DynamoDB? - nosql

Is there a way to have my items automatically ordered by creation order in DynamoDB?
I've tried using an ISO timestamp in my sort-keys but I soon noticed that items created in the same second have no guaranteed order.
Another major issue is that my sort-key is composite, for example :
Sort_Key : someRandomUuidHere|Created:someTimeStampHere
I need to generate UUIds to try to guarantee it is unique, but adding the timestamp at the end of the uuid doesn't seem to order it by the timestamp, but by the uuid instead. And if I add the timestamp to the start I can't use things like begins_with so it breaks my access patterns
The only way I could think of was maintaining a "last-key" object and always ask it the last item index before, but that would require an extra request and some ACID logic.
Or maybe just order it on the client side, there's always that

Sort keys are sorted left to right...so yeah, the UUID would be the only component that affects the order.
Why not place the timestamp first?
Alternatively, consider just Time-Based UUID. Dynamo doesn't offer one natively, alothough some NoSQL DB's, such as Cassandra, do.
Here's an article that discusses creating one with .NET
Creating a Time UUID (GUID) in .NET
Which includes a link the following source
https://github.com/fluentcassandra/fluentcassandra/blob/master/src/GuidGenerator.cs

Related

Swift: Implementing custom merge policy

I'm building an OSX app using Swift, with Coredata as my data layer. As part of this, I have table that lists a large number of files, with metadata associated with each. Each record can include a URI that points to one of three services it can be hosted on.
1. title created_at size uuid source_local source_remote source_cloud
I generate all the records using information pulled from the local source. These records all have a source_local string.
Later I import a number of records from the remote source. These records are all added and have a source_remote string.
A number of these records are hosted on both services, and have matching UUIDs. There is a unique constraint on the UUID field, and I want Swift to merge these two records' fields in some way when it has a constraint error.
I've tried:
NSMergeByPropertyStoreTrumpMergePolicy
and
NSMergeByPropertyObjectTrumpMergePolicy
But these policies result in one record completely trumping the other.
Currently I have to work around this limitation by checking if a record already exists with the UUID and updating the existing record with any missing fields in the new file.
However this feels non-optimal – is there a way to create a custom merge policy, in order to have Swift automatically handle conflicts in this way? At this stage I am not concerned with whether the Store or Memory record trumps the other, as long as I can correctly the merge the source_* fields.
Thanks
First of all, thanks to #tom-harrington for his nod to extend NSMergePolicy. Huge oversight on my part that I hadn't even considered going down that route.
While exploring how NSMergeByPropertyStoreTrumpMergePolicy/NSMergeByPropertyObjectTrumpMergePolicy are implemented, however, I realised that this issue stems from a misunderstanding on my part. These policies already handle conflicts at a property level. Rather than discarding the entirety of one of the object states on conflict, they compare each property and only apply the policy to those properties which have both changed/exist.
NSOverwriteMergePolicy and NSRollbackMergePolicy are policies that will result in one of either object A or B being completely discarded on conflict.

Limit student to select only one job offer

I have a database using PostgreSQL, which holds data on students, applications and job offers.
Is there some kind of constraint that will mean a student can only accept one job offer. So by selecting 'yes' on 'job accepted' attribute, they can no longer do this for any other jobs they may receive?
It is not exactly a "constraint". It is just a column. In the Student table have a column called AcceptedJobOffer. That solves the direct problem. In addition, you want the following:
AcceptedJobOfferId int references JobOffers(JobOfferid)
And, then create a unique index on Applications for StudentId, JobOfferId and include:
foreign key (StudentId, AcceptedJobOfferId) references Applications(StudentId, JobOfferId)
This ensures that the job offer is a valid job and that it references an application (assuming that an application is a requirement -- 100% of the time -- for acceptance).
I imagine you've some kind of job applications table, which has a field called is_accepted in it or something to that order. You can add an exclude constraint on it. Example here.
An alternative is to add an accepted_job_id column (ideally a foreign key) to the students table, as already suggested by Gordon.
Side note: If this is going to be dealing with real data, rather than theoretical data in a database course, you probably do not want to enforce the constraint at all. Sometimes, people want or need multiple jobs, so limiting the system in such a way that they cannot apply to more than one job introduces an artificial limitation which may come back and bite you down the road.

Changing the primary key on a MongoDB collection

I took a shortcut earlier and made the primary key of my Mongo database by concatenating various fields to create a "unique id"
I would now like to change it to actually use the ObjectId. What's the best approach to do it? I have a little over 3M documents and would like this to be as least disruptive as possible.
A simple approach would be to bring down the site for a bit and then copy every document from one to the other one which is using ObjectIds but I'd like to keep the application running if I can. I imagine one way would be to write to both for a period of time while the migration happens but that would require me having two similar code bases so I wonder if there's a way to avoid all that.
To provide some additional information:
It's just one collection that's not referenced by any others. I have another MySQL database that contains some values that are used to create the queries that read from this MongoDB collection.
I'm using PyMongo/Mongoengine libraries to interact with MongoDB from Python and I don't know if it's possible to just change the primary key for a collection.
You shouldn't bring your site down for any reason if it does not go down itself. :)
No matter how many millions of records you have, the solution to the problem resides on how you use your ids.
If you cross-reference documents in different collections using these ids, then for every updated object, you will update all other objects that references this one.
As first step, your system should be updated to stop creating new objects in the old way. If your system lets you easily do this, then you can update your database very easily. If this change is not easy to make, then your system has some architectural problems and you should first change this. If this is the situation, please update your question so I can update my answer.
Since I don't know anything about your applications and data, what I say will be too general. Let's call the collection you want to update coll_bad_id. Every item in this collection is referenced in other collections like coll_poor_guy and coll_wisdom_searcher. How I would do this is to run over coll_bad_id one item at a time like this:
1. read one item
2. update _id with new style of _id
3. insert item back to collection
-- now we have two copies of the same item one with old-style id, one with new
4. update each item referencing this to use new style id
5. remove the duplicate item with old-style id from collection
One thing you should keep in mind that, bson ObjectId's keep date/time data that can be very useful. Since you rebuild all these objects on one day, your ObjectId's will not reflect correct creation times for these items. For newly added items, they will. You can note the first newly added item as the milestone of items with ids with correct-creation times.
UPDATE: Code sample to run on Mongo shell.
This is not the most efficient way to do this; but it is safe to run since we do not remove anything before adding them back with a new _id. Better can be doing this in small amounts by adding queries to find() call.
var cursor = db.testcoll.find()
cursor.forEach(function(item) {
var oldid= item._id; // we save old _id to use for removal below.
delete item._id; // When we add an item without _id, Mongo creates a unique _id.
db.testcoll.insert(item); // We add item without _id.
db.testcoll.remove(oldid); // We delete the item with bad _id.
});

Using an RDBMS as event sourcing storage

If I were using an RDBMS (e.g. SQL Server) to store event sourcing data, what might the schema look like?
I've seen a few variations talked about in an abstract sense, but nothing concrete.
For example, say one has a "Product" entity, and changes to that product could come in the form of: Price, Cost and Description. I'm confused about whether I'd:
Have a "ProductEvent" table, that has all the fields for a product, where each change means a new record in that table, plus "who, what, where, why, when and how" (WWWWWH) as appropriate. When cost, price or description are changed, a whole new row as added to represent the Product.
Store product Cost, Price and Description in separate tables joined to the Product table with a foreign key relationship. When changes to those properties occur, write new rows with WWWWWH as appropriate.
Store WWWWWH, plus a serialised object representing the event, in a "ProductEvent" table, meaning the event itself must be loaded, de-serialised and re-played in my application code in order to re-build the application state for a given Product.
Particularly I worry about option 2 above. Taken to the extreme, the product table would be almost one-table-per-property, where to load the Application State for a given product would require loading all events for that product from each product event table. This table-explosion smells wrong to me.
I'm sure "it depends", and while there's no single "correct answer", I'm trying to get a feel for what is acceptable, and what is totally not acceptable. I'm also aware that NoSQL can help here, where events could be stored against an aggregate root, meaning only a single request to the database to get the events to rebuild the object from, but we're not using a NoSQL db at the moment so I'm feeling around for alternatives.
The event store should not need to know about the specific fields or properties of events. Otherwise every modification of your model would result in having to migrate your database (just as in good old-fashioned state-based persistence). Therefore I wouldn't recommend option 1 and 2 at all.
Below is the schema as used in Ncqrs. As you can see, the table "Events" stores the related data as a CLOB (i.e. JSON or XML). This corresponds to your option 3 (Only that there is no "ProductEvents" table because you only need one generic "Events" table. In Ncqrs the mapping to your Aggregate Roots happens through the "EventSources" table, where each EventSource corresponds to an actual Aggregate Root.)
Table Events:
Id [uniqueidentifier] NOT NULL,
TimeStamp [datetime] NOT NULL,
Name [varchar](max) NOT NULL,
Version [varchar](max) NOT NULL,
EventSourceId [uniqueidentifier] NOT NULL,
Sequence [bigint],
Data [nvarchar](max) NOT NULL
Table EventSources:
Id [uniqueidentifier] NOT NULL,
Type [nvarchar](255) NOT NULL,
Version [int] NOT NULL
The SQL persistence mechanism of Jonathan Oliver's Event Store implementation consists basically of one table called "Commits" with a BLOB field "Payload". This is pretty much the same as in Ncqrs, only that it serializes the event's properties in binary format (which, for instance, adds encryption support).
Greg Young recommends a similar approach, as extensively documented on Greg's website.
The schema of his prototypical "Events" table reads:
Table Events
AggregateId [Guid],
Data [Blob],
SequenceNumber [Long],
Version [Int]
The GitHub project CQRS.NET has a few concrete examples of how you could do EventStores in a few different technologies. At time of writing there is an implementation in SQL using Linq2SQL and a SQL schema to go with it, there's one for MongoDB, one for DocumentDB (CosmosDB if you're in Azure) and one using EventStore (as mentioned above). There's more in Azure like Table Storage and Blob storage which is very similar to flat file storage.
I guess the main point here is that they all conform to the same principal/contract. They all store information in a single place/container/table, they use metadata to identify one event from another and 'just' store the whole event as it was - in some cases serialised, in supporting technologies, as it was. So depending on if you pick a document database, relational database or even flat file, there's several different ways to all reach the same intent of an event store (it's useful if you change you mind at any point and find you need to migrate or support more than one storage technology).
As a developer on the project I can share some insights on some of the choices we made.
Firstly we found (even with unique UUIDs/GUIDs instead of integers) for many reasons sequential IDs occur for strategic reasons, thus just having an ID wasn't unique enough for a key, so we merged our main ID key column with the data/object type to create what should be a truly (in the sense of your application) unique key. I know some people say you don't need to store it, but that will depend on if you are greenfield or having to co-exist with existing systems.
We stuck with a single container/table/collection for maintainability reasons, but we did play around with a separate table per entity/object. We found in practise that meant either the application needed "CREATE" permissions (which generally speaking is not a good idea... generally, there's always exceptions/exclusions) or each time a new entity/object came into existence or was deployed, new storage containers/tables/collections needed to be made. We found this was painfully slow for local development and problematic for production deployments. You may not, but that was our real-world experience.
Another things to remember is that asking action X to happen may result in many different events occurring, thus knowing all the events generated by a command/event/what ever is useful. They may also be across different object types e.g. pushing "buy" in a shopping cart may trigger account and warehousing events to fire. A consuming application may want to know all of this, so we added a CorrelationId. This meant a consumer could ask for all events raised as a result of their request. You'll see that in the schema.
Specifically with SQL, we found that performance really became a bottleneck if indexes and partitions weren't adequately used. Remember events will needs to be streamed in reverse order if you are using snapshots. We tried a few different indexes and found that in practise, some additional indexes were needed for debugging in-production real-world applications. Again you'll see that in the schema.
Other in-production metadata was useful during production based investigations, timestamps gave us insight into the order in which events were persisted vs raised. That gave us some assistance on a particularly heavily event driven system that raised vast quantities of events, giving us information about the performance of things like networks and the systems distribution across the network.
Well you might wanna give a look at Datomic.
Datomic is a database of flexible, time-based facts, supporting queries and joins, with elastic scalability, and ACID transactions.
I wrote a detailed answer here
You can watch a talk from Stuart Halloway explaining the design of Datomic here
Since Datomic stores facts in time, you can use it for event sourcing use cases, and so much more.
I think solution (1 & 2) can become a problem very quickly as your domain model evolves. New fields are created, some change meaning, and some can become no longer used. Eventually your table will have dozens of nullable fields, and loading the events will be mess.
Also, remember that the event store should be used only for writes, you only query it to load the events, not the properties of the aggregate. They are separate things (that is the essence of CQRS).
Solution 3 what people usually do, there are many ways to acomplish that.
As example, EventFlow CQRS when used with SQL Server creates a table with this schema:
CREATE TABLE [dbo].[EventFlow](
[GlobalSequenceNumber] [bigint] IDENTITY(1,1) NOT NULL,
[BatchId] [uniqueidentifier] NOT NULL,
[AggregateId] [nvarchar](255) NOT NULL,
[AggregateName] [nvarchar](255) NOT NULL,
[Data] [nvarchar](max) NOT NULL,
[Metadata] [nvarchar](max) NOT NULL,
[AggregateSequenceNumber] [int] NOT NULL,
CONSTRAINT [PK_EventFlow] PRIMARY KEY CLUSTERED
(
[GlobalSequenceNumber] ASC
)
where:
GlobalSequenceNumber: Simple global identification, may be used for ordering or identifying the missing events when you create your projection (readmodel).
BatchId: An identification of the group of events that where inserted atomically (TBH, have no idea why this would be usefull)
AggregateId: Identification of the aggregate
Data: Serialized event
Metadata: Other usefull information from event (e.g. event type used for deserialize, timestamp, originator id from command, etc.)
AggregateSequenceNumber: Sequence number within the same aggregate (this is usefull if you cannot have writes happening out of order, so you use this field to for optimistic concurrency)
However, if you are creating from scratch I would recomend following the YAGNI principle, and creating with the minimal required fields for your use case.
Possible hint is design followed by "Slowly Changing Dimension" (type=2) should help you to cover:
order of events occurring (via surrogate key)
durability of each state (valid from - valid to)
Left fold function should be also okay to implement, but you need to think of future query complexity.
I reckon this would be a late answer but I would like to point out that using RDBMS as event sourcing storage is totally possible if your throughput requirement is not high. I would just show you examples of an event-sourcing ledger I build to illustrate.
https://github.com/andrewkkchan/client-ledger-service
The above is an event sourcing ledger web service.
https://github.com/andrewkkchan/client-ledger-core-db
And the above I use RDBMS to compute states so you can enjoy all the advantages coming with a RDBMS like transaction support.
https://github.com/andrewkkchan/client-ledger-core-memory
And I have another consumer to be processing in memory to handle bursts.
One would argue the actual event store above still lives in Kafka-- as RDBMS is slow for inserting especially when the inserting is always appending.
I hope the code help give you an illustration apart from the very good theoretical answers already provided for this question.

Options for handling a frequently changing data form

What are some possible designs to deal with frequently changing data forms?
I have a basic CRUD web application where the main data entry form changes yearly. So each record should be tied to a specific version of the form. This requirement is kind of new, so the existing application was not built with this in mind.
I'm looking for different ways of handling this, hoping to avoid future technical debt. Here are some options I've come up with:
Create a new object, UI and set of tables for each version. This is obviously the most naive approach.
Keep adding all the fields to the same object and DB tables, but show/hide them based on the form version. This will become a mess after a few changes.
Build form definitions, then dynamically build the UI and store the data as some dictionary like format (e.g. JSON/XML or maybe an document oriented database) I think this is going to be too complex for the scope of this app, especially for the UI.
What other possibilities are there? Does anyone have experience doing this? I'm looking for some design patterns to help deal with the complexity.
First, I will speak to your solutions above and then I will give my answer.
Creating a new table for each
version is going to require new
programming every year since you will
not be able to dynamically join to
the new table and include the new
columns easily. That seems pretty obvious and really makes this a bad choice.
The issues you mentioned with adding
the columns to the same form are
correct. Also, whatever database you
are using has a max on how many
columns it can handle and how many
bytes it can have in a row. That could become another concern.
The third option I think is the
closest to what you want. I would
not store the new column data in a
JSON/XML unless it is for duplication
to increase speed. I think this is
your best option
The only option you didn't mention
was storing all of the data in 1
database field and using XML to
parse. This option would make it
tough to query and write reports
against.
If I had to do this:
The first table would have the
columns ID (seeded), Name,
InputType, CreateDate,
ExpirationDate, and CssClass. I
would call it tbInputs.
The second table would have the have
5 columns, ID, Input_ID (with FK to
tbInputs.ID), Entry_ID (with FK to
the main/original table) value, and
CreateDate. The FK to the
main/original table would allow you
to find what items were attached to
what form entry. I would call this
table tbInputValues.
If you don't
plan on having that base table then
I would use a simply table that tracks the creation date, creator ID,
and the form_id.
Once you have those you will just need to create a dynamic form that pulls back all of the inputs that are currently active and display them. I would put all of the dynamic controls inside of some kind of container like a <div> since it will allow you to loop through them without knowing the name of every element. Then insert into tbInputValues the ID of the input and its value.
Create a form to add or remove an
input. This would mean you would
not have much if any maintenance
work to do each year.
I think this solution may not seem like the most eloquent but if executed correctly I do think it is your most flexible solution that requires the least amount of technical debt.
I think the third approach (XML) is the most flexible. A simple XML structure is generated very fast and can be easily versioned and validated against an XSD.
You'd have a table holding the XML in one column and the year/version this xml applies to.
Generating UI code based on the schema is basically a bad idea. If you do not require extensive validation, you can opt for a simple editable table.
If you need a custom form every year, I'd look at it as kind of a job guarantee :-) It's important to make the versioning mechanism and extension transparent and explicit though.
For this particular app, we decided to deal with the problem as if there was one form that continuously grows. Due to the nature of the form this seemed more natural than more explicit separation. We will have a mapping of year->field for parts of the application that do need to know which data is for which year.
For the UI, we will be creating a new page for each year's form. Dynamic form creation is far too complex in this situation.