I was wondering if anyone could share any best practices around the design of a concurrent user e-commerce discount engine.
In my system, users can be allocated purchase credits that allow them to purchase things for free. So, for example, the user will select a basket of products that is passed to a discount engine where rules will be applied based on the credits assigned to the user's account. Say the user has 5 credits, how do I ensure that, a credit can be used once and only once? Will I need to introduce some form of database locking? Would I store a count of credits in a single table or maybe create distinct records to model each credit?
I suppose this is analogous to a ticket booking system where it is imperative that a single ticket can't be sold to more than one customer at a time. It seems to be about ensuring that, even in a highly concurrent environment, no purchase credit can be used twice.
Hopefully I'm making at least a little bit of sense!
Just us an SQL database that doesn't suck at transactions, stick the operation of using up a credit into a single transaction (possibly by having the DB constrain the number of credits to being non-negative or something) and it should not be possible for two concurrent transactions to use the same credit. Databases are REALLY good at that sort of thing, it's exactly what they are for.
Basically just shove everything shared in a database and wrap operations that go together in a transaction and your front end code can just pretend there's no concurrency at all. Which is of course the entire reason RDBMS's exist after all.
EDIT: Your schema won't affect the correctness of this approach (although where you begin/end transactions will), it will affect your performance, as will how the DB is implemented. I'm only bringing this up because you tagged the question with 'database-schema' and don't seem to be aware that an ACID DB will just make what you want happen if you write your queries correctly.
Related
I'm designing a microservice-based architecture for an existing monolithic app. The database has a User table, and then a separate Subscription database storing subscription info for premium, subscribed users.
My initial idea was to do something like User Service <-> Subscription Service <-> Payment Service. But a colleague believes that the subscription service is unnecessary and can just be merged with the User Service. What is the best practice for this?
The problem is analogue with the choosing bounded context boundaries problem in DDD. Choosing bounded context can be done with context mapping: https://www.infoq.com/articles/ddd-contextmapping/ As far as I remember another approach is imagining the organization which does the same job manually what you automate with your services and check what departments it should have. The usual example is a webshop for this where for example security, shopping, storage, invoicing, delivery, support can go in different services. All of them have user and some of them has product, but they watch these things from different viewpoints, e.g. in the shopping context the product is something that you have in the catalogue and what the customers choose to add to their basket, for invoicing it is an item in the order with a certain price at the moment of the ordering, for storage it is something that must be found in the physical world, probably has a container number, for delivery it is part of a package and must be delivered to an address. Many times a lot of properties are just copy-paste with different property names in these bounded contexts and the DRY principle is not met. So if the reason why your colleague wants to merge them is meeting the DRY principle, then it does not matter. If they belong together logically and in a real company a single department would do them, then they can be merged. There is not much detail in your question, but I think I would separate the subscription service too and I guess I would rename the user service to security if all it does is registration and handling logins and authorization. If it contains something else too, then it is better to check the boundaries, maybe some of the properties belong to a different service, maybe there is another service which does not exist yet.
Another practical aspect is scalability. You can wait until certain parts of a single service is used a lot more than other parts and think about splitting because of scalability reasons.
Another analogy is choosing class size in OOP. Some people claim the smaller the better, but everybody is comfortable with different class sizes. Of course there are too small classes and overengineering and too big classes, god objects, but there is a range where the class size is acceptable and it depends on the personal taste or convenience which size you choose. I think for microservices this can happen too, so maybe your colleague is just comfortable with bigger services. If you check the name of the technology the service size should be micro. Those this is relative too, micro compared to what...
In my view a single deployable service should be no bigger than a
bounded context, but no smaller than an aggregate. I would also
suggest that it’s better to start with relatively broad service
boundaries to begin with, refactoring to smaller ones as time goes on.
Perhaps “micro” is a misleading prefix here. These are not necessarily
“small” as in “little”. These are services in the classic SOA mould,
except created with more of an agile mind-set including lightweight
infrastructure, decentralized governance and greater emphasis on
automation. Perhaps it’s these more agile aspects of microservice
design that distinguishes them as opposed to size.
https://www.ben-morris.com/how-big-is-a-microservice/
A Microservice shall be no larger than that which allows a two-pizza
team to release a single complete, appropriately sized user story to
production within a single day.
https://kylegenebrown.medium.com/whats-the-right-size-for-a-microservice-bf1740370d47
Microservice developers can start wherever it “feels right”, and over
time split or merge microservices to improve the architecture.
Although I agree that microservice architectures shouldn’t be carved
in stone and that there should always be the option of changing
microservice decomposition, it’d be nice to have more clarity as to
where a good starting point would be.
Identify critical business transactions that require high performance
and/or solid data integrity and trace those transactions through the
microservices in your architecture. If you see problems, consider
changing microservice boundaries and merging a few microservices
together, as an option.
https://www.pipservices.org/blog/whats-the-right-size-for-a-microservice
I am trying to figure out which is the best option for storing individual user logging information and general meta profiling data for each user on our system.
The original idea was to have a "profiler" collection and each document would represent a user. The problem with this design is that a power user could rack up so much meta data and history over the course of a year or less that it exceeds the document size limit. It also would force the documents to have deeper and more complex structures, which could result in slower queries.
The alternative design idea is to create a collection for each user and each document would hold specific types of profiling, history data. There are several benefits to this, namely speed. Yet also presents querying challenges when needing to run comparisons against other users (Solvable through other tracking DBs). I can't find a definitive answer to the question of how many collections a single mongo database contain.
If it can handle millions upon millions of collection per database then fantastic, otherwise I need to find better options for modeling this data. Am I going about this the right way?
The goal is to maintain a history of a user's interactions, reputation tracking, their interests over time, features they use regularly etc. which can allow for a more rich experience.
Create 2 collections: Users & User interactions.
There are certain things that make complete sense to store inside a User's document:
Reputation tracking
Interests -- common tags (similar to stack overflow) that a user frequents
Features -- this should be a finite list items. You could Key and $increment them as they are used
User interactions on the other hand is more of a log type structure that you may want to store with a back reference and process later.
Also check out Apache Kafka -- It's a distributed queuing technology that LinkedIn uses to do something similar to what you are describing.
I know this question has been asked before at PostgreSQL to Data-Warehouse: Best approach for near-real-time ETL / extraction of data.
But I want to rephrase this question.
I am attempting a real-time data warehouse. The difference between real-time and near real-time is huge.
I find real-time data warehouse to be event-driven and transactional in approach.
While near real-time would do the same batch mode application but would poll data more frequently.
It would put so much extra load on the production server and would certainly kill the production system.
Being a batch approach it would scan through all the tables for changes and would take rows which have changed from
a cut-off time stamp.
I mean by event driven, it would be specific to tables which have undergone changes are focus only on transaction
which are happening currently.
But the source system is an elephant of system, SAP, assuming which has 25,000 tables. It is not easy to model that,
not easy to write database triggers on each table to capture each change. I want impact on the production server to be minimal.
Is there any trigger at database level so that I could capture all changes happening in database in one trigger.
Is there any way to write that database trigger on a different database server so that production server goes untouched.
I have not been keeping pace with changes happening to database technology and am sure some nice new technologies would have come by to capture these changes easily.
I know of Log miners and Change data captures but it would be difficult to filter out the information which I need from redo logs.
Alternate ways to capture database write operations on the go.
Just for completeness sake let us assume databases are a heterogeneous mix of Oracle, SQL Server and DB2. But my contention is
the concepts we want to develop.
This is a universal problem, every company is looking for easy to implement solution. So a good discussion would benefit all.
Don't ever try to access SAP directly. Use the APIs of SAP Data Services (http://help.sap.com/bods). Look for the words "Integrator Guide" on that page for documentation.
This document should give you a good hint about where to look for your data sources (http://wiki.scn.sap.com/wiki/display/EIM/What+Extractors+to+use). Extractors are kind-of-somewhat like views in a DBMS, they're abstracting all the SAP stuff into somethin human readable.
As far as near-real-time, think in terms of micro-batches. Run your extract jobs every 5 (?) minutes, or longer if necessary.
Check the Lambda Architecture from Nathan Marz (I provide no link, but you'll find the resources easily). The implementation is all Java and No SQl, but the main ideas are applicable to the classical relational databases as well. In the nutshell: you have two implementations, one real time but responsible for only limited time interval. The "long tail" is maintained with classical best practice batch implementation.
The real time part is always discarded after the batch refresh, effectively blocking the propagation of the problems of the real time processing in the historical data.
As of now I can see only two solutions:
Write services on the source systems. If source is COBOL, put those in services. Put all services in a service bus and
some how trap when changes happen to database. This needs to be explored how that trap will work.
But from outset it appears to be a very
expensive proposition and uncertain. Convincing management for a three year lag time would be difficult. Services are not easy.
Log Shippers: This a trusted database solution. Logs would be available on another server, production server need not be
burdened. There are good number of tools as well available.
But the spirit does not match. Event driven is missing so the action when things
are happening is not captured. I will settle down for this.
As Ron pointed out NEVER TOUCH SAP TABLE DIRECTLY. There are adapters and adapters to access SAP tables. This will build another layer in between but it is unavoidable. One good news I want to share is a customer did a study of SAP tables and found that only 14% of the tables are actually populated or touched by SAP system. Even then 14% of 25,000 tables is coming to huge data model of 2000+ entities. Again micro-batches are like dividing the system into Purchase, Receivables, Payables etc., which is heading for a data mart and not an EDW. I want to focus on a Enterprise Data Warehouse.
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modelled database which the developer has to work with for the lifetime of the application unless there is a big data migration project. CQRS and ES also has other benefits like scaling eventstore, audit log etc. that are already all over the internet.
But what are the disadvantages ?
Here are some disadvantages that I can think of after researching and writing small demo apps
Complex: Some people say ES is complex. But I'd say having a complex application is better than a complex database model on which you can only run very restricted queries using a query language (multiple joins, indexes etc). I mean some programming languages like Scala have very rich collection library that is very flexible to produce some seriously complex aggregations and also there is Apache Spark which makes it easy query distributed collections. But databases will always be restricted to it's query language capabilities and distributing databases are harder then distributed application code (Just deploy another instance on another machine!).
High disk space usage: Event store might end up using a lot of disk space to store events. But we can schedule a clean up every few weeks and creating snapshot and may be we can store historical events locally on an external HD just incase we need old events in the future ?
High memory usage: State of every domain object is stored in memory which might increase RAM usage and we all how expensive RAM is. BIG PROBLEM!! because I'm poor! any solution to this ? May be use Sqlite instead of storing state in memory ? Am I making things more complex by introducing multiple Sqlite instances in my application ?
Longer bootup time: On failure or software upgrade bootup is slow depending on the number of events. But we can use snapshots to solve this ?
Eventual consistency: Problem for some applications. Imagine if Facebook used Event sourcing with CQRS for storing posts and considering how busy facebook's system is and if I posted a post I would see my fb post the next day :)
Serialized events in Event store: Event stores store events as serialized objects which means we can't query the content of events in the event store which is discouraged anyway. And we won't be able to add another attribute to the event in the future. Solution would be to store events as JSON objects instead of Serialized events ? But is that a good idea ? Or add more Events to support the change to orignal event object ?
Can someone please comment on the disadvantages I brought up here and correct me if I am wrong and suggest any other I may have missed out ?
Here is my take on this.
CQRS + ES can make things a lot simpler in complex software systems by having rich domain objects, simple data models, history tracking, more visibility into concurrency problems, scalability and much more. It does require a different way thinking about the systems so it could be difficult to find qualified developers. But CQRS makes it simpler to separate responsibilities across developers. For example, a junior developer can work purely with the read side without having to touch business logic.
Copies of data will require more disk space for sure. But storage is relatively cheap these days. It may require the IT support team to do more backups and planning how to restore the system in a case in things go wrong. However, server virtualization these days makes it a more streamlined workflow. Also, it is much easier to create redundancy in the system without a monolithic database.
I do not consider higher memory usage a problem. Business object hydration should be done on demand. Objects should not keep references to events that have already been persisted. And event hydration should happen only when persisting data. On the read side you do not have Entity -> DTO -> ViewModel conversions that usually happened in tiered systems, and you would not have any kind of object change tracking that full featured ORMs usually do. Most systems perform significantly more reads than writes.
Longer boot up time can be a slight problem if you are using multiple heterogeneous databases due to initialization of various data contexts. However, if you are using something simple like ADO .NET to interact with the event store and a micro-ORM for the read side, the system will "cold start" faster than any full featured ORM. The important thing here is not to over-complicate how you access the data. That is actually a problem CQRS is supposed to solve. And as I said before, the read side should be modeled for the views and not have any overhead of re-mapping data.
Two-phase commit can work well for systems that do not need to scale for thousands of users in my experience. You would need to choose databases that would work well with the distributed transaction coordinator. PostgreSQL can work well for read and write separate models, for example. If the system needs to scale for a high number of concurrent users, it would have to be designed with eventual consistency in mind. There are cases where you would have aggregate roots or context boundaries that do not use CQRS to avoid eventual consistency. It makes sense for non-collaborative parts of the domain.
You can query events in serialized a format like JSON or XML, if you choose the right database for the event store. And that should be only done for purposes of analytics. Nothing inside the system should query event store by anything other than the aggregate root id and the event type. That data would be indexed and live outside the serialized event.
Just to comment on point 5. I've been told that Facebook does use ES with Eventual Consistency, which is why you can sometimes see a post disappear and reappear after you've posted it.
Usually the read-model your browser is accessing is located 'close' to you, but after you make a post the SPA switches over to a read-model that is close to your write-model. The close proximity between the write-model (events) and the read-model mean you get to see your own post.
However, 15 minutes later your SPA switches back to the first, closer, read-model. If the event containing your post hasn't yet propagated to that read-model you'll see your own post disappear only to reappear sometime later.
I know it's been almost 3 years since this question was asked, but still this article may be useful for someone. Key points are
Scaling with snapshots
Visibility of data
Schema changing
Dealing with complex domains
Need to explain it to most new team members
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modeled database which the developer has to work with for the lifetime of the application unless there is a big data migration project.
This is a big misconception. The relational databases were invented exactly for the evolution of the model (thanks to simple two-dimensional tables as opposed to pre-defined hierarchical structures). With views and procedures ensuring the encapsulation of data access, the logical and physical model can evolve independently. This is also why SQL defines DDL and DML in the same language. Some RDBMS also allow all those evolutions to be versioned and deployed online (continuous delivery) as Oracle Edition Based Redefinition.
Big data structures are predefined and can be read only with the code developed for this structure. Ok when consumed immediately but you will have hard time to read it 10 years later without the exact version, and language compiler or interpreter.
I hope to not be late to try to give an answer. In these months I've done a lot of research on that argument with the goal of implementing a production-grade solution for some parts of my architecture where ES can make sense
Complex: Actually, it should not be complex, its mission is to be deadly simple. How? pushing all the complexity from business logic code to infrastructure code. The data access should be done by frameworks that are not enough mature yet. Still, there is no clear winner in the ES/CQRS race, maybe because is still a niche/hipster approach (?) So some team is rolling its own solution or adopting some ready-made technology such as Axon
High disk space usage: I would say more, I would say * potentially infinite* Disk Usage. But if you go towards ES, you also have a very good reason to tolerate this apparent drawback. Let's give some of them:
Audit Logs : The datastore is an event log, we already know it. Financial apps or every mission/safety critical could need a centralized audit log that enables to state Who made What in Which moment. ES provides this capability of the box...you can also decorate your event entries with some business meaningful metadata (eg. a transaction Id correlated with some API consumer identity, A severity level of the operation...)
High Concurrency: there are systems where logical resource states are mutated by many clients in a concurrent way. These are games, IoT platforms, and so on. Logging events instead of change a state representation could be a smart way to provide a total order of events. The other way is to delegate to DB the synchronization stuff. But this is not what you want if you're into ES
Analytics Let's say you have a lot of data with a lot of business value, but you still don't know which. For years we extracted knowledge from applications information by translating data organization with different information models (OLAP cubes). The event store provides something similar out of the box again. Event logs is the rawest form of representation of information And you can have many ways to process them, in batch or reacting to events stored
High memory usage: I think it should be the same once you have built your projection
Longer bootup time: If the read side caches its projections and "remembers" the last update event, it should not re-apply the entire event sequence. Snapshots are mitigation but if you do a lot of snapshots maybe you made a bad choice with ES. I think that this problem is minor in microservices ecosystems, where the boot time can be masked without service interruption. In fact, you get the most out of ES/CQRS when you apply it so microservices
Eventual consistency: Blame CAP theorem for this, not ES. Many non ES/CQRS have to deal with this, but there are a lot of scenarios where it is not a real problem. These are the scenarios where ES fits well. And you can mix ES and non ES services into the same platform
Serialized events in Event store: if it's important to have a non-serialized event representation, you could use a document-oriented DB, but if you do this to make queries over events payload, you are missing the point of ES/CQRS. ES means to move all data manipulation from the DB side to the application tier, where every piece changes fastly, and all are stateless. This enhances scalability and fault tolerance and provides means to shape the organization of your team, doing things like let the frontend guy/girl write his/her BFF in javascript easily.
I hope to put into practices this principles with good results and draw on the benefits of this exciting approach
I will be constructing an ecommerce site, and would like to use a no-sql database, which will fit well with the plans for the app. But when it comes to which database would fit the job, im not sure. After comparing various DB's, the ones that seem best might be either mongo, couch, or even orientdb. I have seen arguments for all of them to be used or not used compared to something like MySQL. But between themselves (nosql databases), which one would fit well with an ecommerce solution?
Note, for the use case, i wont be having thousands of transactions a second. Or similarly high write rates. they will be moderate sure, but at a level that any established database could handle.
CouchDB: Has master to master replication, which I could really use. If not, I will still have to implement the same functionality in code anyways. I need to be able to have a users database, sync with the mothership. (users will have their own, potentially localhost database, that could sync with the main domains server). Couch is also fast, once your queries have been stored in the db.As i will probably have a higher need for read performance. Though not by a lot.
MongoDB: queries are very easy and user friendly. Also, with the fact that end users may need to query for certain things at a given time that I may not be able to account for ahead of time, this seems like it may be a better fit. I dont have to pre-store my queries in the db. Does support atomic transactions, though only when writing to a single document at a time.
OrientDB: A graph database. much different that most people are used to, but with the needs, it could fit very well too. Orient has the benefits of being schemaless, as well as having support for ACID transactions. There is a lot of customer, and product relationships that a graph database could be great with. Orient also support master to master replication, similar to couchdb.
Dont get me wrong, I can see how to build this traditionally with something like MySQL, but the ease and simplicity of a nosql solution, is very attractive. Although, in my case, needing a schemaless solution, would be much easier in nosql rather than mysql. a given product may have more or less items, than another. and avoiding recreating a table whenever a new field is added, is preferrable.
So between these 3 (or even others you think may be better), what features in each could potentially work for, or against me in regards to an ecommerce based site, when dealing with customer transactions?
Edit: The reason I am not using an existing solution, is because with the integrated features I need, there are no solutions available out there. We are also aiming to use this as a full product for our company. There will be a handful of other integrations than just sales. It is also going to be working with a store's POS system.
Since e-commerce can encompass everything from shopping carts through to membership and recurring subscriptions, it is hard to guess exactly what requirements and complexity you are envisioning.
When constructing an e-commerce site, one of the early considerations should be investigating whether there is already an established e-commerce product or toolkit that could meet your requirements. There are many subtleties to processes like ordering, invoicing, payments, products, and customer relationships even when your use case appears to be straightforward. It may also be possible to separate your application into the catalog management aspects (possibly more custom) versus the billing (potentially third party, perhaps even via a hosted billing/payment API).
Another consideration should be who you are developing the e-commerce site for: is this to scratch your own itch, or for a client? Time, budget, and features for a custom build can be difficult to estimate and schedule .. and a niche choice of technology may make it difficult to find/hire additional development expertise.
A third consideration is what your language(s) of choice are for developing your application. Some languages will have more complete/mature/documented drivers and/or framework abstractions for the different databases.
That said, writing an e-commerce system appears to be a rite of passage for many developers ;-).
Edit: a lot has changed since this answer was originally posted in 2012 and you should definitely refer to current product information. For example, MongoDB has had support for Decimal128 values since MongoDB 3.4 (2016) and multi-document transactions since MongoDB 3.6 (2017).
Check the comparison of different available NoSql databases here. Suit your requirement as per that.
MongoDB 4 now multi-document ACID transactions! That makes it suitable for e-Commerce!
Check out: https://www.mongodb.com/transactions