I am trying to figure out which is the best option for storing individual user logging information and general meta profiling data for each user on our system.
The original idea was to have a "profiler" collection and each document would represent a user. The problem with this design is that a power user could rack up so much meta data and history over the course of a year or less that it exceeds the document size limit. It also would force the documents to have deeper and more complex structures, which could result in slower queries.
The alternative design idea is to create a collection for each user and each document would hold specific types of profiling, history data. There are several benefits to this, namely speed. Yet also presents querying challenges when needing to run comparisons against other users (Solvable through other tracking DBs). I can't find a definitive answer to the question of how many collections a single mongo database contain.
If it can handle millions upon millions of collection per database then fantastic, otherwise I need to find better options for modeling this data. Am I going about this the right way?
The goal is to maintain a history of a user's interactions, reputation tracking, their interests over time, features they use regularly etc. which can allow for a more rich experience.
Create 2 collections: Users & User interactions.
There are certain things that make complete sense to store inside a User's document:
Reputation tracking
Interests -- common tags (similar to stack overflow) that a user frequents
Features -- this should be a finite list items. You could Key and $increment them as they are used
User interactions on the other hand is more of a log type structure that you may want to store with a back reference and process later.
Also check out Apache Kafka -- It's a distributed queuing technology that LinkedIn uses to do something similar to what you are describing.
Related
So, I don't have a specific issue here other than a general lack of knowledge, and I am hoping that the big brains on here can give me a nudge in the right direction or maybe refer me to an online resource that could help...here is the general problem I am trying to solve.
I have a mongo database that holds a handful of collections where I store data retrieved from some software that we use for our day to day operations. We are grabbing this data from the API and storing it in Mongo to build up a historical source of data (the API is limited in the timeframe of data that can be retrieved.)
The window for the historical data from the API is 7 days.
The data in question has a unique id for each item we pull from the API, so this allows us to grab a record, store it, and modify it as required if it changes over time. This has been working just fine for our needs, until we started to notice a few discrepencies between the data we stored in Mongo, and what we would get out of our software when we ran "reports." After digging into it, turns out there are a few edge cases where a record would be deleted from the software, but if we already grabbed that record through the API, then it would remain in our Mongo Database.
I'm looking for any advice on how to handle this situation. Ideally I suppose we would like to remove these deleted records from our mongoDB inorder to match what is in the software...but I'm having trouble dreaming up the process to make this happen. Apparently this is one of many gaping holes in my entirely self-taught knowledge of this stuff.
Thanks for any assistance.
I wish to design a schema for a chat server. The schema needs to support delivery and reading of messages. Each message needs to have a option of being a private or group message.
I was trying to think about where the data regarding if it has been read and delivered be sent.
In a relational database this could be set in another table. In MongoDB I could set this either in the user or the actual message json document.
If the message isn't for a specific user but a broadcast message then i presume it would be better to store the IDs of the users that have seen it as part of the json document of the message.
Does anyone know of some good example schemas that are available. I don't fully understand the best way of attacking this issue.
(Too long for a comment. And it kinda answers the question)
Yeah, it's a challenging design. Also it's something we can't do for you, I'm afraid, because we don't know all your requirements, you do. However you design it, you should respect the usual mongodb guidelines. Unfortunately, they conflict with each other:
Don't put too much stuff into one document.
In the classic blog schema exercise, one might be tempted to embed comments into the post document, each comment embedding its user too. This can easily lead to overflowing mongodb's max document size. Also it leads to write contention. Doesn't matter much for MMAPv1 engine, but matters for WiredTiger engine (which has document-level locking).
Do not build overly normalized schemas.
Normalized schemas are encouraged in relational databases. In mongodb they're useless (because of the lack of joins). What you need to do is careful duplication of some data. For example, in blog/comments example, one might embed author's id/email in a comment, but not the rest of author's data (sign up date, membership status, etc.)
When I decide a place or shape of the data, I generally ask myself these two questions:
How am I going to query this?
Isn't this too much duplication?
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modelled database which the developer has to work with for the lifetime of the application unless there is a big data migration project. CQRS and ES also has other benefits like scaling eventstore, audit log etc. that are already all over the internet.
But what are the disadvantages ?
Here are some disadvantages that I can think of after researching and writing small demo apps
Complex: Some people say ES is complex. But I'd say having a complex application is better than a complex database model on which you can only run very restricted queries using a query language (multiple joins, indexes etc). I mean some programming languages like Scala have very rich collection library that is very flexible to produce some seriously complex aggregations and also there is Apache Spark which makes it easy query distributed collections. But databases will always be restricted to it's query language capabilities and distributing databases are harder then distributed application code (Just deploy another instance on another machine!).
High disk space usage: Event store might end up using a lot of disk space to store events. But we can schedule a clean up every few weeks and creating snapshot and may be we can store historical events locally on an external HD just incase we need old events in the future ?
High memory usage: State of every domain object is stored in memory which might increase RAM usage and we all how expensive RAM is. BIG PROBLEM!! because I'm poor! any solution to this ? May be use Sqlite instead of storing state in memory ? Am I making things more complex by introducing multiple Sqlite instances in my application ?
Longer bootup time: On failure or software upgrade bootup is slow depending on the number of events. But we can use snapshots to solve this ?
Eventual consistency: Problem for some applications. Imagine if Facebook used Event sourcing with CQRS for storing posts and considering how busy facebook's system is and if I posted a post I would see my fb post the next day :)
Serialized events in Event store: Event stores store events as serialized objects which means we can't query the content of events in the event store which is discouraged anyway. And we won't be able to add another attribute to the event in the future. Solution would be to store events as JSON objects instead of Serialized events ? But is that a good idea ? Or add more Events to support the change to orignal event object ?
Can someone please comment on the disadvantages I brought up here and correct me if I am wrong and suggest any other I may have missed out ?
Here is my take on this.
CQRS + ES can make things a lot simpler in complex software systems by having rich domain objects, simple data models, history tracking, more visibility into concurrency problems, scalability and much more. It does require a different way thinking about the systems so it could be difficult to find qualified developers. But CQRS makes it simpler to separate responsibilities across developers. For example, a junior developer can work purely with the read side without having to touch business logic.
Copies of data will require more disk space for sure. But storage is relatively cheap these days. It may require the IT support team to do more backups and planning how to restore the system in a case in things go wrong. However, server virtualization these days makes it a more streamlined workflow. Also, it is much easier to create redundancy in the system without a monolithic database.
I do not consider higher memory usage a problem. Business object hydration should be done on demand. Objects should not keep references to events that have already been persisted. And event hydration should happen only when persisting data. On the read side you do not have Entity -> DTO -> ViewModel conversions that usually happened in tiered systems, and you would not have any kind of object change tracking that full featured ORMs usually do. Most systems perform significantly more reads than writes.
Longer boot up time can be a slight problem if you are using multiple heterogeneous databases due to initialization of various data contexts. However, if you are using something simple like ADO .NET to interact with the event store and a micro-ORM for the read side, the system will "cold start" faster than any full featured ORM. The important thing here is not to over-complicate how you access the data. That is actually a problem CQRS is supposed to solve. And as I said before, the read side should be modeled for the views and not have any overhead of re-mapping data.
Two-phase commit can work well for systems that do not need to scale for thousands of users in my experience. You would need to choose databases that would work well with the distributed transaction coordinator. PostgreSQL can work well for read and write separate models, for example. If the system needs to scale for a high number of concurrent users, it would have to be designed with eventual consistency in mind. There are cases where you would have aggregate roots or context boundaries that do not use CQRS to avoid eventual consistency. It makes sense for non-collaborative parts of the domain.
You can query events in serialized a format like JSON or XML, if you choose the right database for the event store. And that should be only done for purposes of analytics. Nothing inside the system should query event store by anything other than the aggregate root id and the event type. That data would be indexed and live outside the serialized event.
Just to comment on point 5. I've been told that Facebook does use ES with Eventual Consistency, which is why you can sometimes see a post disappear and reappear after you've posted it.
Usually the read-model your browser is accessing is located 'close' to you, but after you make a post the SPA switches over to a read-model that is close to your write-model. The close proximity between the write-model (events) and the read-model mean you get to see your own post.
However, 15 minutes later your SPA switches back to the first, closer, read-model. If the event containing your post hasn't yet propagated to that read-model you'll see your own post disappear only to reappear sometime later.
I know it's been almost 3 years since this question was asked, but still this article may be useful for someone. Key points are
Scaling with snapshots
Visibility of data
Schema changing
Dealing with complex domains
Need to explain it to most new team members
Event sourcing and CQRS is great because it gets rids developers being stuck with one pre-modeled database which the developer has to work with for the lifetime of the application unless there is a big data migration project.
This is a big misconception. The relational databases were invented exactly for the evolution of the model (thanks to simple two-dimensional tables as opposed to pre-defined hierarchical structures). With views and procedures ensuring the encapsulation of data access, the logical and physical model can evolve independently. This is also why SQL defines DDL and DML in the same language. Some RDBMS also allow all those evolutions to be versioned and deployed online (continuous delivery) as Oracle Edition Based Redefinition.
Big data structures are predefined and can be read only with the code developed for this structure. Ok when consumed immediately but you will have hard time to read it 10 years later without the exact version, and language compiler or interpreter.
I hope to not be late to try to give an answer. In these months I've done a lot of research on that argument with the goal of implementing a production-grade solution for some parts of my architecture where ES can make sense
Complex: Actually, it should not be complex, its mission is to be deadly simple. How? pushing all the complexity from business logic code to infrastructure code. The data access should be done by frameworks that are not enough mature yet. Still, there is no clear winner in the ES/CQRS race, maybe because is still a niche/hipster approach (?) So some team is rolling its own solution or adopting some ready-made technology such as Axon
High disk space usage: I would say more, I would say * potentially infinite* Disk Usage. But if you go towards ES, you also have a very good reason to tolerate this apparent drawback. Let's give some of them:
Audit Logs : The datastore is an event log, we already know it. Financial apps or every mission/safety critical could need a centralized audit log that enables to state Who made What in Which moment. ES provides this capability of the box...you can also decorate your event entries with some business meaningful metadata (eg. a transaction Id correlated with some API consumer identity, A severity level of the operation...)
High Concurrency: there are systems where logical resource states are mutated by many clients in a concurrent way. These are games, IoT platforms, and so on. Logging events instead of change a state representation could be a smart way to provide a total order of events. The other way is to delegate to DB the synchronization stuff. But this is not what you want if you're into ES
Analytics Let's say you have a lot of data with a lot of business value, but you still don't know which. For years we extracted knowledge from applications information by translating data organization with different information models (OLAP cubes). The event store provides something similar out of the box again. Event logs is the rawest form of representation of information And you can have many ways to process them, in batch or reacting to events stored
High memory usage: I think it should be the same once you have built your projection
Longer bootup time: If the read side caches its projections and "remembers" the last update event, it should not re-apply the entire event sequence. Snapshots are mitigation but if you do a lot of snapshots maybe you made a bad choice with ES. I think that this problem is minor in microservices ecosystems, where the boot time can be masked without service interruption. In fact, you get the most out of ES/CQRS when you apply it so microservices
Eventual consistency: Blame CAP theorem for this, not ES. Many non ES/CQRS have to deal with this, but there are a lot of scenarios where it is not a real problem. These are the scenarios where ES fits well. And you can mix ES and non ES services into the same platform
Serialized events in Event store: if it's important to have a non-serialized event representation, you could use a document-oriented DB, but if you do this to make queries over events payload, you are missing the point of ES/CQRS. ES means to move all data manipulation from the DB side to the application tier, where every piece changes fastly, and all are stateless. This enhances scalability and fault tolerance and provides means to shape the organization of your team, doing things like let the frontend guy/girl write his/her BFF in javascript easily.
I hope to put into practices this principles with good results and draw on the benefits of this exciting approach
I am working on a product using which a user can create his/her mobile site. Now, as this is a mobile site creation platform, there are lots of site created in the application. I need to keep all the visitor data in the database so that product can show the analytics to the user of his/her site.
When there was less site, all was working fine. But now the data is growing fast as there are lots of requests on the server. I make use of mongo as NoSQL DBMS to keep all the data. In a collection named "analytics", I usually insert row with site id so that it can be shown to the user. As the data is large, performance to show user analytics is also slow. Also disk space is growing gradually.
What should be best modeling to keep this type of BIG data.
Should i create collection per site and store data in separate collection per site ?
Should I also separate collection date wise ?
What should be the cleaning procedure of the data. What is the best practices adopted by other leader in the industry ?
Please help
I would strongly suggest reading through MongoDB Optimization strategies at http://docs.mongodb.org/manual/administration/optimization/ . You will find various ways to identify slow performing queries / ops and suggestions to improve them at the mentioned page. Hopefully that should help you resolve slow queries / performance issues.
If you haven't already seen, I would also suggest taking a look at various use cases at http://docs.mongodb.org/ecosystem/use-cases/ , how they are modeled for those scenarios and if there is any that resembles what you are trying to achieve.
After following through the optimization strategies and making appropriate changes, if you still have performance issues, I would suggest posting following information for further suggestions:
What is your current state in terms of performance and what is the planned target state?
How does your system look i.e. hardware / software characteristics?
Once you have needed performance characteristics, following questions may help you achieve your targets:
What are the common query patterns and which ones are slow?
Potentially look for adding indexes that can enhance query performance
Potentially look for schema refactoring based on access patterns
Potentially look for schema refactoring for rolling-up / aggregating analytics data based on how it will be used.
Are writes also slow and is that a concern as well?
Potentially plan for Sharding which would provide write as well as read scaling. Sharding is entirely a topic in itself and I would suggest reading about it at http://docs.mongodb.org/manual/sharding/
How big the data is and how is it growing or intended to grow
Potentially this would give further insights into what could be suggested
I have a project I've been working on recently using MongoDB. Basically, the core element of my website is "projects". Each project will contain a nested "strings" object, which will contain a large amount of strings... So, each project is quite massive and can approach a megabyte or more. As such, I want to have the projects in their own collection.
The problem I'm having is that I'm not for sure how to assign users to a project. Should a user contain a list of the projects they're enrolled in, or should the project contain a list of users? When should I choose one or the other? And is there a better way? (I haven't touched MongoDB in quite a while, so I'm a bit rusty)
I cannot immediately think of any serious stopping differences between the two. I suppose maintaining it user side might be easier for:
User unsubscribing from projects since you just modify that user row which will likely have only that atomic lock on it anyway, unlike the project row which multiple users could be trying to unsubscribe from at the same time. So doing it user side might make better updates and concurrency in general.
The previous point can apply to users subscribing to projects as well.
User deletions, you just delete the user row...simple
So considering these immediate things that come to mind I would probably choose to embed projects within the user.
Have a look at http://www.mongodb.org/display/DOCS/Schema+Design. The guideline that would apply in your case is
Generally, for "contains" relationships between entities, embedding should be be chosen. Use linking when not using linking would result in duplication of data.
If a user can be enrolled in multiple projects, and a project can enroll multiple users, it seems best to have separate collections for Projects and Users.