Meteor publishing and subscribing to a large Collection

Meteor publishing and subscribing to a large Collection - mongodb

So let's take this scenario, in an e-commerce application, a user searches for "wrist watches".
Is it advisable for me to publish and subscribe the entire Products collection ? Because that table my grow a lot in size. Is it possible for me to fetch from a collection without subscribing to it ?
Also, in Meteor 1.3, which is the best place to define collections ? From what I read, it has to be in /imports/api, but some light on it might be helpful.
Thanks,

When you want to get data to your meteor client, you have three options - choose your own adventure.
option 1: publish the whole collection
pros: easy to implement, fast to use/filter on the client once the data has arrived, publication can be reused on the server for all clients
cons: doesn't scale well / doesn't work past a couple of thousand documents, may be a lot to transmit to the client
use when: you have a small size-bounded collection and the client needs all of it for filtering / searching / selecting
option 2: use a method
You can have a meteor method deliver the filtered documents to the client instead of publishing them. I.e. the user searches for "wrist watches", and the method delivers only those documents. See this section of the guide for more details. You can stuff the documents into a local collection if you like, but it isn't required.
pros: performance, scalability, data isolation (you don't have to worry that some subset of the documents were added by another subscription)
cons: it's more work to set up and manage than a subscription
use when: you have an unbounded collection and you need a subset in the most performant way
option 3: use a reactive subscription
This is very similar to (2) except you'll be re-subscribing in an autorun after changing your search parameters. See this section for more details.
pros: easier to implement than (2)
cons: more computationally expensive an a bit slower than (2) with the possible exception that publications could be reused on the server (unlikely in the case of a search)
use when: you have an unbounded collection and you need a subset with the least amount of effort/code
Without knowing more about your particular use case, I'd go with (2).
As for where to define your collections, see this section and the todos app for examples. The recommendation is to use imports/api as you mentioned. For an explanation of why, see this question. If you need more detail, I'd recommend opening a separate question.

Generally speaking we don't post all fetched data onto a page at once. It too lengthy for the customers in terms of user experience. A common advice is pagination plus sorting.
As to Meteor, collections on the server are different from collections on the client. In short, a collection on the client is a subset of the server collection. Data in that subset is determined by a publication-subscription mechanism of Meteor. Data is published on the server and you subscribe to it on the client. This way you derive the subset. Morever you can define filtering, sorting, count, ect to shape the derived subset based on what and how you like the subset to be used on the client. The documentation contains a pretty decent guide and details about Meteor collections.
The place to define collections is really flexible in Meteor. It doesn't have to be /imports/api. It can be any location that can be accessed by both the server and the client, because in general use cases the server needs to see the data and define methods for manipulating the collection, and the client needs to see it as well for rendering data on web pages. But, as said, it is flexible and depends on how you implement and structure your applications. It can a location accessible by both the server and the client, but it needs not to be. In some cases the collections are defined on the server only, and the client fetch the data from implicit and indirect protocols. Meteor method is one of them, and Restful API is another to name a few. It's case by case and you do what you feel best. That is where the fun is from. Subscription is common and convenient but not the only.
Meteor defines special rules to folder access on the server and client respectively, and Meteor 1.3 imposes a new rule for modulation. I enjoy reading Meteor documentation and find them really useful, like this one helps develop solid knowledge on the afore-mentioned rules.

Related

Is it ok to use Meteor publish composite for dozens of subscriptions?

Currently, our system is not entirely normalized, and we use meteor-publish-composite to obtain the normalized data in mongodb. Some models have very few dependencies, but others have arrays of objects (i.e. sub-documents) with few foreign keys that we are subscribing to when fetching each model.
An example would be a Post containing a list of Comment sub-documents, where each comment has a userId field.
My question is, while I know it would be faster to use collection hooks and update the collection with data denormalization, how does Meteor handle multiple subscriptions on the same collection?
Is a hundred subscriptions on the same collection affect the application speed (significantly)? What about a thousand? etc.

This may not fully answer your question, however after spending countless hours tuning the performance of a large meteor app, I thought I would share some of the things that I have learned.
In Meteor, when you define a publication, you are setting up a reactive query that continues to push data to subscribed clients when changes to the underlying mongo data causes the result of the query to change. In other words, it sets up a query that will continually push data to clients as the data is inserted, updated, or removed. The mechanism by which it does this is by creating an observer on the query.
When an observer is initialized (e.g. when publication is subscribed to), it will query mongodb for the initial dataset to send down and then use the oplog to detect changes going forward. Fortunately, meteor is able to re-use an existing observer for a new subscription if the query is for the same collection, same selectors, and same options.
This means that you could create hundreds of subscriptions against many different publications, but if they are hitting against the same collection and using the same query selectors then you effectively only have 1 observe in play. For more details, I highly recommend reading this article from kadira.io (from which I acquired the information I used in this answer).
In addition to this, Meteor is also able to deal with multiple publications publishing the same document, and when this occurs, the documents will be merged into one. See this for more detail.
Lastly, because of Meteor's MergeBox component, it will minimize the data being sent over the wire across all your subscriptions by keeping track of what data changed vs. what is already on the client.
Therefore, in your specific example, it sounds like you will be running several different subscriptions on effectively the same query (since you are just trying to de-normalize your data) and dataset. Because of all the optimizations that I described above, I would guess that you won't be plagued by performance issues by taking this approach.
I have done similar things in one of my apps and have never had an issue.

MongoDB Schema Suggestion

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?

In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

MongoDB database design, Is redundancies important?

I have a question about MongoDB database design.
As far as I know (I'm not sure I'm correct), there is no need to use relationships between collections. For example I have collection for Users with their emails and there are email templates that I want to send them to the users.
Should I use my old paradigm of avoiding redundancies and design 3 collections like this:
Users: ID,Name,Email
Templates: ID,Contents
EmailSent: UserID,TemplateID
Or should use Nosql paradigm like this:
Users: ID,Name,Email
Templates: ID,Contents
EmailSent: UserID,Contents
Difference is only in Email sent collection. I'm looking for a clear answer according to MongoDB design architecture, not personal opinions

In this special case, I would not reference the used template from the sent emails, because a sent email is sent and can not be changed anymore. When you change the template after sending an email, the email already in the inbox of the receiver would not change. But when you look at the email in your application, it would appear with the new template even though that's not the template which was active when the email was generated. That would provide your users with misleading information.
In the more general case, there is no by-the-book solution for the question embedding vs. referencing. While MongoDB generally prefers embedding over referencing because of the lack of on-database JOINs, embedding causes problems when many documents embed copies of the same data and that data changes. In that case you either have to leave the data as-is (which can make sense in some cases, like here for example) or update all documents when you update the embedded data. This would be an expensive operation.
You won't have that costly mass-update operation with referencing instead of embedding. However, it would makes retrieval of the complete documents more expensive because you would have to perform multiple subsequent queries.
Which option you choose depends on your expected usual use-case:
When you expect that requesting with the sub-document is a frequent operation and updating the subdocument is a rare operation, you would choose embedding.
When the sub-document changes very frequently and requests are rare, referencing would be the smarter strategy.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?

Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/

I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.

I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.

Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.

I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

Port From Entity Framework to MongoDB

I'm planing to port from entity framework 4.0 to MongoDb. What are the best practices that can minimize the impact since the project is having social networking functionality hence, maintain a complex relational database.As a result, performance should be a matter if we use
relational database.
We have used domain Layer(using POCO), repository pattern and DTO Mapping in the project.Also,
What are the advantages and disadvantages of the decision ? At the same time, how it affect to my domain layer implementation ?

If you want to 'minimize impact' you'll want to create a database in MongoDB the one you have in SQL. Since there are no joins in the database you'll need to do multiple reads to complete your query. In itself that's not too bad because MongoDB is really fast, but obviously it has other issues (concurrency, etc.).
If, however, you want to move over fully to the NOSQL-way of doing things you'll likely not be able to 'minimize impact', you'll need to make substantial changes to the way you store content, the way you access it and the way you update it.
Storage: You'll likely create documents in your database that are denormalized and much closer to 'ViewModels' than 'Models'. You might for example store a count of child records in a parent record so that you can display it without having to load them or count them.
Access: You might end up using Map-Reduce for some queries to your database which is a very different mind-set from a traditional query.
Updates: In all likelihood your approach to updating will be different in order to take advantage of the many fine-grained MongoDB update features like $inc. Instead of posting back some large view model and then applying it to your model and then updating the database you might instead provide a much finer-grained Ajax call back that updates a single value. Take a look at CQRS for more ideas on how to think about models for updates vs queries.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse