Meteor - Why not just publish all the collection data? - mongodb

This may be quite an easy question to answer as it may just be my lack of understanding, but if you are having to run the query twice - once of the server and once on the client - why not just publish all the collection data, and then just run one query on the client?
Obviously I don't mean doing this for the users collection, but if you have a blog Posts collection, wouldn't this be beneficial?
Publish all the post data, then subscribe to it and running whatever query is necessary on the client to get the data you need.

Publishing everything is good for 'development' environment as meteor adds autopublish by default but this has some fallacies in 'production' environment. I find this two points to be of importance
Security : The idea is, supply only as much data to the client as required. You can never trust the client and you don't know what the client may use the data for. For your use case, of simple blog posts, this may not be a serious risk but may be a critical risk for e commerce application. The last thing, you want is a hacker to use the data and leverage a bug in your code to do nasty stuff.
Data Overheads: For subscriptions, generally waitOn is used. Thus, till all the data has been made available to the client, the templates are not rendered. IF you have a very large amount of data it will take considerable time to render. So, it is advised to keep the data at 'only what to need' stage to optimize this time too.

Related

Microservices - Storing user data in separate database

I am building a microservice that has two separate services: a user service and a comments service. The user service stores the user details like email, first/last name job title, etc, and the comments service stores all comments made by the user.
In the UI, I need to populate the comments (via a REST API) and show the first/last name, email, and job title of the user.
Is it recommended that we store all these user details in the comments database?
If yes, then every time a user changes their details first/last name or job title then I will have to update their details in all the comments (I don't think this is a good idea )
If no, then if I store just the userid in the comments DB, how am I supposed to get the user details for each comment? Let's say we want to show 20 comments per page in the UI.
First, challenge architecture. Let's assume that the both services in the question are part of a larger ecosystem of microservices that all make use of the user information. Else separation will most certainly be overengineered. But from the word "comments" we can at least guess that there is at least one other class of objects, that is the things being commented. So let's assume a "user service" is a meaningful crumb to break out into a microservice, because at least some other crumbs get the necessary weight to justify the microservice breakup.
In that case I suggest the following strategy:
Second, implement an abstraction layer into your comments service right away so that most of the code will not have to care about where the user comes from (i.e. don't join or $lookup). This is also a great opportunity for local testing, because you can just create a collection with the data you need and run service level integration tests against it.
Third, for integration with the user service, get the data from there via API (which should support bulk data selection in any case) every time you need it. Because you have the abstraction layer, you can add caching, cache timeout and displacement strategies and whatever you may need below this abstraction without caring in the main portion of the code. Add such on an as needed basis. Keep it simple.
Fourth, when things really go heavyweight and you have to care with tens of thousands of users, tons of comments and many requests per second the comments service could, still below the abstraction, implement an upfront replication pattern to get the full user database locally. This will usually be done based on an asynchronous message being sent by the user service to all subscribers when something changes in te user base. When it suits the subscribers (i.e. the comment service), they can trigger full or (from time to time) delta replication of the changes. Suitable collections will be already in place from what you did for caching. And it will probably be considerably less info you need in the comments service, than is stored in user service (let alone the hashed password, other login options or accounting information).
Fifth, should you still hit performance challenges, you can break the abstraction for the few cases you need to and do the join or $lookup.
Follow the steps in order, and stop as soon as the overall assembly works fine. Every step adds considerable complexity, and when you don't need it, don't implement it.

Share a MongoDB instance between Meteor apps without lag in reactivity?

This question has been asked multiple times, here and here, and the answer to get this working is fairly straight forward: add an environmental variable to your bash_profile and all Meteor instances on your localhost will share that MONGO_URL.
What I've noticed however is that while this may be the case, there's quite a bit of latency in the "reactivity" of Meteor. I've tested this with two very lean Meteor apps, with empty collections. Inserting a document to a collection from one Meteor app, where my second app is querying that same collection and printing out a field from the documents does work, but there's a noticeable lag before it updates. I've ruled out the possibility of the collection insertion being the source of the lag (simple console.log callback on the client of the first app, logging the id of the newly inserted document).
My purpose for having multiple apps (two to be precise) sharing the same MongoDB is to separate an admin panel from a mobile app without going crazy regarding name-spacing and bloat. This configuration works, but I'm not sure it's the "proper" way of accomplishing the task, and it certainly seems to be causing a performance hit.
Any insight into this matter would be appreciated. Thank you!
EDIT: To clarify, the db URL I'm using is on my localhost, and isn't something hosted online.
When you use an external database, by default meteor will use periodic polling (every few seconds) in order to observe any changes. The delay you are experiencing is a result of this polling process. You can remove the delay and reduce your app's CPU usage by taking advantage of meteor's oplog tailing feature. In order to use it you will:
Get access to a mongodb instance with the oplog turned on.
Set the environment variable MONGO_OPLOG_URL so your app(s) can read the oplog.
Personally, I'd recommend compose.io for this. They provide exactly this as part of their basic elastic deployment. See this post for detailed instructions.
For users who wish to connect to the oplog created locally for you, you can obtain the URL via:
MongoInternals.defaultRemoteCollectionDriver().mongo._oplogHandle._oplogUrl
It should end up looking something like mongodb://127.0.0.1:3001/local

Caching repeating query results in MongoDB

I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.

Is there any value in using core data for iPhone apps?

Can people give me examples of why they would use coreData in an application?
I ask this because most apps are just clients to a central server where an API of some sort gives you the information you need.
In my case I'm writing a timesheet application for a web app which has an API and I'm debating if there is any value in replicating the data structure on my server in core data(Sqlite)
e.g
Project has many timesheets
employee has many timesheets
It seems to me that I can just connect to the API on every call for lists of projects or existing timesheets for example.
I realize for some kind of offline mode you could store locally in core data but this creates way more problems because you now have a big problem with syncing that data back to the web server when you get connection again.. e.g. the project selected for a timesheet no longer exists.
Can any experienced developer shed some light on there experiences on when core data is best practice approach?
EDIT
I realise of course there is value in storing local persistance but the key value of user defaults seems to cover most applications I can think of.
You shouldn't think of CoreData simply as an SQLite database. It's not JUST an SQLite database. Sure, SQLite is an option, but there are other options as well, such as in-memory and, as of iOS5, a whole slew of custom data stores. The biggest benefit with CoreData is persistence, obviously. But even if you are using an in-memory data store, you get the benefits of a very well structured object graph, and all of the heavy lifting with regards to pulling information out of or putting information into the data store is handled by CoreData for you, without you necessarily needing to concern yourself with what is backing that data store. Sure, today you don't care too much about persistence, so you could use an in-memory data store. What happens if tomorrow, or in a month, or a year, you decide to add a feature that would really benefit from persistence? With CoreData, you simply change or add a persistent data store, and all of your methods to get information out or in remain unchanged. The overhead for that sort of addition is minimal in comparison to if you were trying to access SQLite or some other data store directly. IMHO, that's the biggest benefit: abstraction. And, in essence, abstraction is one of the most powerful things behind OOP. Granted, building the Data Model just for in-memory storage could be overkill for your app, depending on how involved the app is. But, just as a side note, you may want to consider what is faster: Requesting information from your web service every time you want to perform some action, or requesting the information once, storing it in memory, and acting on that stored value for the remainder of the session. An in-memory data store wouldn't persistent beyond that particular session.
Additionally, with CoreData you get a lot of other great features like saving, fetching, and undo-redo.
There are basically two kinds of apps. Those that provide you with local functionality (games, professional applications, navigation systems...) and those that grant access to a remote service.
Your app seems to be in the second category. If you access remote services, your users will want to access new or real-time data (you don't want to read 2 week old Facebook posts) but in some cases, local caching makes sense (e.g. reading your mails when you're on the train with unstable network).
I assume that the value of accessing cached entries when not connected to a network is pretty low for your customers (internal or external) compared to the importance of accessing real-time-data. So local storage might be not necessary at all.
If you don't have hundreds of entries in your timetable, "normal" serialization (NSCoding-protocol) might be enough. If you only access some "dashboard-data", you will be able to get along with simple request/response-caching (NSURLCache can do a lot of things...).
Core Data does make more sense if you have complex data structures which should be synchronized with a server. This adds a lot of synchronization logic to your project as well as complexity from Core Data integration (concurrency, thread-safety, in-app-conflicts...).
If you want to create a "client"-app with a server driven user experience, local storage is not necessary at all so my suggestion is: Keep it as simple as possible unless there is a real need for offline storage.
It's ideal for if you want to store data locally on the phone.
Seriously though, if you can't see a need for it for your timesheet app, then don't worry about it and don't use it.
Solving the sync problems that you would have with an "offline" mode would be detailed in your design of your app. For example - don't allow projects to be deleted. Why would you? Wouldn't you want to go back in time and look at previous data for particular projects? Instead just have a marker on the project to show it as inactive and a date/time that it was made inactive. If the data that is being synced from the device is for that project and is before the date/time that it was marked as inactive, then it's fine to sync. Otherwise display a message and the user will have to sort it.
It depends purely on your application's design whether you need to store some data locally or not, if it is a real problem or a thin GUI client around your web service. Apart from "offline" mode the other reason to cache server data on client side might be to take traffic load from your server. Just think what does it mean for your server to send every time the whole timesheet data to the client, or just the changes. Yes, it means more implementation on both side, but in some cases it has serious advantages.
EDIT: example added
You have 1000 records per user in your timesheet application and one record is cca 1 kbyte. In this case every time a user starts your application, it has to fetch ~1Mbyte data from your server. If you cache the data locally, the server can tell you that let's say two records were updated since your last update, so you'll have to download only 2 kbyte. Now you should scale up this for several tens of thousands of user and you will immediately notice the difference of the server bandwidth and CPU usage.

High Volume MongoDB with Twitter Streaming API, Ruby on Rails, Heroku setup

I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)