mongodb performance for large document - mongodb

I have a document that holds a big data structure in certain fields inside an array, it is slowing down my application due to frequent hits to read such data. am thinking on few solutions to implement but I need advice before i proceed and possibly even a better solution, here are my thoughts/questions:
would it help to cache data?
should I use memcached or redis as a caching engine and why?
would it help to read single fields from this document instead of reading it all every time?
should I do something else?!

Caching will help because it would avoid your db to be hit too often
Memcache or redis it's up to you. I prefere redis but if you already have a memcache it's fine.
If you have a cluster of servers, think if you need a centralized cache or not
Caching a full document won't help for getting a single field because you cache the result of a query without knowing what it contains.

your question need more clarification. for example how big is the data that you are speaking of is it couple of megabytes or gigabytes. All these factors change the solution. But if we consider that you have couple of megabytes and you want to prevent to call database every time the best solution is cache. How to choose a cache is also completely depends on what is your situation. If your web application runs on one server you can use the in-memory cache like ASP.Net cache which is very quick and fast for in-memory cache. this cache is stored in your heap so you can put all your object in the cache without serialization.But consider that whenever your application is restarted like most of deployments. your heap will be deleted and all the cache is cleared inside the heap.
if you have more than one server then you can start to think about an out-of-memory cache because two servers are not sharing heap memory and using all in-memory cache are useless because it duplicate the data and invalidating is nightmare. However, this is more reliable cache while it is not in the heap and in term of persistence is more than in-memory cache. But whatever you want to put in this kind of cache should be serializable while you are transferring the object over network connection. So you cannot put all your object in cache. Both Redis and memcached can be used for this purpose. Redis is more complicated with more functionality than Memcached but for your purpose memcached is quite good.
Whatever caching system you choose, approach it in a wide perspective. Design a caching system in your application while over time you need to put more things in cache. so its better to prepare everything for that time from now.
another things which is very important in cache is that whenever you set something in cache you have to consider when you are going to invalidate it.

Whether or not caching will help depends on the accession of the document. If the document is being accessed multiple times then caching will not help due to how MongoDB to memory caching actually works.
First, you need to understand your data accession patterns.

Related

is memcache relative to database

I have been browsing through a lot of websites. I need experts advice on this one.
can anyone please explain me what exactly is memcache ?
From what I understand that it is a distributed memory caching system used for dynamic web apps but my main question is do we need a database when we say 'memcache' or the term 'memcache' doesnt need a database ?
please answer. Thank you
No, you don't require a traditional database when you say memcache, it's an in memory hash table(dictionary) with key,value storage and so it resides in RAM as a lookup table.For this fact, it is not persistent, so whenever you restart your server, memcache gets reset.
memcached is a specific program that runs a server that other programs can use to keep things in memory. It's something like an in-memory database, depending on your definition of database.
Caching something in memory can also be done generically, without memcached (pronounced memcache dee).

Caching repeating query results in MongoDB

I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.

Caching strategy to reduce load on web application server

What is a good tool for applying a layer of caching between a webserver and an application server.
Basic Requirements:
The application server needs a way to remove items from the cache and put items in the cache with an expiration date.
The webserver needs a way to pull items out of the cache in a very light-weight, fast manner without requiring thread allocation on the application server.
It does not neccessarily need to be a distributed cache (accessible from multiple machines), but it wouldn't hurt.
Strategies I have considered:
Static file caching. Request comes in, gets hashed, if a file exists we serve it, if not we route the request to the app server. Is high I/O a problem or file locking problems due to concurrency? Is it accurate that the file system is actually very fast due to kernel level caching in memory.
Using a key-value DB like mongodb, or redis. This would store the finished HTML/JSON fragments in db. The webserver would be equipped to read from the DB and route to the app server if needed. The app server would be equipped to insert/remove from the DB.
A memory cache like memcached or Varnish (don't know much about Varnish). My only concern with memcached is that I'm going to want to cache 3 - 10 gigabytes of data at any given time, which is more than I can safely allocate in memory. Does memcached have a method to spill to the filesystem?
Any thoughts on some techniques and pitfalls when trying this type of caching layer?
You can also use GigaSpaces XAP in memory data grid for caching and even hosting your web application. You can choose just the caching option or combine the power of two and gain single management of your environment along other things.
Unlike the key value pair approach you suggested, using GigaSpaces XAP you'll be able to have complex queries such as SQL, object based temples and much more. In your caching scenario you should check out more specifically the local cache related features.
Local Cache
Web Container
Disclaimer, I am a developer in GigaSpaces.
Eitan
Just to answer this from the POV of using Coherence (http://coherence.oracle.com/):
1. The application server needs a way to remove items from the cache and put items in the cache with an expiration date.
// remove one item from cache
cache.remove(key);
// remove multiple items from cache
cache.keySet().removeAll(keylist);
2. The webserver needs a way to pull items out of the cache in a very light-weight, fast manner without requiring thread allocation on the application server.
// access one item from cache
Object value = cache.get(key);
// access multiple items from cache
Map mapKV = cache.getAll(keylist);
3. It does not neccessarily need to be a distributed cache (accessible from multiple machines), but it wouldn't hurt.
Elastic. Just add nodes. Auto-discovery. Auto-load-balancing. No data loss. No interruption. Every time you add a node, you get more data capacity and more throughput.
Automatic high availability (HA). Kill a process, no data loss. Kill a server, no data loss.
A memory cache like memcached or Varnish (don't know much about Varnish). My only concern with memcached is that I'm going to want to cache 3 - 10 gigabytes of data at any given time, which is more than I can safely allocate in memory. Does memcached have a method to spill to the filesystem?
Use both RAM and flash. Transparently. Easily handle 10s or even 100s of gigabytes per Coherence node (e.g. up to a TB or more per physical server).
For the sake of full disclosure, I work at Oracle. The opinions and views expressed in this post are my own, and do not necessarily reflect the opinions or views of my employer.

Apple "Avoid writing cache files to disk." - where should I save cache?

In Apple's Performance Tuning Guide there is a writing:
Avoid writing cache files to disk. The only exception to this rule is
when your app quits and you need to write state information that can
be used to put your app back into the same state when it is next
launched.
I'm saving a lot of cache files in Library/Cache directory, because my app deals with web services, and nobody likes the white screen. What does this statement mean? I shouldn't do this or what?
Thank you!
Well, "avoid" means "avoid if possible, because writing/reading is relatively slow". If by caching a small amount of data (I assume the definitions of the web services retrieved from somewhere?) you can improve the performance of your app's startup, by all means do it. If you are only using this data for one run of your application, and the next run will re-fetch this anyway, use an in-memory cache.
Library\Caches is basically designed to store data you fetched from somewhere which provides performance boosts when stored locally.
The text from Apple feels like more a general guideline against overusing storage if you don't need data to persist from one run of your application to another.

How to link MemCached server together?

I'm looking into using MemCached for a web application I am developing and after researching MemCached over the past few days, I have come across a question I could not find the answer to.
How do you link Memcached server together or how do you replicate data between MemCached server?
Additionally: Is this functionality controlled by the servers or the clients and how?
when you set several servers, the client libraries use a first hash to pick one where to store each key/data pair. that means that there's no replication, and also that every client has to use the same set of servers.
pros:
almost zero overhead, storage and bandwidth grow linearly.
server code is kept simple and reliable.
cons:
any change in the set of servers (one goes down, or you add a new one) suddenly invalidates (almost) the whole cache.
you have to be sure to use the same algorithm on every client.
if you have control to the client's code, you can simply store each key/data pair twice on two servers. just be sure to search on the same places when reading from a different client.
I've used BeITMemcached and in that you create an instance of MemcacheClient and set the servers you want to use, just as strings.
At that point the client itself determines which of the servers it has available to put different items into. You never know which an item will be in.
Check here to see how the servers handle failover.
The easiest thing is to have a repopulate mechanism. In my case, I store several hundred objects in memcache which come out of a database. I can just call repopulate and put them all back in there. Whenever I add, update or delete them to the database, I make those same calls to memcache.
http://repcached.lab.klab.org/
Also, the PHP PECL memcache client can replicate data to multiple servers, see memcache.redundancy.
It sounds like you wish to have caches that can cope with machines rebooting etc if so…
In a lot of case (assuming you are not writing Facebook) a RDMS is fast enough for caching. Just create a table that has a key and a blob column. If the RDBS server has enough ram, all the data will be in RAM and just saved to disk so as to allow recovery.
Remember this could be a separate server(s) from your main database server.
If you wish to get more fancy and are using a high-end RDMS, you may be able to set up change notifications on the queries that are used to build the “cached data” that delete out-of-date rows from the cache.
Someone you can set up triggers to clear invalid rows from the cache, however this can be very complex very quickly.
Memcached does not provide replication property. To do that, you need to add the server to memcached client server list and then hit the DB for the data to be stored in that particular server.
You should seriously consider CouchBase. It uses the memcached protocol, provides nearly the same speed, and delivers the automatic replication you're looking for. It also persists to disk so your cache will never be cold.