Perl, which is the best module to cache data for a web application? - perl

I need some method to cache data fetched (or pre-fetched) from the internet in a web service written in Perl.
Data collected is parsed to XML format.
My goal is to decrease response time of my service to user requests.
I see there's a bunch of modules available: which is the most performing in terms of usability and efficiency (in that order of importance) ?
UPDATE:
I'm evaluating CHI, Cache, and Cache::Cache (which I suppose is quite outmoded...)
I'm in a mod_perl2 Linux environment: in this case, does a memory cache make any sense?

CHI is a good base, and can do most caching you are likely to need, with the back-end chosen according to your need. We used to use Cache::Cache, and on occasion even Memoize, but CHI appeared best in the end, and has certainly superseded Cache::Cache.
Now for the caching back-end system. This will depend a lot on your web setup, and even on your operating system and hardware. If you have a lot of memory, and/or are running a small number of server processes, a memory-based cache is solid and easy. If you are running many processes and still have decent memory, you probably want a shared memory cache (and this will work fine on Linux and UNIX systems, but might not be supportable on Windows). If you have less memory, a file-based cache will work, but it'll be slower.
However, if you are only prefetching, you might just as well use a database. And even if the caching is mainly there to avoid network requests, an SQL database will still provide the performance improvement you need. There's also a CHI driver for that at: CHI::Driver::DBI. Most databases do very decent caching anyway.
UPDATE:
Cache and Cache::Cache are both pretty old. CHI has been designed to replace Cache::Cache. Of these, CHI is definitely the best.
Yes, even in mod_perl, a memory cache can be useful, but you can't afford it to become too large. mod_perl embeds Perl in the server processes, so each server process will have its own cache. This reduces effectiveness, but if you have a small number of things to cache, and they are used often, it is still worth using.
Assuming you're on a UNIX-type system, a shared memory cache or a database cache will likely be your best bet, depending on how many things you are likely to need to cache, and their size. If your cache will become huge, and if network latency is the main part of the problem, a database cache backend will be fine. If the cache will always be relatively small, memory or shared memory should suit.

Related

Does concatenation improves I/O performance?

HTTP/2 was a very nice improvement, now we don't have the network overhead related to creating multiple connections to download multiple files, and is recommended to stop concatenating JS and CSS files to improve caching, but I'm not sure if from the perspective of I/O, having to read multiple files from hard drive will impact performance.
I don't know if Web servers like Apache or nginx have some RAM caching of frequent files and this avoids I/O overhead and I can stop concatenating and stop worrying about this.
No, having to read multiple files from the hard disk won't worsen performance, specially if your are using the current crop of SSD hard-drives. They have amazing performance and thanks to them network bandwidth, SSL and everything else are the bottlenecks.
Also, operating systems automatically cache in RAM recently accessed files, so the feature you would like doesn't need to be built into the web server, because it is already built into the operating system. And this works even if you are using magnetic drives.
There is one thing though. We (at shimmercat.com) have noticed that sometimes compression suffers by having multiple small files instead of a concatenated one. In that sense, it would have been nice if HTTP/2 had included a compression format like SDCH ... But this should matter more for particularly small files than for big files. All in all, I think that the cache benefits for small files are worth it.

Optimized environment for mongo

I have my RHEL linux server(VM) running a 4core processor and 8GB ram running the below applications
- an Apache Karaf container
- an Apache tomcat server
- an ActiveMQ server
- and the mongod server(either primary of secondary).
Often I see that mongo consumes nearly 80% of cpu. Now I see that my cpu and memory is overshooting most of the time and this has caused me to doubt whether my hardware config is too low for running these many components.
Please let me know if it is ok to run mongo like this on a shared server..
The question is to broad and the answer depends on too many variables, but I'll try to give you overall sense of it.
Can you use all these services together on the same machine at a minimum load? - for sure. It's not clear where other shards reside though, but it will work either way. You didn't provide your HDD specs which is quite important for a DB server, but again it will work at a minimum load.
Can you use this setup under heavy load - not the best idea. Perhaps it's better to have separate servers handling these services.
Monitor overall server load like: CPU, memory, IO. Check mongo logs for slow queries. If your queries supposed to run fast and they don't, you'll need more hardware.
Nobody would be really able to tell you how much load a specific server configuration can handle. You need at least 512Mb RAM and 1 CPU to get going these days but very soon you hit the limits. It all depends on how many users you have, what kinds of queries they run and how much data they cover.
Can you run MongoDB along other applications on a single server? Well it would appear that if you are having memory issues or CPU issues in your current configuration then you will likely need to address something. But "Can You?", well if it is not going to affect you then of course you can.
Should you, do this? Most people would firmly agree that you should not, and that would also stand for most of the other applications you are running on the one machine.
There are various reasons, process isolation, resource allocation, security, and far too many for a short topic response to go into why you should not have this kind of configuration. And certainly where it becomes a problem you should be addressing the issue by seeking a new configuration.
For Mongo alone, most people would not think twice about running their SQL database on dedicated hardware. The choice for Mongo should likely be no different.
Have also suggested this be moved to ServerFault, as it is not a programming question suited to stack overflow.

CouchDB vs MongoDB (memory utilization)

Which has the better performance in low memory environments (sub 1GB) ?
I've used MongoDB in the past which seems to struggle memory-wise with a database of 250mb on a 512mb box, would the same be true of CouchDB ?
CouchDB uses very little memory. It has been embedded in iOS and Android more-or-less unmodified—Erlang and all.
CouchDB works completely through file i/o, delegating caching to the operating system (filesystem cache). A typical situation on a CouchDB server is to see a very small amount of "used" memory, but a very large amount used for "cache." On a dedicated CouchDB server, that number is basically CouchDB's data; however, managing and reallocating those resources is up to the OS where it belongs.
In other words, CouchDB performs excellently in low-memory environments. In fact, embedded environments (e.g. mobile) are still very fast because the low-memory is somewhat balanced by the low-latency storage device (solid-state disk).

The benefits of deploying multiple instances for serving/data/cache

although I've much experience writing code. I don't really have much experience deploying things. I am writing a project that uses mongodb for persistence, redis for meta-caching, and play for serving pages. I am deciding whether to buy a dedicated server vs buying multiple small/medium instance from amazon/linode (one for each, mongo, redis, play). I have thought of the trade-offs as below, I wonder if anyone can add to the list or provide further insights. I am leaning toward (b) buying two sets of instances from linode and amazon, so if one of them have an outage it will fail over to the other provider. Also if anyone has any tips for deploying scala/maven cluster or tools to do so, much appreciated.
A. put everything in one instance
Pros:
faster speed between database and page servlet (same host).
cheaper.
less end points to secure.
Cons:
harder to manage. (in my opinion)
harder to upgrade a single module. if there are installation issues, it might bring down the whole system.
B. put each module (mongo,redis,play) in different instances
Pros:
sharding is easier.
easier to create cluster for a single purpose. (i.e. cluster of redis)
easier to allocate resources between module.
less likely everything will fail at once.
Cons:
bandwidth between modules -> $
secure each connection and end point.
I can only comment about the technical aspects (not cost, serviceability, etc ...)
It is not mentioned whether the dedicated instance is a physical box, or simply a large VM. If the application generates a lot of roundtrips to MongoDB or Redis, then the difference will be quite significant.
With a VM, the cost of I/Os, OS scheduling and system calls is higher. These elements tend to represent an important part in the performance cost of efficient remote data stores like MongoDB or Redis, and the virtualization toll is higher for them.
From a system point of view, I would not put MongoDB and Redis/Play on the same box if the MongoDB database is expected to be larger than the available memory. MongoDB maps data files in memory, and relies on the OS to perform memory swapping. It is designed for this. The other processes are not. Swapping induced by MongoDB will have catastrophic consequences on Redis and Play response time if they are all on the same box. So I would at least separate MongoDB from Redis/Play.
If you plan to use Redis for caching, it makes sense to keep it on the same box than the Play server. Redis will use memory, but low CPU. Play will use CPU, but not much memory. So it seems a good fit. Also, I'm not sure it is possible from Play, but if you use a unix domain socket to connect to Redis instead of the TCP loopback, you can achieve about 50% more throughput for free.

REST service with load balancing

I've been considering the advantages of REST services, the whole statelessness and session affinity "stuff". What strikes me is that if you have multiple deployed versions of your service on a number of machines in your infrastructure, and they all act on a given resource, where is the state of that resource stored?
Would it make sense to have a single host in the infrastructre that utilises a distributed cache, and any state that is change inside a service, it simply fetches/puts to the cache? This would allow any number of deployed services for loading balancing reasons to all see the same state views of resources.
If you're designing a system for high load (which usually implies high reliability), having a single point of failure is never a good idea. If the service providing the consistent view goes down, at best your performance decreases drastically as the database is queried for everything and at worst, your whole application stops working.
In your question, you seem to be worried about consistency. If there's something to be learned about eBay's architecture, it's that there is a trade-off to be made between availability/redundancy/performance vs consistency. You may find 100% consistency is not required and you can get away with a little "chaos".
A distributed cache (like memcache) can be used as a backing for a distributed hashtable which have been used extensively to create scalable infrastructures. If implemented correctly, caches can be redundant and caches can join and leave the ring dynamically.
REST is also inherently cacheable as the HTTP layer can be cached with the appropriate use of headers (ETags) and software (e.g. Squid proxy as a Reverse proxy). The one drawback of specifying caching through headers is that it relies on the client interpreting and respecting them.
However, to paraphrase Phil Karlton, caching is hard. You really have to be selective about the data that you cache, when you cache it and how you invalidate that cache. Invalidating can be done in the following ways:
Through a timer based means (cache for 2 mins, then reload)
When an update comes in, invalidating all caches containing the relevant data.
I'm partial to the timer based approach as its simpler to implement and you can say with relative certainty how long stale data will live in the system (e.g. Company details will be updated in 2 hours, Stock prices will be updated in 10 seconds).
Finally, high load also depends on your use case and depending on the amount of transactions none of this may apply. A methodology (if you will) may be the following:
Make sure the system is functional without caching (Does it work)
Does it meet performance criteria (e.g. requests/sec, uptime goals)
Optimize the bottlenecks
Implement caching where required
After all, you may not have a performance problem in the first place and you may able to get away with a single database and a good back up strategy.
I think the more traditional view of load balancing web applications is that you would have your REST service on multiple application servers and they would retrieve resource data from single database server.
However, with the use of hypermedia, REST services can easily vertically partition the application so that some resources come from one service and some from another service on a different server. This would allow you to scale to some extent, depending on your domain, without have a single data store. Obviously with REST you would not be able to do transactional updates across these services, but there are definitely scenarios where this partitioning is valuable.
If you are looking at architectures that need to really scale then I would suggest looking at Greg Young's stuff on CQS Architecture (video) before attempting to tackle the problems of a distributed cache.