Memcaching sphinx results - good idea or bad idea? - memcached

We maintain a fairly large sphinx store. about 3.3 million records. we also maintain a fairly well distributed memcached base set over 4 servers.
We were just wondering if it is advisable to store sphinx results for various queries in memcached, which would be fairly easy to implement.
While I understand this can be a somewhat broad question, but just any general ideas?
Also worth mentioning, the memcached connection is always made in the script that accesses sphinx. total connection times (sphinx + memcached vs just memcached) could be improved. then again, all queries that do not result in a memcached hit would end up having to send a write to memcached.
So, would it be a good idea to store sphinx results in memcached for future use?
Thanks!

It is depends on your situation and on your needs. In your projects we had the same architecture, where memcache caches sphinx search results. But, in general, in our projects, it was minor chance that needed search query results are already in cache. That was about 10% of all queries because of large variety of queries and not guaranteed long-time storing data in memcache. Further more, sphinx usually searches very fast. So, we decided not to use cache in search.
So, you need to do tests. They will tell you.

Related

Memcache Vs MongoDB for auto-complete

We have RDBMS which also runs our auto-complete queries. I'm planning to reduce the load on RDBMS by re-directing auto-complete queries to MongoDB. The other option I have is use Memcache. The SQL queries are of nature "where lastName like 'abc%'. Can I query Memcache with Like clause? Also, my data will also be updated frequently so if I use Memcache, it needs to stay updated accordingly. Can anyone suggest if Memcache or any other cache is better over NoSQL? What are the advantages, if any, of using cache in this case and which one is preferred approach?
We had the same issue. Memcache is not the correct tool to do this as the data on Memcache is not persistent. And the number of permutations eventually become so huge that it does not make sense storing so much of data to Memcache.
We are using elasticsearch to handle auto suggest queries. It is extremely fast. It gives us most of the results in under 5ms.
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
You can Refer this link.
Let me know if you have any questions.

Which NoSQL database for extremely high volumes of data

I'm looking at NoSQL for extremely high volumes of data. We're storing cached versions of web page text in MySQL at the moment, but it seems like the database will get huge very quickly.
My requirements are:
Durability, must not lose data on flushes/writes
Very fast read, reasonably fast write
Fully consistent replication
Preferably, in-memory plus an eventual disk write
I'm looking at: MongoDB, Redis, Raik, and Cassandra right now.
Which best fits my requirements?
I have experience with Redis and MongoDB, but would not recommend either for your use case. Redis is awesome in every regard, but since it's RAM-only and has no clustering features (yet, they are in development), it doesn't scale very well. MongoDB I wouldn't ever use again for anything that needs anything but a small replica set.
Basically, MongoDB is immature and completely unsuitable for any kind of high volume, high performance requirements. It has a global write lock which is held during disk flushes, which means that performance can vary wildly depending on what you do. In practice it makes updates that grow documents impossible, and you need to be very careful with deletes, too. Speaking of deletes, they fragment the database severely, so if you do a lot of deletes your performance is going to suffer.
Sharding in 1.8.0 through 1.8.1 was a disaster. There were complete show stopper bugs that should never have made it into a stable release. Configuration wasn't flushed properly and it was very easy to get your database into a bad state so that chunks never moved off of the primary shard. 1.8.2 solves most of them and seems more stable, but I don't trust the sharding implementation one bit. Add to this that sharding is hard even when everything works, it's not always easy to select a natural shard key, and if you don't sharding will cause you much grief.
MongoDB is really easy to work with and the feature set is really nice. The documentation, the drivers and the community are all great. MongoDB works super as a replacement for MySQL, but don't use it for anything that needs to scale out.
We're currently looking at moving to Cassandra. I find the dynamo model (e.g. no master nodes; write and read anywhere; simply add nodes to grow the cluster) compelling and the features are more or less right for us. The data model is schema less just like MongoDB, although a little more limited (you can choose between one or two level hashes, basically). I'm sure the community is good once you get into it, but so far I find it hard to find good information on how to solve common problems, and the documentation is lacking. Most of the information you find on blogs is a year old, and a lot of things have happened since then (0.7 and 0.8 seem to be really significant updates both, but most things you find are about 0.6). The drivers are also not very mature or well documented, from what I've seen so far, and everyone seems to be squabbling about whether Thrift, Avro or CQL is what should be used (and that has changed from 0.6 to 0.7 to 0.8).
Riak is interesting, for the same reasons as Cassandra, but for us a pure key-value-store is not enough, we need to be able to update without first doing a read. With Riak this isn't possible since the values are just blobs. This sounds like it wouldn't be an issue for you though.
HBase is another contender. It seems like a pain to set up and run because of the many different pieces, ZooKeeper, HDFS, etc. But the data model is similar to Cassandra (columnar, i.e. one level hashes), which works well for us, but may not be important for you. It seems tried and true, but as with MongoDB you have to watch out for sharding issues, you must put some thought into your keys or you get into trouble.
There is also CouchDB, Project Voldemort and countless other possible choices. I think that if you are serious about "extremely high volumes of data" then it's between Cassandra, Riak and HBase. Strike Riak if pure key-value-storage isn't enough. Depending on what you mean by "fully consistent replication" then Cassandra and Riak are out, because there is a possibility (not necessarily big, and tunable) of reading a stale value.
In the end you obviously have to try it out on your particular use case, so all you really should take home from this answer is: don't bother with MongoDB.
Store the cached versions in MemCache instead of MySQL. It will eliminate most writes. Writing to MySQL is bad, because it kills the query cache. When you cache the pages in MemCache, you will have far less writes to the database, and you'll have less reading pressure too. You can cache the result of complex queries, or cache entire pages as you like.
Maybe it won't be as fast as Cassandra, but it will give you an enormous boost compared to your current situation with only MySQL. And you won't have to rewrite your entire application.
memcachedb - memcached protocol, BDB storage, replication etc
Handlersocket - MySql InnoDB plugin.
Oracle memcached InnoDB plugin
RavenDB can store up to 16TB of data per node, and you can have several nodes per machine acting as one database using its built-in sharding support. Thats as huge as it gets.
Durability, fastness, replication is all there, and running in memory is supported too (but not recommended if you want to scale to 16TB per node).
For extremely high volume data, it's clear that Cassandra and hadoop/hbase are far superior than all others for this task. Cassandra proved itself on large clusters like 400 nodes. rdms dbs cannot scale easily, also mongo has some problems when node counts start to increase http://www.nosqlbenchmarking.com/2011/05/paper-on-elasticity-and-scalability-for-acm-socc-2011/
Serdar

Is Memcache recommended when using MongoDB?

I would like to know if Memcache is recommended when using a NoSQL database like mongoDB.
The concept of using memcache stems from the idea that you have "extra RAM" sitting around somewhere. Both MongoDB and MySQL (and most DBs) will take every meg of RAM that they can get.
In the case of the very common MySQL / Memcache, it is very well documented that using Memcache is more about reducing query load on the server than it is about speeding up queries. A good memcache implementation basically just tries to keep the most common data in memory so that the database server can churn away on bigger stuff.
In fact, it's been my experience that use of memcache generally becomes a reliance on memcache to maintain system performance.
So back to the original question, where do you have extra RAM?
If you have extra RAM on web servers, you may be able to use Memcache. Of course, you could also run Mongo locally on the web server. Just slave the data you need from the master.
If you have extra RAM on other computers, then there's not really a point in using memcache. Just add more nodes to your MongoDB replica set or shard. This is where MongoDB actually shines. Because of sharding / replication, you can add more RAM to Mongo Horizontally to increase performance. With SQL it's very difficult to "just add more servers" because joins don't scale very well. But with Mongo, it's quite possible to simply "add more nodes" to a problem.
MongoDB stores everything in memory anyway and works in a similar vein, being a key-value based system, however I believe MongoDB is more flexible, as it allows for storing BSON objects within themselves.
(Just for clarification, MongoDB uses BSON, a specialised form of JSON, for storing all its data, which includes objects within objects.)
At first no. If you run into performance problems later add a caching layer (memcache). But you won't gain anything if you're going to use Redis for example, as Redis already stores everything in memory.
The answer would depend on your use cases.
In general, accessing RAM is orders of magnitude faster than accessing disk.
Even the fastest SSD drives are about 100 times slower to access than RAM.
Now, I don't know if Mongo has a caching system in place (most likely it does), or what the eviction policy is, but as a programmer i would prefer a cache where i can store/retrieve and delete items at will. Therefore i would prefer using a caching solution even with Mongo.
In summary, it really depends what you are using these solutions for. There is no one answer to cover all possible uses.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

Has anyone used an object database with a large amount of data?

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.
Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?
I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.
Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.
It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.
When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.
Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )
db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.
MongoDB powers SourceForge, The New York Times, and several other large databases...
You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.
These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.
It is worth reading about the CAP theorem too.
I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.
One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.
To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.
I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.
A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.
Here's some benchmarks on db4o:
http://www.db4o.com/about/productinformation/benchmarks/
I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.
Perhaps worth a mention.
The European Space Agency's Planck mission is running on the Versant Object Database.
http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951
It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.
Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T
Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.
Here is the implementation:
http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf
Here they talk about 3T -vs- 12T found in the testing
http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/
Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.
CHeers,
-Robert