Deciding suitable key value store : Voldemort vs Cassandra vs Memcached vs Redis - memcached

I am using triple store database for one of my project (semantic search engine for healthcare) and it works pretty fine. I am considering on giving it a performance boost by using a layer of key value store above triple store. Triple store querying is slower since we do deep semantic processing.
This is how I am planning to improve performance:
1) Running Hadoop job for all query terms every day by querying triple store.
2) Caching these results in a key value store in a cluster.
3) When user searches for a query term, instead of searching triple store, key value store will be searched first. Triple store will be searched only when query term not found in key value store.
Key value pair which I plan to save is a "String" to "List of POJO mapping". I can save it as a BLOB.
I am confused on using which key value store. I am looking mainly for failover and load balancing support. All I need is a simple key value store which provides above features. I do not need to sort/search within values or any other functionalities.
Please correct me if I am wrong. I am assuming memcached and Redis will be faster since it is in memory. But I do not know if any Java clients of Redis(Jredis) or memchaced(Spymemcached) supports failover. I am not sure whether to go with in memory or persistent storage. I am also considering Voldemort, Cassandra and HBase. Overall key values will be around 2GB to 4GB size. Any pointers on this will be really helpful.
I am very new to nosql and key value stores. Please let me know if you need any more details.

Have you gone over memcached tutorial article (they explain load balancing aspects there, since memcached instances balance load based on your key hash, also spymemcached is discussed how it handles connectivity failures):
Use Memcached for Java enterprise performance, Part 1: Architecture and setup http://www.javaworld.com/javaworld/jw-04-2012/120418-memcached-for-java-enterprise-performance.html
Use Memcached for Java enterprise performance, Part 2: Database-driven web apps http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html
For enterprise grade fail-over/cross data center replication support in memcached you should use Couchbase that offers these features. The product has evolved from memcached base.

Before you build infrastructure to load your cache, you might just try adding memcached on top of your existing system. First, measure your current performance well. I suggest JMeter or similar tools. Here's the workflow in your application: Check memcached, if it's there, you're done. If not, run the query against the triple store and save the results in memcached. This will improve performance if you have queries that are repeated. Memcached will use the memory you give it efficiently, throwing away things that don't get used very often. Failover is handled by your application (if it's not in memcached, you use your existing infrastructure).

We use triple store and cache data in memcache provided by google app engine and it works fine. It reduced the overhead of sparql query over triple store.

Only cassandra will have mentioned features and CQL full support, which helps in maintaining, otherwise maybe you should look in another direction:
Write heavy, replicated, bigger-than-memory key-value store

Since you want just to cache data in front of your triple store, going with disk-based, or replicated/distributed key-value stores seems to be pointless. All you need is essentially to cache data in front of your queries right on the machines where those queries are done. No "key-value stores", just vanilla Java caching solutions.
In 2016 the best cache for Java is Caffeine.

Related

For extensive Read and write operation MongoDB vs Cassandra

I have used MongoDB but new to Cassandra. I have worked on applications which are using MongoDB and are not very large applications. Read and Write operations are not very much intensive. MongoDB worked well for me in that scenario. Now I am building a new application(w/ some feature like Stack Overflow[voting, totals views, suggestions, comments etc.]) with lots of Concurrent write operations on the same item into the database(in future!). So according to the information, I gathered via online, MongoDB is not the best choice (but Cassandra is). But the problem I am finding in Cassandra is Picking the right data model.
Construct Models around your queries. Not around relations and
objects.
I also looked at the solution of using Mongo + Redis. Is it efficient to update Mongo database first and then updating Redis DB for all multiple write requests for the same data item?
I want to verify which one will be the best to solve this issue Mongo + redis or Cassandra?
Any help would be highly appreciated.
Picking a database is very subjective. I'd say that modern MongoDB 3.2+ using the new WiredTiger Storage Engine handles concurrency pretty well.
When selecting a distributed NoSQL (or SQL) datastore, you can generally only pick two of these three:
Consistency (all nodes see the same data at the same time)
Availability (every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)
This is called the CAP Theorem.
MongoDB has C and P, Cassandra has A and P. Cassandra is also a Column-Oriented Database, and will take a bit of a different approach to storing and retrieving data than, say, MongoDB does (which is a Document-Oriented Database). The reality is that either database should be able to scale to your needs easily. I would worry about how well the data storage and retrieval semantics fit your application's data model, and how useful the features provided are.
Deciding which database is best for your app is highly subjective, and borders on an "opinion-based question" on Stack Overflow.
Using Redis as an LRU cache is definitely a component of an effective scaling strategy. The typical model is, when reading cacheable data, to first check if the data exists in the cache (Redis), and if it does not, to query it from the database, store the result in the cache, and return it. While maybe appropriate in some cases, it's not common to just write everything to both Redis and the database. You need to figure out what's cacheable and how long each cached item should live, and either cache it at read time as I explained above, or at write time.
It only depends on what your application is for. For extensive write apps it is way better to go with Cassandra

Suitable db solution for high read rate

I'll explain the use cases first.
High read rates (10000+ p/s), large dataset (lots of string codes(think promocodes) looking for matchs, strings 10 - 20chars). Needs fast response time.
First thought was memcached. However to combat downtime if memcache goes down and starts repopulating the cache from a db like mysql.... i was thinking redis for auto repopulation of cache.
Is it true that redis does not persist to the hdd but instead a flush needs to be called for it to be backed up?
My hope is to use the code string as the key making lookup super quick. Value will be an id linking it to a db record thats not needed by the api.
If i had to guess how many unique strings will be stored..... 10M + after a few months.
Iv also looked at Cassandra briefly and mongodb. Im thinking mongodb will not be enough due to it not storing entire list in memory?
Any insight into these systems is very helpful. Feel like im going around in circles.
The api is made in nodejs. (If it matters)
10K/s is definitely not a high rate for a DB like Cassandra, according that your schema is done wisely. I bet it's the same for the others.
10M unique strings per months is peanuts for modern big data systems.
Whatever big data solution you retain, you will have to design the schema acording to the type of data and operational needs.
IMO, the important ones are the following 2 questions :
What you mean by "looking for matchs"?
If you need indexing and search using substrings or regexps, you need a search engine: ElasticSearch or SOLR are great. Warning that E/S does replication and sharding but it's distribution model is still not 100% safe.
None of the systems you mentionned will provide the reactivity you seem to look for.
If you will query using static strings: a key-value store or column oriented database like Cassandra will be just the perfect fit. So all are good fit.
What is a fast response time?
With selecting the right technology and appropriate schemas all those systems will give you great response time under hundreds of milliseconds, but will it be fast enough for you?
REDIS and MemCached being in-memory will provide the faster responses.
And as a conclusion, the API being in node.js is irrelevant for the choice of your storage and indexing technology, unless you want to stick with Javascript for everything and MongoDB is more friendly for you, it can be a decent candidate depending on your search use cases.

Difference between Memcached and Hadoop?

What is the basic difference between Memcached and Hadoop? Microsoft seems to do memcached with the Windows Server AppFabric.
I know memcached is a giant key value hashing function using multiple servers. What is hadoop and how is hadoop different from memcached? Is it used to store data? objects? I need to save giant in memory objects, but it seems like I need some kind of way of splitting this giant objects into "chunks" like people are talking about. When I look into splitting the object into bytes, it seems like Hadoop is popping up.
I have a giant class in memory with upwards of 100 mb in memory. I need to replicate this object, cache this object in some fashion. When I look into caching this monster object, it seems like I need to split it like how google is doing. How is google doing this. How can hadoop help me in this regard. My objects are not simple structured data. It has references up and down the classes inside, etc.
Any idea, pointers, thoughts, guesses are helpful.
Thanks.
memcached [ http://en.wikipedia.org/wiki/Memcached ] is a single focused distributed caching technology.
apache hadoop [ http://hadoop.apache.org/ ] is a framework for distributed data processing - targeted at google/amazon scale many terrabytes of data. It includes sub-projects for the different areas of this problem - distributed database, algorithm for distributed processing, reporting/querying, data-flow language.
The two technologies tackle different problems. One is for caching (small or large items) across a cluster. And the second is for processing large items across a cluster. From your question it sounds like memcached is more suited to your problem.
Memcache wont work due to its limit on the value of object stored.
memcache faq . I read some place that this limit can be increased to 10 mb but i am unable to find the link.
For your use case I suggest giving mongoDB a try.
mongoDb faq . MongoDB can be used as alternative to memcache. It provides GridFS for storing large file systems in the DB.
You need to use pure Hadoop for what you need (no HBASE, HIVE etc). The Map Reduce mechanism will split your object into many chunks and store it in Hadoop. The tutorial for Map Reduce is here. However, don't forget that Hadoop is, in the first place, a solution for massive compute and storage. In your case I would also recommend checking Membase which is implementation of Memcached with addition storage capabilities. You will not be able to map reduce with memcached/membase but those are still distributed and your object may be cached in a cloud fashion.
Picking a good solution depends on requirements of the intended use, say the difference between storing legal documents forever to a free music service. For example, can the objects be recreated or are they uniquely special? Would they be requiring further processing steps (i.e., MapReduce)? How quickly does an object (or a slice of it) need to be retrieved? Answers to these questions would affect the solution set widely.
If objects can be recreated quickly enough, a simple solution might be to use Memcached as you mentioned across many machines totaling sufficient ram. For adding persistence to this later, CouchBase (formerly Membase) is worth a look and used in production for very large game platforms.
If objects CANNOT be recreated, determine if S3 and other cloud file providers would not meet requirements for now. For high-throuput access, consider one of the several distributed, parallel, fault-tolerant filesystem solutions: DDN (has GPFS and Lustre gear), Panasas (pNFS). I've used DDN gear and it had a better price point than Panasas. Both provide good solutions that are much more supportable than a DIY BackBlaze.
There are some mostly free implementations of distributed, parallel filesystems such as GlusterFS and Ceph that are gaining traction. Ceph touts an S3-compatible gateway and can use BTRFS (future replacement for Lustre; getting closer to production ready). Ceph architecture and presentations. Gluster's advantage is the option for commercial support, although there could be a vendor supporting Ceph deployments. Hadoop's HDFS may be comparable but I have not evaluated it recently.

Key-value store for Ruby & Java

I need a recommendation for a key-value store. Here's my criteria:
Doesn't have to be persistent but needs to support lots of records (records are small, 100-1000 bytes)
Insert (put) will happen only occasionally, always in large datasets (bulk)
Get will be random and needs to be fast
Clients will be in Ruby and, perhaps Java
It should be relatively easy to setup and with as little maintenance needed as possible
Redis sounds like the right thing to use here. It's all in memory so it's very fast (The GET and SET operations are both O(1)) and it supports both Ruby and Java clients.
Aerospike would be a perfect because of below reasons:
Key Value based with clients available in Java and Ruby.
Throughput: Better than Redis/Mongo/Couchbase or any other NoSQL solution. See this http://www.aerospike.com/blog/use-1-aerospike-server-not-12-redis-shards/. Have personally seen it work fine with more than 300k read TPS and 100k Write TPS concurrently.
Automatic and efficient data sharding, data re-balancing and data distribution using RIPEMD160.
Highly Available system in case of Failover and/or Network Partitions.
Open sourced from 3.0 version.
Can be used in Caching mode with no persistence.
Supports LRU and TTL.
Little or No maintenance.
An AVL-Tree will give you O(log n) on insert, remove, search and most everything else.
1 and 3 both scream a database engine.
If your number of records isn't insane and you only have one client using this thing at the same time, I would personally recommend sqlite, which works with both Java and Ruby (also would pass #5). Otherwise go with a real database system, like MySql (since you're not on the Microsoft stack).

why memcached instead of hashmap

I am trying to understand what would be the need to go with a solution like memcached. It may seem like a silly question - but what does it bring to the table if all I need is to cache objects? Won't a simple hashmap do ?
Quoting from the memcache web site, memcache is…
Free & open source, high-performance,
distributed memory object caching
system, generic in nature, but
intended for use in speeding up
dynamic web applications by
alleviating database load.
Memcached is an in-memory key-value
store for small chunks of arbitrary
data (strings, objects) from results
of database calls, API calls, or page
rendering. Memcached is simple yet
powerful. Its simple design promotes
quick deployment, ease of development,
and solves many problems facing large
data caches. Its API is available for
most popular languages.
At heart it is a simple Key/Value
store
A key word here is distributed. In general, quoting from the memcache site again,
Memcached servers are generally
unaware of each other. There is no
crosstalk, no syncronization, no
broadcasting. The lack of
interconnections means adding more
servers will usually add more capacity
as you expect. There might be
exceptions to this rule, but they are
exceptions and carefully regarded.
I would highly recommend reading the detailed description of memcache.
Where are you going to put this hashmap? That's what it's doing for you. Any structure you implement on PHP is only there until the request ends. If you throw stuff in a persistent cache, you can fetch it back out for other requests, instead of rebuilding the data.
I know that this question is rather old, but in addition to being able to share a cache across multiple servers, there is also another aspect that is not mentioned in other answers and is the values expiration.
If you store the values in a HashMap, and that HashMap is bound to the Application context, it will keep growing in size, unless you expire items in some ways. Memcached expires object lazily for maximum performance.
When an item is added to the memcache, it can have an expiration time, for instance 600 seconds. After the object is expired it will just remain there, but if another object asks for it, it will purge it and return null.
Similarly, when memcached memory is full, it will look for the first expired item of adequate size and expire it to make room for the new item. Lastly, it can also happen that the cache is full and there isn't any item to expire, in which case it will replace the least used items.
Using a fully flagded cache system usually allow you to replicate the cache on many servers, or just scale to many server just to scale a lot of parallel requestes, all this remaining acceptable fast in term of reply.
There is an (old) article that compares different caching systems used by php:
https://www.percona.com/blog/2006/08/09/cache-performance-comparison/
Basically, file caching is faster than memcached.
So to answer the question, I believe you would have better performances using a file based cache system.
Here are the results from the tests of the article:
Cache Type Cache Gets/sec
Array Cache 365000
APC Cache 98000
File Cache 27000
Memcached Cache (TCP/IP) 12200
MySQL Query Cache (TCP/IP) 9900
MySQL Query Cache (Unix Socket) 13500
Selecting from table (TCP/IP) 5100
Selecting from table (Unix Socket) 7400