Replacing HBase caching with a separate Caching product - memcached

Can the caching mechanism used in HBase be replaced with a different Caching product such as Memcache, Gemfire, GridGain etc. ?

HBase is very different from something like GridGain. Columnar mini-database (no transactions, etc.) vs. fully transactional in-memory data platform. Different goals, different approaches.
Replacing is probably right approach IF you have a proper use case.

Related

Ecommerce/Shopping Cart (and checkout process) : use Relational or NoSQL

For the use-case of Shopping cart (and checkout process) for E-commerce web application, what is better to use a Relational DB (RDBMS) or NoSQL DB as MongoDB/Cassandra/others ?
For the catalog perspective, NoSQL makes ideal use-case with flexible schema, horizontal scaling of data/nodes.
What are the pros/cons of each approach for Shopping Cart use-case?
There are many differences between SQL and noSQL databases. Those differences are what gives each storage type its pros and cons on different situations.
Since both database types would work in the end, it all really depends on the context or on your implementation.
In this specific case (shopping cart), the pros and cons are probably all related to the consistency of your data and scalability.
noSQL databses are better (pros) suited for more "dynamic" applications (data analysis, IoT, multimedia, etc.). Such applications use data that usually doesn't have a rigid structure and comes in very large volumes. This means that there's no need to develop a complex database model and it's cheaper to store large amounts of data throughout separate "nodes". This also makes noSQL databases easier to expand and scale. The main problem (cons) is the lack of structure. This will make it harder for you to run analysis and to keep track of every detail of the database.
Meanwhile, SQL databases are useful (pros) when your data is well-structured and mostly consistent. As you know, SQL stores data in columns and rows, this gives SQL an advantage if you want to generate detailed statistics of your data and also if you want to keep an organized record of everything that happens in your app. The main downside (cons) is that the design of an SQL database takes more time and also it's probably more expensive (scalability and physical storage require more hardware) to maintain a SQL database.
Performancewise, I would argue that in this usecase there wouldn't be any major difference.
If you think about all of what i just wrote, I would say that in the context of a shopping cart, the SQL model is the way to go. A shopping cart won't require lots of upgrades and changes (scalability), its data is always structured (name of item, price, etc.) and you might want to keep track of every transaction a user makes in your ecommerce application (for accountability and safety reasons).
tl;dr use SQL because the data in a shoppingcart usecase is structured and consistent.
good luck!
The general pros/cons of something like Cassandra vs postgres/mysql look like:
Cassandra handles multi-DC HA much better.
Cassandra handles high write volume much better.
Cassandra allows you to reboot hosts without downtime because you'll have multiple replicas (and you wont have to worry about WAL replay or binlog replay or weird master-master replication problems, though some RDBMS addons make this easier for MySQL and Postgres than it used to be).
Cassandra allows you to scale better (linear scaling with number of instances up to ~1200 or so instances)
MySQL/Postgres allow you to build queries as your business requirements evolve by adding indices to existing tables; Cassandra expects you to know the queries in advance and do data modeling before you start writing data.
MySQL/Postgres tends to be easier to use, and you'll find a ton of libraries/UIs/etc to help you get started
MySQL/Postgres offer real transactions / MVCC - Casssandra has lightweight transactions limited to operations on a single key with much weaker isolation/atomicity guarantees.
Ultimately, though, unless you believe your shopping cart is going to handle thousands of concurrent users, it probably doesn't matter (as long as you use something with real data durability guarantees): use what you're most comfortable using. I'd use Cassandra because I know Cassandra very well, but if you're not great with Cassandra (or whatever), use what you know best.

NoSQL-agnostic persistence layer

It seems to me that, at the end of the day, most NoSQL databases are at their core key/value stores, which means one should be able to build a layer which could be NoSQL database agnostic.
That layer would only use CRUD operations (put, set, delete), but would expose more advanced features, and you'd be able to switch with minimal effort the underlying DB whether it's Mongo, Redis, Cassandra, etc.
Would building something like this have value to many people, and does it already exist?
Thanks
NuoDB is an elastically scalable SQL/ACID database that uses a Key/Value model for storage. It runs on top of Amazon S3 today (as well as standard file systems) and could support any KV store in principle. For the moment it's access method is SQL, but the system could readily support other data access languages and methods if that is a common requirement.
Barry Morris, NuoDB Inc.
There's kundera and DataNucleus
UnQL means Unstructured Query Language. It's an open query language for JSON, semi-structured and document databases.
It's next to impossible to build such thing.
As a thought experiment, I suggest that you take, for example, Redis, MongoDB and Cassandra, and design an API of such layer.
These NoSQL solutions have drastically different characteristics and they serve different purposes. Trying to build a common API for them is like building a common API for SQL database, spreadsheet document, plain text file and gmail.
While you can certainly come up with something, it will completely pointless.
Different needs call for different tools.
PlayOrm is another solution that is built on cassandra but has a pluggable interface for hbase, mongodb, etc. etc. 20/30 years ago they said the same thing about RDBMS, but more and more the featuresets converged. I suspect you will see alot of that in nosql database's as well as they adopt each other's feature sets.
currently, they have vastly different feature sets but at the core there is a set of operations that is very very similar.
PlayOrm actually builds it's query language which works on any noSQL provider as well, so it's S-SQL scalable SQL can work with cassandra, hadoop, etc. etc.
later,
Dean

Difference between Memcached and Hadoop?

What is the basic difference between Memcached and Hadoop? Microsoft seems to do memcached with the Windows Server AppFabric.
I know memcached is a giant key value hashing function using multiple servers. What is hadoop and how is hadoop different from memcached? Is it used to store data? objects? I need to save giant in memory objects, but it seems like I need some kind of way of splitting this giant objects into "chunks" like people are talking about. When I look into splitting the object into bytes, it seems like Hadoop is popping up.
I have a giant class in memory with upwards of 100 mb in memory. I need to replicate this object, cache this object in some fashion. When I look into caching this monster object, it seems like I need to split it like how google is doing. How is google doing this. How can hadoop help me in this regard. My objects are not simple structured data. It has references up and down the classes inside, etc.
Any idea, pointers, thoughts, guesses are helpful.
Thanks.
memcached [ http://en.wikipedia.org/wiki/Memcached ] is a single focused distributed caching technology.
apache hadoop [ http://hadoop.apache.org/ ] is a framework for distributed data processing - targeted at google/amazon scale many terrabytes of data. It includes sub-projects for the different areas of this problem - distributed database, algorithm for distributed processing, reporting/querying, data-flow language.
The two technologies tackle different problems. One is for caching (small or large items) across a cluster. And the second is for processing large items across a cluster. From your question it sounds like memcached is more suited to your problem.
Memcache wont work due to its limit on the value of object stored.
memcache faq . I read some place that this limit can be increased to 10 mb but i am unable to find the link.
For your use case I suggest giving mongoDB a try.
mongoDb faq . MongoDB can be used as alternative to memcache. It provides GridFS for storing large file systems in the DB.
You need to use pure Hadoop for what you need (no HBASE, HIVE etc). The Map Reduce mechanism will split your object into many chunks and store it in Hadoop. The tutorial for Map Reduce is here. However, don't forget that Hadoop is, in the first place, a solution for massive compute and storage. In your case I would also recommend checking Membase which is implementation of Memcached with addition storage capabilities. You will not be able to map reduce with memcached/membase but those are still distributed and your object may be cached in a cloud fashion.
Picking a good solution depends on requirements of the intended use, say the difference between storing legal documents forever to a free music service. For example, can the objects be recreated or are they uniquely special? Would they be requiring further processing steps (i.e., MapReduce)? How quickly does an object (or a slice of it) need to be retrieved? Answers to these questions would affect the solution set widely.
If objects can be recreated quickly enough, a simple solution might be to use Memcached as you mentioned across many machines totaling sufficient ram. For adding persistence to this later, CouchBase (formerly Membase) is worth a look and used in production for very large game platforms.
If objects CANNOT be recreated, determine if S3 and other cloud file providers would not meet requirements for now. For high-throuput access, consider one of the several distributed, parallel, fault-tolerant filesystem solutions: DDN (has GPFS and Lustre gear), Panasas (pNFS). I've used DDN gear and it had a better price point than Panasas. Both provide good solutions that are much more supportable than a DIY BackBlaze.
There are some mostly free implementations of distributed, parallel filesystems such as GlusterFS and Ceph that are gaining traction. Ceph touts an S3-compatible gateway and can use BTRFS (future replacement for Lustre; getting closer to production ready). Ceph architecture and presentations. Gluster's advantage is the option for commercial support, although there could be a vendor supporting Ceph deployments. Hadoop's HDFS may be comparable but I have not evaluated it recently.

Using Multiple Database Types to Model Data in a single application

Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene

Best NoSQL approach to handle 100+ million records

I am working on a project were we are batch loading and storing huge volume of data in Oracle database which is constantly getting queried via Hibernate against this 100+ million records table (the reads are much more frequent than writes).
To speed things up we are using Lucene for some of queries (especially geo bounding box queries) and Hibernate second level cache but thats still not enough. We still have bottleneck in Hibernate queries against Oracle (we dont cache 100+ million table entities in Hibernate second level cache due to lack of that much memory).
What additional NoSQL solutions (apart from Lucene) I can leverage in this situation?
Some options I am thinking of are:
Use distributed ehcache (Terracotta) for Hibernate second level to leverage more memory across machines and reduce duplicate caches (right now each VM has its own cache).
To completely use in memory SQL database like H2 but unfortunately those solutions require loading 100+ mln tables into single VM.
Use Lucene for querying and BigTable (or distributed hashmap) for entity lookup by id.
What BigTable implementation will be suitable for this? I was considering HBase.
Use MongoDB for storing data and for querying and lookup by id.
recommending Cassandra with ElasticSearch for a scalable system (100 million is nothing for them). Use cassandra for all your data and ES for ad hoc and geo queries. Then you can kill your entire legacy stack. You may need a MQ system like rabbitmq for data sync between Cass. and ES.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. THe work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing Oracle on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. Look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.
As you suggest MongoDB (or any similar NoSQL persistence solution) is an appropriate fit for you. We've run tests with significantly larger data sets than the one you're suggesting on MongoDB and it works fine. Especially if you're read heavy MongoDB's sharding and/or distributing reads across replicate set members will allow you to speed up your queries significantly. If your usecase allows for keeping your indexes right balanced your goal of getting close to 20ms queries should become feasable without further caching.
You should also check out the Lily project (lilyproject.org). They have integrated HBase with Solr. Internally they use message queues to keep Solr in sync with HBase. This allows them to have the speed of solr indexing (sharding and replication), backed by a highly reliable data storage system.
you could group requests & split them specific to a set of data & have a single (or a group of servers) process that, here you can have the data available in the cache to improve performance.
e.g.,
say, employee & availability data are handled using 10 tables, these can be handled b a small group of server (s) when you configure hibernate cache to load & handle requests.
for this to work you need a load balancer (which balances load by business scenario).
not sure how much of it can be implemented here.
At the 100M records your bottleneck is likely Hibernate, not Oracle. Our customers routinely have billions of records in the individual fact tables of our Oracle-based data warehouse and it handles them fine.
What kind of queries do you execute on your table?