Can Solr be used as an alternative to cache? - postgresql

We are using Postgresql for persistence, ehcache as our cache. We have recently introduced Solr for enabling faster searches (for fuzzy and exact searches).
So my question is : Can Solr be tuned in such a way that it can replace ehcache? (say by running in cloud-mode or so)
Just to add some context to the question:
We have a bunch of tables to store contact information. Ehcache is currently being used to get these contacts for a given ID. Solr will be used extensively for search related operations. Since Solr is already doing the search... why not replace Ehcache (as in some way it is like : searching with a given ID) provided the performance is not compromised.

In additions to other reasons why No would be an answer, is also the granularity of changes. Lucene (underlying library) stores data in a read-only form. Solr adds updatable documents on top of that, but making them visible is still a heavy operation. Recent versions of Solr made it easier and faster with soft-commits, but the price of making a change visible is still non-trivial.
So, it is really not optimized for updating/caching a single value. The data structures are optimized for a multiple document update and then fast search with caching over that temporarily read-only state.

I'll take a shot, but it's unlikely anyone will have a definitive answer to such a vague question. https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ is four years old now but still relevant. The answers will depend entirely on what you need to do.
So, some generic statements:
SolrCloud or not is unlikely to be an issue that effects your decision. Use it if you want Solr to handle replication and index managment. Don't, if you'd rather do it yourself.
Solr is fast, (given enough memory) so it's certainly possible you could get rid of a caching layer. Only you know your requirements though.
Read through https://wiki.apache.org/solr/SolrCaching, particularly you might be interested in the QueryResultCache.

The simple answer is: No
Reason:
cache is in memory, but the index of solr is on disk (except the part been cached).
Reading memory is over thousands of times quicker than reading disk.
So, solr can't be used as a general purpose cache, in that case ehcache or memcached or redis would be a better choice.
What solr good at is its search ability, analyzer & tokenizer & filter, but not cache.

Related

Suitable db solution for high read rate

I'll explain the use cases first.
High read rates (10000+ p/s), large dataset (lots of string codes(think promocodes) looking for matchs, strings 10 - 20chars). Needs fast response time.
First thought was memcached. However to combat downtime if memcache goes down and starts repopulating the cache from a db like mysql.... i was thinking redis for auto repopulation of cache.
Is it true that redis does not persist to the hdd but instead a flush needs to be called for it to be backed up?
My hope is to use the code string as the key making lookup super quick. Value will be an id linking it to a db record thats not needed by the api.
If i had to guess how many unique strings will be stored..... 10M + after a few months.
Iv also looked at Cassandra briefly and mongodb. Im thinking mongodb will not be enough due to it not storing entire list in memory?
Any insight into these systems is very helpful. Feel like im going around in circles.
The api is made in nodejs. (If it matters)
10K/s is definitely not a high rate for a DB like Cassandra, according that your schema is done wisely. I bet it's the same for the others.
10M unique strings per months is peanuts for modern big data systems.
Whatever big data solution you retain, you will have to design the schema acording to the type of data and operational needs.
IMO, the important ones are the following 2 questions :
What you mean by "looking for matchs"?
If you need indexing and search using substrings or regexps, you need a search engine: ElasticSearch or SOLR are great. Warning that E/S does replication and sharding but it's distribution model is still not 100% safe.
None of the systems you mentionned will provide the reactivity you seem to look for.
If you will query using static strings: a key-value store or column oriented database like Cassandra will be just the perfect fit. So all are good fit.
What is a fast response time?
With selecting the right technology and appropriate schemas all those systems will give you great response time under hundreds of milliseconds, but will it be fast enough for you?
REDIS and MemCached being in-memory will provide the faster responses.
And as a conclusion, the API being in node.js is irrelevant for the choice of your storage and indexing technology, unless you want to stick with Javascript for everything and MongoDB is more friendly for you, it can be a decent candidate depending on your search use cases.

MongoDB + Elasticsearch or only Elasticsearch?

We have a new project there for index a large amount of data and for provide real time. I have also complexe search with facets, full text, geospatial...
The first prototype is to index in MongoDB and next, into Elasticsearch, because I had read that Elasticsearch does not apply a checksum on stored files and the index can't be fully trusted.
But since last versions (in the version 1.5), there is now a checksum and I'm guessing if we can use Elasticsearch as primary data store ? And what is the benefit to use MongoDB in addition to Elasticsearch ?
I can't find up to date answer about thoses features in Elasticsearch
Thanks a lot
Talking about arguments to use Mongo instead of/together with ES:
User/role management.
Built-in in MongoDB. May not fit all your needs, may be clumsy somewhere, but it exists and it was implemented pretty long time ago.
The only thing for security in ES is shield. But it ships only for Gold/Platinum subscription for production use.
Schema
ES is schemaless, but its built on top of Lucene and written in Java. The core idea of this tool - index and search documents, and working this way requires index consistency. At back end, all documents should be fitted in flat lucene index, which requires some understanding about how ES should deal with your nested documents and values, and how you should organize your indexes to maintain balance between speed and data completeness/consistency. Working with ES requires you to keep some things about schema in mind constantly. I.e: as you can index almost anything to ES without putting corresponding mapping in advance, ES can "guess" mapping on the fly but sometimes do it wrong and sometimes implicit mapping is evil, because once it put, it can't be changed w/o reindexing whole index. So, its better to not treat ES as schemaless store, because you can step on a rake some time (and this will be pain :) ), but rather treat it as schema-intensive, at least when you work with documents, that can be sliced to concrete fields.
Mongo, on the other hand, can "chew and leave no crumbs" out of almost anything you put in it. And most your queries will work fine, `til you remember how Mongo will deal with your data from JavaScript perspective. And as JS is weakly typed, you can work with really schemaless workflow (for sure, if you need such)
Handling non-table-like data.
ES is limited to handle data without putting it to search index. And this solution is good enough, when you need to store and retrieve some extra data (comparing to data you want to search against).
MongoDB supports gridFS. This gives you ability to handle large chunks of data behind the same interface. I.e., you can store binary data in Mongo and retrieve it within the same interface, from your code perspective.
Well, choose the right tool for the right job. If you require searching capabilities such as full text search, faceting etc, then nothing can beat a full fledged search engine. ElasticSearch(ES) or Solr is just a matter of choice.
You can actually feed(index) documents into ES for searching and then fetch the complete details for a particular entry from MongoDB or any other database.
I can make your task easier, do take a look at my open source work that's using MongoDB, ES, Redis and RabbitMQ, all integrated at one place, here on github
Please note that the application is built in .Net C#.
After having used Elasticsearch on production, I can add up to this thread few notes :
We securized our Elasticsearch clustering via a reverse proxy which check client certificate authenticity at request time before letting the query in : it proves that there is multiple way to add authentication anyway. (If you need more accuracy in security, like by using roles, there is few plugins that can be added to manage permissions)
Elasticsearch mapping and settings (tuning) are really important concepts to fully understand before going on production with it, and that's no that easy to get how everything works quickly.
Clustering and horizontal scaling is very flexible and easy to set up
The suite tools (Kibana, beats, etc ..) are a very convinient way to gather logs, expose key data, etc ...
Search features are extremely advanced, you can really do amazing things when you master a bit how full text search works (fuzzyness, boosting, scoring, stemming, tokenizer, analyzers, and so on ...).
API's are a bit scattered and there is not unique ways to achieve something. And some API are really WTF to use, like the bulk insert API: you need to pass binary data, with JSON format (ofc don't forget end of line characters) and repeating some fields multiple times. This is very verbose and I guess it's legacy code like we all have in our projects ;).
Last thing : if you develop a Java project, do not use Hibernate Search to duplicate data from a datasource to your ES cluster, we had so much issues with Hibernate Search, if we had to do that again, we'd do that manually.
Now about the real question :
To my mind, using only Elasticsearch is sufficient and may reduce complexity of having a multiple NoSQL storage systems.
I think it's worthy when you are doing a duo Relational and Transactional database + NoSQL search engine, but having two system which roughly serves the same purposes is a bit overkilled
I have recently developed a feature in my company,
we wanted to perform some searches and rank the result according to its relevance on multiple factors and conditions.
So in my application, we were already using MongoDB as Db,
So on ElasticSearch index, I exported some of the fields from MongoDB that I want to perform search and filters on.
So according to required conditions I prepared my mongo query and elasticsearch query also and performed the search. Then I filtered and sorted the result according to my need.
The whole flow will was designed in such a way that,
even if there is an error from ES, mongo will fetch the records.
If I get the result from ES then, mongo result will depend on ES result.
This is how I used mongo and ES in combination.
Also, don't forget to properly handle all updates, deletes and new record insertions.
And Just to Know, results for me were Really Good.

When to use CouchDB over MongoDB and vice versa

I am stuck between these two NoSQL databases.
In my project, I will be creating a database within a database. For example, I need a solution to create dynamic tables.
So users can create tables with columns and rows. I think either MongoDB or CouchDB will be good for this, but I am not sure which one. I will also need efficient paging as well.
Of C, A & P (Consistency, Availability & Partition tolerance) which 2 are more important to you? Quick reference, the Visual Guide To NoSQL Systems
MongodB : Consistency and Partition Tolerance
CouchDB : Availability and Partition Tolerance
A blog post, Cassandra vs MongoDB vs CouchDB vs Redis vs Riak vs HBase vs Membase vs Neo4j comparison has 'Best used' scenarios for each NoSQL database compared. Quoting the link,
MongoDB: If you need dynamic queries. If you prefer to define indexes, not map/reduce functions. If you need good performance on a big DB. If you wanted CouchDB, but your data changes too much, filling up disks.
CouchDB : For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important.
A recent (Feb 2012) and more comprehensive comparison by Riyad Kalla,
MongoDB : Master-Slave Replication ONLY
CouchDB : Master-Master Replication
A blog post (Oct 2011) by someone who tried both, A MongoDB Guy Learns CouchDB commented on the CouchDB's paging being not as useful.
A dated (Jun 2009) benchmark by Kristina Chodorow (part of team behind MongoDB),
I'd go for MongoDB.
The answers above all overcomplicate the story.
If you plan to have a mobile component, or need desktop users to work offline and then sync their work to a server you need CouchDB.
If your code will run only on the server then go with MongoDB
That's it. Unless you need CouchDB's (awesome) ability to replicate to mobile and desktop devices, MongoDB has the performance, community and tooling advantage at present.
Very old question but it's on top of Google and I don't quite like the answers I see so here's my own.
There's much more to Couchdb than the ability to develop CouchApps. Most people use CouchDb in a classical 3-tiers web architecture.
In practice the deciding factor for most people will be the fact that MongoDb allows ad-hoc querying with a SQL like syntax while CouchDb doesn't (you've got to create map/reduce views which turns some people off even though creating these views is Rapid Application Development friendly - they have nothing to do with stored procedures).
To address points raised in the accepted answer : CouchDb has a great versionning system, but it doesn't mean that it is only suited (or more suited) for places where versionning is important. Also, couchdb is heavy-write friendly thanks to its append-only nature (writes operations return in no time while guaranteeing that no data will ever be lost).
One very important thing that is not mentioned by anyone is the fact that CouchDb relies on b-tree indexes. This means that whether you have 1 "row" or 20 billions, the querying time will always remain below 10ms. This is a game changer which makes CouchDb a low-latency and read-friendly database, and this really shouldn't be overlooked.
To be fair and exhaustive the advantage MongoDb has over CouchDb is tooling and marketing. They have first-class citizen tools for all major languages and platforms making the on-boarding easy and this added to their adhoc querying makes the transition from SQL even easier.
CouchDb doesn't have this level of tooling - even though there are many libraries available today - but CouchDb is exposed as an HTTP API and it is therefore quite easy to create a wrapper in your favorite language to talk with it. I personally like this approach as it avoids bloat and allows you to only take what you want (interface segregation principle).
So I'd say using one or the other is largely a matter of comfort and preference with their paradigms. CouchDb approach "just fits", for certain people, but if after learning about the database features (in the exhaustive official guide) you don't have your "hell yeah" moment, you should probably move on.
I'd discourage using CouchDb if you just want to use "the right tool for the right job". because you'll find out that you can't just use it that way and you'll end up being pissed and writing blog posts such as "Where are joins in CouchDb ?" and "Where is transaction management ?". Indeed Couchdb is - paradoxically - very transparent but at the same time requires a paradigm shift and a change in the way you approach problems to really shine (and really work).
But once you've done that it really pays off. I'd personally need very strong reasons or a major deal breaker on a project to choose another database, but so far I haven't met any.
Update December 2022:
Since this post is still getting a lot of views, I felt important to inform people that I have recently moved to using MongoDB as my daily driver, while keeping CouchDB in my toolbelt for specialized cases where this database makes more sense (namely cases where views are not needed). There were multiple reasons for this choice, the most important ones were:
Performance: While precomputed indexes are a powerful asset, the main limitation of CouchDB is its QueryServer architecture. Every time a document is updated, it has to be serialized and processed by every view (even though this happens in a deferred manner, namely when the view is accessed). But more importantly, every time a view is updated (for example to add filtering logic for a new field added as part of the implementation of a new feature), ALL documents of the database must be sent to the view. This becomes a big deal when you have millions of documents in the database. You start worrying about the impact of updating your views and it becomes a distraction. Should you decide to create one database per data type to bypass this limitation, you'd then lose the ability to map/reduce across all your documents since views are scoped per database. MongoDB avoids this by segmenting documents into collections (ie. data types) so that when an index is updated only a subset of the data of the database is impacted. Moreover, MongoDB uses a binary format making these operations way more performant (while CouchDB uses JSON sent to the view server in plain text). This point may not be important if you do not design products needing to operate at large scale (hundreds of thousands of daily users or more).
the tooling available with MongoDB is comprehensive and mature, whether we are talking about the drivers officially supported for various programming languages, or integration with IDEs.
Advanced querying: A wide range of data types and advanced query capabilities are available out of the box (geo types, GridFS allowing one to store files of arbitrary size directly in the DB etc...). Having easy access to powerful query aggregation capabilities made me realize how much CouchDB had been inhibiting my productivity.
Seamless support for resharding: resharding is easy with MongoDB, while it is a dangerous operation involving moving files by hands with CouchDB.
Many other small items that improve quality of life and really add up.
I have been a big CouchDB fan but I have to admit that moving to MongoDB as a daily driver felt a lot like moving back to civilization in terms of productivity and quality of life improvement. Now I only consider CouchDB for key-value store scenarios (in which no map-reduce views are required and all that is needed is getting a document by key - CouchDB shines quite a lot for this), and advanced situations in which having per-user like databases is needed (for example to support advanced synchronization between devices).
The only drawback I see with MongoDB is that it consumes a lot of memory to the point that I cannot install it on development machines having low specs (while by comparison couchdb is launched at startup without me noticing and consumes almost no resource). However I feel this is worth it considering the time saved and the features provided.
As a long-time CouchDB user, the value I see in MongoDB is quite different from the items highlighted in the other answers promoting MongoDB so I felt it was important for me to provide this update (and also out of intellectual honestly when I remembered this post). CouchDB gave me quite a boost in productivity back in the days compared to the SQL products and ORMs I had been using, and at that time there were a lot of horror stories circulating regarding the reliability of MongoDB.
However, as of now, the few concerns I could have (and that were probably given disproportionate importance by internet folks - they essentially all boiled down to defaults whose reliability tradeoffs may surprise new users in a number of scenarios) no longer stand.
At this point, as a long-time CouchDB user in a great position to compare both products, I would recommend MongoDB to people needing a productive and scalable software development experience for their web app and advise to only pick CouchDB for specific needs.
CouchDB had momentum back in the days which probably influenced my perception, but development has stalled, no meaningful features have been introduced for a long-time, otherwise it would probably have caught up with MongoDB in terms of quality of life. I see two possible reasons for this: the way a now aborted rewrite of CouchDB has diverted resources for a long-time, and maybe early architectural decisions (such as the Query Server architecture) that may very well have restricted its future from the start. None of these aspects seem to be the priority of the core team.
I do not totally regret choosing CouchDB because it has been massively helpful and the mindset it has taught me is extremely helpful to allow me to write performant code in MongoDB (writing performant code in MongoDB is a breeze compared to the discipline one has to observe to solve business problems using CouchDB). However if I had to do it again today, I would have transitioned to MongoDB as my daily driver MUCH sooner. I'm usually quite good at picking the winning horse when technologies popup, but this time it seems I haven't played the game that well. Hope this helps.
Ask this questions yourself? And you will decide your DB selection.
Do you need master-master? Then CouchDB. Mainly CouchDB supports master-master replication which anticipates nodes being disconnected for long periods of time. MongoDB would not do well in that environment.
Do you need MAXIMUM R/W throughput? Then MongoDB
Do you need ultimate single-server durability because you are only going to have a single DB server? Then CouchDB.
Are you storing a MASSIVE data set that needs sharding while maintaining insane throughput? Then MongoDB.
Do you need strong consistency of data? Then MongoDB.
Do you need high availability of database? Then CouchDB.
Are you hoping multi databases and multi tables/ collections? Then MongoDB
You have a mobile app offline users and want to sync their activity data to a server? Then you need CouchDB.
Do you need large variety of querying engine? Then MongoDB
Do you need large community to be using DB? Then MongoDB
I summarize the answers found in that article:
http://www.quora.com/How-does-MongoDB-compare-to-CouchDB-What-are-the-advantages-and-disadvantages-of-each
MongoDB: Better querying, data storage in BSON (faster access), better data consistency, multiple collections
CouchDB: Better replication, with master to master replication and conflict resolution, data storage in JSON (human-readable, better access through REST services), querying through map-reduce.
So in conclusion, MongoDB is faster, CouchDB is safer.
Also: http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
Be aware of an issue with sparse unique indexes in MongoDB. I've hit it and it is extremely cumbersome to workaround.
The problem is this - you have a field, which is unique if present and you wish to find all the objects where the field is absent. The way sparse unique indexes are implemented in Mongo is that objects where that field is missing are not in the index at all - they cannot be retrieved by a query on that field - {$exists: false} just does not work.
The only workaround I have come up with is having a special null family of values, where an empty value is translated to a special prefix (like null:) concatenated to a uuid. This is a real headache, because one has to take care of transforming to/from the empty values when writing/quering/reading. A major nuisance.
I have never used server side javascript execution in MongoDB (it is not advised anyway) and their map/reduce has awful performance when there is just one Mongo node. Because of all these reasons I am now considering to check out CouchDB, maybe it fits more to my particular scenario.
BTW, if anyone knows the link to the respective Mongo issue describing the sparse unique index problem - please share.
I'm sure you can with Mongo (more familiar with it), and pretty sure you can with couch too.
Both are documented oriented (JSON-based) so there would be no "columns" but rather fields in documents -- but they can be fully dynamic.
They both do it you may want to look at other factors on which to use: other features you care about, popularity, etc. Google insights and indeed.com job posts would be ways to look at popularity.
You could just try it I think you should be able to have mongo running in 5 minutes.

Disadvantages of CouchDB

I've very recently fallen in love with CouchDB. I'm pretty excited by its enormous benefits and by its beauty. Now I want to make sure that I haven't missed any show-stopping disadvantages.
What comes to your mind? Attached is a list of points that I have collected. Is there anything to add?
Blog posts from as late as 2010 claim "not mature enough" (whatever that's worth).
Slower than in-memory DBMS.
In-place updates require server-side logic (update handlers).
Trades disk vs. speed: Databases can become huge compared to other DBMS (compaction functionality exists, though).
"Only" eventual consistency.
Temporary views on large datasets are very slow.
Replication of large databases may fail.
Map/reduce paradigm requires rethinking (only for completeness).
The only point that worries me is #3 (in-place updates), because it's quite inconvenient.
The data is in JSON
Which means that documents are quite large (BigData, network bandwidth, speed), and having descriptive key names actually hurts, since they add up to the document size.
No built in full text search
Although there are ways: couchdb-lucene, elasticsearch
plus some more:
It doesn't support transactions
It means that enforcing uniqueness of one field across all documents is not safe, for example, enforcing that a username is unique. Another consequence of CouchDB's inability to support the typical notion of a transaction is that things like inc/decrementing a value and saving it back are also dangerous. There aren't many instances that we would want to simply inc/decrement some value where we couldn't just store the individual documents separately and aggregate them with a view.
Relational data
If the data makes a lot of sense to be in 3rd normal form, and we try to follow that form in CouchDB, we are going to run into a lot of trouble. A possible way to solve this problem is with view collations, but we might constantly going to be fighting with the system. If the data can be reformatted to be much more denormalized, then CouchDB will work fine.
Data warehouse
The problem with this is that temporary views in CouchDB on large datasets are really slow. Using CouchDB and permanent views could work quite well. However, in most of cases, a Column-Oriented Database of some sort is a much better tool for the data warehousing job.
But CouchDB Rocks!
But don't let it discorage you: NoSQL DBs that are written in Erlang (CouchDB, Riak) are the best, since Erlang is meant for distributed systems. Have fun with Couch!
2 more things, which make me cry when using CouchDB (though it's awesome):
It is not designed for frequently updated data
It doesn't have built-in fulltext search
Lack of reader ACLs (does exist for writers, however)
As an old Lotus Domino pro I was looking to CouchDB as an alternative for a new project I'm kicking off and found the limits on readers to be very weak in Couch vs. Domino. In my app security is an important consideration and Couch would require a middleware layer to handle reader security.
If you have database in which it's okay that all defined users can see all the documents, then Couch looks like an interesting platform.
If restricting reads is needed then you'll need to look to a middleware solution or consider another alternative.
Note to CouchDB developers: Improve the platform security options. I realize they will diminish performance when used but note that and make the option available.
Now back to determining which database to use...
currently no support for ad-hoc queries (might change with advent of UnQL)
lack of binary protocol support for faster communication
It's nothing to do with CouchDB itself, but being a relative newcomer on the scene means that most sysadmins are still unfamiliar with it and won't allow it anywhere near "their" data centers. If you're in a situation where you're deploying to an environment you don't control yourself, this can be quite the battle.
Lack of support for data archiving - No official support for data
archiving is provided with couch db open source distribution.
Deleting records from db is not straightforward
No option to set a expire (TTL) flag for documents

Frequent large, multi-record updates in MongoDB, Lucene, etc

I am working on the high-level design of a web application with the following characteristics:
Millions of records
Heavily indexed/searchable by various criteria
Variable document schema
Regular updates in blocks of 10K - 200K records at a time
Data needs to remain highly available during updates
Must scale horizontally effectively
Today, this application exists in MySQL and we suffer from a few huge problems, particularly that it is challenging to adapt to flexible schema, and that large bulk updates lock the data for 10 - 15 seconds at a time, which is unacceptable. Some of these things can be tackled by better database design within the context of MySQL, however, I am looking for a better "next generation" solution.
I have never used MongoDB, but its feature set seemed to most closely match what I am looking for, so that was my first area of interest. It has some things I am excited about, such as data sharding, the ability to find-update-return in a single statement, and of course the schema flexibility of NoSQL.
There are two things I am not sure about, though, with MongoDB:
I can't seem to find solid
information about the concurrency of
updates with large data sets (see my
use case above) so I have a hard
time understanding how it might
perform.
I do need open text search
That second requirement brought me to Lucene (or possibly to Solr if I kept it external) as a search store. I did read a few cases where Lucene was being used in place of a NoSQL database like MongoDB entirely, which made me wonder if I am over-complicating things by trying to use both in a single app -- perhaps I should just store everything directly in Lucene and run it like that?
Given the requirements above, does it seem like a combination of MongoDB and Lucene would make this work effectively? If not, might it be better to attempt to tackle it entirely in Lucene?
Currently with MongoDB, updates are locking at the server-level. There are a few JIRAs open that address this, planned for v1.9-2.0. I believe the current plan is to yield writes to allow reads to perform better.
With that said, there are plenty of great ways to scale MongoDB for super high concurrency - many of which are simiar for MySQL. One such example is to use RAID 10. Another is to use master-slave where you write to master and read from slave.
You also need to consider if your "written" data needs to be 1) durable and 2) accessible via slaves immediately. The mongodb drivers allow you to specify if you want the data to be written to disk immediately (or hang in memory for the next fsync) and allow you to specify how many slaves the data should be written to. Both of these will slow down MongoDB writing, which as noted above can affect read performance.
MongoDB also does not have nearly the capability for full-text search that Solr\Lucene have and you will likely want to use both together. I am currently using both Solr and MongoDB together and am happy with it.