How to implement Failover in Lucene? - lucene.net

I am currrently running a single server Lucene search engine for my platform - and would like to explore the possibility of deploying another server (mostly for failover reasons)
I am using a Lucene.Net driver.
Any suggestions for best practices to do so with a Lucene index of about 100,000 documents?

Check out this article, it covers various scalability options and failover modes in great detail.
The url mentioned above does not exists. Please correct.

Related

Horizontally Scaling Database Guide

We want to horizontally scale our existing MongoDB database which is running on one server. Due to increased user base, we can't scale it vertically anymore. We need to scale it horizontally through sharding.
The MongoDB provides a good tutorial to achieve Sharding. But, we need to do it in less amount of time. We are not expert on this.
It seems there are multiple options available like Google Cloud and Amazon RDS. All we want is to use our database but achieve Sharding by some another service.
So my questions are:
1. Is it possible to build a fail-safe cluster architecture is less than a week using MongoDB Sharding with the team having no prior experience in this?
2. If not, do these services like Google cloud SQL and Amazon RDS provide a mechanism to use our database with their Sharding service?
Can anyone with expertise in this just guide me in this direction?
I tried MongoDB Atlas and it looks pretty good https://www.mongodb.com/cloud/atlas
It creates a cluster for you by default
Maybe, you can give it a try:
MongoDB Atlas delivers the world’s leading database for modern
applications as a fully automated cloud service engineered and run by
the same team that builds the database. Proven operational and
security practices are built in, automating time-consuming
administration tasks such as infrastructure provisioning, database
setup, ensuring availability, global distribution, backups, and more.
The easy-to-use UI and API let you spend more time building your
applications and less time managing your database.

Cassandra vs Mongodb running costs?

We are planning to create a public website, and we're in the process of choosing suitable Database for it. After discussions it was suggested to go with NOSQL databases as it would be easier for scaling in future.
In our website we expect regular writes and lot of reads, and it seems either Cassandra or MongoDB would best suit for it.
Kindly suggest between Cassandra and MongoDb which database would be easier on hosting and maintenance and cheaper on hosting charges.
Also please suggest some providers for better and low cost hosting for both cassandra and MongoDb.

Advantages of using Jackrabbit Oak over MongoDB

We are building a news website similar to a blogging platform or a CMS. Users can write articles, post comment, like and share. We are newbies and are unable to decide between Jackrabbit Oak and MongoDB?
I went through the following thread
When to use JCR (content repository) over other options?. I understood that JCR allows to organize your content in a structure that closely matches your needs. I think this can be accomplished in MongoDB also. The answer compares JCR to RDBMS rather than NoSQL DBs like Mongo.
Also JCR Oak seems a bit complex so I would prefer to keep the stack simple and invest time on MongoDB - Unless Jackrabbit offers features which are extremely important and not present in MongoDB.
Can somebody explain is there any killer feature in JCR Oak over MongoDB?
We are finally going ahead with Cassandra.
Through my research I found out that JCR doesn't seem to have a large active community and the amount of tutorials is also limited. Mongo is far ahead of JCR and is being used in production at several companies. Could not find any killer feature in JCR over MongoDB.
I also read several blog posts that although Mongo is a great DB and easy to start development - after a while if your website is growing fast scalebility might create some challenges and performance might also got hit. See one of the blog post here: http://patrickmcfadin.com/2014/02/11/mongodb-this-is-not-the-database-you-are-looking-for/
Although we are not worried about scalebility right now but I found merit in masterless architecture of Cassandra, CQL being almost similar to SQL and there are performance benchmarks posted on PlanetCassandra that shows Cassandra scales linearly.
JCR (Java Content Repository) is only a API Specification. Apache Jackrabbit OAK is the complementary implementation of the JCR. Oak supports multiple underlying storages for content, like NoSQL, RDBMS, File System. So the interesting thing about Jackrabbit OAK is that it can work on top of MongoDB. So you can have JCR and MongoDB at the same time.

What is the difference between CouchDB and Couchbase?

Are there any essential differences between CouchDB and Couchbase?
I think there are some essential differences between CouchDB and Couchbase Server that need to be pointed out.
I will not write about the advantages of switching from CouchDB to the Couchbase Server because those are described pretty much everywhere (see The Future of CouchDB by Damien Katz or
Couchbase vs. Apache CouchDB
by Couchbase). Instead, I will try to enumerate features of CouchDB that you will not find in the Couchbase Server.
All of the names relating to CouchDB and Couchbase can be really confusing, so I've updated this answer, to begin with a brief explanation of the most important ones.
Names and confusion
There is CouchDB, CouchIO, CouchOne, Couchbase, Couchbase Server, Couchbase Mobile, Couchbase Lite, CouchApps, BigCouch, Touchbase, Membase, Memcached, MemcacheDB... all different and yet related in a way not at all obvious from the names alone.
First, there was CouchDB, a database created by Damien Katz, a former IBM developer. Its official name was changed to Apache CouchDB after it became an Apache project.
A company named CouchIO was founded to work on Apache CouchDB and later changed its name to CouchOne (by "its name" I mean the company name - not the database name).
CouchOne (formerly CouchIO) merged with Membase (formerly NorthScale) to form a new company called Couchbase. Membase (the company) developed Membase (a product of the same name). Membase was created by several leaders of the Memcached project and it used the Memcached protocol. After the merger of CouchOne and Membase, Couchbase continued the development of the Membase software and later changed its name to Couchbase Server.
Today I think most people believe that Couchbase Server is a new version of CouchDB but it is, in fact, a new version of Membase. It still uses the Memcached protocol and not the RESTful API of CouchDB. Meanwhile, CouchDB is still CouchDB, actively maintained and enhanced as an Apache project.
Now to the relevant differences:
Licensing
The Couchbase Server is not entirely open-source/free software. There are two versions: Community Edition (free but no latest bug fixes) and Enterprise Edition (there are restrictions on usage, confidentiality provisions, audits by Couchbase Inc. that "will be conducted during regular business hours at Licensee's facilities" and other terms typical to proprietary software that many people may find unacceptable).
CouchDB is an open-source/free software (no strings attached) project of The Apache Software Foundation and is released under the Apache License, Version 2.0 (DFSG-compatible, FSF-approved, OSI-approved, GPL-compatible, non-copyleft, commercial-friendly).
Philosophy
I have never seen it directly pointed out but this may be actually the most important difference between those two databases because it is deeply about the underlying philosophy of distributed computing models and not only about certain features, APIs or licensing. CouchDB and the Couchbase Server completely differ in their philosophy of building distributed systems and databases.
According to the CAP theorem it is impossible for a distributed database to simultaneously provide consistency, availability and partition tolerance.
CouchDB is an AP type system (provides Availability and Partition tolerance).
Couchbase Server is EITHER a CP type system (according to Wikipedia) OR a CA type system (according to Couchbase technical update) - WHICH OF THESE IS CORRECT? PLEASE COMMENT.
Features
This is what I found to be a list of CouchDB features that are not supported by the Couchbase Server:
no RESTful API (only for views, not for CRUD operations)
no _changes feed
no peer-to-peer replication
no CouchApps
no Futon (there is a different administration interface available)
no document IDs
no notion of databases (there are only buckets)
no replication between a CouchDB database and Couchbase Server
no explicit attachments (you have to store additional files as new key/value pairs)
no HTTP API for everything (you need to use the Couchbase Server SDKs or one of the Experimental Client Libraries at Couchbase Develop so no experiments with curl and wget)
no CouchDB API (it uses the Memcached API instead)
you can't do everything from the browser (you have to write a server-side application)
no two-tier architecture for Web apps is possible (you have to write a server-side application to sit between the browser and the database, like with relational databases)
no eventual consistency
not entirely open-source/free software
not a drop-in replacement for CouchDB (seems like a drop-in replacement for Memcached instead)
Those features of CouchDB may or may not be important to you so whether the lack of them is a disadvantage or not is strictly subjective, but I think that the decision whether to switch from CouchDB to Couchbase Server or not should be based on those differences and your dependence on those feature in your current CouchDB deployments.
For example if you've got interested in CouchDB after watching The CouchDB changes feed NodeCamp talk by Mikeal Rogers or one of the great CouchApp tutorials by J. Chris Anderson then you have to realize that if you want to switch to the Couchbase Server then you will have to forget about pretty much everything they were talking about.
Because of that, I would say that Couchbase Server looks like an evolution of Memcached and Membase (not an evolution of CouchDB) and as such it looks like a great product if you are currently using Memchached or Membase. If you are using CouchDB in the most basic way then you may consider using the Couchbase Server for the same things and it may or may not perform better (if you don't mind the license restrictions). But if you are actually using any of the features that are unique in CouchDB (like the changes feed, CouchApps, two-tier architecture, peer-to-peer replication etc.) then you can either forget about those features or stay with CouchDB.
In any case, make sure to read and understand the Migration to Couchbase for CouchDB Users tutorial before you think about switching.
People often get the wrong impression (maybe after reading things like "What's the future of CouchDB? It's Couchbase.") that CouchDB is somehow obsoleted by the Couchbase Server, or that it is an old, legacy version of Couchbase. Meanwhile CouchDB is an actively maintained open-source project, Couchbase server is a completely separate project (it is a newer project but it is not a newer version of CouchDB - they are not even compatible) and since even new tools for creating CouchApps still keep being developed (eg. see the Kanso project) then CouchDB is not going anywhere soon.
I hope it clarifies the confusion. Please correct me if I'm wrong on anything here.
Update:
Couchbase Server is actually a new name for the Membase Server (the Membase Server was renamed to Couchbase Server somewhere around version 1.8). See Couchbase 2011 Year in Review:
Unfortunately, we confused the heck out of many of our potential users. In addition to Membase Server and our new mobile products we also offered Couchbase Single Server which was a packaged “distribution” of Apache CouchDB. On top of that we began releasing developer previews of Couchbase Server 2.0, which incorporated CouchDB technology into Membase Server – but this product was not compatible with Couchbase Single Server (or CouchDB). [...] Membase Server will be renamed Couchbase Server 1.8 on its next release in January – a tiny step that simply alleviates “name” confusion. As has been planned from the beginning, the Couchbase Server 2.0 release (currently at Developer Preview 3) will add index and query functionality. While Couchbase Server 2.0 will incorporate substantial technology from the CouchDB project, it will not be upward compatible with CouchDB and it shouldn’t be viewed as a “version of CouchDB.” [emphasis added]
See also:
Comments to "The Future of CouchDB" by Damien Katz (removed in 2012 - available in the Web Archive)
Comments to "Why Couchbase?" by Damien Katz (removed in 2012 - available in the Web Archive)
Couchbase 2011 Year in Review
Membase Server is Now Couchbase Server
Couchbase technical update
Difference between Cloudant and CouchOne
They are different yet similar pieces of software. I've remixed the content from the top answer into a picture that might help clarify the "difference" as well as the common things:
A comment from Matt Ingenthron adds to this:
To add some context/corrections: NorthScale founders are Steve Yen and Dustin Sallings. I joined them shortly after founding. Also, Damien didn't later join Couchbase, he was part of CouchIO/Couch One prior to the merger. Citing a fun, historical source: https://youtube.com/watch?v=aZ_JOnU8tkI
I think CouchBase seem to be perceived as CouchDB's 'enterprise' alternative. Which in a way seem to be true.
Apart from lack of ability to attach files to records ( documents) and 'out-of-box' REST endpoints compared to CouchDB, CouchBase has sql like language i.e. N1QL (sometimes pronounced a Nickel, UPDATE renamed to SQL++ in Couchbase 7.0).
This is one of the reason why I don't really like / recommend using the term 'NoSQL'. I personally like term 'Non-relational'.

What are the challenges in embedding text search (Lucene/Solr/Hibernate Search) in applications that are hosted at client sites

We have a enterprise java web-app that our customers (external) deploy on their intranets. I am exploring different full text search options: Lucene/Solr/Hibernate Search and one common concern is deployment/administration/tuning overhead for this.
This is particularly challenging in our case, since we do not host these applications. From what I have seen, most uses of these technologies have been in hosted applications. Our customers typically deploy our application in a clustered environment and do not have any experience with Lucene/Solr.
Does anyone have any experience with this? What challenges have you encountered with this approach? How did you overcome them? At this point I am trying to determine if this is feasible.
Thank you
It is very feasible to deploy applications onto clients sites that use Lucene (or Solr).
Some things to keep in mind:
Administration
you need a way to version your index,
so if/when you change the document
structure in the index, it can be
upgraded.
you therefore need a good
way to force a re-index of all
existing data. Probably also a
good idea to provide an Admin
option to allow an Admin to
trigger a re-index as well.
you
could also provide an Admin option to
allow optimize() be called on your
index, or have this scheduled.
Best to test the actual impact
this will have first, since it may
not be needed depending on the shape
of your index
Deployement
If you are deploying into a clustered environment, the simplest (and fastest in terms of dev speed and runtime speed) solution could be to create the index on each node.
Tuning
* Do you have a reasonable approximation of the dataset you will be indexing? You will need to ensure you understand how your index scales (in both speed and size), since what you consider a reasonable dataset size, may not be the same as your clients... Therefore, you at least need to be able to let clients know what factors will lead to overly large index size, and possibly slower performance.
There are two advantages to embedding lucene in your app over sending the queries to a separate Solr cluster, performance and ease of deployment/installation. Embedding lucene means to run lucene in the same JVM which means no additional server round trips. Commits should be batched in a separate thread. Embedding lucene also means including some more JAR files in your class path so no separate install for Solr.
If your app is cluster aware, then the embedded lucene option becomes highly problematic. An update to one node in the cluster needs to be searchable from any node in the cluster. Synchronizing the lucene index on all nodes yields no better performance than using Solr. With Solr 4, you may find the administration to be less of a barrier to entry for your customers. Check out the literature of the grossly misnamed Solr Cloud.