Best NoSQL for filtering on multiple indexes/fields - mongodb

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you

This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice

OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.

Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

Related

Can DynamoDB or SimpleDB replace my MongoDB use-case?

I was wondering if DynamoDB or SimpleDB can replace my MongoDB use-case? Here is how I use MongoDB
15k entries, and I add 200 entries per hour
15 columns each of which is indexed using (ensureIndex)
Half of the columns are integers, the others are text fields (which basically have no more than 10 unique values)
I run about 10k DB reads per hour, and they are super fast with MongoDB right now. It's an online dating site. So the average Mongo query is doing a range search on 2 of the columns (e.g. age and height), and "IN" search for about 4 columns (e.g. ethnicity is A, B, or C... religion is A, B, ro C).
I use limit and skip very frequently (e.g. get me the first 40 entries, the next 40 entries, etc)
I use perl to read/write
I'm assuming you're asking because you want to migrate directly to an AWS hosted persistence solution? Both DynamoDB and SimpleDB are k/v stores and therefor will not be a "drop-in" replacement for a document store such as MongoDB.
With the exception of the limit/skip one (which require a more k/v compatible approach) all your functional requirements can easily be met by either of the two solutions you mentioned (although DynamoDB in my opinion is the better option) but that's not going to be the main issue. The main issue is going to be to move from a document store, especially one with extensive query capabilities, to a k/v store. You will need to rework your schema, rethink your indexes, work within the constraints of a k/v store, make use of the benefits of a k/v store, etc.
Frankly if your current solution works and if MongoDB feels like a good functional fit I'd certainly not migrate unless you have very strong non-technical reasons to do so (such as, say, your boss wants you to ;) )
What would you say is the reason you're considering this move or are you just exploring whether or not it's possible in the first place?
If you are planning to have your complete application on AWS then you might also consider using Amazon RDS (hosted managed MySQL). It's not clear from your description if you actually need MongoDB's document model so considering only the querying capabilities RDS might come close to what you need.
Going with either SimpleDB or DynamoDB will most probably require you to rethink some of the features based around aggregation queries. As regards choosing between SimpleDB and DynamoDB there are many important differences, but I'd say that the most interesting ones from your point of view are:
SimpleDB indexes all attributes
there're lots of tips and tricks that you'll need to learn about SimpleDB (see what the guys from Netflix learned while using SimpleDB)
DynamoDB pricing model is based on actual write/read operations (see my notes about DynamoDB)

What NoSQL database to use as replacement for MySQL?

When it comes to NoSQL, there are bewildering number of choices to select a specific NoSQL database as is clear in the NoSQL wiki.
In my application I want to replace mysql with NOSQL alternative. In my application I have user table which has one to many relation with large number of other tables. Some of these tables are in turn related to yet other tables. Also I have a user connected to another user if they are friends.
I do not have documents to store, so this eliminates document oriented NoSQL databases.
I want very high performance.
The NOSQL database should work very well with Play Framework and scala language.
It should be open source and free.
So given above, what NoSQL database I should use?
I think you may be misunderstanding the nature of "document databases". As such, I would recommend MongoDB, which is a document database, but I think you'll like it.
MongoDB stores "documents" which are basically JSON records. The cool part is it understands the internals of the documents it stores. So given a document like this:
{
"name": "Gregg",
"fave-lang": "Scala",
"fave-colors": ["red", "blue"]
}
You can query on "fave-lang" or "fave-colors". You can even index on either of those fields, even the array "fave-colors", which would necessitate a many-to-many in relational land.
Play offers a MongoDB plugin which I have not used. You can also use the Casbah driver for MongoDB, which I have used a great deal and is excellent. The Rogue query DSL for MongoDB, written by FourSquare is also worth looking at if you like MongoDB.
MongoDB is extremely fast. In addition you will save yourself the hassle of writing schemas because any record can have any fields you want, and they are still searchable and indexable. Your data model will probably look much like it does now, with a users "collection" (like a table) and other collections with records referencing a user ID as needed. But if you need to add a field to one of your collections, you can do so at any time without worrying about the older records or data migration. There is technically no schema to MongoDB records, but you do end up organizing similar records into collections.
MongoDB is one of the most fun technologies I have happened to come across in the past few years. In that one happy Saturday I decided to check it out and within 15 minutes was productive and felt like I "got it". I routinely give a demo at work where I show people how to get started with MongoDB and Scala in 15 minutes and that includes installing MongoDB. Shameless plug if you're into web services, here's my blog post on getting started with MongoDB and Scalatra using Casbah: http://janxspirit.blogspot.com/2011/01/quick-webb-app-with-scala-mongodb.html
You should at the very least go to http://try.mongodb.org
That's what got me started.
Good luck!
At this point the answer is none, I'm afraid.
You can't just convert your relational model with joins to a key-value store design and expect it to be a 1:1 mapping. From what you said it seems that you do have joins, some of them recursive, i.e. referencing another row from the same table.
You might start by denormalizing your existing relational schema to move it closer to a design you wish to achieve. Then, you could see more easily if what you are trying to do can be done in a practical way, and which technology to choose. You may even choose to continue using MySQL. Just because you can have joins doesn't mean that you have to, which makes it possible to have a non-relational design in a relational DBMS like MySQL.
Also, keep in mind - non-relational databases were designed for scalability - not performance! If you don't have thousands of users and a server farm a traditional relational database may actually work better for you.
Hmm, You want very high performance of traversal and you use the word "friends". The first thing that comes to mind is Graph Databases. They are specifically made for this exact case.
Try Neo4j http://neo4j.org/
It's is free, open source, but also has commercial support and commercial licensing, has excellent documentation and can be accessed from many languages (REST interface).
It is written in java, so you have native libraries or you can embedd it into your java/scala app.
Regarding MongoDB or Cassendra, you now (Dec. 2016, 5 years late) try longevityframework.org.
Build your domain model using standard Scala idioms such as case classes, companion objects, options, and immutable collections. Tell us about the types in your model, and we provide the persistence.
See "More Longevity Awesomeness with Macro Annotations! " from John Sullivan.
He provides an example on GitHub.
If you've looked at longevity before, you will be amazed at how easy it has become to start persisting your domain objects. And the best part is that everything persistence related is tucked away in the annotations. Your domain classes are completely free of persistence concerns, expressing your domain model perfectly, and ready for use in all portions of your application.

Extract to MongoDB for analysis

I have a relational database with about 300M customers and their attributes from several perspectives (360).
To perform some analytics I intent to make an extract to a MongoDB in order to have a 'flat' representation that is more suited to apply data mining techniques.
Would that make sense? Why?
Thanks!
No.
Its not storage that would be the concern here, its your flattening strategy.
How and where you store the flattened data is a secondary concern, note MongoDB is a document database and not inherently flat anyway.
Once you have your data in the shape that is suitable for your analytics, then, look at storage strategies, MongoDB might be suitable or you might find that something that allows easy Map Reduce type functionality would be better for analysis... (HBase for example)
It may make sense. One thing you can do is setup MongoDB in a horizontal scale-out setup. Then with the right data structures, you can run queries in parallel across the shards (which it can do for you automatically):
http://www.mongodb.org/display/DOCS/Sharding
This could make real-time analysis possible when it otherwise wouldn't have been.
If you choose your data models right, you can speed up your queries by avoiding any sorts of joins (again good across horizontal scale).
Finally, there is plenty you can do with map/reduce on your data too.
http://www.mongodb.org/display/DOCS/MapReduce
One caveat to be aware of is there is nothing like SQL Reporting Services for MongoDB AFAIK.
I find MongoDB's mapreduce to be slow (however they are working on improving it, see here: http://www.dbms2.com/2011/04/04/the-mongodb-story/ ).
Maybe you can use Infobright's community edition for analytics? See here: http://www.infobright.com/Community/
A relational db like Postgresql can do analytics too (afaik MySQL can't do a hash join but other relational db's can).

CouchDB and MongoDB really search over each document with JavaScript?

From what I understand about these two "Not only SQL" databases. They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
Is that actually how it works? Sounds worse than using a plain RBMS without any indexed keys.
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4) gives the database a simple list of ID's which it uses to find in the actual rows in the massive data collection.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works). Perhaps someone can correct me because I know I must be missing something.
#Xeoncross
I built my schemas so they don't require join operations which leaves me with simple searches on indexed int columns. In other words, the columns are in RAM and a quick value check through them (WHERE user_id IN (12,43,5,2) or revision = 4)
Well then, you'll love MongoDB. MongoDB support indexes so you can index user_id and revision and this query will be able to return relatively quickly.
However, please note that many NoSQL DBs only support Key lookups and don't necessarily support "secondary indexes" so you have to do you homework on this one.
So I'm trying to imagine how in the world looking through every single row in the database could be considered acceptable (if indeed this is how it works).
Well if you run a query in an SQL-based database and you don't have an index that database will perform a table scan (i.e.: looking through every row).
They search over each record and pass it to a JavaScript function you write which calculates which results are to be returned by looking at each one.
So in practice most NoSQL databases support this. But please never use it for real-time queries. This option is primarily for performing map-reduce operations that are used to summarize data.
Here's maybe a different take on NoSQL. SQL is really good at relational operations, however relational operations don't scale very well. Many of the NoSQL are focused on Key-Value / Document-oriented concepts instead.
SQL works on the premise that you want normalized non-repeated data and that you to grab that data in big sets. NoSQL works on the premise that you want fast queries for certain "chunks" of data, but that you're willing to wait for data dependent on "big sets" (running map-reduces in the background).
It's a big trade-off, but if makes a lot of sense on modern web apps. Most of the time is spent loading one page (blog post, wiki entry, SO question) and most of the data is really tied to or "hanging off" that element. So the concept of grabbing everything you need with one query horizontally-scalable query is really useful.
It's the not the solution for everything, but it is a really good option for lots of use cases.
In terms of CouchDB, the Map function can be Javascript, but it can also be Erlang. (or another language altogether, if you pull in a 3rd Party View Server)
Additionally, Views are calculated incrementally. In other words, the map function is run on all the documents in the database upon creation, but further updates to the database only affect the related portions of the view.
The contents of a view are, in some ways, similar to an indexed field in an RDBMS. The output is a set of key/value pairs that can be searched very quickly, as they are stored as b-trees, which some RDBMSs use to store their indexes.
Think CouchDB stores the docs in a btree according to the "index" (view) and just walks this tree.. so it's not searching..
see http://guide.couchdb.org/draft/btree.html
You should study them up a bit more. It's not "worse" than and RDMBS it's different ... in fact, given certain domains/functions the "NoSQL" paradigm works out to be much quicker than traditional and in some opinions, outdated, RDMBS implementations. Think Google's Big Table platform and you get what MongoDB, Riak, CouchDB, Cassandra (Facebook) and many, many others are trying to accomplish. The primary difference is that most of these NoSQL solutions focus on Key/Value stores (some call these "document" databases) and have limited to no concept of relationships (in the primary/foreign key respect) and joins. Join operations on tables can be very expensive. Also, let's not forget the object relational impedance mismatch issue... You don't need an ORM to access MongoDB. It can actually store your code object (or document) as it is in memory. Can you imagine the savings in lines of code and complexity!? db4o is another lightweight solution that does this.
I don't know what you mean when you say "Not only SQL" database? It's a NoSQL paradigm - wherein no SQL is used to query the underlying data store of the system. NoSQL also means not an RDBMS which SQL is generally built on top of. Although, MongoDB does has an SQL like syntax that can be used from .NET when retrieving data - it's called NoRM.
I will say I've only really worked with Riak and MongoDB... I'm by no means familiar with Cassandra or CouchDB past a reading level and feature set comprehension. I prefer to use MongoDB over them all. Riak was nice too but not for what I needed. You should download a few of these NoSQL solutions and you will get the concept. Check out db4o, MongoDB and Riak as I've found them to be the easiest with more support for .NET based languages. It will just make sense for certain applications. All in all, the NoSQL or Document databse or OODBMS ... whatever you want to call it is very appealing and gaining lots of movement.
I also forgot about your javascript question... MongoDB has JavaScript "bindings" that enable it to be used as one method of searching for data. Riak handles data via a JSON format. MongoDB uses BSON I believe and I can't remember what the others use. In any case, the point is instead of SQL (structured query language) to "ask" the database for information some of these (MongoDB being one) use Javascript and/or RESTful syntax to ask the NoSQL system for data. I believe CouchDB and Riak can be queried over HTTP to which makes them very accessible. Not to mention, that's pretty frickin cool.
Do your research.... download them, they are all free and OSS.
db4o: http://www.db4o.com/ (Java & .NET versions)
MongoDB: mongodb.org/
Riak: http://www.basho.com/Riak.html
NoRM: http://thechangelog.com/post/436955815/norm-bringing-mongodb-to-net-linq-and-mono

Which NoSQL solution for a dating search site?

I am highly interested in new NoSQL solutions to implement a search engine for a dating site. However because of having a lot of possibilities, I am little bid confused. My requirements,
1) 10 million people
2) More than 8 index (gender, online, city, name etc...)
3) Scalability
Thanks
You wanna go for either mangoDB or CouchDB.
CouchDB scales a little better while mangoDB syntax is a little more familiar.
also it depends what framework/language u use to create the dating site.
i personally would choose couchdb. (u should know javascript...a lot)
Apache Solr is a data store and fulltext search engine that might be useful to you. Solr is rarely mentioned as a NoSQL technology, but it shares many characteristics with document-oriented databases.
Keep in mind that you have to know what type of queries you're going to run before you can choose a NoSQL solution or design your database.
That's in contrast to a relational database, where you can design a general-purpose database based on the data relationships.
With that large of a dataset you would probably be well advised to look at search as separate from data store. As someone suggested, SOLR will index your data for you to search independently of your database. You have 2 problems, data store and search.
ElasticSearch http://www.elasticsearch.org/overview/
Can handle age difference, geographic location, tastes and dislikes, etc. Or a leaderboard system that depends on many variables.
You'd want something that has sophisticated search and aggregation support.
Elasticsearch is a good candidate. In addition to its ability to perform fuzzy, proximity searches (which is something you'd likely want), you'd also want to integrate some machine learning pipeline to constantly improve your matching 'accuracy'.