Self join(hierarchical data) in Hibernate-search - hibernate-search

I have two fields in an entity:
id
parentId
I want a self-join to fetch (hierarchical data) the childrens of parent id.
Something like oracle Hierarchical Queries:

At the moment, Hibernate Search does expose runtime joins capabilities.
If your goal is to order results "parents first", I think you may be able to create a getter that creates a string similar to "rootId.grandParentId.parentId.thisId", and index the result of that getter. Then you can sort on that string. That would clearly be a hack, but it may work.
Alternatively, you may be able to leverage native join capabilities of Lucene or Elasticsearch within Hibernate Search But that will require extensive knowledge of Lucene or Elasticsearch.
With Hibernate Search 5, you may be able to implement it for Lucene, but probably not for Elasticsearch. Unforunately, documentation of Lucene features is sparse.
With Hibernate Search 6, you may be able to implement it in both cases.
You will need:
native fields (Lucene/ES)
native predicates
obviously a good deal of knowledge of advanced Lucene/Elasticsearch features. For Lucene, documentation is sparse. For Elasticsearch, here is a good place to start: https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html

Related

Does ElasticSearch have the same indexes functionality that mongodb have?

I want to know as we have index creation feature in mognodb to speed up the query process https://docs.mongodb.org/v3.0/indexes/ what do we have for elasticsearch for this purpose? I googled it but I was unable to find any suitable information, I used indexing in mongodb on most frequently used fields to speed up the query process and now I want to do same in elasticsearch i want to know is there anything that elasticsearch provides .Thanks
Elasticsearch also has indices: https://www.elastic.co/blog/what-is-an-elasticsearch-index
They are also used as part of the database's key features to provide swift search capabilities.
It is annoying that "index" is used in a different context with ES and many other databases. I'm not as familiar with MongoDB so I'll resort to their documentation at v3.0/core/index-types.
Basically Elasticsearch was designed to serve efficient "filtering" (yes/no queries) and "scoring" (relevance ranking via tf-idf etc.), and it uses Lucene as the underlying inverted index.
MongoDB concepts and their ES counter-parts:
Single Field Index: trivially supported, perhaps as not_analyzed fields for exact matching
Compound Index: Lucene applies AND filter condition via efficient bitmaps, can ad-hoc merge any "single field" indexes
Multikey Index: Transparent support, no difference values and an array of values
Geospatial Index: directly supported via geo-shapes
Text Index: In some way ES was optimized for this use-case as analyzed field type
In my view at search applications relevance is more important that plain filtering the results, as some words occur at almost every document and thus are less relevant when searching.
Elasticsearch has other very useful concepts as well such as aggregations, nested documents and child/parent relationships.

Best NoSQL for filtering on multiple indexes/fields

Because of the size of the data that needs to be queried and ability to scale as needed on multiple nodes, I am considering using some type of NoSQL db.
I have been researching numerous NoSQL offerings but can't yet decide on what would be the best option which would provide best performance, scalability and features for our data structure.
Data structure model is of a product catalog where each document/set contains certain properties and descriptions for the that individual product. Properties would vary from product to product which is why schema-less offering would work the best.
Sample structure would be like
[
{"name": "item name",
"cost": 563.34,
"category": "computer",
"manufacturer: "sony",
.
.
.
}
]
So requirement is that I need to be able to filter/query on many different data set fields/indexes in the record set, where I could filter on and exclude multiple indexes/fields in the same query. Queries will be mostly reads and there would not be much of a need for any joins or relationship type of linking.
I have looked into: Elastic Search, mongodb, OrientDB, Couchbase and Aerospike.
Elastic Search seems like an obvious choice, but I was wondering on the performance and it's stability?
Aerospike seems like it would be really fast since it does it all mostly in memory but it's filtering and searching capability didn't seem that capable
What do you think best option would be for my use case? or if there any other recommended DBs that I should look into.
I know that best way is to test the performance with the actual real life use case, but I am hoping to first narrow it down little bit.
Thank you
This is a variant on the popular question "what is the best product" :)
As always: this depends on your specific use case and goals. Database products (like all products) are always the result of trade-offs. So there does NOT exist a single product offering best performance, scalability and features. However there are many very good products for your use case.
Because your question is about Product Data and I am working with Product Data for more than 15 years, it will try to answer your question.
A document model is a perfect fit for Product Data. So for all use cases other than simple look up I would recommend a Document Store
If your use case concerns a single application and you are using the Java platform. I would recommend to use an embedded database. This makes things simpler and has a big performance advantage
If you need faceted search or other advance product search, i recommend you to use SOLR or Elastic Search
If you need a distributed system I recommend Elastic Search over SOLR
If you need Product recommendations based on reviews or other graph oriented algorithms, I recommend to use OrientDB or ArangoDB (or Neo4J, but in this case this would be my second choice)
Products we are using in Production or evaluated in depth for the use case you describe are
SOLR and ES. Both extremely well engineered products. Both (also ES) mature and stable products
Neo4J. Most mature graph database. Big advantage IMO is the awesome query language they use. Integrated Lucene engine. Very mature and well engineered product. Disadvantage is the fact that it is not a Document Graph but Property (key-value) Graph. Also it can be expensive
MongoDB. Our first experience with Document store. Very good product. Big advantage: excellent documentation, (by far) most popular NoSQL database
OrientDB and ArangoDB. Both support the Graph/Document paradigm. This are less known products, but very powerful. Because we are a Java based shop, our preference goes to OrientDB. OrientDB has a Lucene engine integrated (although the implementation is quite simple). ArangoDB on the other hand has very good documentation and a very smart and efficient storage format and finally the AQL is also very nice!
Performance: (tested with 11.43 mio Articles and 2.3 mio products). All products are very fast, especially SOLR and ES in this use case. Embedded OrientDB is also mind blowing fast for import and simple queries. For faceted search only the Search Servers provide real fast performance!
Bottom line: I would go for a Graph/Document store and/or Search Server (SOLR or ES). Because you mentioned "filtering" (I assume faceted search). The Search Server is the obvious first choice
OrientDB supports composite indexes on multiple fields. Example:
CREATE INDEX Product_idx ON Product (name, category, manufacturer) unique
SELECT FROM INDEX:Product_idx WHERE key = ["Donald Knuth", "computer"]
You could also create a FULL-TEXT index by using all the power of Lucene as engine.
Aerospike is a key-value store, not an document database. A document database would do such field-level indexing and deeper searching into a nested object better. The secondary indexes in Aerospike currently (version 3.4.x) work on string and integer 'bins' (a concept similar to a document's field or a SQL table's column).
That said, the list and map complex types of Aerospike are being augmented with those capabilities, in work being done in this quarter. Keep an eye out for those changes in the upcoming releases. You'll be able to index and query on bins of type list and map.

Why use ElasticSearch with Mongo?

I have read a few articles recently on the combination of mongodb for storage and elasticsearch for indexing/search. I feel like I'm missing something though. Why would you go this route as opposed to just using mongo to index the data? What benefits does elasticsearch bring and is it worth the added complexity?
ElasticSearch implements a lot more features, such as custom splitting of text into words, custom stemming, facetted search and a whole lot more. While MongoDB's (rather simple) text search does some of this, it is not nearly as powerful as ElasticSearch.
If all you ever do is look for a single string in a single field, then MongoDB's normal query system will work excellently for that. If you need to look for words in multiple fields, then MongoDB's text search will work. If you need anything more than that, ElasticSearch is the way to go.
A search engine and a database do some fundamentally different things. A good search engine (like ElasticSearch) supports far more elaborate and complex indexing, facets, highlighting etc. In the case of ElasticSearch, you also get your replies 'real-time'. On the other hand, a search engine doesn't return every single document that matches your query. Instead, it will score documents according to how much they match, and return the top scoring ones. When you query a database such as MongoDB, you should expect it to return everything that matches your query.
You can store the entire document in ElasticSearch, but it is usually not the optimal solution. Normally you will have it configured to return the document id's, which you use to fetch the document from a database. MongoDB is a database optimized for document based storage. this is why you hear about people using them together.
edit:
When this was posted, it matched the recommendations, but this may no longer be the case.
Derick's answer pretty much nails it. The questions behind all this is:
What are the features you want to implement in your application?
If you rely on heavy searching capabilities in large chunks of text, ElasticSearch is probably a good thing to use. If you want to have a flexible datastore that can cope with complex ad-hoc queries, Mongo might be a good fit. If you have different requirements for a datastore, it is often a good thing to combine two tools instead of implementing all kind of workarounds to make it work with just one datastore.
Choose the right tool for the job.

Combining Neo4J and MongoDB : Consistency

I am experimenting a lot these days, and one of the things I wanted to do is combine two popular NoSQL databases, namely Neo4j and MongoDB. Simply because I feel they complement eachother perfectly. The first class citizens in Neo4j, the relations, are imo exactly what's missing in MongoDB, whereas MongoDb allows me to not put large amounts of data in my node properties.
So I am trying to combine the two in a Java application, using the Neo4j Java REST binding, and the MongoDB Java driver. All my domain entities have a unique identifier which I store in both databases. The other data is stored in MongoDB and the relations between entities are stored in Neo4J. For instance, both databases contain a userid, MongoDB contains the profile information, and Neo4J contains friendship relations. With the custom data access layer I have written, this works exactly like I want it to. And it's fast.
BUT... When I want to create a user, I need to create both a node in Neo4j and a document in MongoDB. Not necessarily a problem, except that Neo4j is transactional and MongoDB is not. If both were transactional, I would just roll back both transactions when one of them fails. But since MongoDB isn't transactional, I cannot do this.
How do I ensure that whenever I create a user, either both a Node and Document are created, or none of both. I don't want to end up with a bunch of documents that have no matching node.
On top of that, not only do I want my combined database interaction to be ACID compliant, I also want it to be threadsafe. Both the GraphDatabaseService and the MongoClient / DB are provided from singletons.
I found something about creating "Transaction Documents" in MongoDB, but I realy don't like that approach. I would like something nice and clean like the neo4j beginTx, tx.success, tx.failure, tx.finish setup. Ideally, something I can implement in the same try/catch/finally block.
Should I perhaps make a switch to CouchDB, which does appear to be transactional?
Edit : After some more research, sparked by a comment, I came to realize that CouchDB is also not suitable for my specific needs. To clarify, the Neo4j part is set in stone. The Document Store database is not as long as it has a Java Library.
Pieter-Jan,
if you are able to use Neo4j 2.0 you can implement a Schema-Index-Provider (which is really easy) that creates your documents transactionally in MongoDB.
As Neo4j makes its index providers transactional (since the beginning), we did that with Lucene and there is one for Redis too (needs to be updated). But it is much easier with Neo4j 2.0, if you want to you can check out my implementation for MapDB. (https://github.com/jexp/neo4j-mapdb-index)
Although I'm a huge fan of both technologies, I think a better option for you could be OrientDB. It's a graph (as Neo4) and document (as MongoDB) database in one and supports ACID transactions. Sounds like a perfect match for your needs.
As posted here https://stackoverflow.com/questions/23465663/what-is-the-best-practice-to-combine-neo4j-and-mongodb?lq=1, you might have a look on Structr.
Its backend can be regarded as a Document database around Neo4j. It's fully transactional and open-source.

Scala integration with Mongodb

We're using mongodb, and rewriting parts of our stack with scala. I'm wondering if I should stick with mophia, or use a scala mongodb library such as subset.
Question is what do I get out of subset? e.g. with mophia I don't have to manually define the mongodb field names like i have to do in subset...
Is subset really worth using?
We use casbah + salat and it works well in almost all cases.
With Scala you should consider using Casbah, which is an officially supported interface for MongoDB that builds on the Java driver.
Casbah's approach is intended to add fluid, Scala-friendly syntax on top of MongoDB and handle conversions of common types. If you try to save a Scala List or Seq to MongoDB, we automatically convert it to a type the Java driver can serialize. If you read a Java type, we convert it to a comparable Scala type before it hits your code. All of this is intended to let you focus on writing the best possible Scala code using Scala idioms. A great deal of effort is put into providing you the functional and implicit conversion tools you’ve come to expect from Scala, with the power and flexibility of MongoDB.
Casbah provides improved interfaces to GridFS, Map/Reduce and the core Mongo APIs. It also provides a fluid query syntax which emulates an internal DSL and allows you to write code which looks like what you might write in the JS Shell. There is also support for easily adding new serialization/deserialization mechanisms for common data types.
Additionally to the ORM-Mapper/Client-Libraries, I would suggest you give Rouge a try. It will serve you with a nice Query DSL for Mongo. Rogue 1.X will only support Lift-MongoDB but version 2.x (which will ship in very near future) will work for a lot more MongoDB libraries.
A sample query would be (pure Scala code with compiletime typechecking):
Venue where (_.mayor eqs 1234) and (_.categories contains "Thai") fetch(10)
which queries for 10 entries in the Venue collection where 1234 is the mayor and Thai is one of its categories.
I am the author of Subset. I would say "Subset" is not really a kind of ORM library. It has no methods for working with databases and collections, leaving it to Java/Scala drivers. But it is more focused on transformations of MongoDB documents. This transformation core is rather generic and suitable not only for reading/writing of fields, but for applications that need perform e.g. document migrations as well. Query/Update builders Subset provides are built on top of this "core".
That said, if you need ORM, there are simpler alternatives indeed. I never had an intent for Subset to compete with true ORM libraries, I've filled the gap I met in my projects.