should I use elasticsearch for non-free-text searches - postgresql

I use Postgres as a data warehouse. I need to do free text search on many of the fields. My DBA recommends not to use Postgres for free text searches. I am now considering elasticsearch. The question is what to do if the user filters both by free text and some structured dimension. Should I query both elastic and postgres and take the intersection, or can I serve all query from elastic? What if there are no free text in the filter - is elastic appropriate for my general purpose querying?
EDIT: as requested some more information. database will contain a few million rows. I cannot give concrete details about data except that a row will contain ~30 columns, half of them are strings, between one word and a few sentences. The reasons to use elastic are not just the DBAs objection to full text index in postgres, but elastic also gives results ranking and specific text search semantics.

It is true that elasticsearch is great for full-text search, since it uses lucene under the covers, but it's also very good for structured search through filters. One other great thing that you can do with it is data analytics, that allows to visualize aggregations of your data.
That said, you don't necessarily need full-text search requirements in order to make good use of elasticsearch. There are many usecases where elasticsearch is used only for one of those three aspects that I mentioned: full-text search, structured search or data analytics. The next step is also to combine those together.
Your usecase is quite common and I would suggest to go ahead and consider running structured queries too against elasticsearch instead of querying two systems. The only obstacle that I can foresee could be document relations, that need to be properly represented and handled within elasticsearch.
Have a look at the elasticsearch query DSL, used to represent queries, and effectively combine structured and unstructured search together.

Related

mongoDB vs. elasticsearch query/aggregation performance comparison

This question is about choosing the type of database to run queries on for an application. Keeping other factors aside for the moment, and given that the choice is between mongodb and elastic, the key criterion is that the query should be resolved in near real time. The queries will be ad-hoc and as such can contain any of the fields in the JSON objects and will likely contain aggregations and subaggregations. Furthermore, there will not be nested objects and none of the fields will be containing 'descriptive' text (like movie reviews etc.), i.e., all the fields will be keyword type fields like State, Country, City, Name etc.
Now, I have read that elasticsearch performance is near real time and that elasticsearch uses inverted indices and creates them automatically for every field.
Given all the above, my questions are as follows.
(there is a similar question posted in stack but I do not think it answers my questions
elasticsearch v.s. MongoDB for filtering application)
1) Since the fields in the use case I mentioned do not contain descriptive text and hence would not require the full-text search capability and other additional features that elastic provides (especially for text search), what would be a better choice between elastic and mongo? How would elastic search and mongo query/aggregation performance compare if I were to create single field indices on all the available fields in mongo?
2) I am not familiar with advanced indexing, so I am assuming that it would be possible to create indices on all available fields in mongo (either using multiple single field indices or maybe compound indices?). I understand that this will come with a cost for storage and write speed which is true for elastic as well.
3) Also, in elastic the user can trade off write speed (indexing rate) with the speed with which the written document becomes available (refresh_interval) for a query. Is there a similar feature in mongo?
I think the size of your data set is also a very important aspect about choosing DB engine. According to this benckmark (2015), if you have over 10 millions of documents, Elasticsearch could be a better choice. If your data set is small there should be no obvious different about performance between Elasticsearch and MongoDB.

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Does ElasticSearch have the same indexes functionality that mongodb have?

I want to know as we have index creation feature in mognodb to speed up the query process https://docs.mongodb.org/v3.0/indexes/ what do we have for elasticsearch for this purpose? I googled it but I was unable to find any suitable information, I used indexing in mongodb on most frequently used fields to speed up the query process and now I want to do same in elasticsearch i want to know is there anything that elasticsearch provides .Thanks
Elasticsearch also has indices: https://www.elastic.co/blog/what-is-an-elasticsearch-index
They are also used as part of the database's key features to provide swift search capabilities.
It is annoying that "index" is used in a different context with ES and many other databases. I'm not as familiar with MongoDB so I'll resort to their documentation at v3.0/core/index-types.
Basically Elasticsearch was designed to serve efficient "filtering" (yes/no queries) and "scoring" (relevance ranking via tf-idf etc.), and it uses Lucene as the underlying inverted index.
MongoDB concepts and their ES counter-parts:
Single Field Index: trivially supported, perhaps as not_analyzed fields for exact matching
Compound Index: Lucene applies AND filter condition via efficient bitmaps, can ad-hoc merge any "single field" indexes
Multikey Index: Transparent support, no difference values and an array of values
Geospatial Index: directly supported via geo-shapes
Text Index: In some way ES was optimized for this use-case as analyzed field type
In my view at search applications relevance is more important that plain filtering the results, as some words occur at almost every document and thus are less relevant when searching.
Elasticsearch has other very useful concepts as well such as aggregations, nested documents and child/parent relationships.

Why use ElasticSearch with Mongo?

I have read a few articles recently on the combination of mongodb for storage and elasticsearch for indexing/search. I feel like I'm missing something though. Why would you go this route as opposed to just using mongo to index the data? What benefits does elasticsearch bring and is it worth the added complexity?
ElasticSearch implements a lot more features, such as custom splitting of text into words, custom stemming, facetted search and a whole lot more. While MongoDB's (rather simple) text search does some of this, it is not nearly as powerful as ElasticSearch.
If all you ever do is look for a single string in a single field, then MongoDB's normal query system will work excellently for that. If you need to look for words in multiple fields, then MongoDB's text search will work. If you need anything more than that, ElasticSearch is the way to go.
A search engine and a database do some fundamentally different things. A good search engine (like ElasticSearch) supports far more elaborate and complex indexing, facets, highlighting etc. In the case of ElasticSearch, you also get your replies 'real-time'. On the other hand, a search engine doesn't return every single document that matches your query. Instead, it will score documents according to how much they match, and return the top scoring ones. When you query a database such as MongoDB, you should expect it to return everything that matches your query.
You can store the entire document in ElasticSearch, but it is usually not the optimal solution. Normally you will have it configured to return the document id's, which you use to fetch the document from a database. MongoDB is a database optimized for document based storage. this is why you hear about people using them together.
edit:
When this was posted, it matched the recommendations, but this may no longer be the case.
Derick's answer pretty much nails it. The questions behind all this is:
What are the features you want to implement in your application?
If you rely on heavy searching capabilities in large chunks of text, ElasticSearch is probably a good thing to use. If you want to have a flexible datastore that can cope with complex ad-hoc queries, Mongo might be a good fit. If you have different requirements for a datastore, it is often a good thing to combine two tools instead of implementing all kind of workarounds to make it work with just one datastore.
Choose the right tool for the job.

What is the fundmental difference between MongoDB / NoSQL which allows faster aggregation (MapReduce) compared to MySQL

Greeting!
I have the following problem. I have a table with huge number of rows which I need to search and then group search results by many parameters. Let's say the table is
id, big_text, price, country, field1, field2, ..., fieldX
And we run a request like this
SELECT .... WHERE
[use FULLTEXT index to MATCH() big_text] AND
[use some random clauses that anyway render indexes useless,
like: country IN (1,2,65,69) and price<100]
This we be displayed as search results and then we need to take these search results and group them by a number of fields to generate search filters
(results) GROUP BY field1
(results) GROUP BY field2
(results) GROUP BY field3
(results) GROUP BY field4
This is a simplified case of what I need, the actual task at hand is even more problematic, for example sometimes the first results query does also its own GROUP BY. And example of such functionality would be this site
http://www.indeed.com/q-sales-jobs.html
(search results plus filters on the left)
I've done and still doing a deep research on how MySQL functions and at this point I totally don't see this possible in MySQL. Roughly speaking MySQL table is just a heap of rows lying on HDD and indexes are tiny versions of these tables sorted by the index field(s) and pointing to the actual rows. That's a super oversimplification of course but the point is I don't see how it is possible to fix this at all, i.e. how to use more than one index, be able to do fast GROUP BY-s (by the time query reaches GROUP BY index is completely useless because of range searches and other things). I know that MySQL (or similar databases) have various helpful things such index merges, loose index scans and so on but this is simply not adequate - the queries above will still take forever to execute.
I was told that the problem can be solved by NoSQL which makes use of some radically new ways of storing and dealing with data, including aggregation tasks. What I want to know is some quick schematic explanation of how it does this. I mean I just want to have a quick glimpse at it so that I could really see that it does that because at the moment I can't understand how it is possible to do that at all. I mean data is still data and has to be placed in memory and indexes are still indexes with all their limitation. If this is indeed possible, I'll then start studying NoSQL in detail.
PS. Please don't tell me to go and read a big book on NoSQL. I've already done this for MySQL only to find out that it is not usable in my case :) So I wanted to have some preliminary understanding of the technology before getting a big book.
Thanks!
There are essentially 4 types of "NoSQL", but three of the four are actually similar enough that an SQL syntax could be written on top of it (including MongoDB and it's crazy query syntax [and I say that even though Javascript is one of my favorite languages]).
Key-Value Storage
These are simple NoSQL systems like Redis, that are basically a really fancy hash table. You have a value you want to get later, so you assign it a key and stuff it into the database, you can only query a single object at a time and only by a single key.
You definitely don't want this.
Document Storage
This is one step up above Key-Value Storage and is what most people talk about when they say NoSQL (such as MongoDB).
Basically, these are objects with a hierarchical structure (like XML files, JSON files, and any other sort of tree structure in computer science), but the values of different nodes on the tree can be indexed. They have a higher "speed" relative to traditional row-based SQL databases on lookup because they sacrifice performance on joining.
If you're looking up data in your MySQL database from a single table with tons of columns (assuming it's not a view/virtual table), and assuming you have it indexed properly for your query (that may be you real problem, here), Document Databases like MongoDB won't give you any Big-O benefit over MySQL, so you probably don't want to migrate over for just this reason.
Columnar Storage
These are the most like SQL databases. In fact, some (like Sybase) implement an SQL syntax while others (Cassandra) do not. They store the data in columns rather than rows, so adding and updating are expensive, but most queries are cheap because each column is essentially implicitly indexed.
But, if your query can't use an index, you're in no better shape with a Columnar Store than a regular SQL database.
Graph Storage
Graph Databases expand beyond SQL. Anything that can be represented by Graph theory, including Key-Value, Document Database, and SQL database can be represented by a Graph Database, like neo4j.
Graph Databases make joins as cheap as possible (as opposed to Document Databases) to do this, but they have to, because even a simple "row" query would require many joins to retrieve.
A table-scan type query would probably be slower than a standard SQL database because of all of the extra joins to retrieve the data (which is stored in a disjointed fashion).
So what's the solution?
You've probably noticed that I haven't answered your question, exactly. I'm not saying "you're finished," but the real problem is how the query is being performed.
Are you absolutely sure you can't better index your data? There are things such as Multiple Column Keys that could improve the performance of your particular query. Microsoft's SQL Server has a full text key type that would be applicable to the example you provided, and PostgreSQL can emulate it.
The real advantage most NoSQL databases have over SQL databases is Map-Reduce -- specifically, the integration of a full Turing-complete language that runs at high speed that query constraints can be written in. The querying function can be written to quickly "fail out" of non-matching queries or quickly return with a success on records that meet "priority" requirements, while doing the same in SQL is a bit more cumbersome.
Finally, however, the exact problem you're trying to solve: text search with optional filtering parameters, is more generally known as a search engine, and there are very specialized engines to handle this particular problem. I'd recommend Apache Solr to perform these queries.
Basically, dump the text field, the "filter" fields, and the primary key of the table into Solr, let it index the text field, run the queries through it, and if you need the full record after that, query your SQL database for the specific index you got from Solr. It uses some more memory and requires a second process, but will probably best suite your needs, here.
Why all of this text to get to this answer?
Because the title of your question doesn't really have anything to do with the content of your question, so I answered both. :)