SQL(MSSQL/MariaDB) or NoSQL(MongoDB): XML search and processing

SQL(MSSQL/MariaDB) or NoSQL(MongoDB): XML search and processing - mongodb

Current project situation.
Get lot of XML from outside system whose size is less than 50KB and if it contains an attachment, the size would be around 5MB MAX. XML structure complexity is medium because of inner nested elements. There are ~70 first level element and then its child and child of child... Storing that XML in String column of MS SQL server.
While storing XML, we read Search criteria field from XML and maintain them in new columns to improve search queries.
Search functionality to display these messages data in the form of list. Search criteria fields(optional ~10 fields) are from XML elements. Parse that XML to show the elements(around 10 -15) in lists
There are chances of reporting functionality too in future.
Challenge with this design: If new functionality introduced to search the list based on new criteria, then need to add one more column in DB table and have to store that field value from XML which is not best part of this design.
Suggested Improvement: Instead of storing an XML in String format, plan is to store it as an XML column to get a rid of extra columns to keep value of search fields and use XML column query for search and fetch.
Question:
Which DB will give me optimum performance in case of search? I have to fetch only the XMLs which are fitting inside that search criteria.
SQL or NoSQL like MongoDB?
Is there any performance metrics available? Or any case study for same?
DB to manage Reporting load.

What client language are you using? Java / PHP / C# / ...? Most of them have XML libraries that do what you need. A database is a data repository, not a data manipulator.

Related

Does a Mongo full collection scan read every single word in a collection?

Let's say that you don't have something indexed for some legitimate reason (like maybe you maxed out the 64 allowable indexes) and you are searching for values within only certain fields.
To go extreme, let's say each object has an authorName field, bookTitles field, and bookFullText field (where the content of all their novels was collected.)
If there was no index and you looked for a list of authorNames, would it have to read through all the content of all the fields in the entire collection, or would it read just the authorName fields and the names but not content of the other fields?

Fields in a document are ordered. The server stores documents as lists of key-value pairs. Therefore, I would expect that, if the server is doing a collection scan and field comparison, that the server will:
Skip over all of the fields preceding the field in question, one field at a time (which requires the server to perform string comparisons over each field name), and
Skip over the fields after the field in question in a particular document (jump to next document in collection).
The above applies to comparisons. What about reads from disk?
The basic database design I am familiar with separates logical records (documents in case of MongoDB, table rows in a RDBMS) from physical pages. For performance reasons the database generally will not read documents from disk, but will read pages. As such, it seems unlikely to me that the database will skip over some of the fields when it maps documents to pages. I expect that when any field of a document is needed, the entire document will be read from disk.
Further supporting this hypothesis is MongoDB's 16 MB document limit. This is rather low, and I suspect is set such that the server can read documents into memory completely without worrying that they might be very large. Postgres for example distinguishes VARCHAR from TEXT types in where the data is stored - VARCHAR data is stored inline in the table row and TEXT data is stored separately, presumably to avoid this exact issue of having to read it from disk if any column value is needed.
I am not a MongoDB server engineer though so the above could be wrong.

BSON Documents are kept in the common case (wiredTiger snappy compressed) in 32KB blocks in 64MB(default size ) chunks on storage , in case your document compressed size is 48KB , two blocks 32KB each must be loaded in memory , to be uncompressed and searched for your non indexed field which is expensive operation , moreover if you search multiple documents usually they are not written in sequential blocks increasing the demands for IOPS to your backend storage , this is why it is best to do some initial analysis and index the fields you will search mostly and create indexes , indexes(B-tree) are very effective since they are kept most of the time in memory compressed ( prefix compression) and are very fast for field search.
There is text indexes in mongodb that are enough for some simple text searches or you can use regex expressions.
If you will do full text search most of the time you better have search engine like elasticsearch which support inverse indexes in front of the database since the inverse indexes have your full text results already calculated and can give you the results times faster than similar operation using standard B-tree indexes.
If you use ATLAS ( the mongodb cloud service ) there is already lucene engine(inverse index) integrated that can do the fulltext search for you.
I hope my answer throw some light on the subject ... :)

Can I store data that won't affect query performance in MongoDB?

We have an application which requires saving of data that should be in documents, for querying and sorting purposes. The data should be schema less, as some of the fields would be known only via usage. For this, MongoDB is a great solution and it works great for us.
Part of the data in each document, is for displaying purposes. Meaning the data can be objects (let's say json) that the client side uses in order to plot diagrams.
I tried to save this data using gridfs, but the use cases makes it not responsive enough. Also, the documents won't exceed the 16 MB limits even with the diagram data inside them. And in fact, while trying to save this data directly within the documents, we got better results.
This data is used only for client side responses, meaning we should never query it. So my question is, can I insert this data to MongoDB, and set it as a 'not for query' data? Meaning, can I insert this data without affecting Mongo's performance? The data is strict and once a document is inserted, there might be only updating of existing fields, not adding new ones.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for?
Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?

As at MongoDB 3.4, read and write operations are atomic on the level of a single document from the storage/memory point of view. If the MongoDB server needs to fetch a document from memory or disk (even when projecting a subset of fields to return) the full document generally has to be loaded into memory on a mongod. The only exception is if you can take advantage of covered queries where all of the fields returned are also included in the index used.
This data is used only for client side responses, meaning we should never query it.
Data fields which aren't queried directly do not need to be in any indexes. However, there is currently no concept like "not for query" fields in MongoDB. You can query or project any field (with or without an index).
Meaning, can I insert this data without affecting Mongo's performance?
Data with very different access or growth patterns (such as your infrequently requested client data) is a recommended candidate for storing separately from a parent document with frequently accessed data. This will improve the efficiency of memory usage for mongod by avoiding unnecessary retrieval of data when working with documents in the parent collection.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for? Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
You should use a type that is most appropriate for the data that you are storing. Storing text data as binary will not gain you any obvious efficiencies in server storage. However, storing a complex object as a single value (for example, a JSON document serialized as a string) could save some serialization overhead if that object will only be interpreted via your client-side code. Binary data stored in MongoDB will be an opaque blob as far as indexing or querying, which sounds fine for your purposes.

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.

It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Zend: index generation and the pros and cons of Zend_Search_Lucene

I've never came across an app/class like Zend Search Lucene before, as I've always queried my database.
Zend_Search_Lucene operates with
documents as atomic objects for
indexing. A document is divided into
named fields, and fields have content
that can be searched.
A document is represented by the
Zend_Search_Lucene_Document class, and
this objects of this class contain
instances of Zend_Search_Lucene_Field
that represent the fields on the
document.
It is important to note that any
information can be added to the index.
Application-specific information or
metadata can be stored in the document
fields, and later retrieved with the
document during search.
So this is basically saying that I can apply this to anything including databases, the key thing here is making indexes for searching.
What I'm trying to grasp is where exactly should I store the indexes in my application, let's take for example we have phones stored in a database, a manufacturers, models - how should I categorize the indexes?
If I'm making indexes of users with say, addresses I obviously wouldn't want them to be publically viewable, I'm just confused on how it all works out together, if there are known disadvantages, any gotchas I should know while using it.

A Lucene index is stored outside the database. I'd store it in a "data" directory as a sister to your controllers, models, and views. But you can store it anywhere; you just need to specify the path when you open the index for querying.
It's basically a redundant copy of the documents stored in your database, and you have to keep them in sync yourself. That's one of the disadvantages: you have to write code to populate the Lucene index based on results of a query against your database. As you add data to the database, you have to update your Lucene index as well.
An advantage of using an external full-text index solution is that you can reduce the workload on your RDBMS. To find a document, you execute a search using the Lucene API. The result should include a field containing the primary key value (as part of the document but no need to make it analyzed for FT search). You get this field back when you do a Lucene search, so you can look up the respective row in the database.
Does that help answer your question?
I gave a presentation recently for MySQL University comparing full-text search solutions:
http://forge.mysql.com/wiki/Practical_Full-Text_Search_in_MySQL
I also publish my slides at http://www.SlideShare.net/billkarwin.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?

I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.

When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.

I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.

Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).