Can I store data that won't affect query performance in MongoDB? - mongodb

We have an application which requires saving of data that should be in documents, for querying and sorting purposes. The data should be schema less, as some of the fields would be known only via usage. For this, MongoDB is a great solution and it works great for us.
Part of the data in each document, is for displaying purposes. Meaning the data can be objects (let's say json) that the client side uses in order to plot diagrams.
I tried to save this data using gridfs, but the use cases makes it not responsive enough. Also, the documents won't exceed the 16 MB limits even with the diagram data inside them. And in fact, while trying to save this data directly within the documents, we got better results.
This data is used only for client side responses, meaning we should never query it. So my question is, can I insert this data to MongoDB, and set it as a 'not for query' data? Meaning, can I insert this data without affecting Mongo's performance? The data is strict and once a document is inserted, there might be only updating of existing fields, not adding new ones.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for?
Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?

As at MongoDB 3.4, read and write operations are atomic on the level of a single document from the storage/memory point of view. If the MongoDB server needs to fetch a document from memory or disk (even when projecting a subset of fields to return) the full document generally has to be loaded into memory on a mongod. The only exception is if you can take advantage of covered queries where all of the fields returned are also included in the index used.
This data is used only for client side responses, meaning we should never query it.
Data fields which aren't queried directly do not need to be in any indexes. However, there is currently no concept like "not for query" fields in MongoDB. You can query or project any field (with or without an index).
Meaning, can I insert this data without affecting Mongo's performance?
Data with very different access or growth patterns (such as your infrequently requested client data) is a recommended candidate for storing separately from a parent document with frequently accessed data. This will improve the efficiency of memory usage for mongod by avoiding unnecessary retrieval of data when working with documents in the parent collection.
I've noticed there is a Binary Data type in Mongo, and I am wondering if I should use this type for objects that are not binary. Can this give me what I'm looking for? Also, I would love to know what is the advantage in using this type inside my documents. Can it save me disk space?
You should use a type that is most appropriate for the data that you are storing. Storing text data as binary will not gain you any obvious efficiencies in server storage. However, storing a complex object as a single value (for example, a JSON document serialized as a string) could save some serialization overhead if that object will only be interpreted via your client-side code. Binary data stored in MongoDB will be an opaque blob as far as indexing or querying, which sounds fine for your purposes.

Related

What are the practical size limits of streaming a Firestore collection (in Flutter)

I have a collection that is about 2000 docs in size. I want to stream a List of just their ids.
As follows...
Stream<QuerySnapshot> stream = FirebaseFirestore.instance.collection('someCollection').snapshots();
return stream.map((querySnapshot) => querySnapshot.docs.map((doc) => doc.id).toList());
Will this be a significant performance issue on the client side? Is this an unrealistic approach?
When you query for documents in Firestore using the client SDKs, you are pulling down the entire document every time. There is no way to reduce the amount of data - all of the matching documents are sent across the wire in their entirety.
As such, you use of map() to extract only the document ID has no real effect on performance here, since it runs after the query is complete and you have all of that data in a snapshot on the client. All you are doing is trimming down the entire document down to a string, but you are not saving on the cost of transferring that entire document.
If you want to make this faster, you should make the query on a backend (such as Cloud Functions), ideally in the same region as your Firestore instance, and trim the documents in your backend code before you send the data to the frontend. That will save you the cost of unnecessarily trasferring the contents of the document you don't need.
Read: Should I query my database directly or use Cloud Functions?
Performance implications will mostly come from:
The bandwidth consumed for transferring the documents.
The memory used for keeping the DocumentSnapshot objects.
Since you're throwing the document.data() of each snapshot away, that is quite directly where the quickest gains will be. If your documents have few fields, the gains will be small. If they have many fields, the gains will larger.
If you have a lot to gain by not transferring the fields and keeping them in memory, the main options you have are:
Use a server-side SDK to get only the document IDs, and transfer only that back to the client.
Create secondary/proxy documents that contain only the ID, and no data.
While the first approach is tempting because it reduces data duplication, the second one is typically a lot simpler to implement (as you're only going to be impacting the code that handles data writes).

Strategies on storing raw scraped HTML in conjunction with extracted fields using MongoDB

Problem:
I scrape ~2.5-3gb worth of HTML data on a daily basis, and following the day's scrapes, I extract certain fields from each page and store them in a MongoDB collection, where the extracted fields are in the same document as the HTML that they were extracted from. I'm starting to consider that maybe I should store these in separate collections though. Here's my reasoning:
HTML field:
Never altered
Never used in any computationally expensive algorithms
Never queried from our database's dashboard
All other fields:
Some fields are changed consistently in computationally expensive programs
Used heavily in computationally expensive programs
Queried relatively frequently via the database's dashboard
Note: A couple minor details on "computationally expensive" - the main program that comes to mind is a matching algorithm, run daily, which runs 14-16 parallel processes that all read/write the collection heavily. One of the fields that gets updated consistently during this algorithm is an indexed field. I try to use projections and batch size increases where they seem useful, but I do not project on every query.
We're coming up on ~30 million documents in the collection, thus indexes are starting to get very large.
Questions:
Will removing the raw HTML from the collection increase performance on reads and/or writes?
Since I rarely query the HTML, should I even bother with a MongoDB collection and instead consider storing/compressing it in the regular file system?
I'm happy to provide additional details on request...just figured I would keep it relatively general to start with.
Thanks a ton!

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

Can Lucene store more than 100Gb original's documents in index?

I'm writing application what will be manipulate with more than 100Gb text documents. The size of each document is 2Kb-100Kb.
At first I supposed to use DBMS such as MySQL or Firebird to store raw documents with storing index in lucene's index. This approach have some disadvantages. For example, database transactions know nothing about lucene index and vice versa. So I need to synchronize them.
Then I supposed what Lucene can store entire documents in index. So I need regulary create index's backups. But it so easy: I can copy entire catalog with index. I use some kind of No SQL storage (i.e. Lucene). And I may don't use DBMS.
What is the best practice: to store original documents in index or not? I'm really don't want use DBMS to such purpose. Is it possible?
You would not want to store the raw document in a Lucene index, especially the size that you are talking about. I have done this a couple ways, but both ONLY store the indexed fields in the Lucene index and you have an ID/pointer to the raw document. I have dealt with indexes well over 100 million records and they work fine on a single server.
The reason this is important is that the build time of the index and manageability of the index dramatically drops if you don't need to store an additional 100 gig of data.
Basically, you need to index all the fields you need for searching/satisfying search queries. If a user clicks on the item in a grid, I assume you want to show the raw text (the UI pattern is that most of the time you will access a lot of the Lucene fields, but RARELY need to pull down the full binary text file).
The raw access I have used in conjunction with Lucene is:
SQL Server FILESTREAM, which is optimized for large binary file storage. It is really fast too. Not sure if MySQL has this (never worked with it)
Azure Table Storage, which is a key-value NoSQL cloud database. That was used to store the binary blobs.
It really doesn't matter what the persisted storage is, as long as it is optimized for larger binary files that can be accessed/streamed fast based off of a key. You could use an in-memory cache like Redis too as long as Lucene has the ID pointer to access the binary text file.

Best practices for combining Lucene.NET and a relational database?

I'm working on a project where I will have a LOT of data, and it will be searchable by several forms that are very efficiently expressed as SQL Queries, but it also needs to be searched via natural language processing.
My plan is to build an index using Lucene for this form of search.
My question is that if I do this, and perform a search, Lucene will then return the ID's of matching documents in the index, I then have to lookup these entities from the relational database.
This could be done in two ways (That I can think of so far):
N amount of queries (Horrible)
Pass all the ID's to a stored procedure at once (Perhaps as a comma delimited parameter). This has the downside of being limited to the max parameter size, and the slow performance of a UDF to split the string into a temporary table.
I'm almost tempted to mirror everything into lucenes index, so that I can periodicly generate the index from the backing store, but only need to access it for the frontend.
Advice?
I would store the 'frontend' data inside the index itself, avoiding any db interaction. The db would be queried only when you want more information on the specific record.
When I encountered this problem I went with a relational database that has full-text search capabilities (I used PostgreSQL 8.3, which has built in ft support, with stemming and thesaurus support). This way the database can query using both SQL and ft commands. The downside is that you need a DB that has full-text-search capabilities, and these capabilities might be inferior to what lucene can do.
I guess the answer depends on what you are going to do with the results, if you are going to display the results in a grid and let the user choose the exact document he wants to access then you may want to add to the index enough text to help the user identify the document, like a blurb of say 200 characters and then once the member selects a document hit the DB to retrieve the whole thing.
This will impact the size of your index for sure, so that is another consideration you need to keep in mind. I would also put a cache between the DB and the front end so that the most used items will not incur the full cost of a DB access every time.
Probably not an option depending on how much stuff is in your database, but what I have done is store the db id's in the search index along with the properties I wanted indexed. Then in my service classes I cache all the data needed to display search results for all the objects (e.g., name, db id, image url's, description blurbs, social media info). The service class returns a Dictionary that can look up objects by db id, and I use the id's returned by Lucene.NET to pull data from the in-memory cache.
You could also forego the in-memory cache and store all the necessary properties for displaying a search result in the search index. I didn't do this because the in-memory cache is also used in scenarios other than search.
The in-memory cache is always fresh to within a few hours, and the only time I have to hit the db is if I need to pull more detailed data for a single object (if the user clicks on the link for a specific object to go to the page for that object).