How to search values in real time on a badly designed database?

How to search values in real time on a badly designed database? - mongodb

I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?

Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.

Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊

Related

How to tag documents in MongoDB?

I need to tag documents in a collection, let's call it 'Contacts'.
The first idea I had was to create an attribute called "tags" for each document.
Well, in this case we have something like:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:['mongodb', 'maths', 'travels']
}
Now, let's suppose that we have users that want to tag any document in 'Contacts'.
If we keep the decision to save the tags attribute for each document, as the tags are personal, we need to use the userId for each tag.
So our document would be something like that (or not):
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
]
}
Now, let's complicate it a bit. Let's imagine that we have A LOT of users and each one want to tag documents with his personal tags.
How to deal with that?
Ok, we could create thousands of tags for each document:
{
_id:'1',
contact_name:'Asya Kamsky',
tags:[
{userId:'alex',tags:['mongodb', 'maths', 'travels']},
{userId:'eric',tags:['databases', 'friends', 'japan']},
{.....................................................}
{.....................................................}
{......................................................}
]
}
But, what if we have millions of users? In this case we have a 16mg limitation for each document, as I know....
At this point, worrying about the future growth of my application, I decided
to create a nice separated collection called 'tags' that would contain documents similar to:
{
"contact_name" : "Asya Kamsky",
"useriId" : "alex",
"tags" : ['mongodb', 'maths', 'travels'],
"timestamp" : "2017-08-08 14:33:28"
},
{
"contact_name" : "Asya Kamsky",
"useriId" : "eric",
"tags" : ['databases', 'friends', 'japan'],
"timestamp" : "2017-08-08 14:33:28"
}
That's, we have a separated documents that represent a tag of each user.
Cool and clean, right?
Well, i this case, we face 2 problems:
Minor problem: We return to the SQL logic that I don't like anymore but I accept in some cases.
Big (for me) problem: how to search a contact by PERSONAL tags? In this case we have a nice 'JOIN' problem that MongoDB resolves well using $lookup.
"Resolves well" for 10000, 20000, or even 500000 documents. But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
How to resolve this challenge?
Thanks all

If your usage is such that the number of users X number/size of tags per contact (plus whatever other data is in a contacts document) is likely to bring you near the 16MB document size limit then storing the tags ins a separate collection seems valid. But before you go down that route are you sure this is likely? Have you tried creating contact documents in a bid to see how many tags, how many users per contact would get you near the 16MB limit. If the answer implies a number of users and/or tags which you are unlikely ever to reach then maybe your concerns are strictly theoretical and you could consider sticking with the simplest solution which is to embed the user specific tags inside contacts.
The rest of this answer assumes that the size estimates and your knowledge about the likely number of tags and users per contact are such that the size constraints are valid. On this basis, you stated this specific concern about join performance ...
But as I want to ensure a good performance in the future, I think about 10000000 contacts. So, as I researched recently, the $lookup works well for a "small part" of universe and, even with indexes, this search would take a lot of time to be executed.
Have you tried measuring this performance? Generate seed documents for contacts and tags and then persist variations of these and then run queries using $lookup and measure the performance. You could do this for a few benchmarks, for example:
1,000 contacts and 10,000 tags
100,000 contacts and 1,000,000 tags
1,000,000 contacts and 10,000,000 tags
10,000,000 contacts and 100,000,000 tags
When running your benchmark tests you can additionally use explain() to understand what's going on inside MongoDB.
You might find that performance is acceptable, only you can know this since you understand what expectations the users of your system have with respect to performance.
One last point, if the use case here is that a given user wants to find all of their contacts and tags then this could be handled with a 'client side join' i.e. two queries (1) to get the tags for "userId" : "..." and (2) to find the contacts referenced by those tags. Depending on what your use cases are, this could be more performant that a server side join (aka $lookup).

Solr Increase relevance of search result based on a map of word:value

Let's say we have a structure like this per entry that goes to solr. The document is first amended and than saved. The way it is amended at the moment is that we lose the connection between the number and the score. However, we could change that into something else, if necessary.
"keywords" : [
{
"score" : 1,
"content" : "great finisher"
},
{
"score" : 1,
"content" : "project"
},
{
"score" : 1,
"content" : "staying"
},
{
"score" : 1,
"content" : "staying motivated"
}
]
What we want is to give a boost to a solr query result to a document using the "score" value in case the query contains the word/collocation to which the score is associated.
So each document has a different "map" of keyword with a score. And the relevancy would be computed normally how it Solr does now, but with a boost according to this map and the words present in the query.
From what I saw we can give boosts to results according to some criteria, but this criteria is very dynamic - context dependent. Not sure how to implement or where to start.

At the moment there is no built-in support in Solr to do anything like this. The most ideal way would be to have each term in a multiValued field boosted separately, but this is currently not possible (the progress (although there is none) is tracked in SOLR-2499).
There are however ways of working around this; two are suggested in the issue tracker above. I can't say much about using payloads and a custom BoostingTermQuery, but using dynamic fields are a possibility. The drawbacks are managing your cache sizes if you have many different field names and query/sort by most of them. If you have a small index with fewer terms, it will work, but a larger (in the higher five and six digits) with many dynamic fields will eat up your memory quick (as you for each sort/query will have one lookup cache with an int/long-array in the same size as your document count.
Another suggestion would be to look at using function queries together with a boost. If you reference the field here instead, you might avoid the cache issue. Try it!

MongoDB - forcing stored value to uppercase and searching

in SQL world I could do something to the effect of:
SELECT name FROM table WHERE UPPER(name) = UPPER('Smith');
and this would match a search for "Smith", "SMITH", "SmiTH", etc... because it forces the query and the value to be the same case.
However, MongoDB doesn't seem to have this capability without using a RegEx, which won't use indexes and would be slow for a large amount of data.
Is there a way to convert a stored value to a particular case before doing a search against it in MongoDB?
I've come across the $toUpper aggregate, but I can't figure out how that would be used in this particular case.
If there's not way to convert stored values before searching, is it possible to have MongoDB convert a value when it's created in Mongo? So when I add a document to the collection it would force the "name" attribute to a particular case? Something like a callback in the Rails world.
It looks like there's the ability to create stored JS for MongoDB as well, similar to a Stored Procedure. Would that be a feasible solution as well?
Mostly looking for a push in the right direction; I can figure out the particular code once I know what I'm looking for, but so far I'm not even sure if my desired functionality is doable.

You have to normalize your data before storing them. There is no support for performing normalization as part of a query at runtime.

The simplest thing to do is probably to save both a case-normalized (i.e. all-uppercase) and display version of the field you want to search by. Suppose you are storing users and want to do a case-insensitive search on last name. You might store:
{
_id: ObjectId(...),
first_name: "Dan",
last_name: "Crosta",
last_name_upper: "CROSTA"
}
You can then create an index on last_name_upper, and query like:
> db.users.find({last_name_upper: "CROSTA"})

How to search in a Collection with unknown structure?

I spent several hours reading through docs and forums, trying to find a solution for the following problem:
In A Mongo database, I have a collection with some unstructured data:
{"data" : "some data" , "_id" : "497ce96f395f2f052a494fd4"}
{"more_data" : "more data here" ,"recursive_data": {"some_data": "even more data here", "_id" : "497ce96f395f2f052a4323"}
{"more_unknown_data" : "string or even dictionaries" , "_id" : "497ce96f395f2f052a494fsd2"}
...
The catch is that the elements in this collections don't have a predefined structure and they can be unlimited levels.
My goal is to create a query, that searches through the collection and finds all the elements that match a regular expression( in both the keys and the values ).
For example, if I have a regex: '^even more' - It should return all the elements that have the string "even more" somewhere in the structure. In this case - that will be the second one.

Simply add an array to each object and populate it with the strings you want to be able to search on. Typically I'd lowercase those values to make case-insensitive search easy.
e.g. Tags : ["copy of string 1", "copy of string 2", ...]
You can extend this technique to index every word of every element. Sometimes I also add the field with an identifier in front of it, e.g. "genre:rock" which allows searches for values in specific fields (choose the ':' character carefully).
Add an index on this array and now you have the ability to search for any word or phrase in any document in the collection and you can search for "genre:rock" to search for that value in a specific field.

Ever if you will find a way to do this you will still face problem of slow searches, becouse there are no indexes
I had similar problem and solution was to create additional database(on same engine or any other more suitable for search) and to fill it with mongo keys and combined to one text field data. And to update it whenever mongodb data updates.
If it suitable you can also try to go this way...At least search works very fast. (I used postgresql as search backend)

MongoDB: Speed of field ("inside record") search in comporation with speed of search in "global scope"

My question may be not very good formulated because I haven't worked with MongoDB yet, so I'd want to know one thing.
I have an object (record/document/anything else) in my database - in global scope.
And have a really huge array of other objects in this object.
So, what about speed of search in global scope vs search "inside" object? Is it possible to index all "inner" records?
Thanks beforehand.
So, like this
users: {
..
user_maria:
{
age: "18",
best_comments :
{
goodnight:"23rr",
sleeptired:"dsf3"
..
}
}
user_ben:
{
age: "18",
best_comments :
{
one:"23rr",
two:"dsf3"
..
}
}
So, how can I make it fast to find user_maria->best_comments->goodnight (index context of collections "best_comment") ?

First of all, your example schema is very questionable. If you want to embed comments (which is a big if), you'd want to store them in an array for appropriate indexing. Also, post your schema in JSON format so we don't have to parse the whole name/value thing :
db.users {
name:"maria",
age: 18,
best_comments: [
{
title: "goodnight",
comment: "23rr"
},
{
title: "sleeptired",
comment: "dsf3"
}
]
}
With that schema in mind you can put an index on name and best_comments.title for example like so :
db.users.ensureIndex({name:1, 'best_comments.title:1})
Then, when you want the query you mentioned, simply do
db.users.find({name:"maria", 'best_comments.title':"first"})
And the database will hit the index and will return this document very fast.
Now, all that said. Your schema is very questionable. You mention you want to query specific comments but that requires either comments being in a seperate collection or you filtering the comments array app-side. Additionally having huge, ever growing embedded arrays in documents can become a problem. Documents have a 16mb limit and if document increase in size all the time mongo will have to continuously move them on disk.
My advice :
Put comments in a seperate collection
Either do document per comment or make comment bucket documents (say,
100 comments per document)
Read up on Mongo/NoSQL schema design. You always query for root documents so if you end up needing a small part of a large embedded structure you need to reexamine your schema or you'll be pumping huge documents over the connection and require app-side filtering.

I'm not sure I understand your question but it sounds like you have one record with many attributes.
record = {'attr1':1, 'attr2':2, etc.}
You can create an index on any single attribute or any combination of attributes. Also, you can create any number of indices on a single collection (MongoDB collection == MySQL table), whether or not each record in the collection has the attributes being indexed on.
edit: I don't know what you mean by 'global scope' within MongoDB. To insert any data, you must define a database and collection to insert that data into.
Database 'Example':
Collection 'table1':
records: {a:1,b:1,c:1}
{a:1,b:2,d:1}
{a:1,c:1,d:1}
indices:
ensureIndex({a:ascending, d:ascending}) <- this will index on a, then by d; the fact that record 1 doesn't have an attribute 'd' doesn't matter, and this will increase query performance
edit 2:
Well first of all, in your table here, you are assigning multiple values to the attribute "name" and "value". MongoDB will ignore/overwrite the original instantiations of them, so only the final ones will be included in the collection.
I think you need to reconsider your schema here. You're trying to use it as a series of key value pairs, and it is not specifically suited for this (if you really want key value pairs, check out Redis).
Check out: http://www.jonathanhui.com/mongodb-query