Is it ok to use MongoDB when we have no idea about the availabe keys? - mongodb

We are scraping a huge products website.
So, we will get and persist so many products, and almost each product has a different set of features/details.
Naturally, we consider using a NoSQL database (MongoDB) for this job. We will make a collection "products", and a document for each product where each key/value pair map to detail_name/detail_description of the product.
Since products are quite different, we have almost no idea what are the product details/features. In other words, we have no knowledge of the available keys.
According to this link MongoDB case insensitive key search, It is a "gap" for MongoDB (that we do not have some idea of the available keys).
Is this true? If yes, what are the alternatives?

Your key problem isn't that much of an issue for MongoDB provided you can live with a slightly different schema and big indexes :
Normally you would do something like :
{
productId :..
details : {
detailName1 : detailValue1,
detailName2 : detailValue2;
}
}
But if you do this you can index the details field :
{
productId :..
details : [
{field : detailName1, value : detailValue1},
{field : detailName2, value : detailValue2}
]
}
Do note that this will result in a very big index. Not necessarily a problem but something to be aware of. The index will then be {details.field:1, details.value:1} (or just {details:1} if you're not adding additional fields per detail).

Once you've scraped all of the data you could examine it to determine if there is a field/set of fields in the documents that you could add an index to in order to improve performance.

Related

MongoDB - Using Index to get nested IDs is slow

I have a MongoDB collection with 8k+ documents, around 40GB. Inside it, the data follows this format:
{
_id: ...,
_session: {
_id: ...
},
data: {...}
}
I need to get all the _session._id for my application. The following approach (python) takes too long to get them:
cursor = collection.find({}, projection={'_session._id': 1})
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
Is there a way to speed this query such that I get all the _session._id very fast?
In mongo shell you can hint() the query optimizer to use the available index as follow:
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Following test is confirmed to work via python:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:12345")
mydb=db["test"]
docs= mydb.test2.find( {} ).hint([ ("x.y", pymongo.ASCENDING) ])
for i in docs:
print(i)
db.test2.createIndex({"x.y":1})
{
"v" : 2,
"key" : {
"x.y" : 1
},
"name" : "x.y_1"
}
python 3.7 ,
pymongo 3.11.2 ,
mongod 5.0.5
In your case seems to be text index , btw it seems abit strange why session is text index , for text index somethink like this must work:
db.test2.find({}).hint("x.y_text").explain()
And here is working example with text index:
import pymongo
db=pymongo.MongoClient("mongodb://user:pass#localhost:123456")
print('Get first 10 docs from test.test:')
mydb=db["test"]
docs= mydb.test2.find( {"x.y":"3"} ).hint( "x.y_text" )
print("===start:====")
for i in docs:
print(i)
db.test2.createIndex({"x.y":"text"}):
{
"v" : 2,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "x.y_text",
"weights" : {
"x.y" : 1
},
"default_language" : "english",
"language_override" : "language",
"textIndexVersion" : 3
}
There are a few points of confusion in this question and the ensuing discussion which generally come down to:
What indexes are present in the environment (and why the attempts to hint it failed)
When using indexing is most appropriate
Current Indexes
I think there are at least 5 indexes that were mentioned so far:
A standard index of {"_session._id":1} mentioned originally in #R2D2's answer.
A text index on the _session._id field (mentioned in this comment)
A text index on the _ts_meta.session field (mentioned in this comment)
A standard index of {"x.y":1} mentioned second in #R2D2's answer.
A text index of {"x.y":"text"} mentioned at the end of #R2D2's answer.
Only the first of these is likely to even really be relevant to the original question. Note that the difference a text index is a specialized index that is meant for performing more advanced text searching. Such indexes are not required for simple string matching or value retrieval. But standard indexes, { '_session._id': 1}, will also store string values and are relevant here.
What Indexing is For
Indexes are typically useful for retrieving a small subset of results from the database. The larger that set of results becomes relative to the overall size of the collection, the less helpful using an index will become. In your situation you are looking to retrieve data from all of the documents in the collection which is why the database doesn't consider using any index at all.
Now it is still possible that an index could help in this situation. That would be if we used it to perform a covered query which means that the data can be retrieved from the index alone without looking at the documents themselves. In this case the database would have to scan the full index, so it is not clear that it would be faster or not. But you could certainly try. To do so you would need to follow #R2D2's instructions, specifically by creating the index and then hinting it in the query (while also projecting out the _id field):
db.collection.createIndex({"_session._id":1})
db.collection.find({},{_id:0,"_session._id":1}).hint({"_session._id":1})
Additional Questions
There were two other things mentioned in the question that are important to address.
I have created an Index in MongoDB Compass, but I'm not sure if my query is making use of it at all.
We talked about why this was the case above. But to find out if the database is using it or not you could navigate to the Explain tab in compass to take a look. If you explain plan visualization it should indicate if the index was used. Remember that you will need to hint the index based on your query.
Is there a way to speed this query such that I get all the _session._id very fast?
What is your definition of "very fast" here?
The general answer is that your operation requires scanning either all documents in the collection or a full index. There is no way to do this more efficiently based on the current schema. Therefore how fast it happens is largely going to come down to the hardware that the database is running on and it will slow down as the collection grows.
If this operation is something that you will be running frequently or have strict performance requirements around, then it may be important to think through your intended goals to see if there are other ways of achieving them. What will you or the application be doing with this list of session IDs?

How to design product category and products models in mongodb?

I am new to mongo db database design,
I am currently designing a restaurant products system, and my design is similar to a simple eCommerce database design where each productcategory has products so in a relational database system this will be a one (productcategory) to many products.
I have done some research I do understand that in document databses.
Denationalization is acceptable and it results in faster database reads.
therefore in a nosql document based database I could design my models this way
//product
{
name:'xxxx',
price:xxxxx,
productcategory:
{
productcategoryName:'xxxxx'
}
}
my question is this, instead of embedding the category inside of each product, why dont we embed the products inside productcategory, then we can have all the products once we query the category, resulting in this model.
//ProductCategory
{
name:'categoryName',
//array or products
products:[
{
name:'xxxx',
price:xxxxx
},
{
name:'xxxx',
price:xxxxx
}
]
}
I have researched on this issue on this page http://www.slideshare.net/VishwasBhagath/product-catalog-using-mongodb and here http://www.stackoverflow.com/questions/20090643/product-category-management-in-mongodb-and-mysql both examples I have found use the first model I described (i.e they embed the productCategory inside product rather than embed array of products inside productCategory), I do not understand why, please explain. thanks
For the DB design, you have to consider the cardinality (one-to-few, one-to-many & one-to-gazillions) and also your data access patterns (your frequent queries, updates) while designing your DB schema to make sure that you get optimum performance for your operations.
From your scenario, it looks like each Product has a category, however it also looks like you need to query to find out Products for each category.
So, in this case, you could do with something like :
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : ObjectId("111") }
ProductCategory = { _id = ObjectId("111"), name : "productCat1", products : ["productId1", productId2", productId3"]}
As i said about data access patterns, if you always read the Category-name and "Category-name" is something which is very infrequently updated then you can go for denormalizing with this two-way referencing by adding the product-category-name in the product:
Product = { _id = "productId1", name : "SomeProduct", price : "10", category : { ObjectId("111"), name:"productCat1" }
So, with embedding the documents, specific queries would be faster if no joins would be required, however other queries in which you need to access embedded details as stand-alone entities would be difficult.
This is a link from MongoDB which explains DB design for one-to-many scenario like you have with very nice examples in which you would realize that there are much more ways of doing it and many other things to think about.
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 (also has links for parts 1 & 2)
This link also describes the pros & cons for each scenario enabling you to reach a narrow down on a DB schema design.
It all depends on what queries you have in mind.
Suppose a product can only belong to one category, and one category applies to many products. Then, if you expect to retrieve the category together with the product, it makes sense to store it directly:
// products
{_id:"aabbcc", name:"foo", category:"bar"}
and if you expect to query all the products in a given category then it makes sense to create a separate collection
// categories
{_id:"bar", products=["aabbcc"]}
Remember that you cannot atomically update both the products and categories database (MongoDB is eventually consistent), but you can occasionally run a batch job which will make sure all categories are up-to-date.
I would recommend to think in terms of what kind of information will I often need, as opposed to how to normalize/denormalize this data, and make your collections reflect what you actually want.

Solr Increase relevance of search result based on a map of word:value

Let's say we have a structure like this per entry that goes to solr. The document is first amended and than saved. The way it is amended at the moment is that we lose the connection between the number and the score. However, we could change that into something else, if necessary.
"keywords" : [
{
"score" : 1,
"content" : "great finisher"
},
{
"score" : 1,
"content" : "project"
},
{
"score" : 1,
"content" : "staying"
},
{
"score" : 1,
"content" : "staying motivated"
}
]
What we want is to give a boost to a solr query result to a document using the "score" value in case the query contains the word/collocation to which the score is associated.
So each document has a different "map" of keyword with a score. And the relevancy would be computed normally how it Solr does now, but with a boost according to this map and the words present in the query.
From what I saw we can give boosts to results according to some criteria, but this criteria is very dynamic - context dependent. Not sure how to implement or where to start.
At the moment there is no built-in support in Solr to do anything like this. The most ideal way would be to have each term in a multiValued field boosted separately, but this is currently not possible (the progress (although there is none) is tracked in SOLR-2499).
There are however ways of working around this; two are suggested in the issue tracker above. I can't say much about using payloads and a custom BoostingTermQuery, but using dynamic fields are a possibility. The drawbacks are managing your cache sizes if you have many different field names and query/sort by most of them. If you have a small index with fewer terms, it will work, but a larger (in the higher five and six digits) with many dynamic fields will eat up your memory quick (as you for each sort/query will have one lookup cache with an int/long-array in the same size as your document count.
Another suggestion would be to look at using function queries together with a boost. If you reference the field here instead, you might avoid the cache issue. Try it!

How to implement persisted sorted list which is often updated and you need maintain the order

I need to display members of community which are sorted by last visit. There are millions of communities each of wich can have millions of members. The list should be scrollable. Because of sorting by last visit time the order is updated very often.
In RDBMS this functionality could be simply done by ordinary B-tree index. But how can I do it with NoSQL approach?
My current thoughts are:
Standart NoSQL scrollable list approach which uses buckets of fixed length that are chained doesn't help much because of requirements of reordering.
Cassandra keeps values ordered by column name. So theoretically I could use last visit time as column key but for each update I would need to delete existing column and insert new one which doesn't sound very effectively.
Apache Lucene is not NoSQL storage but also an option because it creates sorted index. But I'm not sure how it is scalable for massive updates.
Redis Sorted Sets sounds really promising but I haven't had experience with it.
What other options do I have?
If you keep the last modification date in the object you could sort at query time in many NoSQL db's:
MongoDB (see docs on indexes):
db.collection.find({ ... spec ... }).sort({ key: 1 })
db.collection.ensureIndex( { "username" : 1, "timestamp" : -1 } )
Elastic search has sorting in queries too:
{
"sort" : [
{ "date" : {"order" : "asc"} }
],
"query" : {
...
}
}
Some storages like CouchDB seem to lack built-in sorting feature altogether so it pays off to have a look at a particular solution before investing in it.

Adding an index to a MongoDB collection hash field

I have a MongoDB collection that I would like to add an index on. For the purpose of this post, let's say the collection name is Cats. I have a hash key on the Cats collection so if you do db.cats.findOne(); it'll look like the following:
> db.cats.findOne();
{
"_id" : ObjectId("4f248f8ae4b0b775c9eb002d"),
"metaData" : {
"type" : "cute",
"id" : "4ed3b6c599114b488be52bc3"
},
....
}
I query very often (using Mongoid), with something like this:
Cat.first(:conditions => { "metaData.id" => an_id }
I'd really like to be able to take advantages of indexes here, but I'm not entirely sure if I should index all of metaData or just metaData.id (I query against id specifically, and very often).
Would love any solution to this problem because I think I can dramatically speed up queries if I do the right thing here. Also, this is a unique index.
also metaData is not an embedded document. it does not have its own collection. it is simply a hash with a 1:1 mapping in each cats object.
You can just define an index on the embedded document. This is covered here:
http://www.mongodb.org/display/DOCS/Indexes#Indexes-UsingDocumentsasKeys
For your specific example, this would be:
db.Cats.ensureIndex({ "metaData.id" : 1}, {unique : true})
To compare your results do some of your standard queries in the shell with a .explain() to compare the speed with and without the index. If you are not doing a lot of queries you might need to hint the index to use so that it doesn't cache the "best" index (don't forget there is one on _id by default). More explain info here:
http://www.mongodb.org/display/DOCS/Explain