I have the following structure in my MongoDB database for a product :
product_1 = {
'name':'...',
'logo':'...',
'nutrition':'...',
'brand':{
'name':'...',
'logo':'...'
},
'stores':{[
'store_id':...,
'url':'...',
'prices':{[
'date':...,
'price':...
]}
]}
})
My pymongo script goes from store to store and I try to do the following thing :
if the product is not in the database : add all the informations about the product with the current price and date for the current store_id.
if the product is in the database but I don't have any entries for the current stroe : add an entry in stores with the current price, date and store_id.
if the product is in the database and I have a price entry for the current shop but the current price is not the same : add a new entry with the new date and price for the current store_id.
Is it possible to do all in one request ? For now I have been trying the following, without really knowing how to handles the stores and prices case.
Maybe it is not the best way to contruct my database, I am open to suggestions.
db.find_and_modify(
query={'$and':[
{'name':product['name']},
{'stores':{'$in':product['store_id']}}
]},
update={
'$setOnInsert':{
'name':product['product_name'],
'logo':product['product_logo'],
'brand':product['brand'],
[something for stores and prices ?]
},
},
upsert=True
)
There isn't presently (MongoDB 2.6) a way to do all of those things in one query. You can retrieve the document and do the updates, then save the updated version (or insert the new document):
oldDoc = collection.find_one({ "name" : product["name"] })
if oldDoc:
# examine stores + prices to create updated doc called newDoc
else:
# make a new doc, newDoc, for the product
collection.save(newDoc) # you can use save if you put the same _id on newDoc as oldDoc
Alternatively, I think your nested array schema is a cause of this headache and may cause more headaches down the line (e.g. update the price for a particular product for a particular date and store - cannot do this with a single database call). I would make each document represent the lowest level of one of your nested arrays - a particular product for sale for a particular price at a particular store on a particular date:
{
"product_name" : "bacon mayonnaise",
"store_id" : "123ZZZ8",
"price" : 99,
"date" : ISODate("2014-12-31T17:18:53.944Z")
// other stuff
}
You can duplicate a lot of the generic product information in each document or store it in a separate document - whether it's worth duplicating some information versus making another trip to the database to recall product details depends on how your going to use the documents. Notice that the problem you're trying to solve just disappears with this structure - you just do an upsert for a given product, store, price, and date.
Related
I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊
I'm trying to find a way to create the db schema. Most operations to the database will be Read.
Say I'm selling books on the app so the schema might look like this
{
{ title : "Adventures of Huckleberry Finn"
author : ["Mark Twain", "Thomas Becker", "Colin Barling"],
pageCount : 366,
genre: ["satire"] ,
release: "1884",
},
{ title : "The Great Gatsby"
author : ["F.Scott Fitzgerald"],
pageCount : 443,
genre: ["Novel, "Historical drama"] ,
release: "1924"
},
{ title : "This Side of Paradise"
author : ["F.Scott Fitzgerald"],
pageCount : 233,
genre: ["Novel] ,
release: "1920"
}
}
So most operations would be something like
1) Grab all books by "F.Scott Fitzgerald"
2) Grab books under genre "Novel"
3) Grab all book with page count less than 400
4) Grab books with page count more than 100 no later than 1930
Should I create separate collections just for authors and genre and then reference them like in a relational database or embed them like above? Because it seems like if I embed them, to store data in the db I have to manually type in an author name, I could misspell F.Scott Fitzgerald in a document and I wouldn't get back the result.
First of all i would say a nice DB choice.
As far as mongo is concerned the schema should be defined such that it serves your access patterns best. While designing schema we also must observe that mongo doesn't support joins and transactions like SQL. So considering all these and other attributes i would suggest that your choice of schema is best as it serves your access patterns. Usually whenever we pull any book detail, we need all information like author, pages, genre, year, price etc. It is just like object oriented programming where a class must have all its properties and all non- class properties should be kept in other class.
Taking author in separate collection will just add an extra collection and then you need to take care of joins and transactions by your code. Considering your concern about manually typing the author name, i don't get actually. Let's say user want to see books by author "xyz" so he clicks on author name "xyz" (like some tag) and you can fetch a query to bring all books having that selected name as one of the author. If user manually types user name then also it is just finding the document by entered string. I don't see anything manual here.
Just adding on, a price key shall also fit in to every document.
I am collecting data from a streaming API and I want to create a real-time analytics dashboard. Every time a new record appears at the end of the stream I update a counter in the below document.
From a design perspective. Am I correct to use only one document, like in the below example?
{
"_id" : ObjectId("5238beb4d4bed9e444c99978"),
"counts" : {
"hours" : {
"1" : 835,
"2" : 1007,
.
.
.
"3" : 174,
}
}
The benefit with this approach is that only one document needs to be sent to the real-time analytics dashboard. Also after a year this document would have only 365 * 24 fields, 1 for each hour in that year?
What about indexing? Can I create an index on counts.hours if I only have one document? Or do indexes only work across collections in mongodb? Do indexes help with finding documents faster or also fields inside documents?
If I could create an index on counts.hours, then the counter increment process could find the correct hour to increment (per new document at the end of the stream) much more efficiently.
You can create indexes in fields embedded in a document. In the case above:
yourCollection.ensureIndex({ 'counts.hours':1 });
The index will help you optimize queries to return documents based on 'counts.hours' field.
youCollection.find({ 'count.hours':1 });
Your data structure design should depend on the kind of queries and updates you are planning to do. In the case you described I imagine you will be adding members to the 'hours' object, updates like that might be expensive since MongoDB pads each collection record optimizing for the case where the record size is stable across updates.
I am using MongoDB and I ended up with two Collections (unintentionally).
The first Collection (sample) has 100 million records (Tweets) with the following structure:
{
"_id" : ObjectId("515af34297c2f607b822a54b"),
"text" : "bla bla ",
"id" : NumberLong("314965680476803072"),
"user" :
{
"screen_name" : "TheFroooggie",
"time_zone" : "Amsterdam",
},
}
The second Collection (users) with 30 Million records of unique users from the tweet collection and it looks like this
{ "_id" : "000000_n", "target" : 1, "value" : { "count" : 5 } }
where the _id in the users collection is the user.screen_name from the tweets collection, the target is their status (spammer or not) and finally the value.count is the number a user appeared in our first collection (sample) collection (e.g. number of captured tweets)
Now I'd like to make the following query:
I'd like to return all the documents from the sample collection (tweets) where the user has the target value = 1
In other words, I want to return all the tweets of all the spammers for example.
As you receive the tweets you could upsert them into a collection. Using the author information as the key in the "query" document portion of the update. The update document could utilize the $addToSet operator to put the tweet into a tweets array. You'll end up with a collection that has the author and an array of tweets. You can then do your spammer classification for each author and have their associated tweets.
So, you would end up doing something like this:
db.samples.update({"author":"joe"},{$addToSet:{"tweets":{"tweet_id":2}}},{upsert:true})
This approach does have the likely drawback of growing the document past its initially allocated size on disk which means it would be moved and expanded on disk. You would likely incur some penalty for index updating as well.
You could also take an approach of storing a spam rating with each tweet document and later pulling those based on user id.
As others have pointed out, there is nothing wrong with setting up the appropriate indexes and using a cursor to loop through your users pulling their tweets.
The approach you choose should be based on your intended access pattern. It sounds like you are in a good place where you can experiment with several different possible solutions.
Suppose that I have a database with student information:
{'student_name' : 'Alen', 'subjects' : {'cse101' : 4, 'cse102' : 3, 'cse201' : 4}}
Suppose I need to store the aggregate information of the student as well. I can add the field 'aggregate' : 3.67 to the record. But the aggregate changes when another subject is added to the subjects list. Is there a way I can write a "dynamic field" which could calculate the aggregate whenever requested? Something like student['aggregate'] which is not persistent but available when needed?
P.S: Aggregate is just a simple example. I am dealing with something more complex involving various other fields of the element.
There are no dynamic or calculated fields in MongoDB at the moment (although there are some tickets in the jira).
But you can always implement this functionality in the app code.