I have two collections, raw_doc and unique_doc in Mongo.
raw_doc receives imports of a large amount of data on a regular basis ( +500k rows ). unique_doc has every unique instance of 3 fields found in raw_doc.
A shortened example of the data in each table
raw_doc
{Licence : "Free", Publisher : "Jeff's music", Name: "Music for all",Customer:"Dave", uniqueclip_id:12345},
{Licence : "Free", Publisher : "Jeff's music", Name: "Music for all",Customer:"Jim", uniqueclip_id:12345}
unique_doc
{_id:12345, Licence : "Free", Publisher : "Jeff's music", Name: "Music for all"}
I would like to add a reference to raw_doc, linking it to the appropriate unique_doc. I can't use the three fields in unique_doc for the key as those fields will be edited eventually, but the data in raw_doc will stay the same(thus the data will no longer match but still needs to be linked).
Is there a query I could run in Mongo that would pull in bulk the IDs from unique_doc and insert them into the appropriate raw_docs?
You can try updateMany. Please try this:
db.raw_doc.updateMany({uniqueclip_id:"12345"},{$set:{uniqueclip_id:"54321"}})
This will update all the documents in raw_doc that contains the uniqueclip_id:"12345" and will set it to "54321".
Generating my own id up front seems to be the way to go. I have managed to keep the processing time down to around 120s for 500k rows.
Related
I have a collection named Company which has the following structure:
{
"_id" : ObjectId("57336ea1a7454c0100d889e4"),
"currentMonth" : 62,
"variables1": { ... },
...
"variables61": { ... },
"variables62" : {
"name" : "Test",
"email": "email#test.com",
...
},
"country" : "US",
}
My need is to be able to search for companies by name with up-to-date data. I don't have permission to change this data structure because many applications still use it. For the moment I haven't found a way to index these variables with this data structure, which makes the search slow.
Today each of these documents can be several megabytes in size and there are over 20,000 of them in this collection.
The system I want to implement uses a search engine to index the names of companies, but for that it needs to be able to detect changes in the collection.
MongoDB's change stream seems like a viable option but I'm not sure how to make it scalable and efficient.
Do you have any suggestions that would help me solve this problem? Any suggestion on the steps needed to set up the above system?
Usually with MongoDB you can add new fields to documents and existing applications would simply ignore the extra fields (though they naturally would not be populated by old code). Therefore:
Create a task that is regularly executed which goes through all documents in your collection, figures out the name for each document from its fields, then writes the name into a top-level field.
Add an index on that field.
In your search code, look up by the values of that field.
Compare the calculated name to the source-of-truth name. If different, discard the document.
If names don't change once set, step 1 only needs to go through documents that are missing the top-level name and step 4 is not needed.
Using the change detection pattern with monstache, I was able to synchronise in real time MongoDB with ElasticSearch, performing a Filter based on the current month and then Map the result of the variables to be indexed 🎊
From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.
There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted.
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id value from your collection. Example steps:
Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
Save the merged dataframe (C). Without specifying overwrite mode.
we just want to update John, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.
I have the following structure in my MongoDB database for a product :
product_1 = {
'name':'...',
'logo':'...',
'nutrition':'...',
'brand':{
'name':'...',
'logo':'...'
},
'stores':{[
'store_id':...,
'url':'...',
'prices':{[
'date':...,
'price':...
]}
]}
})
My pymongo script goes from store to store and I try to do the following thing :
if the product is not in the database : add all the informations about the product with the current price and date for the current store_id.
if the product is in the database but I don't have any entries for the current stroe : add an entry in stores with the current price, date and store_id.
if the product is in the database and I have a price entry for the current shop but the current price is not the same : add a new entry with the new date and price for the current store_id.
Is it possible to do all in one request ? For now I have been trying the following, without really knowing how to handles the stores and prices case.
Maybe it is not the best way to contruct my database, I am open to suggestions.
db.find_and_modify(
query={'$and':[
{'name':product['name']},
{'stores':{'$in':product['store_id']}}
]},
update={
'$setOnInsert':{
'name':product['product_name'],
'logo':product['product_logo'],
'brand':product['brand'],
[something for stores and prices ?]
},
},
upsert=True
)
There isn't presently (MongoDB 2.6) a way to do all of those things in one query. You can retrieve the document and do the updates, then save the updated version (or insert the new document):
oldDoc = collection.find_one({ "name" : product["name"] })
if oldDoc:
# examine stores + prices to create updated doc called newDoc
else:
# make a new doc, newDoc, for the product
collection.save(newDoc) # you can use save if you put the same _id on newDoc as oldDoc
Alternatively, I think your nested array schema is a cause of this headache and may cause more headaches down the line (e.g. update the price for a particular product for a particular date and store - cannot do this with a single database call). I would make each document represent the lowest level of one of your nested arrays - a particular product for sale for a particular price at a particular store on a particular date:
{
"product_name" : "bacon mayonnaise",
"store_id" : "123ZZZ8",
"price" : 99,
"date" : ISODate("2014-12-31T17:18:53.944Z")
// other stuff
}
You can duplicate a lot of the generic product information in each document or store it in a separate document - whether it's worth duplicating some information versus making another trip to the database to recall product details depends on how your going to use the documents. Notice that the problem you're trying to solve just disappears with this structure - you just do an upsert for a given product, store, price, and date.
I am using MongoDB and I ended up with two Collections (unintentionally).
The first Collection (sample) has 100 million records (Tweets) with the following structure:
{
"_id" : ObjectId("515af34297c2f607b822a54b"),
"text" : "bla bla ",
"id" : NumberLong("314965680476803072"),
"user" :
{
"screen_name" : "TheFroooggie",
"time_zone" : "Amsterdam",
},
}
The second Collection (users) with 30 Million records of unique users from the tweet collection and it looks like this
{ "_id" : "000000_n", "target" : 1, "value" : { "count" : 5 } }
where the _id in the users collection is the user.screen_name from the tweets collection, the target is their status (spammer or not) and finally the value.count is the number a user appeared in our first collection (sample) collection (e.g. number of captured tweets)
Now I'd like to make the following query:
I'd like to return all the documents from the sample collection (tweets) where the user has the target value = 1
In other words, I want to return all the tweets of all the spammers for example.
As you receive the tweets you could upsert them into a collection. Using the author information as the key in the "query" document portion of the update. The update document could utilize the $addToSet operator to put the tweet into a tweets array. You'll end up with a collection that has the author and an array of tweets. You can then do your spammer classification for each author and have their associated tweets.
So, you would end up doing something like this:
db.samples.update({"author":"joe"},{$addToSet:{"tweets":{"tweet_id":2}}},{upsert:true})
This approach does have the likely drawback of growing the document past its initially allocated size on disk which means it would be moved and expanded on disk. You would likely incur some penalty for index updating as well.
You could also take an approach of storing a spam rating with each tweet document and later pulling those based on user id.
As others have pointed out, there is nothing wrong with setting up the appropriate indexes and using a cursor to loop through your users pulling their tweets.
The approach you choose should be based on your intended access pattern. It sounds like you are in a good place where you can experiment with several different possible solutions.
I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.