merge rows in pentaho PDI kettle - mongodb

I was wondering if I could merge 2 or more rows in pentaho?
example:
I have 2 rows of
case:'001',
owner:'Barack'
date:'2017-04-10'
case:'001',
owner:'Trump'
date:'2017-02-10'
Then I want to have a mongoDB output :
case:'001'
ownerHistory:[
{
owner:'Barack'
date:'2017-04-10'
},
{
owner:'Trump'
date:'2017-02-10'
}
]

The MongoDB Output step supports modifier updates that affect individual fields in a document and the $push operation to add items to a list/array. Using those, you don't need to merge any rows, you can just send them all to the MongoDB Output step.
With the Modifier update active each next row will be either inserted if the case doesn't exist yet or have the owner added to the array if it does.
Unfortunately, if you run this transformation a second time with the same incoming data, it will add more copies of the owners. I haven't found a way to fix that, so I hope you don't have that in your use case.
Perhaps if you split the data and insert/replace the using the first record for each case, then do the $push updates for the second and later records you can manage it.

Related

mongodb multiple documents insert or update by unique key

I would like to get a list of items from an external resource periodically and save them into a collection.
There are several possible solutions but they are not optimal, for example:
Delete the entire collection and save the new list of items
Get all items from the collection using "find({})" and use it to filter out existing items and save those that do not exist.
But a better solution will be to set a unique key and just do kind of "update or insert".
Right now on saving items the unique key already exists I will get an error
is there a way to do it at all?
**upsert won't do the work since it's updating all items with the same value so it's actually good for a single document only
I have a feeling you can achieve what you want simply by using the "normal" insertMany with the ordered option set to false. The documentation states that
Note that one document was inserted: The first document of _id: 13
will insert successfully, but the second insert will fail. This will
also stop additional documents left in the queue from being inserted.
With ordered to false, the insert operation would continue with any
remaining documents.
So you will get "duplicate key" exceptions which, however, you can simply ignore in your case.

In update method, query parameter containing a list (pymongo)

I have a dictionary. I need to insert column 2 into mongodb corresponding to column 1(key).
Say this is the dictionary:
values = {'a':['1','2','3'],
'b':['1','2'],
'c':['3','4'] }
Right now I am doing this:
for k,v in values.items():
col4.update({"name":k},{"$set":{"fieldName":v}})
But this takes 3 accesses to the db. Is it possible to do it one go like the way $in works.
In your code you are finding each document by name field and set its fieldName to v. There is no update operation in Mongo that can do such thing in one shot for multiple documents.
However there is a bulk insert statement which can be more efficient than multiple inserts or updates. http://docs.mongodb.org/manual/core/bulk-inserts/.
I thinks I previously didn't quite understand what you were asking and wrote the answer below, but I'm still not sure what you mean by $in. Perhaps you can provide example of data before and after update in DB, that way it will be absolutely clear what you are trying to achieve.
OLD answer ... (I'll edit it soon)
You need to restructure your loop. Build up a query (not running) by adding {field: newValue} to $set clause. After the loop is done you will have an analog of {$set:{"a": 1, "b": 1, "c": 3}}. Then you will update all fields in one shot.
Here is official documentation:
http://docs.mongodb.org/manual/reference/operator/update/set/

Mongodb compare two big data collection

I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.

MongoDB - Combine filter with default filter

I'm hoping to do a very common task which is to never delete items that I store, but instead just mark them with a deleted flag. However, for almost every request I will now have to specify deleted:false. Is there a way to have a "default" filter on which you can add? Such that I can construct a live_items filter and do queries on top of that?
This was just one guess at a potential answer. In general, I'd just like to have deleted=False be the default search.
Thanks!
In SQL you would do this with a view, but unfortunately MongoDB doesn't support views.
But when queries which exclude items which are marked as deleted are far more frequent than those which include them, you could remove the deleted items from the main items collection and put them in a separate items_deleted collection. This also has the nice side-effect that the performance of the collection of active items doesn't get impaired by a large number of deleted items. The downside is that indices can't be guaranteed to be unique over both collections.
I went ahead and just made a python function that combines the specified query:
def find_items(filt, single=False, live=True):
if live:
live = {'deleted': False}
filt = dict(filt.items() + live.items())
if single:
return db.Item.find_one(filt)
else:
return db.Item.find(filt)

MongoDB update many documents with different timestamps in one update

I have 10000 documents in one MongoDB collection. I'd like to update all the documents with datetime values that are 1 second apart for each document (so all the date time values are unique and are spaced 1 second apart). Is there any way to do this with a single update instead of updating each document in turn which results in 10000 distinct update operations?
Thanks.
No, there is no way to do this with a single update statement. There are no expressions which run at the server to allow this type of update. There is a feature request for this but it is not done so it cannot be used.