mongo-hadoop. not to handle mongodb document deletion - mongodb

I want to synchronize mongodb and hadoop, but when I delete document from mongodb, this document must not be deleted in hadoop.
I tried using mongo-hadoop and hive. this is hive query:
CREATE EXTERNAL TABLE SubComponentSubmission
(
id STRING,
status INT,
providerId STRING,
dateCreated TIMESTAMP,
subComponentId STRING,
packageName STRING
)
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler'
WITH SERDEPROPERTIES('mongo.columns.mapping'=
'{"id":"_id", "status":"Status",
"providerId":"ProviderId",
"dateCreated":"DateCreated",
"subComponentId":"SubComponentPackage.SubComponentId",
"packageName":"SubComponentPackage.PackageName"}'
)
TBLPROPERTIES('mongo.uri'='mongodb://<host>:27017/<db name>.<collection name>');
this query creates table that is synchronized to corresponding mongodb collection. by this query mongo-hadoop handles document deletion too.
does mongo-hadoop have any option, not to handle document deletion?
or, is there any other tool that solves this problem?
thanks in advance.

If you query directly against mongo like you're doing, yes, you're going to see all the document mutations that happen in mongo. That's the whole point of querying against mongo like this. If you want snapshotted views of your mongo data, you'll need to do something like a mongodump and putting the bson files on disk somewhere (like HDFS). Otherwise you'll always be querying against the live, mutating data.

Related

Mongoose schema definition [duplicate]

This question already has an answer here:
Why does mongoose use schema when mongodb's benefit is supposed to be that it's schema-less?
(1 answer)
Closed 5 years ago.
I am a beginner with MongoDB and trying to learn MEAN Stack. So I am using Mongoose as the ORM
I read that MongoDB is a NoSQL database, but while using Mongoose as ORM, I am asked to create a schema first. Why is it so? There shouldn't be a schema ideally as MongoDB is a NoSQL database.
Thanks in advance.
Mongoose is an orm on top of mongodb , if you are using core mongodb you need not create any schema , you can just dump any data you want , but in mongoose you have a schema so that you can i have some basic key value pair for advanced searching and filtering and you can anytime update the schema. Or If you want to go schemaless and dump whatever the response is you can use a schema type like this var someSchema = {data:Object} and drop all your data in this data key and then you can easily extract whatever JSON data is inside your id field.
var mongoose = require('mongoose');
module.exports = mongoose.model('twitter', {
created_at:{
type:Date
},
dump:{
type:Object
}
});
In the above example dump is used to save whatever JSON I get as a response from twitter api and created_at contains only the creating date of tweet , so I have the entire data , but if i want to search tweets of a particular date I can search it using a find query on created_at and this query will be lot faster and here I have a fixed structure and a knowledge about what to expect of a find query each time a run one, So this is one of the benefit of using the mongoose orm i.e I don't lose data but I can maximise my searching ability by creating appropriate keys.
So basically mongoose is an ORM db , it offers you relational db features like creating foreign keys , not strictly foreign keys but you can create something like an id reference to another schema and later populate the field by the id associated parameters when you fetch data using your find query , also a relational schema is easy to manage , what mongoose does is it gives a JSON/BSON based db the power of relational db and you get best of both the world i.e you can easily maintain new keys or you don't need to worry about extracting each and every data from your operation and placing it properly/inserting it , you just need to see that your keys and values match , as well as you have flexibility in update operations while having a schema or table structure.

How to find last update/insert/delete operation time on mongodb collection without objectid field

I have some unused collections in the MongoDb database. I've to find out when the CRUD operations done against collections in the database. We have our own _id field instead of mongo's default object_id. We dont have any time filed in the collections to find out the modification time. is there any way to find out the modification time of collections in mongodb from meta data? Is there any data dictionay informations like in oracle to find out this? please give some idea/workarounds
To make a long story short: MongoDB has a flexible schema. Simply add a date field. Since older entries don't have it, they can not be the last entry.
Let's call that field mtime.
So after adding a date field to your schema definition, we generate an index in descending order on the new field:
db.yourCollction.createIndex({mtime:-1})
Finding the last mtime for a collection now is easy:
db.yourCollection.find({"mtime":{"$exists":true}}).sort({"mtime":-1}).limit(1)
Do this for every collection. When the above query does not return a value within the timeframe you defined for purging a collection, simply drop it, since it has not been modified since you introduced the mtime field.
After your collections are cleaned up, you may remove the mtime field from your schema definition. To remove it from the documents, you can run a simple query:
db.yourCollection.update(
{ "mtime":{ $exists:true} },
{ "$unset":{ "mtime":""} },
{ multi: true}
)
There is no "data dictionary" to get this information in MongoDB.
If you've enabled the profiling level in advance to log all operations (db.setProfilingLevel(2)) and you haven't had many operations to log, so that the system.profile capped collection hasn't overwritten whatever logs you are interested in, you can get the information you need there—but otherwise it's gone.

Insert all documents from one collection into another collection in MongoDB database

I have a python script that collects data everyday and inserts it into a MongoDB collection (~10M documents). Sometimes the job fails and I am left with partial data which is not useful to me. I would like to insert the data into a staging collection first and then copy or move all documents from the staging collection into the final collection only when the job finishes and the data is complete. I cannot seem to find a straight forward solution for doing this as a "bulk" type operation, but it seems there should be one.
In SQL it would be something like this:
INSERT INTO final_table
SELECT *
FROM staging_table
I thought that db.collection.copyTo() would work for this but it seems it makes the destination collection a clone of the source collection.
Additionally, I know from this: mongodb move documents from one collection to another collection that I can do something like the following:
var documentsToMove = db.collectionA.find({});
documentsToMove.forEach(function(doc) {
db.collectionB.insert(doc);
}
But it seems like there should be a more efficient way.
So, How can I take all documents from one collection and insert them into another collection in the most efficient manner?
NOTE: the final collection has data in it already. The new documents that I want to move over would be adding to this data, e.g if my staging collection has 2 documents and my final collection has 10 documents, I would have 12 documents in my final collection after I move the staging data over.
You can use db.cloneCollection(); see mondb cloneCollection
if you no longer need the staging collection you can simply use the renaming option.
switch to admin db
db.runCommand({renameCollection:"staging.CollectionA",to:"targetdb.CollectionB"})

Which is the best way to insert data in mongodb

While writing data to mongodb, we are checking if the data is present get the _id and using save update it else using insert add the data. Read save is the best way if you are providing _id in the query while saving it will update/insert based on if the _id is present in the db. Is the save the best method or is there any other way.
If you have all data available to save, just run update() each time but use the upsert functionality. Only one query required:
db.collection.update(
['_id' => $id],
$data,
['upsert' => true]
);
If your _id is generated by mongo you always know there is a record in the database and update is the one to use, but then again you could also save().
If you generated your id's (and thus don't know if it comes from the collection), this will always work without having to run an extra query.
From the documentation
db.collection.save()
Updates an existing document or inserts a new document, depending on its document parameter.
db.collection.insert()
Inserts a document or documents into a collection.
If you use db.collection.insert() in your case you will get duplication key error since it will try to insert new document which has same _id with an existing document. But instead of using save you should use the update method.

Does _id field change in MongoDB when copying data from one collection to another?

We are planning on using MongoDB _id as a key that we would provide to the client. Therefore, the requirement is that this key should not change if we ever need to move the data from one collection to another. The copy will be performed using db.copyDatabase() or mongoimport.
One of the ways in which data can be copied from one collection to another is iterating through the documents in the first collection(C1) and inserting these documents in the second collection(C2). In this case _id should remain the same(in C2) because it would be present in the documents(of C1) being inserted(same as the case in which we would provide an _id ourselves).
However, if there is an alternate way in which documents are copied, the _id might change since it depends on :
(1) The UNIX timestamp
(2) Machine identifier
(3) ProcessId
(**This should only happen if MongoDB while copying removes _id from documents in C1 and regenerated them while inserting into C2?)
We want the _id values to be same irrespective of the location of the destination collection:
(1)within same database
(2)different database - same machine
(3)different database - different machine)
Thanks
No, the _id numbers will not change.
A new ObjectId is generated when a document without an _id field is inserted into the database. When you insert a document which already has an _id field, MongoDB won't touch it.
The timestamp, machine identifier and processID refer to those where the ObjectID was generated. This can be a database server, but it can also be generated by the MongoDB driver on the application server. In that case MongoDB will not change it on its own.
By the way: The _id can be an auto-generated ObjectId, but it doesn't have to. You can also use any other value as _id, as long as you can guarantee that it's unique. So when your data already has a natural key, you can use this as _id when you want to.