Read data from huge Mongo DB - mongodb

Scenario:
Collection A has 40 million records and each record has almost 20 fields.
Get 5 (defined)fields from A and change the field name and populate in collection B.
Example:
A
"_id" is the primary key here
{
"_id":123
"id":123
"title":"test"
"summary": "test"
"version":1
"parentid":12
}
B
{
"_id":123
"p$id":123
"p$parentid":12
"p$title":"test"
}
Can someone please suggest a good way to write a code for this scenario?
I wrote the code but it took 5 hrs to complete.
My Code:
config.py:
It has all Mongo DB related details.
Actual code:
from pymongo import MongoClient
import operator
import datetime
print "Start time", datetime.datetime.now()
primary_dict = {}
primary_list = []
secondary_dict = {}
secondary_list = []
missing_id = []
mismatch_id = []
alias_dict = {
"_id": "_id",
"id":"p$id"
"title": "p$title"
"parentid":"p$parentid"
}
def mongo_connect(host, port, db, collection):
client = MongoClient(host, port)
db_obj = client[db]
collection_obj = db_obj[collection]
return collection_obj
def primary():
global primary_list
global primary_dict
global secondary_dict
global secondary_list
global missing_id
primary_collection = mongo_connect(config.mongo_host, config.mongo_port, config.mongo_primary_db, config.mongo_primary_collection)
secondary_collection = mongo_connect(config.mongo_host, config.mongo_port, config.mongo_secondary_db, config.mongo_secondary_collection)
for dict1 in primary_collection.find({},{"_id":1,"title":1}).batch_size(1000):
count = 0
target_id = ''
primary_list = []
secondary_list = []
target_id = dict1['_id']
primary_list.insert(count, dict1)
if (secondary_collection.find_one({"_id":target_id})) is None:
missing_id.append(target_id)
continue
else:
secondary_list.insert(count,secondary_collection.find_one({"_id":target_id}))
compare(primary_list, secondary_list)
def compare(list1, list2):
global alias_dict
global mismatch_id
global missing_id
for l1, l2 in zip(primary_list,secondary_list):
if len(l1) != len(l2):
mismatch_id.append(l1['_id'])
continue
else:
for key, value in l1.items():
if value != l2[alias_dict[key]]:
mismatch_id.append(l1['_id'])
primary()
print "Mismatch id list", mismatch_id
print "Missing Id list", missing_id
print "End time", datetime.datetime.now()

Well you could do this:
db.eval(function(){
db.primary_collection.find({},
{ id: 1, parentid: 1, title: 1 }).forEach(function(doc){
var newDoc = {};
Object.keys(doc).forEach(function(key) {
var newKey = ( key == "_id" ) ? key : "p$" + key;
newDoc[newKey] = doc[key];
});
db.secondary_collection.insert(newDoc);
});
})
Which uses db.eval() to execute the code on the server, which will be as fast as you will get.
But please read the documentation on this as you will be "locking" the database while this operation takes place. And of course you cannot do this across servers if that is your intent.

Related

How to filter mongodb documents by array of objects

I have an array like the following structure
[{"id":1,"address":"UK"},{"id":2,"address":"US"}]
I want to fetch all entries from mongodb where id = 1 and address = "UK" OR id = 2 and address = "US"
let's assume your collection is test_collection, this could do the work.
It works well in DataGrip, you may try it in your mongodb client.
const arr = [{"id":1,"address":"UK"},{"id":2,"address":"US"},{"id":2,"address":"CN"}];
arr.forEach((v) => {
db.test_collection.find({id: v.id, address: v.address}).forEach(
function (order) {
let row = {}
row.id = order.id;
row.address = order.address;
// row.other_field = order.other_field;
print(row);
});
})
;

Mongo Document Increment Sequence is skipping numbers

I'm currently having issues when querying for one of my Documents inside a Database through Meteor.
Using this line of code I'm trying to retrieve the next sequence number out of the DB. But it sometimes skips numbers randomly for some reason.
var col = MyCounters.findOne(type);
MyCounters.update(col._id, {$inc: {seq: 1}});
return col.seq;
Not getting any kind of errors server side.
Does anybody know what the issue might be?
I'm on Meteor 1.4+
====================
Update
I also update another Collection with the new value obtained from MyCounters collection, so it would be something like this:
var col = MyCounters.findOne(type);
MyCounters.update(col._id, {$inc: {seq: 1}});
var barId = col.seq;
// declare barObject + onInsertError
barObject.barId = barId;
// ...
FooCollection.insert(barObject, onInsertError);
And FooCollection ends up having skipped sequence numbers up to 5000 sometimes.
If you want increament at that item Document, you can use :
var col = MyCounters.findOne(type);
var valueOne = 1;
var nameItem = 'seq';
var inc = {};
inc[ nameItem ] = valueOne;
MyCounters.update({ _id: col._id }, { '$inc': inc } )
But if you want increament value from all Document from Collections MyCounters ( maks seq + 1 ) you can use :
var count = MyCounters.findOne({}, {sort:{seq:-1}}).seq;
count = count + 1;
MyCounters.update({_id:col._id}, {$set:{seq:count}})
I hope it work for you. Thanks
refer to : https://stackoverflow.com/a/33968766/4287229

Mongo DB - map relational data to document structure

I have a dataset containing 30 million rows in a mongo collection. An example set of records would be:
{"_id" : ObjectId("568bc0f2f7cd2653e163a9e4"),
"EmailAddress" : "1234#ab.com",
"FlightNumber" : 1043,
"FlightTime" : "10:00"},
{"_id" : ObjectId("568bc0f2f7cd2653e163a9e5"),
"EmailAddress" : "1234#ab.com",
"FlightNumber" : 1045,
"FlightTime" : "12:00"},
{"_id" : ObjectId("568bc0f2f7cd2653e163a9e6"),
"EmailAddress" : "5678#ab.com",
"FlightNumber" : 1045,
"FlightTime" : "12:00"},
This has been imported directly from SQL server, hence the relational'esque nature of the data.
How can I best map this data to another collection so that all the data is then grouped by EmailAddress with the FlightNumbers nested? An example of the output would then be:
{"_id" : ObjectId("can be new id"),
"EmailAddress" : "1234#ab.com",
"Flights" : [{"Number":1043, "Time":"10:00"},{"Number":1045, "Time":"12:00"}]},
{"_id" : ObjectId("can be new id"),
"EmailAddress" : "5678#ab.com",
"Flights" : [{"Number":1045, "Time":"12:00"}]},
I've been working on an import routing that iterates through each record in the source collection and then bulk inserts into the second collection. This is working fine however doesn't allow me to group the data unless I back process through the records which adds a huge time overhead to the import routine.
The code for this would be:
var sourceDb = db.getSiblingDB("collectionSource");
var destinationDb = db.getSiblingDB("collectionDestination");
var externalUsers=sourceDb.CRM.find();
var index = 0;
var contactArray = new Array();
var identifierArray = new Array();
externalUsers.forEach(function(doc) {
//library code for NewGuid omitted
var guid = NewGuid();
//buildContact and buildIdentifier simply create 2 js objects based on the parameters
contactArray.push(buildContact(guid, doc.EmailAddress, doc.FlightNumber));
identifierArray.push(buildIdentifier(guid, doc.EmailAddress));
index++;
if (index % 1000 == 0) {
var now = new Date();
var dif = now.getTime() - startDate.getTime();
var Seconds_from_T1_to_T2 = dif / 1000;
var Seconds_Between_Dates = Math.abs(Seconds_from_T1_to_T2);
print("Written " + index + " items (" + Seconds_Between_Dates + "s from start)");
}
//bulk insert in batches
if (index % 5000 == 0) {
destinationDb.Contacts.insert(contactArray);
destinationDb.Identifiers.insert(identifierArray);
contactArray = new Array();
identifierArray = new Array();
}
});
Many thanks in advance
Hey there and welcome to MongoDB. In this situation you may want to consider using two different Collections -- one for users and one for flights.
User:
{
_id:
email:
}
Flight:
{
_id:
userId:
number: // if number is unique, you can actually specify _id as number
time:
}
In your forEach loop, you would first check to see if a user document with that specific email address already exists. If it doesn't, create it. Then use the User document's unique identifier to insert a new document into the Flights collection, storing the identifier under the field userId (or maybe passengerId?).

Slow session: fastest way to check if a document exists

I am trying to write a simpler session (haskell driver if it matters) with mongodb backend. I may be wrong but it seems a bit slow compared to when I run the bench without the session.
With session it gives me 25 connections a second - 10596 without.
Once the session is set on the initial load, all it does is compares the SID from cookie to the SID stored in session document in mongodb. So on every request it does a single trip to the database server. I get the SID from cookie and check if a document with such SID exist in mongodb. That's all. I am learning, so my session logic could be off too.
At the moment, I use count to check if the document exist. I count documents with relevant SID and test if it == 1. Is this a fast enough way to check if document exist?
I found in this document test if document exists that testing with find and limit is faster. But it only compares it to findOne - not to count.
So my question is: what is the fastest way to check if a document exist?
Thanks.
As to your question, have a look at the source code of find/findOne/count
rs0:PRIMARY> db.geo.count
function ( x ){
return this.find( x ).count();
}
rs0:PRIMARY> db.geo.findOne
function ( query , fields, options ){
var cursor = this.find(query, fields, -1 /* limit */, 0 /* skip*/,
0 /* batchSize */, options);
if ( ! cursor.hasNext() )
return null;
var ret = cursor.next();
if ( cursor.hasNext() ) throw "findOne has more than 1 result!";
if ( ret.$err )
throw "error " + tojson( ret );
return ret;
}
rs0:PRIMARY> db.geo.find
function ( query , fields , limit , skip, batchSize, options ){
var cursor = new DBQuery( this._mongo , this._db , this ,
this._fullName , this._massageObject( query ) , fields , limit , skip , batchSize , options || this.getQueryOptions() );
var connObj = this.getMongo();
var readPrefMode = connObj.getReadPrefMode();
if (readPrefMode != null) {
cursor.readPref(readPrefMode, connObj.getReadPrefTagSet());
}
return cursor;
}
The difference is, findOne/count uses something from this.find, while find uses DBQuery.
So I did a benchmark on the 3 ways:
function benchMark1() {
var date = new Date();
for (var i = 0; i < 100000; i++) {
db.zips.find({
"_id": "35004"
}, {
_id: 1
});
}
print(new Date() - date);
}
function benchMark2() {
var date = new Date();
for (var i = 0; i < 100000; i++) {
db.zips.findOne({
"_id": "35004"
}, {
_id: 1
});
}
print(new Date() - date);
}
function benchMark3() {
var date = new Date();
for (var i = 0; i < 100000; i++) {
db.zips.count({
"_id": "35004"
}, {
_id: 1
});
}
print(new Date() - date);
}
It turns out benchMark1 takes 1046ms, 2 takes 37611ms, 3 takes 63306ms.
It seems you are using the worst one.
EDIT: The reason why it's slow is described here: https://dba.stackexchange.com/questions/7573/difference-between-mongodbs-find-and-findone-calls
What else, make sure you have an unique index on the field SID:
rs0:PRIMARY> db.system.indexes.find()
If no index exists on SID,
rs0:PRIMARY> db.session.ensureIndex({SID: 1}, {unique: true}) // change "session" to your collection name
Note that although _id is usually an ObjectId, it doesn't have to be. So you can use the SID as _id. And there's already an index on it so that you can save an index and thus make the insertion faster. To do this, just set the _id field to SID when you insert a record.
{
_id: [value of SID]
... // rest of record
}
And if this still doesn't meat your requirements, you need to try analyse where the bottleneck is. That's another topic we can talk about if necessary.

Picking an array of hashes from a hash

I have a hash coming back from an XML datasource that looks like this:
{...,
'records' :{
'record' :[
{'availability' :{'$t' :'available'}, ...},
{'availability' :{'$t' :'available'}, ...}
]
}
};
I'd like to get all the record hashes into an array so I can filter() it and do some other operations. However, when I have this statement in my pre block,
raw_records = raw.pick("$..record");
the array that gets returned is an array of two empty strings:
var raw_records = ['', ''];
The odd thing is that I can pick out just availability with expected results:
availability = raw.pick("$..availability.$t");
producing
var availability = ['available', 'available'];
What's wrong with my first pick()?
EDIT: Here is a more complete version that should help with reproducing the problem. It's slightly different, since I'm using the JSON version of the web service now:
global {
datasource hbll <- "https://svc.lib.byu.edu/services/catalog/v1/search/?field=isbn&format=json&terms=";
}
rule new_rule {
select when pageview "amazon.com/.*/?dp/(.*)/" setting (isbn)
pre {
//This is the array with two empty strings...
raw = datasource:hbll(isbn);
myfilter = function(x) { x.pick("availability") eq "available"; };
records = raw.filter(myfilter);
len = records.length();
availability = records.pick("$..availability");
middleman = len > 1 => availability[0] | availability;
available = middleman eq "available" => true | false;
url_list = records.pick("$..url");
url = len > 1 => url_list[0] | url_list;
msg = <<
<p>This book is available for checkout at the BYU Library.</p>
More information
>>;
}
notify("BYU Harold B. Lee Library", msg) with sticky=true;
}
I'm going to need a more complete example. The test app and results I got are below:
ruleset a8x167 {
meta {
name "Pick - Array of Hashes"
description <<
Testing
>>
author "Sam Curren"
logging on
}
dispatch {}
global {
raw = {
'records' :{
'record' :[
{'availability' :{'$t' :'available'}},
{'availability' :{'$t' :'available'}}
]
}
};
}
rule test {
select when pageview ".*" setting ()
pre {
raw_records = raw.pick("$..record");
availability = raw.pick("$..availability.$t");
}
notify("Hello World", "This is a sample rule.");
}
}
And Results:
var raw_records = [{'availability' :{'$t' :'available'}}, {'availability' :{'$t' :'available'}}];
var availability = ['available', 'available'];