Mongo DB Aggregates - mongodb

I'm stuck in a problem of aggregates in mongoDB. The data structure that I'm dealing with is like this :-
{
"_id" : ObjectId("4f16fe11d1e2d32371072aa0"),
"body" : " \nHi Kate, per our discussion on yesterday about the $15.00 f
lat fee on Tom's \nand Mark's deals, here is Bloomberg's response. Please pass
this info to all \nof our traders. Please let me know what the response is from
them.\n\nThanks\n\n\n---------------------- Forwarded by Evelyn Metoyer/Corp/En
ron on 04/17/2001 \n02:34 PM ---------------------------\n\n\n\"PAUL CALLAHAN, B
LOOMBERG/ NEW YORK\" <PCALLAHAN2#bloomberg.net> on 04/17/2001 \n02:28:57 PM\nTo:
Evelyn.Metoyer#enron.com\ncc: \n\nSubject: Commission\n\n\nEvelyn, as of April
16, 2001 our charge for Spot trades is a flat fee of\n$15/trade.\n\n\n",
"filename" : "3272.",
"headers" : {
"Content-Transfer-Encoding" : "7bit",
"Content-Type" : "text/plain; charset=us-ascii",
"Date" : ISODate("2001-04-17T14:33:00Z"),
"From" : "evelyn.metoyer#enron.com",
"Message-ID" : "<33504483.1075841847839.JavaMail.evans#thyme>",
"Mime-Version" : "1.0",
"Subject" : "Commission for Bloomberg",
"To" : [
"kate.symes#enron.com"
],
"X-FileName" : "kate symes 6-27-02.nsf",
"X-Folder" : "\\kate symes 6-27-02\\Notes Folders\\Discussion th
reads",
"X-From" : "Evelyn Metoyer",
"X-Origin" : "SYMES-K",
"X-To" : "Kate Symes",
"X-bcc" : "",
"X-cc" : ""
},
"mailbox" : "symes-k",
"subFolder" : "discussion_threads"
}
There are 120477 records in the database. I'm supposed to find out the pair of people who tend to communicate most (and 2nd most) with each other. The query that I've written is as follows:
db.messages.aggregate([{$project:{From:"$headers.From",To:"$headers.To",_id:0}
},{$unwind:"$headers.To"},{$group:{_id:{From:"$From",To:"$To"},number:{$sum:1}}}
,{$limit:3},{$sort:{number:-1}}]);
but it somehow does not work.

Related

Sum Emails Sizes From MongoDB Collection

I have been asked to perform a very basic query task in MongoDB however I am unable to understand how to properly query the collection w/ aggregate functions in proper syntax.
I need to query the email collection for all email attachment sizes & sum them for today. Customer is aksing me to group all the email attachements for just their account for a single day (today). How would I find this?
Below is the output of db.email.findOne():
{
"_id" : ObjectId("55893983e4b0ea8af5a61550"),
"customer_id" : "12345",
"Subject" : "test message",
"Date" : ISODate("2016-08-04T10:48:13Z"),
"headers" : [
"Date: Tue, 23 Jun 2015 12:48:13 +0200 (CEST)",
"From: user#domain.tld",
"Message-ID: <240354118.javamail.email.server.tld>",
"Subject: Cats",
"Content-Type: text/plain; charset=us-ascii",
"Content-Transfer-Encoding: 7bit",
"To: undisclosed-recipients:;",
"X-ClamAV: clean"
],
"text" : "feed the cats please",
"attachments" : [ ],
"langprob" : 0.8301894511454121,
"original_message_file_id" : "239863489r7637208",
"account_id" : "xxx",
"received_time" : ISODate("2015-06-23T10:48:35.097Z"),
"direction" : "inbound",
"state" : "CLOSED",
"encryption_key_id" : null,
"size" : 1651,
"routing_type" : "PUSH",
"priority" : 1,
"closed_time" : ISODate("2015-07-10T21:02:53.409Z")
}
Can anyone please assist me in properly creating a query in JSON syntax to extract the data I need from MongoDB based on my predicates?
Thank you for any help!

Mongodb Document size

A quick question please to be sure.
I have the following example of document in my collection guest
"_id" : "JM15061985",
"last_name" : "Michel",
"first_name" : "Justine",
"gender" : "Female",
"title" : "Mme",
"telephone" : 3375,
"mail" : "justine.michel#yahoo.com",
"language" : "French",
"birthday" : ISODate("1985-06-14T22:00:00Z"),
"status" : "VIP",
"company" : "Test",
"address" : [
{
"street" : "45 Avenue de Paris",
"city" : "Nice",
"zip_code" : "06072",
"country" : "France"
},
{
"street" : "12 square xvy",
"city" : "Toulon",
"zip_code" : "83072",
"country" : "France"
},
]
I know that one document ins Mongodb can't exceed 16Mb.
So my basics questions are :
What does 16Mb represents really? (any exemple maybe?)
In my example, is each address considered as a document or this is only one document?
16MB is the maximum size of the BSON-document that represents your document. This includes nested objects, like your address example, and also key names (not just the values).
There's also some overhead per document property, as explained here.
To check BSON document size for a particular JS object, and if you happen to use Node.js, you can use the bson module:
var BSON = new (require('bson')).BSONPure.BSON();
var bson = BSON.serialize(obj, false, true, false);
console.log('bson size', bson.length);
There should be similar solutions for other programming languages.

Convert a MongoDB with two collections in a neo4j graph

I finished to create my Mongo database. It is made on two collections:
1. team
2. coach
I give you an example of the documents contained in these collections:
Here is a team document:
{
"_id" : "Mil.74",
"official_name" : "Associazione Calcio Milan S.p.A",
"common_name" : "Milan",
"country" : "Italy",
"started_by" : {
"day" : 16,
"month" : 12,
"year" : 1899
},
"stadium" : {
"name" : "Giuseppe Meazza",
"capacity" : 81277
},
"palmarès" : {
"Serie A" : 18,
"Serie B" : 2,
"Coppa Italia" : 5,
"Supercoppa Italiana" : 6,
"UEFA Champions League" : 7,
"UEFA Super Cup" : 5,
"Cup Winners cup" : 2,
"UEFA Intercontinental cup" : 4
},
"uniform" : "black and red"
}
This is a coach document:
{
"_id" : ObjectId("556cec3b9262ab4f14165fcd"),
"name" : "Carlo",
"surname" : "Ancelotti",
"age" : 55,
"date_Of_birth" : {
"day" : 10,
"month" : 6,
"year" : 1959
},
"place_Of_birth" : "Reggiolo",
"nationality" : "Italian",
"preferred_formation" : "4-2-3-1",
"coached_Team" : [
{
"team_id" : "RMa.103",
"in_charge" : {
"from" : "26/june/2013",
"to" : "25/may/2015"
},
"matches" : 119
},
{
"team_id" : "PSG.00",
"in_charge" : {
"from" : "30/dec/2011",
"to" : "24/june/2013"
},
"matches" : 77
},
{
"team_id" : "Che.11",
"in_charge" : {
"from" : "01/july/2009",
"to" : "22/may/2011"
},
"matches" : 109
},
{
"team_id" : "Mil.74",
"in_charge" : {
"from" : "07/nov/2001",
"to" : "31/may/2009"
},
"matches" : 420
}
]
As you can see, I used a normalized model: every coach has an array of coached teams.
I want to convert this Mongo database into a graph database, in particular Neo4j; my goal is to show that in this highly connected domains neo4j has better performance than Mongo(For example the query:"Find the palmarès of all teams coached by Carlo Ancelotti, in mongo requires two queries, instead in neo4j it's enough to follow relationships).
I found this guide on the forum that uses Gremlin to convert a mongo collection of documents into neo4j graph automatically.The problem is that the guide talks about just one collection.
So, is it possible to generate automatically the neo4j graph starting from my mongo database(with two collections) or must I create the graph "by hand"?
Gremlin is a Domain Specific Language for working with graphs, but it is based on Groovy so you effectively have all the flexibility you want to really do whatever you want. In other words, what you can do with one MongoDB collection you can easily do with two (or however many collections you have). That was the point of the blog post referenced in one of the other answers:
http://thinkaurelius.com/2013/02/04/polyglot-persistence-and-query-with-gremlin/
Gremlin is a great language for transforming data into graph form, whatever its source format is. I would think that you would first load all of your teams as vertices then iterate through your coaches, creating coach vertices and edges to their related teams as you go.
I would also add that nothing is "automatic" about Gremlin. It's not as though you tell Gremlin that you have data in MongoDB and it turns it into a graph. You have to write Gremlin to tell it how you want your MongoDB data turned into a graph.

MongoDB shard key for emails collection

I'm using MongoDB 2.6.1
I have a collection that stores the emails, project-wise. The documents are as follows(haven't included the 'Raw Email Text' key for readability) :
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d6"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 05:05:35 IST 2014",
"To" : "manisha.bhopate#infostretch.com; ",
"From" : "Shubhangi Thorat",
"CC" : "NO VALUES",
"Subject" : "RE: pics",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d7"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 05:02:38 IST 2014",
"To" : "manisha.bhopate#infostretch.com; ",
"From" : "Shubhangi Thorat",
"CC" : "NO VALUES",
"Subject" : "FW: pics",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d8"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 04:37:47 IST 2014",
"To" : "Prachi Sutrawe; ",
"From" : "Mahindra Shambharkar",
"CC" : "NO VALUES",
"Subject" : "Accepted: Show and tell -Sale",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
I had the following thoughts on my mind when selecting the shard key:
Build a compound index {Project_Id, _id} since Project_Id has a low cardinality but _id has a high one
A hashed index on 'Date' / 'Unique_Id' which are both timestamps
A hashed index on 'From' field but it's cardinality is dependent on the no. of people involved in the project
'To' and 'CC' are multivalue keys and 'Subject' has high randomness so not sure if these keys can be used at all
While not listed in the output, 'Raw_Text' will be extensively read by different applications but I'm not sure if an index should be built and even used in sharding for this key !
What will be the optimal shard key in this case ?

want to merge two collection in mongo db using map reduce

I have two collection as bellow products has reference of user. i search product by name & in return i want combine output of product and user using map reduce method
user collection
{
"_id" : ObjectId("52ac5dd1fb670c2007000000"),
"company" : {
"about" : "This is textile machinery dealer",
"contactAddress" : [{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
},{
"address" : "abcd",
"city" : "52ac4bc6fb670c1007000000",
"zipcode" : "39as46as80"
}],
"fax" : "58784868",
"mainProducts" : "ads,asd,asd",
"mobileNumber" : "9537236588",
"name" : "krishna steels",
}
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
product colletion
{
"_id" : ObjectId("52ac5722fb670cf806000002"),
"category" : "52a2a9cc48a508b80e00001d",
"deliveryTime" : "10 days after received the ",
"price" : {
"minPrice" : "2000",
"maxPrice" : "3000",
"perUnit" : "5288ac6f7c104203e0976851",
"currency" : "INR"
},
"productName" : "New Mobile Solar Charger with Carabiner",
"rejectReason" : "",
"status" : 1,
"user" : ObjectId("52ac4eb7fb670c0c07000000")
}
This cannot be done. Mongo support Map Reduce only on one collection. You could try to fetch and merge in a java collection. Couple of days back I solved a similar problem using java collection.
Click to see similar response about joins and multi collection not supported in mongo.
This can be done using two map reduces.
You run your first MR and then you reduce out the second MR onto the results of the first.
You shouldn't do this though. JOINs are not designed to be done through MR, in fact it sounds like you are trying to do this MR with inline output which in itself is a very bad idea.
MRs are not designed to run inline to the application.
You would be better off doing the JOIN else where.