MongoDB shard key for emails collection - mongodb

I'm using MongoDB 2.6.1
I have a collection that stores the emails, project-wise. The documents are as follows(haven't included the 'Raw Email Text' key for readability) :
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d6"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 05:05:35 IST 2014",
"To" : "manisha.bhopate#infostretch.com; ",
"From" : "Shubhangi Thorat",
"CC" : "NO VALUES",
"Subject" : "RE: pics",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d7"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 05:02:38 IST 2014",
"To" : "manisha.bhopate#infostretch.com; ",
"From" : "Shubhangi Thorat",
"CC" : "NO VALUES",
"Subject" : "FW: pics",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
{
"_id" : ObjectId("540d4ae7eea013be22f1f0d8"),
"Project_Id" : "E11593",
"Project_Name" : "National Hearing Care- Novo",
"Email_Id" : "E11593.monitor#lntinfotech.com",
"Date" : "Mon Sep 08 04:37:47 IST 2014",
"To" : "Prachi Sutrawe; ",
"From" : "Mahindra Shambharkar",
"CC" : "NO VALUES",
"Subject" : "Accepted: Show and tell -Sale",
"Unique_Id" : "Mon-Sep-08-11:51:20-IST-2014"
}
I had the following thoughts on my mind when selecting the shard key:
Build a compound index {Project_Id, _id} since Project_Id has a low cardinality but _id has a high one
A hashed index on 'Date' / 'Unique_Id' which are both timestamps
A hashed index on 'From' field but it's cardinality is dependent on the no. of people involved in the project
'To' and 'CC' are multivalue keys and 'Subject' has high randomness so not sure if these keys can be used at all
While not listed in the output, 'Raw_Text' will be extensively read by different applications but I'm not sure if an index should be built and even used in sharding for this key !
What will be the optimal shard key in this case ?

Related

In Mongodb, after create the text index, when query using text filter there is no any output showing

In Mongodb, I have created the collection as follows.
db.test.insertMany([
{CustomerKey : "11026", FirstName : "Harold", LastName : "Sai", BirthDate : new Date("1951-10-1"),MaritalStatus : "S", Gender : "M", EmailAddress : "harold3#adventure-works.com", YearlyIncome : 30000, TotalChildren : 2, NumberChildrenAtHome : 0, EnglishEducation : "Partial College", EnglishOccupation : "Clerical", NumberCarsOwned : 2, AddressLine1 : {House_No : 2596, Area_Name: "Franklin Canyon Road"}, Phone : "1 (11) 500 555-0131", DateFirstPurchase : new Date("2011-10-1"), CommuteDistance : "1-2 Miles"} ,
{CustomerKey : "11027", FirstName : "Jessie", LastName : "Zhao", BirthDate : new Date("1952-6-5"),MaritalStatus : "M", Gender : "M", EmailAddress : "jessie16#adventure-works.com", YearlyIncome : 30000, TotalChildren : 2, NumberChildrenAtHome : 0, EnglishEducation : "Partial College", EnglishOccupation : "Clerical", NumberCarsOwned : 2, AddressLine1 : {House_No : 8211, Area_Name: "Leeds Ct."}, Phone : "1 (11) 500 555-0184", DateFirstPurchase : new Date("2011-6-1"), CommuteDistance : "5-10 Miles"} ,
{CustomerKey : "11028", FirstName : "Jill", LastName : "Jimenez", BirthDate : new Date("1951-10-9"),MaritalStatus : "M", Gender : "F", EmailAddress : "jill13#adventure-works.com", YearlyIncome : 30000, TotalChildren : 2, NumberChildrenAtHome : 0, EnglishEducation : "Partial College", EnglishOccupation : "Clerical", NumberCarsOwned : 2, AddressLine1 : {House_No : 213, Area_Name: "Valencia Place"}, Phone : "1 (11) 500 555-0116", DateFirstPurchase : new Date("2011-10-1"), CommuteDistance : "1-2 Miles"} ,
]);
Following is the output of query :(emailaddress with Harold Available)
I have set "EmailAddress" field as Text Index.
db.test.createIndex({EmailAddress : "text"})
But When i Query using the following code, there is no any output for text filter.
db.test.find({$text:{$search:"harold"}})
What you are looking for is
db.test.find({"EmailAddress":{"$regex":"harold"}})
As you are looking for some sort of pattern match.
A text index stores the field in a tokenised form by removing stop words, replacing words by their stem words etc
You can read more about it here https://docs.mongodb.com/manual/text-search/#-text-operator
regex operator and its index use: https://docs.mongodb.com/manual/reference/operator/query/regex/
Text indexes do not support partial word matches. They are expected to find the whole word in a sentence. In your example harold is considered as part of the word harold3#adventure-works.com thus you are trying to perform a partial word match. Consider the following document as a test case...
db.test.insert({
"CustomerKey" : "11026",
"FirstName" : "Harold",
"LastName" : "Sai",
"BirthDate" : ISODate("1951-10-01T00:00:00Z"),
"MaritalStatus" : "S",
"Gender" : "M",
"EmailAddress" : "harold is at harold3#adventure-works.com",
"YearlyIncome" : 30000,
"TotalChildren" : 2,
"NumberChildrenAtHome" : 0,
"EnglishEducation" : "Partial College",
"EnglishOccupation" : "Clerical",
"NumberCarsOwned" : 2,
"AddressLine1" : {
"House_No" : 2596,
"Area_Name" : "Franklin Canyon Road"
},
"Phone" : "1 (11) 500 555-0131",
"DateFirstPurchase" : ISODate("2011-10-01T00:00:00Z"),
"CommuteDistance" : "1-2 Miles"
})
... now, your original query will find it because the whole word harold is found in the field EmailAddress.
While Text indexes do no support partial word matches they will allow word-stemming. For example if you search on run, it will find running.
Another option is to use MongoDB Atlas. Atlas supports Apache Lucene based search indexes which provide partial and fuzzy match capabilities.
For another reference to a similar SO article see MongoDB Full and Partial Text Search.

Best way to create index for MongoDB

I am having records stored in mongo-db collection for customer and there transactions with below format:
{
"_id" : ObjectId("59b6992a0b54c9c4a5434088"),
"Results" : {
"id" : "2139623696",
"member_joined_date" : ISODate("2010-07-07T00:00:00.000+0000"),
"account_activation_date" : ISODate("2010-07-07T00:00:00.000+0000"),
"family_name" : "XYZ",
"given_name" : "KOKI HOI",
"gender" : "Female",
"dob" : ISODate("1967-07-20T00:00:00.000+0000"),
"preflanguage" : "en-GB",
"title" : "MR",
"contact_no" : "60193551626",
"email" : "abc123#xmail.com",
"street1" : "address line 1",
"street2" : "address line 2",
"street3" : "address line 3",
"zipcd" : "123456",
"city" : "xyz",
"countrycd" : "Malaysia",
"Transaction" : [
{
"txncd" : "411",
"txndate" : ISODate("2017-08-02 00:00:00.000000"),
"prcs_date" : ISODate("2017-08-02 00:00:00.000000"),
"txn_descp" : "Some MALL : SHOP & FLY FREE",
"merchant_id" : "6587867dsfd",
"orig_pts" : "0.00000",
"text" : "Some text"
}
]
}
I want to create index on fields "txn_descp", "txndate", "member_joined_date", "gender", "dob" for faster access. Can some one help me in creating index for this document? Will appreciate any kind of help and suggestions.
While creating the index there are a few things to keep in mind.
Always create the index for the queries you use.
Go for compound indexes whenever possible.
First field in the index should be the one with the minimum possible values.Ie, if there is an index with gender and DOB as keys, It is better to have {gender:1,dob:1}

Sum Emails Sizes From MongoDB Collection

I have been asked to perform a very basic query task in MongoDB however I am unable to understand how to properly query the collection w/ aggregate functions in proper syntax.
I need to query the email collection for all email attachment sizes & sum them for today. Customer is aksing me to group all the email attachements for just their account for a single day (today). How would I find this?
Below is the output of db.email.findOne():
{
"_id" : ObjectId("55893983e4b0ea8af5a61550"),
"customer_id" : "12345",
"Subject" : "test message",
"Date" : ISODate("2016-08-04T10:48:13Z"),
"headers" : [
"Date: Tue, 23 Jun 2015 12:48:13 +0200 (CEST)",
"From: user#domain.tld",
"Message-ID: <240354118.javamail.email.server.tld>",
"Subject: Cats",
"Content-Type: text/plain; charset=us-ascii",
"Content-Transfer-Encoding: 7bit",
"To: undisclosed-recipients:;",
"X-ClamAV: clean"
],
"text" : "feed the cats please",
"attachments" : [ ],
"langprob" : 0.8301894511454121,
"original_message_file_id" : "239863489r7637208",
"account_id" : "xxx",
"received_time" : ISODate("2015-06-23T10:48:35.097Z"),
"direction" : "inbound",
"state" : "CLOSED",
"encryption_key_id" : null,
"size" : 1651,
"routing_type" : "PUSH",
"priority" : 1,
"closed_time" : ISODate("2015-07-10T21:02:53.409Z")
}
Can anyone please assist me in properly creating a query in JSON syntax to extract the data I need from MongoDB based on my predicates?
Thank you for any help!

rename year field in $project in MongoDB

I am trying to rename my ID field in the project phase but I have an error message. The $match and $sort phases work fine. Here are the details:
db.complaints.aggregate([
{$match:{$text:{$search:"\"loan\""}}},
{$group:{"_id":{Year:{$substr: ["$received", 0, 4]}}, "loan":{$sum:1}}},
{$sort:{_id:-1}},
{$project:{_id:0, "Year":"_id.Year", "loan":1}}
])
Here is my schema:
> db.complaints.findOne()
{
"_id" : ObjectId("55e5990d991312e2c9b266e3"),
"complaintID" : 1388734,
"product" : "mortgage",
"subProduct" : "conventional adjustable mortgage (arm)",
"issue" : "loan servicing, payments, escrow account",
"subIssue" : "",
"state" : "va",
"ZIP" : 22204,
"submitted" : "web",
"received" : "2015-05-22",
"sent" : "2015-05-22",
"company" : "green tree servicing, llc",
"response" : "closed with explanation",
"timely" : "yes",
"disputed" : ""
}

Mongo DB Aggregates

I'm stuck in a problem of aggregates in mongoDB. The data structure that I'm dealing with is like this :-
{
"_id" : ObjectId("4f16fe11d1e2d32371072aa0"),
"body" : " \nHi Kate, per our discussion on yesterday about the $15.00 f
lat fee on Tom's \nand Mark's deals, here is Bloomberg's response. Please pass
this info to all \nof our traders. Please let me know what the response is from
them.\n\nThanks\n\n\n---------------------- Forwarded by Evelyn Metoyer/Corp/En
ron on 04/17/2001 \n02:34 PM ---------------------------\n\n\n\"PAUL CALLAHAN, B
LOOMBERG/ NEW YORK\" <PCALLAHAN2#bloomberg.net> on 04/17/2001 \n02:28:57 PM\nTo:
Evelyn.Metoyer#enron.com\ncc: \n\nSubject: Commission\n\n\nEvelyn, as of April
16, 2001 our charge for Spot trades is a flat fee of\n$15/trade.\n\n\n",
"filename" : "3272.",
"headers" : {
"Content-Transfer-Encoding" : "7bit",
"Content-Type" : "text/plain; charset=us-ascii",
"Date" : ISODate("2001-04-17T14:33:00Z"),
"From" : "evelyn.metoyer#enron.com",
"Message-ID" : "<33504483.1075841847839.JavaMail.evans#thyme>",
"Mime-Version" : "1.0",
"Subject" : "Commission for Bloomberg",
"To" : [
"kate.symes#enron.com"
],
"X-FileName" : "kate symes 6-27-02.nsf",
"X-Folder" : "\\kate symes 6-27-02\\Notes Folders\\Discussion th
reads",
"X-From" : "Evelyn Metoyer",
"X-Origin" : "SYMES-K",
"X-To" : "Kate Symes",
"X-bcc" : "",
"X-cc" : ""
},
"mailbox" : "symes-k",
"subFolder" : "discussion_threads"
}
There are 120477 records in the database. I'm supposed to find out the pair of people who tend to communicate most (and 2nd most) with each other. The query that I've written is as follows:
db.messages.aggregate([{$project:{From:"$headers.From",To:"$headers.To",_id:0}
},{$unwind:"$headers.To"},{$group:{_id:{From:"$From",To:"$To"},number:{$sum:1}}}
,{$limit:3},{$sort:{number:-1}}]);
but it somehow does not work.