I have a csv file ( name : members ) that contains 20,000 IDs and a mongodb BSON collection ( name : Customers ) that contains 40 million rows of data and with ID , phone_Number , etc columns.
what I want to do is that I want to search for those 20,000 IDs ( name : user id ) that are available in that 40-million-rowed database ( name : id ) and if so, I want to return the ID and the phone-Number. Here is mongodb query that I am currently using:
db.getCollection("Customers").aggregate(
[
{
"$project" : {
"_id" : NumberInt(0),
"Customers" : "$$ROOT"
}
},
{
"$lookup" : {
"localField" : "Customers.id",
"from" : "members",
"foreignField" : "user id",
"as" : "members"
}
},
{
"$unwind" : {
"path" : "$members",
"preserveNullAndEmptyArrays" : false
}
},
{
"$project" : {
"Customers.id" : "$Customers.id",
"Customers.phone" : "$Customers.phone",
"_id" : NumberInt(0)
}
}
],
{
"allowDiskUse" : true
}
);
I am not so familiar with mongodb and the problem is that this query took 5 hours till now and still is not giving any output.
Is there any suggestions for me to get the result as fast as possible?
Is there any ways to speedup query's performance?
Do you suggest any other query codes for this job?
Thank you!
Related
Problem
I have a collection hotelreviews_collection containing 1 million rows (documents) of reviews with various metadata. I would like to group by the Hotel_Name field, count the number of times this hotel has showed up, but also get the fields "lat", "lng" and "Average_Score" with my query. The three extra rows are the same for each Hotel_Name.
I am doing the queries in R using the mongolite library connected to a local MongoDB.
My Attempt
I have gotten to retrieving the Hotel_Names and counting their appearances using the code below, but cannot for the life of me get the other fields to work.
Current Code
overviewData <- M_CONNECTION$aggregate('[{"$group":{"_id":"$Hotel_Name", "count": {"$sum":1}, "average":{"$avg":"$distance"}}}]',
options = '{"allowDiskUse":true}')
I am completely lost on this, any and all help would be greatly appreciated.
I have solved my issue using the following code.
db.getCollection("hotelreviews_collection").aggregate(
[
{
"$group" : {
"_id" : {
"Hotel_Name" : "$Hotel_Name",
"lat" : "$lat",
"lng" : "$lng",
"Average_Score" : "$Average_Score"
},
"COUNT(Hotel_Name)" : {
"$sum" : NumberInt(1)
}
}
},
{
"$project" : {
"Hotel_Name" : "$_id.Hotel_Name",
"lat" : "$_id.lat",
"lng" : "$_id.lng",
"Average_Score" : "$_id.Average_Score",
"COUNT(Hotel_Name)" : "$COUNT(Hotel_Name)",
"_id" : NumberInt(0)
}
}
]
)
I am trying to understand the best modeling for monitoring application.
I have a monitoring application which will be running every 30 mins to get stats from the target system and stores the details in MongoDB.
Use case:
Products, Companies
There will be around 2000 products. Products will be added/removed but the growth will not be more than 10% every month. So, I don't expect more than 3000 in the next 1 year.
Companies are the consumers for each products. There will be 1 to 10 companies for each product who are using the product. Consumers count also will go up and down.
So, on each run, we will get list of products along with the corresponding companies. Product details will be like,
Product:
Product name
Total number (this will give the current number available and will change on every poll)
Product weight
Durability days (might change once in a while)
Companies List - Who are using this product
Sample data for product:
{
"productName" : "Small Box",
"total" : NumberLong(1000),
"weight" : "1.5",
"durability" : "20",
"companies" : [
{
"name" : "Nike",
"taken" : NumberLong(10)
},
{
"name" : "Reebok",
"taken" : NumberLong(20)
}
]
}
Here, taken count will keep changing on each poll.
Web application:
There will be 3 screens to show the details.
Dashboard - Which will show high level stats like ( No of products, No of companies, Total size, ....)
Products - List view( To view the complete list )- Will show the details of a product on selecting any product
Here, I will have to show the product details and will have to list the companies who are all consuming.
Companies - List view( To view the complete list )- Will show the details company each selecting any company
Here, I will have to show Company details and all the products it is consuming.
The way, I am storing currently.
Dashboard collection - To show the stats details like, Total products, Total companies, ...
{
"time" :
"totalProducts" : NumberLong(1000),
"totalCompanies" : "1.5",
}
Products collection - Will have the following details.
{
"productName" : "Small Box",
"total" : NumberLong(1000),
"weight" : "1.5",
"durability" : "20",
"companies" : [
{
"name" : "Nike",
"taken" : NumberLong(10)
},
{
"name" : "Reebok",
"taken" : NumberLong(20)
}
]
}
Companies collection - will have the following details
{
"companyName" : "Nike",
"products" : [
{
"name" : "Small Box",
"taken" : NumberLong(10)
},
{
"name" : "Medium Box",
"taken" : NumberLong(20)
}
]
}
So, on each run, I am generating unique Id and adding this id to all the data being stored. I am keeping only last 2 weeks of data in these collections. Data older than 2 weeks will be cleaned every day.
So, when user comes to Dashboard, doing sort by to get the latest record and showing the details. There will be only one record for each run in Dashboard collection and there will be last 2 weeks of records.
When user comes to Products screen, Still will have to get the latest record from Dashboard to get the UniqueId and going to Products collection to get all the records for that UniqueId as there will be around 2000 records for each run. Same for companies collection.
Here, I will have to always show the latest data. I am going to 2 different collection when user goes to Products or Companies screen.
Is there any better approach?
Please check Its help you how to prepare schema
Note : Mongo Version : 3.6.5
Products Coolection
/* 1 */
{
"_id" : ObjectId("5bb1e270269004e06093e178"),
"productName" : "Small Box",
"total" : NumberLong(1000),
"weight" : "1.5",
"durability" : "20",
"companies" : [
ObjectId("5bb1e2d2269004e06093e17b"),
ObjectId("5bb1e2d8269004e06093e17c")
],
"date" : ISODate("2018-10-01T09:28:40.502Z")
}
/* 2 */
{
"_id" : ObjectId("5bb1e293269004e06093e179"),
"productName" : "Large Box",
"total" : 1000.0,
"weight" : "1.2",
"durability" : "20",
"companies" : [
ObjectId("5bb1e2d8269004e06093e17c"),
ObjectId("5bb1e2de269004e06093e17d")
],
"date" : ISODate("2018-10-01T09:28:40.502Z")
}
/* 3 */
{
"_id" : ObjectId("5bb1e29d269004e06093e17a"),
"productName" : "Medium Box",
"total" : 1000.0,
"weight" : "1.2",
"durability" : "20",
"companies" : [
ObjectId("5bb1e2d2269004e06093e17b"),
ObjectId("5bb1e2d8269004e06093e17c"),
ObjectId("5bb1e2de269004e06093e17d")
],
"date" : ISODate("2018-07-01T09:28:40.502Z")
}
Company collection
/* 1 */
{
"_id" : ObjectId("5bb1e2d2269004e06093e17b"),
"companyName" : "Nike"
}
/* 2 */
{
"_id" : ObjectId("5bb1e2d8269004e06093e17c"),
"companyName" : "Reebok"
}
/* 3 */
{
"_id" : ObjectId("5bb1e2de269004e06093e17d"),
"companyName" : "PUMA"
}
Get Single products with comapny
db.getCollection('products').aggregate([{
$match : { "_id" : ObjectId("5bb1e270269004e06093e178") } },
{ $lookup : {
from : 'company',
foreignField : '_id',
localField : 'companies',
as : 'companies'
}
}
])
All Producsts with Company List
db.getCollection('products').aggregate([
{ $lookup : {
from : 'company',
foreignField : '_id',
localField : 'companies',
as : 'companies'
}
}
])
Company by id and its used products
db.getCollection('company').aggregate([{
$match : { "_id" : ObjectId("5bb1e2d2269004e06093e17b") } },
{ $lookup : {
from : 'products',
foreignField : 'companies',
localField : '_id',
as : 'products'
}
}
])
Also by add date fields in each products you get last week data by
db.getCollection('products').aggregate([{
$match : {
date: {
$gte: new Date(new Date() - 7 * 60 * 60 * 24 * 1000)
} } },
{ $lookup : {
from : 'products',
foreignField : 'companies',
localField : '_id',
as : 'products'
}
}
])
Get latest product with company
db.getCollection('products').aggregate([
{ $sort : { date : -1} },
{ $limit : 1},
{ $lookup : {
from : 'company',
foreignField : '_id',
localField : 'companies',
as : 'companies'
}
}
])
So, On each run, I am generating unique Id and adding this id to all the data being stored. I am keeping only last 2 weeks of data in these collections. Data older than 2 weeks will be cleaned every day.
i wouldnt do it like this. i would use the _id, mongodb automaticly give you. and make a runobject that collects all those objects id's. because its better to have 1 key per object than 336 keys ( 48(30 minutes a day) x 14(days in 2 weeks))
i would make a run object that contains a array of company_id's , product_id's and timestamp. and i would make it a possibility that if nothing has changed in 30 minutes that you only use a runobject _id(of the first one that was the same) and a timestamp. so you saved a lot of space.
hopefully i understand you correctly, because it was a tough read for me.
In the following model a product is owned by a customer. and cannot be ordered by other customers. So I know that in an order by customer 1 there can only be products owned by customer one.
To give you an idea here is a simple version of the data model:
Orders:
{
'customer' : 1
'products' : [
{'productId' : 'a'},
{'productId' : 'b'}
]
}
Products:
{
'id' : 'a'
'name' : 'somename'
'customer' : 1
}
I need to find orders that contain certain products. I know the product id and customer id. I'm free to add/change indexes on my database.
Now my question is. Is it faster to just add a single field index on the product id's and query only using that ID. Or should I go for a compound index with customer and product id?
I'm not sure if this matters, but in my real model the list of products is actually a list of objects which have an amount and a dbref to the product. And the customer is also a dbref.
Here is a full order object:
{
"_id" : 0,
"_class" : "nl.pfa.myprintforce.models.Order",
"orderNumber" : "e35f1fa8-b4c4-4d53-89c9-66abe94a3553",
"status" : "ERROR",
"created" : ISODate("2017-03-30T11:50:50.292Z"),
"finished" : false,
"orderTime" : ISODate("2017-01-12T12:50:50.292Z"),
"expectedDelivery" : ISODate("2017-03-30T11:50:50.292Z"),
"totalItems" : 19,
"orderItems" : [
{
"amount" : 4,
"product" : {
"$ref" : "product",
"$id" : NumberLong(16)
}
},
{
"amount" : 7,
"product" : {
"$ref" : "product",
"$id" : NumberLong(26)
}
},
{
"amount" : 8,
"product" : {
"$ref" : "product",
"$id" : NumberLong(7)
}
}
],
"stateList" : [
{
"timestamp" : ISODate("2017-03-28T11:50:50.074Z"),
"status" : "NEW",
"message" : ""
},
{
"timestamp" : ISODate("2017-03-29T11:50:50.075Z"),
"status" : "IN_PRODUCTION",
"message" : ""
},
{
"timestamp" : ISODate("2017-03-30T11:50:50.075Z"),
"status" : "ERROR",
"message" : "Something went wrong"
}
],
"customer" : {
"$ref" : "customer",
"$id" : ObjectId("58dcf11a71571a24c475c044")
}
}
When I have the following indexes:
1: {"customer" : 1, "orderItems.product" : 1}
2: {"orderItems.product" : 1}
both count queries (I use count to forcefully find all documents without the network transfer):
a: db.getCollection('order').find({
'orderItems.product' : DBRef('product',113)
}).count()
b: db.getCollection('order').find({
'customer' : DBRef('customer',ObjectId("58de009671571a07540a51d5")),
'orderItems.product' : DBRef('product',113)
}).count()
Run with the same time of ~0.007 seconds on a set of 200k.
When I add 1000k record for a different customer (and different products) it does not effect the time at all.
an extended explain shows that:
query 1 just uses index 2.
query 2 uses index 2 but also considered index 1. Perhaps index intersection is used here?
Because if I drop index 1 the results are:
Query a: 0.007 seconds
Query b: 0.035 seconds (5x as long!)
So my conclusion is that with the right indexing both methods work about as fast. However, if you do not need the compound index for anything else it's just a waste of space & write speed.
So: single field index is better in my case.
I need to be able to get a count of distinct 'transactions' the problem I'm having is that using .distinct() comes back with an error because the documents too large.
I'm not familiar with aggregation either.
I need to be able to group it by 'agencyID' as you see below there are 2 different agencyID's
I need to be able to count transactions where the agencyID is 01721487 etc
db.myCollection.distinct("bookings.transactions").length
this doesn't work as I need to be able to group by agencyID and if there are too many results I get an error saying it's too large.
{
"_id" : ObjectId("5624a610a6e6b53b158b4744"),
"agencyID" : "01721487",
"paxID" : "-530189664",
"bookings" : [
{
"bookingID" : "24232",
"transactions" : [
{
"tranID" : "001",
"invoices" : [
{
"invNum" : "1312",
"type" : "r",
"inv_date" : "20150723",
"inv_time" : "0953",
"inv_val" : -300
}
],
"tranType" : "Fee",
"tranDate" : "20150723",
"tranTime" : "0952",
"opCode" : "admin",
"udf_1" : "j s"
}
],
"acctID" : "acct11",
"agt_id" : "xy"
}
],
"title" : "",
"firstname" : "",
"surname" : "f bar"
}
I've also tried this but it didn't work for me.
thank you for text data -
this is something you could play with:
db.kieron.aggregate([{
$unwind : "$bookings"
}, {
$match : {
"bookings.transactions" : {
$exists : true,
$not : {
$size : 0
}
}
}
}, {
$group : {
_id : "$agencyID",
count : {
$sum : {
$size : "$bookings.transactions"
}
}
}
}
])
as there is nested array we need to unwind it first, and then we can check size of inner array.
Happy reporting!
I have a collection that stored information about devices like the following:
/* 1 */
{
"_id" : {
"startDate" : "2012-12-20",
"endDate" : "2012-12-30",
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount"]
},
"data" : {
"results" : "1"
}
}
/* 2 */
{
"_id" : {
"startDate" : "2012-12-20",
"endDate" : "2012-12-30",
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
},
"data" : {
"results" : "2"
}
}
/* 3 */
{
"_id" : {
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
},
"data" : {
"results" : "3"
}
}
And I am trying to query the documents using the _id field which will be unique. The problem I am having is that when I query for all the different attributes as in:
db.collection.find({$and: [{"_id.dimensions":{ $all: ["manufacturer","model"], $size: 2}}, {"_id.metrics": { $all:["noOfUsers","deviceCount"], $size: 2}}]});
This matches 2 and 3 documents (I don't care about the order of the attributes values), but I would like to only get 3 back. How can I say that there should not be any other attributes to _id than those that I specify in the search query?
Please advise. Thanks.
Unfortunately, I think the closest you can get to narrowing your query results to just unordered _id.dimensions and unordered _id.metrics requires you to know the other possible fields in the _id subdocument field, eg. startDate and endDate.
db.collection.find({$and: [
{"_id.dimensions":{ $all: ["manufacturer","model"], $size: 2}},
{"_id.metrics": { $all:["noOfUsers","deviceCount"], $size: 2}},
{"_id.startDate":{$exists:false}},
{"_id.endDate":{$exists:false}}
]});
If you don't know the set of possible fields in _id, then the other possible solution would be to specify the exact _id that you want, eg.
db.collection.find({"_id" : {
"dimensions" : ["manufacturer", "model"],
"metrics" : ["deviceCount", "noOfUsers"]
}})
but this means that the order of _id.dimensions and _id.metrics is significant. This last query does a document match on exact BSON representation of _id.