Get count of documents without loading the whole collection in DerbyJS 0.6 - derbyjs

How can I count the results of a query without loading the whole resultset into memory?
The easy way of counting documents returned by a query would be:
var q = model.query('mycollection', { date: today });
q.fetch(function() {
var length = q.get().length;
});
But this would load the whole resultset into memory and "count" an array in javascript. When you have lots of data you don't want to do this. I think.
Counting the underlying mongodb collection is rather complicated since LiveDB (I think it is LiveDB) creates many mongodb documents for one derbyjs document.
The internets point to this google groups thread from 2013, but the solution described there (putting $count: true into the query options) doesn't seem to work in DerbyJS 0.6 and current mongodb.".
query.extraRef is undefined.

It is done like described in the google groups thread. But query.extraRef is now query.refExtra.
Example:
var q = model.query('mycollection', { $count: true, date: today });
q.refExtra('_page.docsOfToday');
q.fetch();

Related

Mongoose mongodb modify data before returning with pagination

So am fetching data with mongoose and i would like to modify the data like apply some date formats. Currently i have
const count = await UserModel.countDocuments();
const rows = await UserModel.find({ name:{$regex: search, $options: 'i'}, status:10 })
.sort([["updated_at", -1]])
.skip(page * perPage)
.limit(perPage)
.exec();
res.json({ count, rows });
The above UserModel is a mongoose model
I would like to modify some of objects like applying date formats before the data is returned while still paginating as above.
Currently i have added the following which works but i have to loop through all rows which will be a performance nighmare for large data.
res.json({ count, rows:rows.map(el=>({...el,created_at:'format date here'})) });
Is there a better option
As much as I understood your question, If you need to apply some date formats before showing data on frontend, you just need to pass the retrieved date in a date-formating library before displaying it, like in JS:
const d = new Date("2015-03-25T12:00:00Z");
However, if you want to get date in formatted form, than you must format it before storing. I hope that answer your question.
I think the warning from #Fabian Strathaus in the comments is an important consideration. I would strongly recommend that the approach you are trying to solve sets you up for success overall as opposed to introducing new pain points elsewhere with your project.
Assuming that you want to do this, an alternative approach is to ask the database to do this directly. More specifically, the $dateToString operator sounds like it could be of use here. This playground example demonstrates the basic behavior by adding a formatted date field which will be returned directly from the database. It takes the following document:
{
_id: 1,
created_at: ISODate("2022-01-15T08:15:39.736Z")
}
We then execute this sample aggregation:
db.collection.aggregate([
{
"$addFields": {
created_at_formatted: {
$dateToString: {
format: "%m/%d/%Y",
date: "$created_at"
}
}
}
}
])
The document that gets returned is:
{
"_id": 1,
"created_at": ISODate("2022-01-15T08:15:39.736Z"),
"created_at_formatted": "01/15/2022"
}
You could make use of this in a variety of ways, such as by creating and querying a view which will automatically create and return this formatted field.
I also want to comment on this statement that you made:
Currently i have added the following which works but i have to loop through all rows which will be a performance nighmare for large data.
It's good to hear that you're thinking about performance upfront. That said, your query includes a query predicate of name:{$regex: search, $options: 'i'}. Unanchored and/or case insensitive regex filters cannot use indexes efficiently. So if your status predicate is not selective, then you may need to take a look at alternative approaches for filtering on name to make sure that the query is performant.

MongoDB big collection aggregation is slow

I'm having a problem with the time of my mongoDB query, from a node backend using mongoose. i have a collection called people that has 10M records, and every record is queried from the backend and inserted from another part of the system that's written in c++ and needs to be very fast.
this is my mongoose schema:
{
_id: {type: String, index: {unique: true}}, // We generate our own _id! Might it be related to the slowness?
age: { type: Number },
id_num: { type: String },
friends: { type: Object }
}
schema.index({'id_num': 1}, { unique: true, collation: { locale: 'en_US', strength: 2 } })
schema.index({'age': 1})
schema.index({'id_num': 'text'});
Friends is an object looking like that: {"Adam": true, "Eve": true... etc.}.
there's no meaning to the value, and we use dictionaries to avoid duplicates fast on C++.
also, we didn't encounter a set/unique-list type of field in mongoDB.
The Problem:
We display people in a table with pagination. the table has abilities of sort, search, and select number of results.
At first, I queried all people and searched, sorted and paged it on the js. but when there are a lot of documents, It's turning problematic (memory problems).
The next thing i did was to try to fit those manipulations (searching, sorting & paging) on my query.
I used mongo's text search- but it not matches a partial word. is there any way to search a partial insensitive string? (I prefer not to use regex, to avoid unexpected problems)
I have to sort before paging, so I tried to use mongo sort. the problem is, that when the user wants to sort by "Friends", we want to return the people sorted by their number of friends (number of entries in the object).
The only way i succeeded pulling it off was using $addFields in aggregation:
{$addFields: {$size: {$ifNull: [{$objectToArray: '$friends'}, [] ]}}}
this addition is taking forever! when sorting by friends, the query takes about 40s for 8M people, and without this part it takes less than a second.
I used limit and skip for pagination. it works ok, but we have to wait until the user requests the second page and make another very long query.
In the end, this is the the interesting code part:
const { sortBy, sortDesc, search, page, itemsPerPage } = req.query
// Search never matches partial string
const match = search ? {$text: {$search: search}} : {}
const sortByInDB = ['age', 'id_num']
let sort = {$sort : {}}
const aggregate = [{$match: match}]
// if sortBy is on a simple type, we just use mongos sort
// else, we sortBy friends, and add a friends_count field.
if(sortByInDB.includes(sortBy)){
sort.$sort[sortBy] = sortDesc === 'true' ? -1 : 1
} else {
sort.$sort[sortBy+'_count'] = sortDesc === 'true' ? -1 : 1
// The problematic part of the query:
aggregate.push({$addFields: {friends_count: {$size: {
$ifNull: [{$objectToArray: '$friends'},[]]
}}}})
}
const numItems = parseInt(itemsPerPage)
const numPage = parseInt(page)
aggregate.push(sort, {$skip: (numPage - 1)*numItems}, {$limit: numItems})
// Takes a long time (when sorting by "friends")
let users = await User.aggregate(aggregate)
I tried indexing all simple fields, but the time is still too much.
The only other solution i could think of, is making mongo calculate a field "friends_count" every time a document is created or updated- but i have no idea how to do it, without slowing our c++ that writes to the DB.
Do you have any creative idea to help me? I'm lost, and I have to shorten the time drastically.
Thank you!
P.S: some useful information- the C++ area is writing the people to the DB in a bulk once in a while. we can sync once in a while and mostly rely on the data to be true. So, if that gives any of you any idea for a performance boost, i'd love to hear it.
Thanks!

Iterating over MongoDB collection to duplicate all documents is painfully slow

I have a collection of 7,000,000 documents (each of perhaps 1-2 KB BSON) in a MongoDB collection that I would like to duplicate, modifying one field. The field is a string with a numeric value, and I would like to increment the field by 1.
Following this approach From the Mongo shell, I took the following approach:
> var all = db.my_collection.find()
> all.forEach(function(it) {
... it._id = 0; // to force mongo to create a new objectId
... it.field = (parseInt(it.field) + 1).toString();
... db.my_collection.insert(it);
... })
Executing the following code is taking an extremely long time; at first I thought the code was broken somehow, but from a separate terminal I checked the status of the collection something like an hour later to find the process was still running and there was now 7,000,001 documents! I checked to find that sure enough, there was exactly 1 new document that matched the incremented field.
For context, I'm running a 2015 MBP with 4 cores and 16 GB ram. I see mongo near the top of my CPU overhead averaging about 85%.
1) Am I missing a bulk modify/update capability in Mongodb?
2) Any reason why the above operation would be working, yet working so slowly that it is updating a document at a rate of 1/hr?
Try the db.collection.mapReduce() way:
NB: A single emit can only hold half of MongoDB’s maximum BSON document size.
var mapFunction1 = function() {
emit(ObjectId(), (parseInt(this.field) + 1).toString());
};
MongoDB will not call the reduce function for a key that has only a single value.
var reduceFunction1 = function(id, field) {
return field;
};
Finally,
db.my_collection.mapReduce(
mapFunction1,
reduceFunction1.
{"out":"my_collection"} //Replaces the entire content; consider merge
)
I'm embarrassed to say that I was mistaken that this line:
... it._id = 0; // to force mongo to create a new objectId
Does indeed force mongo to create a new ObjectId. Instead I needed to be explicit:
... it._id = ObjectId();

How to get the count of multiple queries on one collection in only one query?

I have a collection called Students.
Documents have a field house which contains a string value, for example:
'Gryffindor', 'Hufflepuff', 'Ravenclaw', 'Slytherin'
I am displaying the count of the students in each house.
For that I'm doing something like this:
G = Students.find({ house: 'Gryffindor' }).count();
H = Students.find({ house: 'Hufflepuff' }).count();
R = Students.find({ house: 'Ravenclaw' }).count();
S = Students.find({ house: 'Slytherin' }).count();
and displaying.
Is there a way that this could be done in a single query?
You can do it in a single query using mongo aggregation.
Students.aggregate([
{
$group : {
_id : "$house",
count: { $sum: 1 }
}
}
])
If you already have all the data locally, you could do it using an aggregation, either using the mongo one (like Volodymyr pointed out) or underscore/lodash's countBy method.
In the server it always works because all data is always present, but for this to work on client (display the actual count) you should have all users published, which might not be the best practice and even impossible for large applications.
In that case, you should probably use a method (which will return the non-reactive counts), create a separate counts collection and keeping it up to date or even use the low-level meteor publication API to publish a reactive count.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});