mongodb remove documents older than period of time but with no date attribute - mongodb

we're trying to remove documents older than 3 months in a specific collection.
There's no TTL configured on this collection and no single date/time attribute on those documents.
How can I remove those old documents anyway? is there any script I could run to make it automatically?
thanks

Assuming you did not generate your own _id field the ObjectId contains a timestamp, from the docs:
The 12-byte ObjectId value consists of: ...
a 4-byte timestamp value, representing the ObjectId's creation, measured in seconds since the Unix epoch
So if you're using Mongo version 4.0+ you could use $toDate, match the documents and overwrite the current collection using $out
db.collection.aggregate([
{
$addFields: {
shouldKeep: {
$lt: [
{
$subtract: [
"$$NOW",
{
$toDate: "$_id"
}
]
},
7776000// 90 days in seconds
]
}
}
},
{
$match: {
shouldKeep: true
}
},
{
$project: {
shouldKeep: 0
}
},
{
out: "curr_collection"
}
])
Mongo Playground
Mind you this is a POC example however this does not deal with many issues, such as timezones. exact month start and ending ( currently it calculates 90 days back ) and more.
Not mentioning that using $out on a large collection contains a lot of overhead.
My recommendation is to paginate the results and do this in code.
For nodejs for example you can use ObjectId's getTimestamp method, like so: (pseudo code)
const someDocuments = [...];
for each document:
const timestamp = document._id.getTimestamp();
if (timestamp < 3 months ago) delete document.
Now in code you can handle timezones, month start dates and scale issues relatively easily.

Related

$match comparing one date field to another in mongoDB aggregate query

I have a field called file_date, and in the start of the query I've set a new field called start_date which is 10 days back using:
{ $set: { start_date:{ $subtract: [ "$$NOW", 10*24*60*60*1000 ] } } }.
I want to use $match to find all the results where file_date is bigger then start_date.
I've tried using $gte but I can't seem to get the right syntax.
I'm using mongo 4.2 so I cant use SubstructDate
Thanks for any help
Changing the dates to strings using $toString.
using $gte

Mongo 4.2 query base on date

I need to query Mongo using the FIND function, I can't use the aggregate function.
My documents are like this:
{
"name": "Tom",
"priDate":2010-04-11T00:00:00.000Z
}
The query I would like to make is:
Find all documents where ("priDate" + 1 year) is lte today.
Is it possible to do this without using an aggregation query? I can't use the field value in find ..
The query that I would need, I think, would be like this one I made:
db.system.profile.find({
"priDate" :
{
$gte: new Date(ITSELF + 1 year??) ,
$lt : new Date()
}
})
Can you help me?
many thanks, i'm going crazy :)
see if this works:
db.collection.find(
{
$expr: {
$lte: ['$priDate', { $subtract: ['$$NOW', 31536000000] }]
}
}
)
https://mongoplayground.net/p/QJ3BbHTQlgh
Adding "1 year" can be be difficult because of leap years or daylight-saving-times.
I suggest moment.js, then solution would be
db.system.profile.find(
{
priDate: {
$gte: moment().add(1, "year").toDate(),
$lt: moment().toDate()
}
}
)
However priDate >= "today + 1 year" AND priDate < "today" is not possible. Change the condition according to your needs.
MongoDB stores dates as milliseconds since epoch, so you can advance a date one year by adding the number of milliseconds in a year using $add inside $expr, then test with $lte:
db.system.profile.find({$expr:{$lte:[{$add:["$priDate",31536000000]},new ISODate()]}})
Note that this will be off by a day around leap year unless you adjust the number of milliseconds for that.
You expressed the constraint that you cannot use aggregate, but with $expr in Mongo 3.6 onwards, you can use any and all aggregation operators in a find query as well.
https://docs.mongodb.com/manual/reference/operator/query/expr/#definition

How to convert BSON Timestamp from Mongo changestream to a date?

I am getting started with the Changestream in Mongo. In my current setup a stitch functions inserts the changelog events into a revision collection. However when I read data from the collection, I can't convert the Timestamp fields. I have tried with the following 2 attempts:
1) A pipeline
[
{
$match: {
'documentKey._id': _id,
},
},
{
$sort: { _id: -1 },
},
{
$addFields: {
convertedDate: { $toDate: 'clusterTime' },
},
},
]
But it gives the error: Error parsing date string 'clusterTime'; 0: passing a time zone identifier as part of the string is not allowed 'c'; 6: Double timezone specification 'r'
2) The bson Timestamp class
import { Timestamp } from 'bson';
const asTimestampInstance = new Timestamp(v.clusterTime);
But here typescript gives me the error: Expected 2 arguments, but got 1.ts(2554)
index.d.ts(210, 30): An argument for 'high' was not provided.
In Altas, the clustertime correctly looks like a timestamp:
I hope that I am just missing something simple :)
Unfortunately $toDate doesn't work with timestamps directly. At least not in v4.0.
The argument should be either a number, a string, or an ObjectId.
You need to convert Timestamp to string first:
$addFields: {
convertedDate: { $toDate: {$dateToString:{date:"$clusterTime"}} },
},
2) The bson Timestamp class
You should take first 32-bit value from BSON's Timestamp class instance, it means seconds Epoch time, then multiply seconds on 1000 and make it milliseconds Epoch time, then call JS Date constructor.
If v is document from ChangeStream, v.clusterTime is BSON Timestamp class object. So, you should write:
const date = new Date(v.clusterTime.getHighBits() * 1000);
This example worked for me on MongoDB 4.0, ODM Mongoose 5.12, Node.js 12.

Compute Simple Moving Average in Mongo Shell

I am developing a financial application with Nodejs. I wonder would it be possible to compute simple moving average which is the average last N days of price directly in Mongo Shell than reading it and computing it in Node js.
Document Sample.
[{code:'0001',price:0.10,date:'2014-07-04T00:00:00.000Z'},
{code:'0001',price:0.12,date:'2014-07-05T00:00:00.000Z'},{code:'0001',price:0.13,date:'2014-07-06T00:00:00.000Z'},
{code:'0001',price:0.12,date:'2014-07-07T00:00:00.000Z'}]
If you have more than a trivial number of documents you should use the DB server to do the work rather than JS.
You don't say if you are using mongoose or the node driver directly. I'll assume you are using mongoose as that is the way most people are headed.
So your model would be:
// models/stocks.js
const mongoose = require("mongoose");
const conn = mongoose.createConnection('mongodb://localhost/stocksdb');
const StockSchema = new mongoose.Schema(
{
price: Number,
code: String,
date: Date,
},
{ timestamps: true }
);
module.exports = conn.model("Stock", StockSchema, "stocks");
You rightly suggested that aggregation frameworks would be a good way to go here. First though if we are dealing with returning values between date ranges, the records in your database need to be date objects. From your example documents you may have put strings. An example of inserting objects with dates would be:
db.stocks.insertMany([{code:'0001',price:0.10,date:ISODate('2014-07-04T00:00:00.000Z')}, {code:'0001',price:0.12,date:ISODate('2014-07-05T00:00:00.000Z')},{code:'0001',price:0.13,date:ISODate('2014-07-06T00:00:00.000Z')}, {code:'0001',price:0.12,date:ISODate('2014-07-07T00:00:00.000Z')}])
The aggregation pipeline function accepts an array with one or more pipeline stages.
The first pipeline stage we should use is $match, $match docs, this filters the documents down to only the records we are interested in which is important for performance
{ $match: {
date: {
$gte: new Date('2014-07-03'),
$lte: new Date('2014-07-07')
}
}
}
This stage will send only the documents that are on the 3rd to 7th July 2014 inclusive to the next stage (in this case all the example docs)
Next stage is the stage where you can get an average. We need to group the values together based on one field, multiple fields or all fields.
As you don't specify a field you want to average over I'll give an example for all fields. For this we use the $group object, $group docs
{
$group: {
_id: null,
average: {
$avg: '$price'
}
}
}
This will take all the documents and display an average of all the prices.
In the case of your example documents this results in
{ _id: null, avg: 0.1175 }
Check the answer:
(0.10 + 0.12 + 0.12 + 0.13) / 4 = 0.1175
FYI: I wouldn't rely on calculations done with javascript for anything critical as Numbers using floating points. See https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html for more details if you are worried about that.
For completeness here is the full aggregation query
const Stock = require("./models/stocks");
Stock.aggregate([{ $match: {
date: {
$gte: new Date('2014-07-03'),
$lte: new Date('2014-07-07')
}
}},
{
$group: {
_id: null,
avg: {
$avg: '$price'
}
}
}])
.then(console.log)
.catch(error => console.error(error))
Not sure about your moving average formula, but here is how I would do it:
var moving_average = null
db.test.find().forEach(function(doc) {
if (moving_average==null) {
moving_average = doc.price;
}
else {
moving_average = (moving_average+doc.price)/2;
}
})
output:
> moving_average
0.3
And if you wan to define the N days to do the average for, just modify the argument for find:
db.test.find({ "date": { $lt: "2014-07-10T00:00:00.000Z" }, "date": { $gt: "2014-07-07T00:00:00.000Z" } })
And if you want to do the above shell code in one-line, you can assume that moving_average is undefined and just check for that before assigning the first value.

MongoDB query slow on very large collection - retrieve distinct documents based on field

I am having some problems making my MongoDB queries more efficient. My documents have the following format:
{
_id: UUID,
p: 'some.path',
t: 'Fri Dec 12 2014 09:26:18 GMT+0100 (CET)',
v: 123.4
}
where path and timestamp are indexed. The path is an identifier in our system, where we have around 15-16000 different paths. Our logging application is logging these values at around 3000 documents each second, and this is running fine without problems.
My problem is when retrieving a "snapshot in time" of these values; I want to make a query to know what all the signals are, at a given time. That means for each path, get the latest value where the timestamp is less or equal to the requested timestamp, and return all of these documents.
So in SQL/pseudocode:
results = []
for each distinct path:
find document where p == path and t <= timestamp, order by t desc, limit 1
results.push(result)
I have also gotten the same results by using aggregate with match, sort and group:
aggregate(
[
{ $match: { t : { $lte: timestamp}} },
{ $sort: { t : -1 } },
{ $group:
{
_id: { x: "$p"},
p: { $last: "$p"},
t: { $last: "$t"},
v: { $last: "$v"}
}
}
]
However, with my collection currently clocking in at around 250 million records, both of these methods are very slow. I am really stuck on this, and can't seem to get this to work. If I cannot find a solution to this problem I will probably have to start looking at other databases that might support these types of queries better.