buffer tasks for bulk write to Mongodb database

buffer tasks for bulk write to Mongodb database - mongodb

I am looking for efficient solution for case i have to handle to
thousands of records for saving them to MongoDB database.
i cant lose data but can have delay in saving time.
i searching for the right solution. form what i read i think about:
1. use Redis for caching records and with cronjob bulk insert the records.
2. use task queue for buffering the records and with cronjob bulk insert.
for example rabbitMQ/google task queue - is there a way to set condition consume on buffer length x or time passed from last delivery(like cronjob)
i am using nodejs and mongodb.
tracker schema:
let trackSchema = {
threadId: string,
updated: number,//(timestamp)
created: number,//(timestamp)
count: number //number of opens
trace: [
{
created: number,//(timestamp)
userAgent: string
}
]
}

Related

IPython parallel : how to recover job IDs from IPcontroller

I have a server running IP controller and 12 IPengines. I connect to the controller from my laptop using SSH. I submitted some jobs to the controller using the load-balanced view interface (in non-blocking mode) and stored the message IDs in the Asyc Result object returned the by apply_async() method.
I accidentally lost the message IDs for the jobs and wanted to know if there's a way to retrieve the job IDs (or the results) from the Hub database. I use a SQLite database for the Hub, and I can get the rc.db_query() method to work, but I don't know what to look for.
Does anyone know how to query the Hub database only for message IDs of the jobs I submitted? What's the easiest way of retrieving the job results from the Hub, if I don't have access to the AsyncHubResult object (or their message IDs)?
Thanks!

Without the message IDs, you are might have a pretty hard time finding the right tasks, unless there haven't been so many tasks submitted.
The querying is based on MongoDB (it's a passthrough when you use mongodb, and a subset of simple operators are implemented for sqlite).
Quick summary: a query is a dict. If you use literal values, they are equality tests, but you can use dict values for comparison operators.
You can search by date for any of the timestamps:
submitted: arrived at the controller
started: arrived on an engine
completed: finished on the engine
For instance, to find tasks submitted yesterday:
from datetime import date, time, timedelta, datetime
# round to midnight
today = datetime.combine(date.today(), time())
yesterday = today - timedelta(days=1)
rc.db_query({'submitted': {
'$lt': today, # less than midnight last night
'$gt': yesterday, # greater than midnight the night before
}})
or all tasks submitted 1-4 hours ago:
found = rc.db_query({'submitted': {
'$lt': datetime.now() - timedelta(hours=1),
'$gt': datetime.now() - timedelta(hours=4),
}})
With the results of that, you can look at keys like client_uuid to retrieve all messages submitted by a given client instance (e.g. a single notebook or script):
client_id = found[0]['client_uuid']
all_from_client = rc.db_query({'client_uuid': client_uuid})
Since you are only interested in results at this point, you can specify keys=['msg_id'] to only retrieve the message IDs. We can then use these msg_ids to get all the results produced by a single client session:
# construct list of msg_ids
msg_ids = [ r['msg_id'] for r in rc.db_query({'client_uuid': client_uuid}, keys=['msg_id']) ]
# use client.get_result to retrieve the actual results:
results = rc.get_result(msg_ids)
At this point, you have all of the results, but you have lost the association of which results came from which execution. There isn't a lot of info to help you out there, but you might be able to tell by type, timestamps, or perhaps select the 9 final items from a given session.

Best approach for scheduling tasks on interval basis from database

I have a MongoDB collection with Tasks. Each task has an interval in seconds, Task identifier and payload that should be sent via HTTP POST to gather results and store them into another collection.
It may be thousands tasks with different intervals and I cannot figure out how to schedule them.
Currently I'm using simple polling by last execution time every 10ms but it produces heavy load on DB.
and it looks like this
mongo.MongoClient.connect(MONGO_URL, (err, db) ->
handle_error(err)
schedule = (collection) ->
collection.find({isEnabled:true, '$where': '((new Date()).getTime() - this.timestamp) > (this.checkInterval * 60 * 1000)'}).toArray((err, docs) ->
handle_error(err)
for i, doc of docs
collection.update({_id: doc._id}, {'$set': {timestamp: (new Date()).getTime()}}, {w: 1})
task = prepare(doc)
request.post({url: url, formData: {task: JSON.stringify(prepare(doc))}}, (err,httpResponse,body) ->
result = JSON.parse(body)
console.log(result)
db.collection(MONGO_COLLECTION_RESULTS).save({
task: result.id,
type: result.type,
data: result
})
)
setTimeout((() -> schedule(collection)), 10)
)
setTimeout((() -> schedule(db.collection(MONGO_COLLECTION_TASKS))), 10)
)
tasks can be added, updated, deleted and I have to handle it.
what about using redis? but I have no clue how to sync the data from mongo to redis when some tasks are waiting for result, interval was changed, etc
please advice best strategy for that

I don't think this is the right way to solve your use case.
I would suggest to not store the tasks in whatever database but schedule them directly when they come in and save the result, with or without the original task information.
Why not use Quartz to schedule the task?

If you know the tasks to be run, you can schedule with the unix crontab, which runs a script that connects to DB or send HTTP requests.
If each task is unique, and you cannot pre-schedule them that way, perhaps you can use your current db collections, but not poll the db that often.
If it is not critical that the tasks are executed exactly on the right time, I would do a db lookup maybe once every 10 sec to see what tasks should have been executed since the last lookup.
One way of solving the db load would be to make a query that fetches tasks ordered on when they should be executed, with all tasks that should be executed within the next minute or so. Then you have (hopefully) a low amount of tasks in memory, and can set a javascript timeout for when they should be run. If too many tasks should be run at the same time, this could be problematic to fetch from the db at once.
The essence is to batch several tasks from the db into memory, and handle some of the scheduling there.

Incremental field to existing collection

I have a collection containing around 100k documents. I want to add an auto incrementing "custom_id" field to my documents, and keep adding my documents by incrementing that field from now on.
What's the best approach for this? I've seen some examples in the official document (http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/) however they're only for adding new documents, not for updating an existing collection.
Example code I created based on the link above to increment my counter:
function incrementAndGetNext(counter, callback) {
counters.findAndModify({
name: counter
}, [["_id", 1]], {
$inc: {
"count": 1
}
}, {
"new": true
}, function (err, doc) {
if (err) return console.log(err);
callback(doc.value);
})
}
On the above code counters is db.counters collection and I have this document there:
{_id:"...",name:"post",count:"0"}
Would love to know.
Thank you.
P.S. I'm using native mongojs driver for js

Well, using the link you mentionned, I'd rather use the counters collection approach.
The counters collections approach has some drawbacks including :
It always generates multiples request (two): one to get the sequence number, another to do the insertion using the id you got via the sequence,
If you are using sharding features of mongodb, a document responsible for storing a counter state may be used a lot, and each time it will reach the same server.
However it should be appropriate for most uses.
The approach you mentionned ("the optimistic loop") should not break IMO, and I don't guess why you have a problem with it. However I'd not recommend it. What happens if you execute the code on multiple mongo clients, if one has a lot of latency and others keep taking IDs? I'd not like to encounter this kind of problem... Furthermore, there are at least two request per successful operation, but no maximum of retries before a success...

Does Mongodb expireAfterSeconds has callback? And how to catch it with nodejs?

This is the sample of my code:
var changesDB = new mongoose.Schema({
eventId: String,
date: Date
})
changesDB.index({ title: 1 }, { expireAfterSeconds : 60*60*24*30 });
It works fine, but I need to delete all the files connected to this collection, so I have to catch this event with nodejs.
How can I make it?

As of MongoDB 3.0, there's no callback mechanism of any sort in MongoDB; in particular, there is no such mechanism for TTL indexes. The TTL enforcement is just a background thread that queries every minute for documents that have expired, then deletes them. If you have related data that you need to expire, I'd suggest just mimicking the TTL index's operation in your application, where you can perform whatever extra logic is necessary to clean up related data.
Alternatively, you could make all related documents expire at the same time, so they will all be deleted at approximately the same time (within the same TTL pass).

Best practice for storing counters in meteor.js

I used to store counters inside the User document. But I often check the existing of Meteor.user() before running some code.
The counters update every second so the code reruns over and over again.
Is creating a separate Counters collection a way to solve this problem?
Counters example:
counters: {
generatedDocs: {
total: 482360
}
posts: {
total: 23
},
comments: {
total: 200
}
}

Yes. If you want to store the counter variable across multiple sessions and/or have it visible for multiple users, you want to make a counter collection in the database and work with that. If you just want counters for one session, then you can store them in a variable on the window object.

If this is for one session you can use the Session api of Meteor. And as Goodword said, if this is across sessions and users you can store the counters in its own collection. If what you are counting are own collections you can also use the count() function if it fits your use case.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

buffer tasks for bulk write to Mongodb database - mongodb

Related

IPython parallel : how to recover job IDs from IPcontroller

Best approach for scheduling tasks on interval basis from database

Incremental field to existing collection

Does Mongodb expireAfterSeconds has callback? And how to catch it with nodejs?

Best practice for storing counters in meteor.js

Categories

Resources