Iterating through database records in node.js - mongodb

I'm looking to learn node.js and mongodb which look suitable for something I'd like to make. As a little project to help me learn I thought I'd copy the "posts" table from a phpbb3 forum I have into a mongodb table so I did something like this where db is mongodb database connection, and client is a mysql database connection.
db.collection('posts', function (err, data) {
client.query('select * from phpbb_posts", function(err, rs) {
data.insert(rs);
});
this works ok when I do it on small tables, but my posts table has about 100000 rows in and this query doesn't return even when I leave it running for an hour. I suspect that it's trying to load the entire database table into memory and then insert it.
So what I would like to do is read a chunk of rows at a time and insert them. However I can't see how to read a subset of the rows in node.js, and even more of a problem, I can't understand how I can iterate through the queries one at a time when I only get notification via a callback that it's finished.
Any ideas how I can best do this? (I'm looking for solutions using node.js as I'd like to know how to solve this kind of problem, I could no doubt do it easily some other way)

You could try using the asnyc library by caolan. The library implements some async flow control methods to handle the caveats of a callback-oriented programming style as it is in node.js.
For your case, using the whilst method could work out, using LIMIT queries against mysql and inserting them into mongodb.
Example (not tested, as i have no testdata available, but i think you'll get the idea)
var insertCount = 0;
var offset = 0;
// set this to the overall recordcound from mysql
var recordCount = 0;
async.whilst(
// test condition callback
function () { return insertCount < recordCount; },
// actual worker callback
function (callback) {
db.collection('posts', function (err, data) {
client.query('select * from phpbb_posts LIMIT ' + insertCount + ',1000', function(err, rs) {
data.insert(rs);
// increment by actually fetched recordcount (res.length?)
insertCount += res.length;
// trigger flow callback
callback();
});
});
},
// finished callback
function (err) {
// finished inserting data, maybe check record count in mongodb here
}
});
As i already mentioned, this code is just adapted from an example of the async library readme. But maybe it is an option for adding such amounts of database records from mysql to mongo.

Related

Mongoose/Express - problems using findById()

I have a database of activities in MongoDB, had all the basic CRUD operations working fine but now at the point in developing the front end of the app where I need to do a GET request for a single activity in the database. I have working PUT and DELETE requests for single activities but for some reason the GET one just isn't playing ball - it's returning an array of objects rather than a single object with that ID.
I'm currently using Postman to make the requests while I iron this problem out. Mongoose version is 5.12.13.
router.get('/:id', async (req, res) => {
try {
const activity = await Activities.findById(req.params.id)
res.json(activity).send()
} catch (error) {
res.send(error.message)
}
})
Then making a request using Postman to http://localhost:5000/api/activities?id=60968e3369052d084cb6abbf (the id here is just one I've copied & pasted from an entry in the database for the purposes of troubleshooting)
I'm really stumped by this because I can't understand why it's not working! The response I get in Postman is an array of objects, like I said, which seems to be the entire contents of the database rather than just one with the queried ID...
Try calling exec on your findById, findById returns a query, you need to call exec to execute your query.
Without the call to the exec function, your 'activity' variable is a mongoose query object.
router.get('/:id', async (req, res) => {
try {
const activity = await Activities.findById(req.params.id).exec();
res.json(activity).send()
} catch (error) {
res.send(error.message)
}
});
Docs for findById
https://mongoosejs.com/docs/api.html#model_Model.findById
Edit:
As righty spotted by Dang, given your code is inspecting req.params the URL you're calling needs updating to:
http://localhost:5000/api/activities/60968e3369052d084cb6abbf

How to use Redis and MongoDb together

I should build a web application to track users activities and I have some issues to understand how can use Redis for tracking online users activities and Mongo to store that data for analyzing it.
I could use just Mongo but I'm worried about the fact I have a lot of invocations to follow what a user is doing. So I was thinking to write on Redis the online data and put in Mongo when they become old. I mean old when the data is meaningless for being online.
I thought about one gateway between Mongo and Redis, so could be RabbitMQ?.
Any suggestions?
I should use just Mongo?
Just an example of code I wrote:
Front-end (Angular application / Socket.io):
setInterval(function () {
socket.emit('visitor-data', {
referringSite: document.referrer,
browser: navigator.sayswho,
os: navigator.platform,
page: location.pathname
});
}, 3000);
Back-end ( Node.js/ Socket.io)
socket.on('visitor-data', function(data) {
visitorsData[socket.id] = data;
);
VisitorsData is a just in an array, but I should build a scalable application, so I can't store data anymore in this way.
Then I have some functions like this for computing the data:
function computeRefererCounts() {
var referrerCounts = {};
for (var key in visitorsData) {
var referringSite = visitorsData[key].referringSite || '(direct)';
if (referringSite in referrerCounts) {
referrerCounts[referringSite]++;
} else {
referrerCounts[referringSite] = 1;
}
}
return referrerCounts;
}
Just some numbers:
I estimated something like :
1 million users for day
15 million activities for day.

Meteor: Increment DB value server side when client views page

I'm trying to do something seemingly simple, update a views counter in MongoDB every time the value is fetched.
For example I've tried it with this method.
Meteor.methods({
'messages.get'(messageId) {
check(messageId, String);
if (Meteor.isServer) {
var message = Messages.findOne(
{_id: messageId}
);
var views = message.views;
// Increment views value
Messages.update(
messageId,
{ $set: { views: views++ }}
);
}
return Messages.findOne(
{_id: messageId}
);
},
});
But I can't get it to work the way I intend. For example the if(Meteor.isServer) code is useless because it's not actually executed on the server.
Also the value doesn't seem to be available after findOne is called, so it's likely async but findOne has no callback feature.
I don't want clients to control this part, which is why I'm trying to do it server side, but it needs to execute everytime the client fetches the value. Which sounds hard since the client has subscribed to the data already.
Edit: This is the updated method after reading the answers here.
'messages.get'(messageId) {
check(messageId, String);
Messages.update(
messageId,
{ $inc: { views: 1 }}
);
return Messages.findOne(
{_id: messageId}
);
},
For example the if(Meteor.isServer) code is useless because it's not
actually executed on the server.
Meteor methods are always executed on the server. You can call them from the client (with callback) but the execution happens server side.
Also the value doesn't seem to be available after findOne is called,
so it's likely async but findOne has no callback feature.
You don't need to call it twice. See the code below:
Meteor.methods({
'messages.get'(messageId) {
check(messageId, String);
var message = Messages.findOne({_id:messageId});
if (message) {
// Increment views value on current doc
message.views++;
// Update by current doc
Messages.update(messageId,{ $set: { views: message.views }});
}
// return current doc or null if not found
return message;
},
});
You can call that by your client like:
Meteor.call('messages.get', 'myMessageId01234', function(err, res) {
if (err || !res) {
// handle err, if res is empty, there is no message found
}
console.log(res); // your message
});
Two additions here:
You may split messages and views into separate collections for sake of scalability and encapsulation of data. If your publication method does not restrict to public fields, then the client, who asks for messages also receives the view count. This may work for now but may violate on a larger scale some (future upcoming) access rules.
views++ means:
Use the current value of views, i.e. build the modifier with the current (unmodified) value.
Increment the value of views, which is no longer useful in your case because you do not use that variable for anything else.
Avoid these increment operator if you are not clear how they exactly work.
Why not just using a mongo $inc operator that could avoid having to retrieve the previous value?

Neo4j: Create nodes via CYPHER/REST slow

I try to create/update nodes via the REST API with Cypher's MERGE-statement. Each node has attributes of ca. 1kb (sum of all sizes). I create/update 1 node per request. (I know there are other ways to create lots of nodes in a batch, but this is not the question here.)
I use Neo4j community 2.1.6 on a Windows Server 2008 R2 Enterprise (24 CPUs, 64GB) and the database directory resides on a SAN drive. I get a rate of 4 - 6 nodes per second. Or in other words, a single create or update takes around 200ms. This seems rather slow for me.
The query looks like this:
MERGE (a:TYP1 { name: {name}, version: {version} })
SET
a.ATTR1={param1},
a.ATTR2={param2},
a.ATTR3={param3},
a.ATTR4={param4},
a.ATTR5={param5}
return id(a)
There is an index on name, version and two of the attributes.
Why does it take so long? And what can I try to improve the situation?
I could imagine that one problem is that every request must create a new connection? Is there a way to keep the http connection open for multiple requests?
For a query I'm pretty sure you can only use one index per query per label, so depending on your data they index usage might not be efficient.
As far as a persistent connection, that is possible, though I think it would depend on the library you're using to connect to the REST API. In the ruby neo4j gem we use the Faraday gem which has a NetHttpPersistent adapter.
The index is only used when you use ONE attribute with MERGE
If you need to merge on both, create a compound property, index it (or better use a constraint) and merge on that compound property
Use ON CREATE SET otherwise you (over-)write the attributes everytime, even if you didn't actually create the node.
Adapted Statement
MERGE (a:TYP1 { name_version: {name_version} })
ON CREATE SET
a.version = {version}
a.name = {name}
a.ATTR1={param1},
a.ATTR2={param2},
a.ATTR3={param3},
a.ATTR4={param4},
a.ATTR5={param5}
return id(a)
This is an example of how you can execute a batch of cypher queries from nodejs in one communication with the Neo4j.
To run it,
get nodejs installed (if you don't have it already)
get a token from https://developers.facebook.com/tools/explorer giving you access to user_groups
run it as > node {yourFileName}.js {yourToken}
prerequisites:
var request=require("request") ;
var graph = require('fbgraph');
graph.setAccessToken(process.argv[2]);
function now() {
instant = new Date();
return instant.getHours()
+':'+ instant.getMinutes()
+':'+ instant.getSeconds()
+'.'+ instant.getMilliseconds();
}
Get facebook data:
graph.get('me?fields=groups,friends', function(err,res) {
if (err) {
console.log(err);
throw now() +' Could not get groups from faceBook';
}
Create cypher statements
var batchCypher = [];
res.groups.data.forEach(function(group) {
var singleCypher = {
"statement" : "CREATE (n:group{group}) RETURN n, id(n)",
"parameters" : { "group" : group }
}
batchCypher.push(singleCypher);
Run them one by one
var fromNow = now();
request.post({
uri:"http://localhost:7474/db/data/transaction/commit",
json:{statements:singleCypher}
}, function(err,res) {
if (err) {
console.log('Could not commit '+ group.name);
throw err;
}
console.log('Used '+ fromNow +' - '+ now() +' to commit '+ group.name);
res.body.results.forEach(function(cypherRes) {
console.log(cypherRes.data[0].row);
});
})
});
Run them in batch
var fromNow = now();
request.post({
uri:"http://localhost:7474/db/data/transaction/commit",
json:{statements:batchCypher}
}, function(err,res) {
if (err) {
console.log('Could not commit the batch');
throw err;
}
console.log('Used '+ fromNow +' - '+ now() +' to commit the batch');
})
});
The log shows that a transaction for 5 groups is significantly slower than a transactions for 1 group but significantly faster than 5 transactions for 1 group each.
Used 20:38:16.19 - 20:38:16.77 to commit Voiture occasion Belgique
Used 20:38:16.29 - 20:38:16.82 to commit Marches & Randonnées
Used 20:38:16.31 - 20:38:16.86 to commit Vlazarus
Used 20:38:16.34 - 20:38:16.87 to commit Wijk voor de fiets
Used 20:38:16.33 - 20:38:16.91 to commit Niet de bestemming maar de route maakt de tocht goed.
Used 20:38:16.35 - 20:38:16.150 to commit the batch
I just read your comment, Andreas, do it is not applicable for you, but you might use it to find out if the time is spent in the communication or in the updates

Mongo - new vs processed approach

I am new to Mongo and have gotten close to where I want to be after 3 days of banging my head against the keyboard, but now I think I may just be misunderstanding certain key concepts:
What I am trying to do:
I have a node script that is pulling in feed items from various sources very frequently and storing them (title, link, origin, processed:false)
I have another script pulling out records at random, one at a time, using them, and updating processed:true
End Goal: Items should be unique by title - if it's been seen before it should not be written to DB, and once it's been processed one time, it should never be processed again.
INSERT SCRIPT:
key = {'title':title};
data = {'origin':origin, 'title':title, 'original_link':original_url, 'processed':false};
collection.update(key, data, {upsert:true}, function(err, doc) { ...
READ SCRIPT:
collection.findOne({processed:false}, function(err, doc){
if (err) throw err;
logger.info("Read out the following item from mongodb:...");
console.dir(doc);
thisId = doc._id;
markProcessed(thisId);
}
var markProcessed = function(id) {
collection.update({ _id:id },
{
$set: {'processed':true},
}, function(err, doc){
if (err) throw err;
logger.info("Marked record:"+id+" as processed");
console.dir(doc);
}
)
};
I've tried using collection.ensureIndex({'title':1}, {unique:true}) to no success either.
As the two scripts run in parallel the read script ends up repeating work on already processed records, and although the markProcessed function was working all yesterday it miraculously does not today :)
I would very much appreciate any guidance.
There is a problem with your insert script. When you use collection.update and you already have a document with the same key in the database, that document will be overwritten with the new one. An unique index doesn't prevent this, because there aren't two documents with the same title in the collection at the same time.
When you don't want to overwrite an existing record, use collection.insert which will fail when the inserted document violates an unique index.